Data Preparation with Python & Pandas
0.0 Setup
This guide was written in Python 3.6.
0.1 Python and Pip
0.2 Other
Let's install the modules we'll need for this tutorial. Open up your terminal and enter the following commands to install the needed python modules:
pip3 install pandas==0.20.1
Lastly, you can find all the needed data for this tutorial at this link.
1.0 Introduction
Once you have the data, it might not be in the best shape for further processing or analysis. You might have scraped a bunch of data from a website, but need it in the form of a DataFrame to work with it in an easier manner. This process is called data preparation - preparing your data in a format that's easiest to form with.
1.1 Overview
Data Acquisition: Reading and writing with a variety of file formats and databases.
Preparation: Cleaning, munging, combining, normalizing, reshaping, slicing and dicing, and transforming data for analysis.
Transformation: Applying mathematical and statistical operations to groups of data sets to derive new data sets. For example, aggregating a large table by group variables.
Modeling and computation: Connecting your data to statistical models, machine learning algorithms, or other computational tools
Presentation: Creating interactive or static graphical visualizations or textual summaries
1.2 Glossary
Here is some common terminology that we'll encounter throughout the workshop:
Munging/Wrangling: This refers to the overall process of manipulating unstructured or messy data into a structured or clean form.
2.0 Pandas
Pandas allows us to deal with data in a way that us humans can understand it - with labelled columns and indexes. It allows us to effortlessly import data from files such as CSVs, allows us to quickly apply complex transformations and filters to our data and much more. Along with Numpy and Matplotlib, it helps create a really strong base for data exploration and analysis in Python.
import pandas as pd
from pandas import Series, DataFrame
2.1 Series
A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index. The simplest Series is formed from only an array of data:
obj = Series([4, 7, -5, 3])
Often it will be desirable to create a Series with an index identifying each data point:
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
You can also take a dictionary and convert it to a Series:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
2.2 DataFrames
A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.).
There are numerous ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
Then we take this and convert it to a DataFrame:
frame = DataFrame(data)
This gets us:
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
You can also specify the sequence of columns by:
DataFrame(data, columns=['year', 'state', 'pop'])
2.2.1 Apply
Lets's generate a random dictionary:
frame = DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
With this, we can apply a function on a DataFrame:
np.abs(frame)
We can also apply functions with the apply()
method:
f = lambda x: x.max() - x.min()
frame.apply(f)
2.2.2 Sorting
To sort lexicographically by row or column index, use the sort_index method, which returns a new, sorted object:
frame.sort_index()
Naive Bayes works on Bayes Theorem of probability to predict the class of a given data point. Naive Bayes is extremely fast compared to other classification algorithms and works with an assumption of independence among predictors.
The Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.
2.2 Challenge
Recall Bayes Theorem, which provides a way of calculating the posterior probability. Its formula is as follows:
Let's go through an example of how the Naive Bayes Algorithm works using pandas
. We'll go through a classification problem that determines whether a sports team will play or not based on the weather.
Let's load the module data:
import pandas as pd
f1 = pd.read_csv("./weather.csv")
2.3.1 Frequency Table
The first actual step of this process is converting the dataset into a frequency table. Using the groupby()
function, we get the frequencies:
df = f1.groupby(['Weather','Play']).size()
Now let's split the frequencies by weather and yes/no. Let's start with the three weather frequencies:
df2 = f1.groupby('Weather').count()
Now let's get the frequencies of yes and no:
df1 = f1.groupby('Play').count()
2.3.2 Likelihood Table
Next, you would create a likelihood table by finding the probabilites of each weather condition and yes/no. This will require that we add a new column that takes the play frequency and divides it by the total data occurances.
df1['Likelihood'] = df1['Weather']/len(f1)
df2['Likelihood'] = df2['Play']/len(f1)
This gets us a dataframe that looks like:
Play Likelihood
Weather
Overcast 4 0.285714
Rainy 5 0.357143
Sunny 5 0.357143
Now, we're able to use the Naive Bayesian equation to calculate the posterior probability for each class. The highest posterior probability is the outcome of prediction.
2.3.1 Calculation
So now we need a question. Let's propose the following: "Players will play if the weather is sunny. Is this true?"
From this question, we can construct Bayes Theorem. So what's our P(A|B)? P(Yes|Sunny), which gives us:
P(Yes|Sunny) = (P(Sunny|Yes)*P(Yes))/P(Sunny)
Based off the likelihood tables we created, we just grab P(Sunny) and P(Yes).
ps = df2['Likelihood']['Sunny']
py = df1['Likelihood']['Yes']
That leaves us with P(Sunny|Yes). This is the probability that the weather is sunny given that the players played that day. In df
, we see that the total number of yes
days under sunny
is 3. We take this number and divide it by the total number of yes
days, which we can get from df
.
psy = df['Sunny']['Yes']/df1['Weather']['Yes']
Now, we just have to plug these variables into bayes theorem:
p = (psy*py)/ps
And we get:
0.59999999999999998
That means the answer to our original question is yes!