Exploratory data analysis with Pandas: Grouping (Part~3)
Grouping
In general, grouping data in Pandas works as follows:
df.groupby(by=grouping_columns)[columns_to_show].function()
-
First, the groupby method divides the grouping_columns by their values. They become a new index in the resulting dataframe.
-
Then, columns of interest are selected (columns_to_show). If columns_to_show is not included, all non groupby clauses will be included.
-
Finally, one or several functions are applied to the obtained groups per selected columns.
Here is an example where we group the data according to the values of theChurn
variable and display statistics of three columns in each group:
columns_to_show = ["Total day minutes", "Total eve minutes", "Total night minutes"]
df.groupby(["Churn"])[columns_to_show].describe(percentiles=[])
Total Day, Evening, and Night Minutes Statistics by Churn
Churn | Total day minutes | Total eve minutes | Total night minutes |
---|---|---|---|
count | mean | std | |
------- | ------- | -------- | ------- |
0 | 2850.0 | 175.18 | 50.18 |
1 | 483.0 | 206.91 | 69.00 |
Summary tables
Suppose we want to see how the observations in our dataset are distributed in the context of two variables – Churn
and International plan
. To do so, we can build a contingency table using the crosstab
method:
pd.crosstab(df["Churn"], df["International plan"])
International Plan and Churn
International plan | False | True |
---|---|---|
Churn | ||
0 | 2664 | 186 |
1 | 346 | 137 |
pd.crosstab(df["Churn"], df["Voice mail plan"], normalize=True)
Voice Mail Plan and Churn
Voice mail plan | False | True |
---|---|---|
Churn | ||
0 | 0.60 | 0.25 |
1 | 0.12 | 0.02 |
We can see that most of the users are loyal and do not use additional services (International Plan/Voice mail). |
This will resemble pivot tables to those familiar with Excel. And, of course, pivot tables are implemented in Pandas: the pivot_table
method takes the following parameters:
-
values
– a list of variables to calculate statistics for, -
index
– a list of variables to group data by, -
aggfunc
– what statistics we need to calculate for groups, e.g. sum, mean, maximum, minimum or something else.
Let’s take a look at the average number of day, evening, and night calls by area code:
df.pivot_table(
["Total day calls", "Total eve calls", "Total night calls"],
["Area code"],
aggfunc="mean",
)
Total Calls by Area Code
Area code | Total day calls | Total eve calls | Total night calls |
---|---|---|---|
408 | 100.50 | 99.79 | 99.04 |
415 | 100.58 | 100.50 | 100.40 |
510 | 100.10 | 99.67 | 100.60 |
DataFrame transformations:
Like many other things in Pandas, adding columns to a DataFrame
is doable in many ways.
For example, if we want to calculate the total number of calls for all users, let’s create the total_calls
Series and paste it into the DataFrame:
total_calls = (
df["Total day calls"]
+ df["Total eve calls"]
+ df["Total night calls"]
+ df["Total intl calls"]
)
df.insert(loc=len(df.columns), column="Total calls", value=total_calls)
# loc parameter is the number of columns after which to insert the Series object
# we set it to len(df.columns) to paste it at the very end of the dataframe
df.head()
It is possible to add a column more easily without creating an intermediate Series instance:
df["Total charge"] = (
df["Total day charge"]
+ df["Total eve charge"]
+ df["Total night charge"]
+ df["Total intl charge"]
)
df.head()
To delete columns or rows, use the drop
method, passing the required indexes and the axis
parameter (1 if you delete columns, and nothing or 0 if you delete rows). The inplace
argument tells whether to change the original DataFrame. With inplace=False
, the drop method doesn’t change the existing DataFrame and returns a new one with dropped rows or columns. With inplace=True
, it alters the DataFrame.
# get rid of just created columns
df.drop(["Total charge", "Total calls"], axis=1, inplace=True)
# and here’s how you can delete rows
df.drop([1, 2]).head()
Thanks for reading and If you have any issues, please let me know at this email.
ujjwalpaliwal35@gmail.com
~Ujjwal Paliwal
6 Reactions
1 Bookmarks