Posted on Dec 21, 2024 | Backend

Exploratory data analysis with Pandas: Grouping (Part~3)

Grouping

In general, grouping data in Pandas works as follows:

df.groupby(by=grouping_columns)[columns_to_show].function()

First, the groupby method divides the grouping_columns by their values. They become a new index in the resulting dataframe.
Then, columns of interest are selected (columns_to_show). If columns_to_show is not included, all non groupby clauses will be included.
Finally, one or several functions are applied to the obtained groups per selected columns.
Here is an example where we group the data according to the values of the Churn variable and display statistics of three columns in each group:

columns_to_show = ["Total day minutes", "Total eve minutes", "Total night minutes"]

df.groupby(["Churn"])[columns_to_show].describe(percentiles=[])

Total Day, Evening, and Night Minutes Statistics by Churn

Churn	Total day minutes	Total eve minutes	Total night minutes
	count	mean	std
-------	-------	--------	-------
0	2850.0	175.18	50.18
1	483.0	206.91	69.00

Summary tables

Suppose we want to see how the observations in our dataset are distributed in the context of two variables – Churn and International plan. To do so, we can build a contingency table using the crosstab method:

pd.crosstab(df["Churn"], df["International plan"])

International Plan and Churn

International plan	False	True
Churn
0	2664	186
1	346	137

pd.crosstab(df["Churn"], df["Voice mail plan"], normalize=True)

Voice Mail Plan and Churn

Voice mail plan	False	True
Churn
0	0.60	0.25
1	0.12	0.02
We can see that most of the users are loyal and do not use additional services (International Plan/Voice mail).

This will resemble pivot tables to those familiar with Excel. And, of course, pivot tables are implemented in Pandas: the pivot_table method takes the following parameters:

values – a list of variables to calculate statistics for,
index – a list of variables to group data by,
aggfunc – what statistics we need to calculate for groups, e.g. sum, mean, maximum, minimum or something else.
Let’s take a look at the average number of day, evening, and night calls by area code:

df.pivot_table(
    ["Total day calls", "Total eve calls", "Total night calls"],
    ["Area code"],
    aggfunc="mean",
)

Total Calls by Area Code

Area code	Total day calls	Total eve calls	Total night calls
408	100.50	99.79	99.04
415	100.58	100.50	100.40
510	100.10	99.67	100.60

DataFrame transformations:

Like many other things in Pandas, adding columns to a DataFrame is doable in many ways.

For example, if we want to calculate the total number of calls for all users, let’s create the total_calls Series and paste it into the DataFrame:

total_calls = (
    df["Total day calls"]
    + df["Total eve calls"]
    + df["Total night calls"]
    + df["Total intl calls"]
)
df.insert(loc=len(df.columns), column="Total calls", value=total_calls)
# loc parameter is the number of columns after which to insert the Series object
# we set it to len(df.columns) to paste it at the very end of the dataframe
df.head()

It is possible to add a column more easily without creating an intermediate Series instance:

df["Total charge"] = (
    df["Total day charge"]
    + df["Total eve charge"]
    + df["Total night charge"]
    + df["Total intl charge"]
)
df.head()

To delete columns or rows, use the drop method, passing the required indexes and the axis parameter (1 if you delete columns, and nothing or 0 if you delete rows). The inplace argument tells whether to change the original DataFrame. With inplace=False, the drop method doesn’t change the existing DataFrame and returns a new one with dropped rows or columns. With inplace=True, it alters the DataFrame.

# get rid of just created columns
df.drop(["Total charge", "Total calls"], axis=1, inplace=True)
# and here’s how you can delete rows
df.drop([1, 2]).head()

Thanks for reading and If you have any issues, please let me know at this email.
ujjwalpaliwal35@gmail.com
~Ujjwal Paliwal

10 Reactions

1 Bookmarks