Login

Sign Up

Exploratory data analysis with Pandas:Part 1
Ujjwal Paliwal

Posted on Dec 15, 2024 | Backend

Exploratory data analysis with Pandas:Part 1

Machine Learning (ML) is an exciting field, especially when you get to see mathematical models in action, solving real-world problems. But here's the thing many don’t talk about: most of the work in ML isn’t about building or fine-tuning models—it’s about preparing and cleaning the data. In fact, this process often consumes 70-80% of the time spent on any project.

This is where Pandas steps in as an absolute lifesaver. Whether you’re cleaning messy datasets, wrangling data for insights, or performing exploratory analysis, Pandas is the ultimate tool. It’s a library I rely on every single day, and it plays a pivotal role in making data analysis not only efficient but enjoyable.

In this article, we’ll focus on key Pandas methods for preliminary data analysis. But it won’t just be a theoretical overview. To keep things practical, we’ll analyze a telecom customer churn dataset, exploring churn predictions using just Pandas and common sense—no complex ML models involved.

You might wonder: is it possible to gain meaningful insights without advanced algorithms? Absolutely! By relying on logic and an intuitive understanding of the data, we can uncover patterns and make powerful predictions. Sometimes, simplicity is all you need to solve real-world problems.

1. Demonstration of the main Pandas methods

Introduction to Pandas: The Go-To Library for Data Analysis

Pandas is a powerful Python library designed to make data analysis simple and efficient. It’s especially useful for working with datasets stored in table formats such as .csv, .tsv, or .xlsx.

With Pandas, you can:

  • Load data seamlessly.
  • Process and clean messy datasets.
  • Analyze data using SQL-like commands.

When paired with Matplotlib and Seaborn, Pandas becomes even more versatile, offering endless possibilities for visualizing and analyzing tabular data. Whether you’re just starting your data science journey or working on advanced projects, Pandas is an indispensable tool in your Python toolkit.

import numpy as np
import pandas as pd

pd.set_option("display.precision", 2)

Demonstrating Pandas in Action: Telecom Churn Analysis

To showcase Pandas’ main methods, we’ll analyze a dataset focused on the churn rate of telecom operator clients. Let’s start by reading the dataset using the read_csv method and then preview the first 5 lines with the head method:

import pandas as pd

# Load the dataset
data = pd.read_csv("DATA_URL" + "telecom_churn.csv")

# Display the first 5 rows
print(data.head())

Recall that each row corresponds to one client, an instance, and columns are features of this instance. Let’s have a look at data dimensionality, feature names, and feature types.

print(data.shape)
(3333, 20)

From the output, we can see that the table contains 3333 rows and 20 columns. Now let’s try printing out column names using columns:

print(data.columns)
Index(['State', 'Account length', 'Area code', 'International plan',
       'Voice mail plan', 'Number vmail messages', 'Total day minutes',
       'Total day calls', 'Total day charge', 'Total eve minutes',
       'Total eve calls', 'Total eve charge', 'Total night minutes',
       'Total night calls', 'Total night charge', 'Total intl minutes',
       'Total intl calls', 'Total intl charge', 'Customer service calls',
       'Churn'],
      dtype='object')

We can use the info() method to output some general information about the dataframe:

print(data.info())

The describe method shows basic statistical characteristics of each numerical feature (int64 and float64 types): number of non-missing values, mean, standard deviation, range, median, 0.25 and 0.75 quartiles.

data.describe()

Dataset Summary Statistics: Telecom Churn

Here’s a summary of the telecom churn dataset. It includes key statistics for each feature, providing insights such as the count, mean, standard deviation, and range (min, max) for numerical columns.

Feature Count Mean Std Min 25% 50% 75% Max
Account length 3333.00 101.06 39.82 1.00 74.00 101.00 127.00 243.00
Area code 3333.00 437.18 42.37 408.00 408.00 415.00 510.00 510.00
Number vmail messages 3333.00 8.10 13.69 0.00 0.00 0.00 20.00 51.00
Total day minutes 3333.00 179.78 54.47 0.00 143.70 179.40 216.40 350.80
Total day calls 3333.00 100.44 20.07 0.00 87.00 101.00 114.00 165.00
Total day charge 3333.00 30.56 9.26 0.00 24.43 30.50 36.79 59.64
Total eve minutes 3333.00 200.98 50.71 0.00 166.60 201.40 235.30 363.70
Total eve calls 3333.00 100.11 19.92 0.00 87.00 100.00 114.00 170.00
Total eve charge 3333.00 17.08 4.31 0.00 14.16 17.12 20.00 30.91
Total night minutes 3333.00 200.87 50.57 23.20 167.00 201.20 235.30 395.00
Total night calls 3333.00 100.11 19.57 33.00 87.00 100.00 113.00 175.00
Total night charge 3333.00 9.04 2.28 1.04 7.52 9.05 10.59 17.77
Total intl minutes 3333.00 10.24 2.79 0.00 8.50 10.30 12.10 20.00
Total intl calls 3333.00 4.48 2.46 0.00 3.00 4.00 6.00 20.00
Total intl charge 3333.00 2.76 0.75 0.00 2.30 2.78 3.27 5.40
Customer service calls 3333.00 1.56 1.32 0.00 1.00 1.00 2.00 9.00
Churn 3333.00 0.14 0.35 0.00 0.00 0.00 0.00 1.00

In order to see statistics on non-numerical features, one has to explicitly indicate data types of interest in the include parameter.

data.describe(include=["object", "bool"])

Categorical Features Summary

A summary of the categorical features in the telecom churn dataset. It highlights the count of entries, number of unique values, and the most frequent (top) value along with its frequency for each feature.

Feature Count Unique Top Frequency
State 3333 51 WV 106
International plan 3333 2 No 3010
Voice mail plan 3333 2 No 2411

For categorical (type object) and boolean (type bool) features we can use the value_counts method. Let’s take a look at the distribution of Churn:

data["Churn"].value_counts()

The count of clients who have not churned (0) versus those who have churned (1) is as follows:

Churn Status Count
0 (Not Churned) 2850
1 (Churned) 483

This shows that out of 3333 clients, the majority (2850) have not churned, while 483 clients have. This distribution highlights the imbalance in churn status, which is a key consideration when analyzing and addressing churn rates.

2850 users out of 3333 are loyal; their Churn value is 0. To calculate fractions, pass normalize=True to the value_counts function.

data["Churn"].value_counts(normalize=True)
Churn Status Proportion
0 (Not Churned) 0.86
1 (Churned) 0.14

Thanks for reading! Part 2 coming soon ~ UJJWAL PALIWAL

7 Reactions

2 Bookmarks

Read next

Ujjwal Paliwal

Ujjwal Paliwal

Dec 14, 24

4 min read

|

Building an Own AI Chatbot: Integrating Custom Knowledge Bases

Ujjwal Paliwal

Ujjwal Paliwal

Dec 17, 24

8 min read

|

Exploratory data analysis with Pandas: Sorting (Part~2)