Exploratory data analysis with Pandas:Part 1
Machine Learning (ML) is an exciting field, especially when you get to see mathematical models in action, solving real-world problems. But here's the thing many don’t talk about: most of the work in ML isn’t about building or fine-tuning models—it’s about preparing and cleaning the data. In fact, this process often consumes 70-80% of the time spent on any project.
This is where Pandas steps in as an absolute lifesaver. Whether you’re cleaning messy datasets, wrangling data for insights, or performing exploratory analysis, Pandas is the ultimate tool. It’s a library I rely on every single day, and it plays a pivotal role in making data analysis not only efficient but enjoyable.
In this article, we’ll focus on key Pandas methods for preliminary data analysis. But it won’t just be a theoretical overview. To keep things practical, we’ll analyze a telecom customer churn dataset, exploring churn predictions using just Pandas and common sense—no complex ML models involved.
You might wonder: is it possible to gain meaningful insights without advanced algorithms? Absolutely! By relying on logic and an intuitive understanding of the data, we can uncover patterns and make powerful predictions. Sometimes, simplicity is all you need to solve real-world problems.
1. Demonstration of the main Pandas methods
Introduction to Pandas: The Go-To Library for Data Analysis
Pandas is a powerful Python library designed to make data analysis simple and efficient. It’s especially useful for working with datasets stored in table formats such as .csv
, .tsv
, or .xlsx
.
With Pandas, you can:
- Load data seamlessly.
- Process and clean messy datasets.
- Analyze data using SQL-like commands.
When paired with Matplotlib and Seaborn, Pandas becomes even more versatile, offering endless possibilities for visualizing and analyzing tabular data. Whether you’re just starting your data science journey or working on advanced projects, Pandas is an indispensable tool in your Python toolkit.
import numpy as np
import pandas as pd
pd.set_option("display.precision", 2)
Demonstrating Pandas in Action: Telecom Churn Analysis
To showcase Pandas’ main methods, we’ll analyze a dataset focused on the churn rate of telecom operator clients. Let’s start by reading the dataset using the read_csv
method and then preview the first 5 lines with the head
method:
import pandas as pd
# Load the dataset
data = pd.read_csv("DATA_URL" + "telecom_churn.csv")
# Display the first 5 rows
print(data.head())
Recall that each row corresponds to one client, an instance, and columns are features of this instance. Let’s have a look at data dimensionality, feature names, and feature types.
print(data.shape)
(3333, 20)
From the output, we can see that the table contains 3333 rows and 20 columns. Now let’s try printing out column names using columns
:
print(data.columns)
Index(['State', 'Account length', 'Area code', 'International plan',
'Voice mail plan', 'Number vmail messages', 'Total day minutes',
'Total day calls', 'Total day charge', 'Total eve minutes',
'Total eve calls', 'Total eve charge', 'Total night minutes',
'Total night calls', 'Total night charge', 'Total intl minutes',
'Total intl calls', 'Total intl charge', 'Customer service calls',
'Churn'],
dtype='object')
We can use the info()
method to output some general information about the dataframe:
print(data.info())
The describe
method shows basic statistical characteristics of each numerical feature (int64 and float64 types): number of non-missing values, mean, standard deviation, range, median, 0.25 and 0.75 quartiles.
data.describe()
Dataset Summary Statistics: Telecom Churn
Here’s a summary of the telecom churn dataset. It includes key statistics for each feature, providing insights such as the count, mean, standard deviation, and range (min, max) for numerical columns.
Feature | Count | Mean | Std | Min | 25% | 50% | 75% | Max |
---|---|---|---|---|---|---|---|---|
Account length | 3333.00 | 101.06 | 39.82 | 1.00 | 74.00 | 101.00 | 127.00 | 243.00 |
Area code | 3333.00 | 437.18 | 42.37 | 408.00 | 408.00 | 415.00 | 510.00 | 510.00 |
Number vmail messages | 3333.00 | 8.10 | 13.69 | 0.00 | 0.00 | 0.00 | 20.00 | 51.00 |
Total day minutes | 3333.00 | 179.78 | 54.47 | 0.00 | 143.70 | 179.40 | 216.40 | 350.80 |
Total day calls | 3333.00 | 100.44 | 20.07 | 0.00 | 87.00 | 101.00 | 114.00 | 165.00 |
Total day charge | 3333.00 | 30.56 | 9.26 | 0.00 | 24.43 | 30.50 | 36.79 | 59.64 |
Total eve minutes | 3333.00 | 200.98 | 50.71 | 0.00 | 166.60 | 201.40 | 235.30 | 363.70 |
Total eve calls | 3333.00 | 100.11 | 19.92 | 0.00 | 87.00 | 100.00 | 114.00 | 170.00 |
Total eve charge | 3333.00 | 17.08 | 4.31 | 0.00 | 14.16 | 17.12 | 20.00 | 30.91 |
Total night minutes | 3333.00 | 200.87 | 50.57 | 23.20 | 167.00 | 201.20 | 235.30 | 395.00 |
Total night calls | 3333.00 | 100.11 | 19.57 | 33.00 | 87.00 | 100.00 | 113.00 | 175.00 |
Total night charge | 3333.00 | 9.04 | 2.28 | 1.04 | 7.52 | 9.05 | 10.59 | 17.77 |
Total intl minutes | 3333.00 | 10.24 | 2.79 | 0.00 | 8.50 | 10.30 | 12.10 | 20.00 |
Total intl calls | 3333.00 | 4.48 | 2.46 | 0.00 | 3.00 | 4.00 | 6.00 | 20.00 |
Total intl charge | 3333.00 | 2.76 | 0.75 | 0.00 | 2.30 | 2.78 | 3.27 | 5.40 |
Customer service calls | 3333.00 | 1.56 | 1.32 | 0.00 | 1.00 | 1.00 | 2.00 | 9.00 |
Churn | 3333.00 | 0.14 | 0.35 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
In order to see statistics on non-numerical features, one has to explicitly indicate data types of interest in the include
parameter.
data.describe(include=["object", "bool"])
Categorical Features Summary
A summary of the categorical features in the telecom churn dataset. It highlights the count of entries, number of unique values, and the most frequent (top) value along with its frequency for each feature.
Feature | Count | Unique | Top | Frequency |
---|---|---|---|---|
State | 3333 | 51 | WV | 106 |
International plan | 3333 | 2 | No | 3010 |
Voice mail plan | 3333 | 2 | No | 2411 |
For categorical (type object) and boolean (type bool
) features we can use the value_counts
method. Let’s take a look at the distribution of Churn
:
data["Churn"].value_counts()
The count of clients who have not churned (0) versus those who have churned (1) is as follows:
Churn Status | Count |
---|---|
0 (Not Churned) | 2850 |
1 (Churned) | 483 |
This shows that out of 3333 clients, the majority (2850) have not churned, while 483 clients have. This distribution highlights the imbalance in churn status, which is a key consideration when analyzing and addressing churn rates.
2850 users out of 3333 are loyal; their Churn
value is 0. To calculate fractions, pass normalize=True
to the value_counts
function.
data["Churn"].value_counts(normalize=True)
Churn Status | Proportion |
---|---|
0 (Not Churned) | 0.86 |
1 (Churned) | 0.14 |
Thanks for reading! Part 2 coming soon ~ UJJWAL PALIWAL
7 Reactions
2 Bookmarks