Exploratory data analysis with Pandas: Sorting (Part~2)
Sorting
a DataFrame
can be sorted by the value of one of the variables (i.e., columns). For example, we can sort by Total day charge
(use ascending=False
to sort in descending order):
df.sort_values(by="Total day charge", ascending=False). head ()
State | Account Length | Area Code | International Plan | Voice Mail Plan | Number Vmail Messages | Total Day Minutes | Total Day Calls | Total Day Charge | Total Eve Minutes | Total Eve Calls | Total Eve Charge | Total Night Minutes | Total Night Calls | Total Night Charge | Total Intl Minutes | Total Intl Calls | Total Intl Charge | Customer Service Calls | Churn |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CO | 154 | 415 | No | No | 0 | 350.8 | 75 | 59.64 | 216.5 | 94 | 18.40 | 253.9 | 100 | 11.43 | 10.1 | 9 | 2.73 | 1 | 1 |
NY | 64 | 415 | Yes | No | 0 | 346.8 | 55 | 58.96 | 249.5 | 79 | 21.21 | 275.4 | 102 | 12.39 | 13.3 | 9 | 3.59 | 1 | 1 |
OH | 115 | 510 | Yes | No | 0 | 345.3 | 81 | 58.70 | 203.4 | 106 | 17.29 | 217.5 | 107 | 9.79 | 11.8 | 8 | 3.19 | 1 | 1 |
OH | 83 | 415 | No | No | 0 | 337.4 | 120 | 57.36 | 227.4 | 116 | 19.33 | 153.9 | 114 | 6.93 | 15.8 | 7 | 4.27 | 0 | 1 |
MO | 112 | 415 | No | No | 0 | 335.5 | 77 | 57.04 | 212.5 | 109 | 18.06 | 265.0 | 132 | 11.93 | 12.7 | 8 | 3.43 | 2 | 1 |
We can also sort by multiple columns:
df.sort_values(by= ["Churn", "Total day charge"], ascending= [True, False]). head ()
Indexing and retrieving data
A DataFrame
can be indexed in a few different ways.
To get a single column, you can use a DataFrame['Name']
construction. Let’s use this to answer a question about that column alone: what is the proportion of churned users in our dataframe
?
df["Churn"]. mean ()
np. float64(0.14491449144914492)
What are the average values of numerical features for churned users?
Boolean indexing with one column is also very convenient. The syntax is df[P(df['Name'])]
, where P is some logical condition that is checked for each element of the Name
column. The result of such indexing is the DataFrame consisting only of the rows that satisfy the P condition on the Name
column.
python
df.select_dtypes(include=np.number)[df["Churn"] == 1]. mean ()
python
Account length 102.66
Area code 437.82
Number vmail messages 5.12
Total day minutes 206.91
Total day calls 101.34
Total day charge 35.18
Total eve minutes 212.41
Total eve calls 100.56
Total evecharge 18.05
Total night minutes 205.23
Total night calls 100.40
Total night charge 9.24
Total intl minutes 10.70
Total intl calls 4.16
Total intl charge 2.89
Customer service calls 2.23
Churn 1.00
dtype: float64
How much time (on average) do churned users spend on the phone during daytime?
df[df["Churn"] == 1] ["Total day minutes"]. mean ()
np. float64(206.91407867494823)
What is the maximum length of international calls among loyal users (Churn == 0) who do not have an international plan?
df[(df["Churn"] == 0) & (df["International plan"] == "No")] ["Total intl minutes"].max ()
np.float64(18.9)
DataFrames can be indexed by column name (label) or row name (index) or by the serial number of a row. The loc method is used for indexing by name, while iloc() is used for indexing by number.
df.loc[0:5, "State":"Area code"]
State | Account length | Area code |
---|---|---|
KS | 128 | 415 |
OH | 107 | 415 |
NJ | 137 | 415 |
OH | 84 | 408 |
OK | 75 | 415 |
AL | 118 | 510 |
If we need the first or the last line of the data frame, we can use the df[:1] or df[-1:] construction:
df[-1:]
State | Account length | Area code | International plan | Voice mail plan | Number vmail messages | Total day minutes | Total day calls | Total day charge | Total eve minutes | Total eve calls | Total eve charge | Total night minutes | Total night calls | Total night charge | Total intl minutes | Total intl calls | Total intl charge | Customer service calls | Churn |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TN | 74 | 415 | No | Yes | 25 | 234.4 | 113 | 39.85 | 265.9 | 82 | 22.60 | 241.4 | 77 | 10.86 | 13.7 | 4 | 3.70 | 0 | 0 |
6 Reactions
1 Bookmarks