Geoff Ruddock

Exploratory data analysis (EDA)

import numpy as np
import pandas as pd
Package    Version
---------  ---------
python     3.9.12
numpy      1.21.5
pandas     1.4.2

General

Context manager

If you are working with very long or wide dataframes, it makes sense to set your pandas options such as max_rows and max_columns so that the output from a cell is not miles long.

If you need to temporarily override these settings, you can use the pd.option_context context manager to do so.

df = pd.DataFrame([np.random.randint(1, 100, 100)])
df

0123456789...90919293949596979899
014623121586052656...652145693234516463

1 rows × 100 columns

with pd.option_context('display.max_rows', 999, 'max_colwidth', 100):
    display(df)

0123456789...90919293949596979899
014623121586052656...652145693234516463

1 rows × 100 columns

Value counts

By default, srs.value_counts() sorts by frequency descending. If you want the counts back in their default order, chain on .sort_index().

srs = pd.Series(np.random.choice(list('ABCDEF'), size=100))

srs.value_counts()
D    19
C    18
B    17
A    16
F    15
E    15
dtype: int64
srs.value_counts().sort_index()
A    16
B    17
C    18
D    19
E    15
F    15
dtype: int64

Frequency

df = pd.DataFrame({
    'dim_A': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'],
    'dim_B': ['X', 'Y', 'Z', 'X', 'Y', 'Z', 'X', 'Y'],
    'num_': 1
})

df

dim_Adim_Bnum_
0AX1
1AY1
2AZ1
3BX1
4BY1
5BZ1
6CX1
7CY1

unstack

You can technically do a simple frequency count using either df.groupby() or df.set_index() in combination with df.unstack(), but this becomes cumbersome for all but the most simple of use-cases.

(
    df
    .groupby(['dim_A', 'dim_B']) 
    ['num_'].count()
    .unstack()
    .fillna(0)
    .astype(int)
)

dim_BXYZ
dim_A
A111
B111
C110
(
    df
    .set_index(['dim_A', 'dim_B'])
    .iloc[:, 0]
    .unstack()
    .fillna(0)
    .astype(int)
)

dim_BXYZ
dim_A
A111
B111
C110

pivot

This is, I suppose, a couple lines shorter than using df.unstack(), but does not offer any of the additional options that pd.pivot_table() below does.

pivot_table

This achieves the same result as above, but with just a single line.

df.pivot_table(index='dim_A', columns='dim_B', values='num_', fill_value=0)

dim_BXYZ
dim_A
A111
B111
C110

And it also has a handful of parameters than come in useful for more complex use-cases.

df.pivot_table(
    index='dim_A',
    columns='dim_B',
    values='num_',
    aggfunc=sum,
    fill_value=0,
    margins=True
).astype(int)

dim_BXYZAll
dim_A
A1113
B1113
C1102
All3328

crosstab

pd.crosstab is structurally similar to df.pivot_table, except:

Further reading

pd.crosstab(
    index=df.dim_A,
    columns=df.dim_B,
    values=df.num_,
    aggfunc='sum',
    margins=True,
    normalize=True
)

dim_BXYZAll
dim_A
A0.1250.1250.1250.375
B0.1250.1250.1250.375
C0.1250.1250.0000.250
All0.3750.3750.2501.000

Misc

Manually adding a totals row

def add_total_row(df: pd.DataFrame, name='Total') -> pd.DataFrame:
    """ Append a sum row to a DataFrame. Useful when table is final dataviz. """
    total_row = df.sum(numeric_only=True).rename(name).to_frame().T
    return pd.concat([df, total_row])


df = pd.DataFrame({
    'X': np.random.randint(1, 100, size=5),
    'Y': np.random.randint(1, 10, size=5)
}, index=['A', 'B', 'C', 'D', 'E'])

df.pipe(add_total_row)

XY
A281
B668
C902
D286
E403
Total25220
df.pipe(add_total_row).style.set_table_styles([{
    'selector': 'tbody > tr:last-child',
    'props': [('border-top', '2px solid #333;')]
}])
 XY
A281
B668
C902
D286
E403
Total25220

Parsimonious display

Diff rows

df = pd.DataFrame({
    'A': [1, 1, 1],
    'B': [1, 2, 3],
    'C': [1, 1, 2],
    'D': [None, None, None]
})

df

ABCD
0111None
1121None
2132None
def diff_rows(df: pd.DataFrame) -> pd.DataFrame:
    """ Drop columns for which all values across rows are identical. """
    non_unique_cols = df.astype(str).nunique().loc[lambda x: x > 1].index
    return df[non_unique_cols]

df.pipe(diff_rows)

BC
011
121
232
# One-liner  version

df.loc[:, lambda x: x.astype(str).nunique().loc[lambda y: y > 1].index]

BC
011
121
232

Drop always-null columns

def drop_all_null_cols(df: pd.DataFrame) -> pd.DataFrame:

    null_cols = df.isnull().mean().loc[lambda x: x == 1].index.tolist()
    null_cols_str = ', '.join(null_cols)
    print(f'Dropping {len(null_cols)} columns that are always null: {null_cols_str}\n')
    return df.drop(null_cols, axis=1)

df.pipe(drop_all_null_cols)
Dropping 1 columns that are always null: D

ABC
0111
1121
2132

Pairwise percentages

Suppose you have two dimensions (A, B) and are interested in the % of some numeric value for which the dimensions match.

df = pd.DataFrame({
    'dim_A': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'dim_B': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
    'num_': np.arange(1, 10)
})

df

dim_Adim_Bnum_
0AA1
1AB2
2AC3
3BA4
4BB5
5BC6
6CA7
7CB8
8CC9

A good order to handle this is: compute → compare (and filter) → aggregate.

This has the benefit of making the intermediate outputs useful to inspect/debug.

(
    df
    .assign(pct_=lambda x: x['num_'] / x['num_'].sum())
    .loc[lambda x: x['dim_A'] == x['dim_B'], 'pct_']
    .sum()
)
0.33333333333333337

Alternatively, you could: compare → aggregate → filter.

(
    df
    .assign(is_eq=lambda x: x['dim_A'] == x['dim_B'])
    .groupby('is_eq')['num_'].sum()
    .pipe(lambda x: x[True] / x.sum())
)
0.3333333333333333

If you have a series w/ dimensions as MultiIndex, you can combine compare/filter/aggregate steps into the groupby call itself using a lambda function, which receives a tuple representing the index values of each row.

srs = df.set_index(['dim_A', 'dim_B'])['num_']

(
    srs
    .groupby(lambda x: x[0] == x[1]).sum()
    .pipe(lambda x: x[True] / x.sum())
)
0.3333333333333333