‹ Geoff Ruddock

Misc

May 07, 2019
import numpy as np
import pandas as pd

from platform import python_version

print(f'python: {python_version()}')
print(f'pandas: {pd.__version__}')

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'  # display all output cells
python: 3.8.10
pandas: 1.3.1

Input

Inferring dtypes automatically

Let’s make a toy df that contains a variety of different dtypes.

df = pd.DataFrame({
    'Int': [1, 2, 3],
    'Date': ['2021-01-01', '2018-05-07', '2011-12-12'],
    'Float': [1.2, 2.3, 3.4],
    'Bool': [True, False, True],
    'Str': ['A', 'B', 'C']
})

df.dtypes
Int        int64
Date      object
Float    float64
Bool        bool
Str       object
dtype: object

If we write to a csv file and read it back, pd.read_csv() actually does a great job of inferring the original dtypes for each column—with the exception of Date.

import tempfile

with tempfile.TemporaryFile() as f:
    df.to_csv(f, index=False)
    _ = f.seek(0)
    df2 = pd.read_csv(f)
    
df2.dtypes
Int        int64
Date      object
Float    float64
Bool        bool
Str       object
dtype: object

Suppose we are dealing with a df whose columns are all dtype object—perhaps the output from some other package or function that isn’t quite so diligent as pd.read_csv().

df_as_all_obj = df2.astype(object)
df_as_all_obj.dtypes
Int      object
Date     object
Float    object
Bool     object
Str      object
dtype: object

We can still recover them using df.convert_dtypes().

This also makes use of more modern dtypes when possible, incl. nullable ints and explicit strings rather than just Object.

df_as_all_obj.convert_dtypes().dtypes
Int        Int64
Date      string
Float    Float64
Bool     boolean
Str       string
dtype: object

EDA

Context manager

If you are working with very long or wide dataframes, it makes sense to set your pandas options such as max_rows and max_columns so that the output from a cell is not miles long.

If you need to temporarily override these settings, you can use the pd.option_context context manager to do so.

df = pd.DataFrame([np.random.randint(1, 100, 100)])
df

0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
0 38 61 38 85 79 11 8 93 55 62 ... 78 57 41 68 27 76 95 39 78 13

1 rows × 100 columns

with pd.option_context('display.max_rows', 999, 'max_colwidth', 100):
    display(df)

0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
0 38 61 38 85 79 11 8 93 55 62 ... 78 57 41 68 27 76 95 39 78 13

1 rows × 100 columns

Value counts

By default, srs.value_counts() sorts by frequency descending. If you want the counts back in their default order, chain on .sort_index().

srs = pd.Series(np.random.choice(list('ABCDEF'), size=100))

srs.value_counts()
E    20
F    19
B    17
A    15
C    15
D    14
dtype: int64
srs.value_counts().sort_index()
A    15
B    17
C    15
D    14
E    20
F    19
dtype: int64

Wrangling

Binning with pd.cut()

cities = ['New York', 'Toronto', 'Chicago', 'Berlin', 'Paris', 'Singapore']

df = pd.DataFrame({
    'age': np.random.randint(20, 80, size=100),
    'location': np.random.choice(cities, size=100)
}).rename_axis('user_id')

df
df.dtypes

age location
user_id
0 79 Paris
1 21 Singapore
2 48 Singapore
3 40 Singapore
4 74 Chicago
... ... ...
95 28 Singapore
96 76 Singapore
97 32 Berlin
98 20 Berlin
99 29 Singapore

100 rows × 2 columns

age          int64
location    object
dtype: object
df['age_range'] = pd.cut(df['age'], 3)

df
df.dtypes

age location age_range
user_id
0 79 Paris (59.333, 79.0]
1 21 Singapore (19.941, 39.667]
2 48 Singapore (39.667, 59.333]
3 40 Singapore (39.667, 59.333]
4 74 Chicago (59.333, 79.0]
... ... ... ...
95 28 Singapore (19.941, 39.667]
96 76 Singapore (59.333, 79.0]
97 32 Berlin (19.941, 39.667]
98 20 Berlin (19.941, 39.667]
99 29 Singapore (19.941, 39.667]

100 rows × 3 columns

age             int64
location       object
age_range    category
dtype: object

CASE WHEN

📚 Further reading

def case_when(df: pd.DataFrame, conditions: dict, output_col: str, default:bool = None) -> pd.DataFrame:
    """ Simulates SQL's CASE WHEN syntax for pandas. 

    Produce an output column based on a dict of conditions, evaluated sequentially using np.select.
    """
    
    bool_masks = {k: v(df) for k, v in conditions.items()}
    labels, conditions = zip(*bool_masks.items())
    df[output_col] = np.select(conditions, labels, default=default)

    return df

case_whens = {
    'Europe': lambda x: x['location'].isin(['Paris', 'Berlin']),
    'Asia': lambda x: x['location'].isin(['Singapore']),
    'North America': lambda x: x['location'].isin(['New York', 'Chicago'])
}

(
    df
    .pipe(case_when, case_whens, 'continent')
    .groupby(['continent', 'location'])
    ['age'].count()
    .reset_index()
)

continent location age
0 Asia Singapore 23
1 Europe Berlin 16
2 Europe Paris 16
3 North America Chicago 16
4 North America New York 11

Summarizing (2D)

Frequency

df = pd.DataFrame({
    'dim_A': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'],
    'dim_B': ['X', 'Y', 'Z', 'X', 'Y', 'Z', 'X', 'Y'],
    'num_': 1
})

df

dim_A dim_B num_
0 A X 1
1 A Y 1
2 A Z 1
3 B X 1
4 B Y 1
5 B Z 1
6 C X 1
7 C Y 1

Using df.unstack

You can technically do a simple frequency count using either df.groupby() or df.set_index() in combination with df.unstack(), but this becomes cumbersome for all but the most simple of use-cases.

(
    df
    .groupby(['dim_A', 'dim_B']) 
    ['num_'].count()
    .unstack()
    .fillna(0)
    .astype(int)
)

dim_B X Y Z
dim_A
A 1 1 1
B 1 1 1
C 1 1 0
(
    df
    .set_index(['dim_A', 'dim_B'])
    .iloc[:, 0]
    .unstack()
    .fillna(0)
    .astype(int)
)

dim_B X Y Z
dim_A
A 1 1 1
B 1 1 1
C 1 1 0

Using df.pivot

This is, I suppose, a couple lines shorter than using df.unstack(), but does not offer any of the additional options that pd.pivot_table() below does.

(
    df
    .pivot(index='dim_A', columns='dim_B', values='num_')
    .fillna(0)
    .astype(int)
)

dim_B X Y Z
dim_A
A 1 1 1
B 1 1 1
C 1 1 0

⭐ Using df.pivot_table

This achieves the same result as above, but with just a single line.

df.pivot_table(index='dim_A', columns='dim_B', values='num_', fill_value=0)

dim_B X Y Z
dim_A
A 1 1 1
B 1 1 1
C 1 1 0

And it also has a handful of parameters than come in useful for more complex use-cases.

df.pivot_table(
    index='dim_A',
    columns='dim_B',
    values='num_',
    aggfunc=sum,
    fill_value=0,
    margins=True
).astype(int)

dim_B X Y Z All
dim_A
A 1 1 1 3
B 1 1 1 3
C 1 1 0 2
All 3 3 2 8

pd.crosstab

Structurally similar to df.pivot_table, except:

  • It expects series to be passed as index, columns, values parameters, rather than columm names of a df.
  • It has a normalize parameter, that is often useful.

Further reading

pd.crosstab(
    index=df.dim_A,
    columns=df.dim_B,
    values=df.num_,
    aggfunc='sum',
    margins=True,
    normalize=True
)

dim_B X Y Z All
dim_A
A 0.125 0.125 0.125 0.375
B 0.125 0.125 0.125 0.375
C 0.125 0.125 0.000 0.250
All 0.375 0.375 0.250 1.000

Pairwise percentages

Suppose you have two dimensions (A, B) and are interested in the % of some numeric value for which the dimensions match.

df = pd.DataFrame({
    'dim_A': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'dim_B': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
    'num_': np.arange(1, 10)
})

df

dim_A dim_B num_
0 A A 1
1 A B 2
2 A C 3
3 B A 4
4 B B 5
5 B C 6
6 C A 7
7 C B 8
8 C C 9

A good order to handle this is: compute → compare (and filter) → aggregate.

This has the benefit of making the intermediate outputs useful to inspect/debug.

(
    df
    .assign(pct_=lambda x: x['num_'] / x['num_'].sum())
    .loc[lambda x: x['dim_A'] == x['dim_B'], 'pct_']
    .sum()
)
0.33333333333333337

Alternatively, you could: compare → aggregate → filter.

(
    df
    .assign(is_eq=lambda x: x['dim_A'] == x['dim_B'])
    .groupby('is_eq')['num_'].sum()
    .pipe(lambda x: x[True] / x.sum())
)
0.3333333333333333

If you have a series w/ dimensions as MultiIndex, you can combine compare/filter/aggregate steps into the groupby call itself using a lambda function, which receives a tuple representing the index values of each row.

srs = df.set_index(['dim_A', 'dim_B'])['num_']

(
    srs
    .groupby(lambda x: x[0] == x[1]).sum()
    .pipe(lambda x: x[True] / x.sum())
)
0.3333333333333333

Further reading

  • Data School’s top 25 pandas tricks [Jupyter Notebook] – Good overview of some practical pandas code. The df.prefix() and df.suffix() methods to rename columns are a neat trick that I just learned from this article.
comments powered by Disqus