Geoff Ruddock

Indexing (and MultiIndex)

import numpy as np
import pandas as pd
Package    Version
---------  ---------
python     3.8.8
numpy      1.20.1
pandas     1.3.3

Selecting

Select only datetime columns

dates = pd.date_range('2010-01-01', '2019-12-31')
dates

df = pd.DataFrame({
    'user_id': range(6),
    'name': ['Alice', 'Bob', 'Charles', 'Denise', 'Eric', 'Fiona'],
    'age': np.random.randint(20, 80, size=6),
    'location': ['New York', 'Toronto', 'Chicago', 'Berlin', 'Paris', 'Singapore'],
    'registration_date': np.random.choice(dates, size=6),
    'purchase_date': np.random.choice(dates, size=6),
    'cancel_date': np.random.choice(dates, size=6)
}).set_index('user_id')


df.select_dtypes(include=['datetime64[ns]'])

registration_datepurchase_datecancel_date
user_id
02019-06-102017-02-162014-06-16
12011-10-122012-08-132014-01-26
22016-08-262015-08-252012-12-17
32018-11-072010-03-092010-05-16
42019-04-192014-09-162018-04-19
52019-01-022013-02-052014-03-09

Select only columns with non-zero values

df = pd.DataFrame(np.random.randint(1, 100, size=(10, 5)))
df.iloc[:, [2, 4]] = 0
df

01234
05510770
11774010
289120190
382110810
465840590
516160830
68920540
723230470
848300830
949680520
df.loc[:, (df != 0).any(axis=0)]

013
055177
117741
2891219
3821181
4658459
5161683
689254
7232347
8483083
9496852

Select using combination of loc and iloc

df.iloc[5:, df.columns.get_indexer([0, 2])]

02
5160
680
7230
8480
9490

Modifying

Cast index as datetime

df_with_string_index = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
}, index=['2020-01-01', '2020-01-02', '2020-01-03'])

df_with_string_index

AB
2020-01-0114
2020-01-0225
2020-01-0336
df_with_string_index.index
Index(['2020-01-01', '2020-01-02', '2020-01-03'], dtype='object')
df_with_dt_index = df_with_string_index.pipe(lambda x: x.set_index(pd.to_datetime(x.index)))
df_with_dt_index

AB
2020-01-0114
2020-01-0225
2020-01-0336
df_with_dt_index.index
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03'], dtype='datetime64[ns]', freq=None)

MultiIndex

Select rows in which one level of the multiindex match a list of values

bool_filt = (df.groupby(level='user_id').size() > 1)
user_ids_to_include = bool_filt[bool_filt].index
ms_df = df[lambda x: x.index.get_level_values('user_id').isin(user_ids_to_include)]

Convert all levels of a hierarchical index to string

df.columns.set_levels([x.astype(str) for x in x.columns.levels])

E.g. including rows from users with more than one row

bool_filt = (df.groupby(level='user_id').size() > 1)
user_ids_to_include = bool_filt[bool_filt].index
multi_search_df = df[lambda x: x.index.get_level_values('user_id').isin(user_ids_to_include)]

Levels

df = pd.DataFrame(index=list('abcde'), data={'A': range(5), 'B': range(5)})
display(df)

AB
a00
b11
c22
d33
e44

Add level above existing

insert_above = df.copy()

insert_above.columns = pd.MultiIndex.from_product([['Super'], df.columns])
insert_above

Super
AB
a00
b11
c22
d33
e44

Add level below existing

insert_below = df.copy()

insert_below.columns = pd.MultiIndex.from_product([df.columns, ['Sub']])
insert_below

AB
SubSub
a00
b11
c22
d33
e44

Swap existing levels

df = pd.DataFrame(index=list('abcde'), data={'sub_level_1': range(5), 'sub_level_2': range(5)})
df.columns = pd.MultiIndex.from_product([['super_level'], df.columns])
df

super_level
sub_level_1sub_level_2
a00
b11
c22
d33
e44
df.swaplevel(axis=1)

sub_level_1sub_level_2
super_levelsuper_level
a00
b11
c22
d33
e44

Modify a level

date_strs = ['2020-01-01', '2020-01-02', '2020-01-03']
dims = ['A', 'B', 'C']

idx = pd.MultiIndex.from_product([date_strs, dims])
df = pd.DataFrame(index=idx, data={'A': range(len(idx)), 'B': range(len(idx))})

display(df)

AB
2020-01-01A00
B11
C22
2020-01-02A33
B44
C55
2020-01-03A66
B77
C88
df.index.dtypes
level_0    object
level_1    object
dtype: object
df.index.set_levels(pd.to_datetime(df.index.levels[0]), level=0)
MultiIndex([('2020-01-01', 'A'),
            ('2020-01-01', 'B'),
            ('2020-01-01', 'C'),
            ('2020-01-02', 'A'),
            ('2020-01-02', 'B'),
            ('2020-01-02', 'C'),
            ('2020-01-03', 'A'),
            ('2020-01-03', 'B'),
            ('2020-01-03', 'C')],
           )

MultiIndex to DataFrame

Create a DataFrame out of the levels of a MultiIndex.

df.index.to_frame(index=False)

01
02020-01-01A
12020-01-01B
22020-01-01C
32020-01-02A
42020-01-02B
52020-01-02C
62020-01-03A
72020-01-03B
82020-01-03C