‹ Geoff Ruddock

Indexing (and MultiIndex)

Jan 07, 2020
import numpy as np
import pandas as pd

from platform import python_version

print(f'python: {python_version()}')
print(f'pandas: {pd.__version__}')
python: 3.8.10
pandas: 1.2.4

Selecting

Select only datetime columns

dates = pd.date_range('2010-01-01', '2019-12-31')
dates

df = pd.DataFrame({
    'user_id': range(6),
    'name': ['Alice', 'Bob', 'Charles', 'Denise', 'Eric', 'Fiona'],
    'age': np.random.randint(20, 80, size=6),
    'location': ['New York', 'Toronto', 'Chicago', 'Berlin', 'Paris', 'Singapore'],
    'registration_date': np.random.choice(dates, size=6),
    'purchase_date': np.random.choice(dates, size=6),
    'cancel_date': np.random.choice(dates, size=6)
}).set_index('user_id')


df.select_dtypes(include=['datetime64[ns]'])

registration_date purchase_date cancel_date
user_id
0 2016-12-06 2017-09-28 2019-06-28
1 2018-06-10 2014-01-28 2018-01-19
2 2012-02-22 2012-12-09 2014-11-15
3 2018-10-15 2011-08-23 2018-08-07
4 2016-12-20 2010-11-29 2019-09-22
5 2012-06-05 2011-02-20 2019-01-14

Select only columns with non-zero values

df = pd.DataFrame(np.random.randint(1, 100, size=(10, 5)))
df.iloc[:, [2, 4]] = 0
df

0 1 2 3 4
0 95 87 0 39 0
1 46 27 0 29 0
2 19 35 0 17 0
3 55 40 0 93 0
4 87 69 0 63 0
5 72 11 0 52 0
6 88 68 0 84 0
7 60 32 0 22 0
8 33 93 0 63 0
9 98 42 0 5 0
df.loc[:, (df != 0).any(axis=0)]

0 1 3
0 95 87 39
1 46 27 29
2 19 35 17
3 55 40 93
4 87 69 63
5 72 11 52
6 88 68 84
7 60 32 22
8 33 93 63
9 98 42 5

Select using combination of loc and iloc

df.iloc[5:, df.columns.get_indexer([0, 2])]

0 2
5 72 0
6 88 0
7 60 0
8 33 0
9 98 0

MultiIndex

Select rows in which one level of the multiindex match a list of values

bool_filt = (df.groupby(level='user_id').size() > 1)
user_ids_to_include = bool_filt[bool_filt].index
ms_df = df[lambda x: x.index.get_level_values('user_id').isin(user_ids_to_include)]

Convert all levels of a hierarchical index to string

df.columns.set_levels([x.astype(str) for x in x.columns.levels])

E.g. including rows from users with more than one row

bool_filt = (df.groupby(level='user_id').size() > 1)
user_ids_to_include = bool_filt[bool_filt].index
multi_search_df = df[lambda x: x.index.get_level_values('user_id').isin(user_ids_to_include)]

Manipulate existing index

df = pd.DataFrame(index=list('abcde'), data={'A': range(5), 'B': range(5)})
display(df)

A B
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4

Add level above existing

insert_above = df.copy()

insert_above.columns = pd.MultiIndex.from_product([['Super'], df.columns])
insert_above

Super
A B
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4

Add level below existing

insert_below = df.copy()

insert_below.columns = pd.MultiIndex.from_product([df.columns, ['Sub']])
insert_below

A B
Sub Sub
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4

Swap existing levels

df = pd.DataFrame(index=list('abcde'), data={'sub_level_1': range(5), 'sub_level_2': range(5)})
df.columns = pd.MultiIndex.from_product([['super_level'], df.columns])
df

super_level
sub_level_1 sub_level_2
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
df.swaplevel(axis=1)

sub_level_1 sub_level_2
super_level super_level
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4

Misc

Create a DataFrame out of the levels of a MultiIndex

df.index.to_frame(index=False)

0
0 a
1 b
2 c
3 d
4 e
comments powered by Disqus