import os
import numpy as np
import pandas as pd
from platform import python_version
print(f'python: {python_version()}')
print(f'pandas: {pd.__version__}')
python: 3.8.10
pandas: 1.2.4
Loading data
Pythonify column names
Name dataframe columns as valid python identifiers so that you can use dot notation to access them.
df = pd.DataFrame(np.random.random(size=(5, 3)), columns=['Quantity A', 'Quantity B\t', 'Quantity C\n'])
df
|
Quantity A |
Quantity B\t |
Quantity C\n |
0 |
0.421532 |
0.838176 |
0.867388 |
1 |
0.171445 |
0.201028 |
0.418358 |
2 |
0.095384 |
0.949771 |
0.604631 |
3 |
0.473025 |
0.623443 |
0.093247 |
4 |
0.032081 |
0.500601 |
0.277768 |
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
df
|
quantity_a |
quantity_b |
quantity_c |
0 |
0.421532 |
0.838176 |
0.867388 |
1 |
0.171445 |
0.201028 |
0.418358 |
2 |
0.095384 |
0.949771 |
0.604631 |
3 |
0.473025 |
0.623443 |
0.093247 |
4 |
0.032081 |
0.500601 |
0.277768 |
Remove duplicate index values
- Suppose you want to discard rows with a duplicate index value.
df.drop_duplicates()
works on values, but not the index itself.
- We can use
df.index.duplicated()
to isolate the duplicated index values, and then select everything except those indices from our dataframe.
- We use the
keep=False
flag because we don’t want to deduplicate the index (leaving one of each), but rather to remove all entries which are duplicated at all. This typically arises when duplication signals some unreliable upstream data, and we don’t have any visibility on which row is the truly correct one.
df = pd.DataFrame({
'user_id': [0, 1, 1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charles', 'Denise', 'Eric', 'Fiona'],
'age': np.random.randint(20, 80, size=6),
'location': ['New York', 'Toronto', 'Chicago', 'Berlin', 'Paris', 'Singapore']
}).set_index('user_id')
df
|
name |
age |
location |
user_id |
|
|
|
0 |
Alice |
67 |
New York |
1 |
Bob |
52 |
Toronto |
1 |
Charles |
39 |
Chicago |
2 |
Denise |
50 |
Berlin |
3 |
Eric |
53 |
Paris |
4 |
Fiona |
60 |
Singapore |
df[~df.index.duplicated(keep=False)]
|
name |
age |
location |
user_id |
|
|
|
0 |
Alice |
67 |
New York |
2 |
Denise |
50 |
Berlin |
3 |
Eric |
53 |
Paris |
4 |
Fiona |
60 |
Singapore |
Saving/loading
df['location'] = df['location'].astype('category')
display(df)
display(df.dtypes)
|
name |
age |
location |
user_id |
|
|
|
0 |
Alice |
67 |
New York |
1 |
Bob |
52 |
Toronto |
1 |
Charles |
39 |
Chicago |
2 |
Denise |
50 |
Berlin |
3 |
Eric |
53 |
Paris |
4 |
Fiona |
60 |
Singapore |
name object
age int64
location category
dtype: object
Pickle
fp = 'file.pickle'
# write
df.to_pickle(fp)
# read
from_pickle = pd.read_pickle(fp)
display(from_pickle)
display(from_pickle.dtypes)
# clean up example file
os.remove(fp)
|
name |
age |
location |
user_id |
|
|
|
0 |
Alice |
67 |
New York |
1 |
Bob |
52 |
Toronto |
1 |
Charles |
39 |
Chicago |
2 |
Denise |
50 |
Berlin |
3 |
Eric |
53 |
Paris |
4 |
Fiona |
60 |
Singapore |
name object
age int64
location category
dtype: object
HDF
- Supports partial reads and incremental writes
- Strange memory leak bug on
pd.read_hdf
in jupyter notebook on VM
fp = 'file.hdf5'
# write
df.to_hdf(fp, key='df', format='table', complevel=5, complib='zlib')
# read back
from_hdf = pd.read_hdf(fp)
display(from_hdf)
display(from_hdf.dtypes)
# clean up example file
os.remove(fp)
|
name |
age |
location |
user_id |
|
|
|
0 |
Alice |
67 |
New York |
1 |
Bob |
52 |
Toronto |
1 |
Charles |
39 |
Chicago |
2 |
Denise |
50 |
Berlin |
3 |
Eric |
53 |
Paris |
4 |
Fiona |
60 |
Singapore |
name object
age int64
location category
dtype: object
Feather
Does not support non-default indices, otherwise you’ll get this error:
ValueError: feather does not support serializing a non-default index for the index; you can .reset_index() to make the index into column(s)
!pip install -Uqq pyarrow
fp = 'file.feather'
# write
df.reset_index().to_feather(fp)
# read back
from_feather = pd.read_feather(fp).set_index('user_id')
display(from_feather)
display(from_feather.dtypes)
# clean up example file
os.remove(fp)
|
name |
age |
location |
user_id |
|
|
|
0 |
Alice |
67 |
New York |
1 |
Bob |
52 |
Toronto |
1 |
Charles |
39 |
Chicago |
2 |
Denise |
50 |
Berlin |
3 |
Eric |
53 |
Paris |
4 |
Fiona |
60 |
Singapore |
name object
age int64
location category
dtype: object
Parquet
- Parallelizes reads across multiple cores
- Still in active development, finicky
- Does not support hierarchical indices
- Run into error with very large string columns:
- If error
AttributeError: module 'snappy' has no attribute 'compress'
then it is trying to use the wrong package. Open terminal and run conda install python-snappy
.
- Error
RuntimeError: Different dictionaries encountered while building categorical
caused by multi-index.
!pip install -Uqq fastparquet
fp = 'file.parquet'
# write
df.to_parquet(fp, compression='gzip')
# read back
from_parquet = pd.read_parquet(fp, engine='fastparquet')
display(from_parquet)
display(from_parquet.dtypes)
# clean up example file
os.remove(fp)
|
name |
age |
location |
user_id |
|
|
|
0 |
Alice |
67 |
New York |
1 |
Bob |
52 |
Toronto |
1 |
Charles |
39 |
Chicago |
2 |
Denise |
50 |
Berlin |
3 |
Eric |
53 |
Paris |
4 |
Fiona |
60 |
Singapore |
name object
age int64
location category
dtype: object
Further reading
comments powered by Disqus