‹ Geoff Ruddock

I/O with pandas

Jul 13, 2019
import os
import numpy as np
import pandas as pd

from platform import python_version

print(f'python: {python_version()}')
print(f'pandas: {pd.__version__}')
python: 3.8.10
pandas: 1.2.4

Loading data

Pythonify column names

Name dataframe columns as valid python identifiers so that you can use dot notation to access them.

df = pd.DataFrame(np.random.random(size=(5, 3)), columns=['Quantity A', 'Quantity B\t', 'Quantity C\n'])
df

Quantity A Quantity B\t Quantity C\n
0 0.421532 0.838176 0.867388
1 0.171445 0.201028 0.418358
2 0.095384 0.949771 0.604631
3 0.473025 0.623443 0.093247
4 0.032081 0.500601 0.277768
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
df

quantity_a quantity_b quantity_c
0 0.421532 0.838176 0.867388
1 0.171445 0.201028 0.418358
2 0.095384 0.949771 0.604631
3 0.473025 0.623443 0.093247
4 0.032081 0.500601 0.277768

Remove duplicate index values

  • Suppose you want to discard rows with a duplicate index value. df.drop_duplicates() works on values, but not the index itself.
  • We can use df.index.duplicated() to isolate the duplicated index values, and then select everything except those indices from our dataframe.
  • We use the keep=False flag because we don’t want to deduplicate the index (leaving one of each), but rather to remove all entries which are duplicated at all. This typically arises when duplication signals some unreliable upstream data, and we don’t have any visibility on which row is the truly correct one.
df = pd.DataFrame({
    'user_id': [0, 1, 1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charles', 'Denise', 'Eric', 'Fiona'],
    'age': np.random.randint(20, 80, size=6),
    'location': ['New York', 'Toronto', 'Chicago', 'Berlin', 'Paris', 'Singapore']
}).set_index('user_id')

df

name age location
user_id
0 Alice 67 New York
1 Bob 52 Toronto
1 Charles 39 Chicago
2 Denise 50 Berlin
3 Eric 53 Paris
4 Fiona 60 Singapore
df[~df.index.duplicated(keep=False)]

name age location
user_id
0 Alice 67 New York
2 Denise 50 Berlin
3 Eric 53 Paris
4 Fiona 60 Singapore

Saving/loading

df['location'] = df['location'].astype('category')

display(df)
display(df.dtypes)

name age location
user_id
0 Alice 67 New York
1 Bob 52 Toronto
1 Charles 39 Chicago
2 Denise 50 Berlin
3 Eric 53 Paris
4 Fiona 60 Singapore
name          object
age            int64
location    category
dtype: object

Pickle

fp = 'file.pickle'

# write
df.to_pickle(fp)

# read
from_pickle = pd.read_pickle(fp)
display(from_pickle)
display(from_pickle.dtypes)

# clean up example file
os.remove(fp)

name age location
user_id
0 Alice 67 New York
1 Bob 52 Toronto
1 Charles 39 Chicago
2 Denise 50 Berlin
3 Eric 53 Paris
4 Fiona 60 Singapore
name          object
age            int64
location    category
dtype: object

HDF

  • Supports partial reads and incremental writes
  • Strange memory leak bug on pd.read_hdf in jupyter notebook on VM
fp = 'file.hdf5'

# write
df.to_hdf(fp, key='df', format='table', complevel=5, complib='zlib')

# read back
from_hdf = pd.read_hdf(fp)
display(from_hdf)
display(from_hdf.dtypes)

# clean up example file
os.remove(fp)

name age location
user_id
0 Alice 67 New York
1 Bob 52 Toronto
1 Charles 39 Chicago
2 Denise 50 Berlin
3 Eric 53 Paris
4 Fiona 60 Singapore
name          object
age            int64
location    category
dtype: object

Feather

Does not support non-default indices, otherwise you’ll get this error:

ValueError: feather does not support serializing a non-default index for the index; you can .reset_index() to make the index into column(s)
!pip install -Uqq pyarrow
fp = 'file.feather'

# write
df.reset_index().to_feather(fp)

# read back
from_feather = pd.read_feather(fp).set_index('user_id')
display(from_feather)
display(from_feather.dtypes)

# clean up example file
os.remove(fp)

name age location
user_id
0 Alice 67 New York
1 Bob 52 Toronto
1 Charles 39 Chicago
2 Denise 50 Berlin
3 Eric 53 Paris
4 Fiona 60 Singapore
name          object
age            int64
location    category
dtype: object

Parquet

  • Parallelizes reads across multiple cores
  • Still in active development, finicky
  • Does not support hierarchical indices
  • Run into error with very large string columns:
  • If error AttributeError: module 'snappy' has no attribute 'compress' then it is trying to use the wrong package. Open terminal and run conda install python-snappy.
  • Error RuntimeError: Different dictionaries encountered while building categorical caused by multi-index.
!pip install -Uqq fastparquet
fp = 'file.parquet'

# write
df.to_parquet(fp, compression='gzip')

# read back
from_parquet = pd.read_parquet(fp, engine='fastparquet')
display(from_parquet)
display(from_parquet.dtypes)

# clean up example file
os.remove(fp)

name age location
user_id
0 Alice 67 New York
1 Bob 52 Toronto
1 Charles 39 Chicago
2 Denise 50 Berlin
3 Eric 53 Paris
4 Fiona 60 Singapore
name          object
age            int64
location    category
dtype: object

Further reading

comments powered by Disqus