Geoff Ruddock

I/O with pandas

import os
import numpy as np
import pandas as pd

from platform import python_version

print(f'python: {python_version()}')
print(f'pandas: {pd.__version__}')
python: 3.8.10
pandas: 1.2.4

Loading data

Pythonify column names

Name dataframe columns as valid python identifiers so that you can use dot notation to access them.

df = pd.DataFrame(np.random.random(size=(5, 3)), columns=['Quantity A', 'Quantity B\t', 'Quantity C\n'])
df

Quantity A Quantity B\t Quantity C\n
0 0.421532 0.838176 0.867388
1 0.171445 0.201028 0.418358
2 0.095384 0.949771 0.604631
3 0.473025 0.623443 0.093247
4 0.032081 0.500601 0.277768
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
df

quantity_a quantity_b quantity_c
0 0.421532 0.838176 0.867388
1 0.171445 0.201028 0.418358
2 0.095384 0.949771 0.604631
3 0.473025 0.623443 0.093247
4 0.032081 0.500601 0.277768

Remove duplicate index values

df = pd.DataFrame({
    'user_id': [0, 1, 1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charles', 'Denise', 'Eric', 'Fiona'],
    'age': np.random.randint(20, 80, size=6),
    'location': ['New York', 'Toronto', 'Chicago', 'Berlin', 'Paris', 'Singapore']
}).set_index('user_id')

df

name age location
user_id
0 Alice 67 New York
1 Bob 52 Toronto
1 Charles 39 Chicago
2 Denise 50 Berlin
3 Eric 53 Paris
4 Fiona 60 Singapore
df[~df.index.duplicated(keep=False)]

name age location
user_id
0 Alice 67 New York
2 Denise 50 Berlin
3 Eric 53 Paris
4 Fiona 60 Singapore

Saving/loading

df['location'] = df['location'].astype('category')

display(df)
display(df.dtypes)

name age location
user_id
0 Alice 67 New York
1 Bob 52 Toronto
1 Charles 39 Chicago
2 Denise 50 Berlin
3 Eric 53 Paris
4 Fiona 60 Singapore
name          object
age            int64
location    category
dtype: object

Pickle

fp = 'file.pickle'

# write
df.to_pickle(fp)

# read
from_pickle = pd.read_pickle(fp)
display(from_pickle)
display(from_pickle.dtypes)

# clean up example file
os.remove(fp)

name age location
user_id
0 Alice 67 New York
1 Bob 52 Toronto
1 Charles 39 Chicago
2 Denise 50 Berlin
3 Eric 53 Paris
4 Fiona 60 Singapore
name          object
age            int64
location    category
dtype: object

HDF

fp = 'file.hdf5'

# write
df.to_hdf(fp, key='df', format='table', complevel=5, complib='zlib')

# read back
from_hdf = pd.read_hdf(fp)
display(from_hdf)
display(from_hdf.dtypes)

# clean up example file
os.remove(fp)

name age location
user_id
0 Alice 67 New York
1 Bob 52 Toronto
1 Charles 39 Chicago
2 Denise 50 Berlin
3 Eric 53 Paris
4 Fiona 60 Singapore
name          object
age            int64
location    category
dtype: object

Feather

Does not support non-default indices, otherwise you’ll get this error:

ValueError: feather does not support serializing a non-default index for the index; you can .reset_index() to make the index into column(s)
!pip install -Uqq pyarrow
fp = 'file.feather'

# write
df.reset_index().to_feather(fp)

# read back
from_feather = pd.read_feather(fp).set_index('user_id')
display(from_feather)
display(from_feather.dtypes)

# clean up example file
os.remove(fp)

name age location
user_id
0 Alice 67 New York
1 Bob 52 Toronto
1 Charles 39 Chicago
2 Denise 50 Berlin
3 Eric 53 Paris
4 Fiona 60 Singapore
name          object
age            int64
location    category
dtype: object

Parquet

!pip install -Uqq fastparquet
fp = 'file.parquet'

# write
df.to_parquet(fp, compression='gzip')

# read back
from_parquet = pd.read_parquet(fp, engine='fastparquet')
display(from_parquet)
display(from_parquet.dtypes)

# clean up example file
os.remove(fp)

name age location
user_id
0 Alice 67 New York
1 Bob 52 Toronto
1 Charles 39 Chicago
2 Denise 50 Berlin
3 Eric 53 Paris
4 Fiona 60 Singapore
name          object
age            int64
location    category
dtype: object

Further reading


comments powered by Disqus