I/O with pandas

import os
import numpy as np
import pandas as pd

from platform import python_version

print(f'python: {python_version()}')
print(f'pandas: {pd.__version__}')

python: 3.8.10
pandas: 1.2.4

Loading data

Pythonify column names

Name dataframe columns as valid python identifiers so that you can use dot notation to access them.

df = pd.DataFrame(np.random.random(size=(5, 3)), columns=['Quantity A', 'Quantity B\t', 'Quantity C\n'])
df

	Quantity A	Quantity B\t	Quantity C\n
0	0.421532	0.838176	0.867388
1	0.171445	0.201028	0.418358
2	0.095384	0.949771	0.604631
3	0.473025	0.623443	0.093247
4	0.032081	0.500601	0.277768

df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
df

	quantity_a	quantity_b	quantity_c
0	0.421532	0.838176	0.867388
1	0.171445	0.201028	0.418358
2	0.095384	0.949771	0.604631
3	0.473025	0.623443	0.093247
4	0.032081	0.500601	0.277768

Remove duplicate index values

Suppose you want to discard rows with a duplicate index value. df.drop_duplicates() works on values, but not the index itself.
We can use df.index.duplicated() to isolate the duplicated index values, and then select everything except those indices from our dataframe.
We use the keep=False flag because we don’t want to deduplicate the index (leaving one of each), but rather to remove all entries which are duplicated at all. This typically arises when duplication signals some unreliable upstream data, and we don’t have any visibility on which row is the truly correct one.

df = pd.DataFrame({
    'user_id': [0, 1, 1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charles', 'Denise', 'Eric', 'Fiona'],
    'age': np.random.randint(20, 80, size=6),
    'location': ['New York', 'Toronto', 'Chicago', 'Berlin', 'Paris', 'Singapore']
}).set_index('user_id')

df

	name	age	location
user_id
0	Alice	67	New York
1	Bob	52	Toronto
1	Charles	39	Chicago
2	Denise	50	Berlin
3	Eric	53	Paris
4	Fiona	60	Singapore

df[~df.index.duplicated(keep=False)]

	name	age	location
user_id
0	Alice	67	New York
2	Denise	50	Berlin
3	Eric	53	Paris
4	Fiona	60	Singapore

Saving/loading

df['location'] = df['location'].astype('category')

display(df)
display(df.dtypes)

	name	age	location
user_id
0	Alice	67	New York
1	Bob	52	Toronto
1	Charles	39	Chicago
2	Denise	50	Berlin
3	Eric	53	Paris
4	Fiona	60	Singapore

name          object
age            int64
location    category
dtype: object

Pickle

fp = 'file.pickle'

# write
df.to_pickle(fp)

# read
from_pickle = pd.read_pickle(fp)
display(from_pickle)
display(from_pickle.dtypes)

# clean up example file
os.remove(fp)

	name	age	location
user_id
0	Alice	67	New York
1	Bob	52	Toronto
1	Charles	39	Chicago
2	Denise	50	Berlin
3	Eric	53	Paris
4	Fiona	60	Singapore

name          object
age            int64
location    category
dtype: object

HDF

Supports partial reads and incremental writes
Strange memory leak bug on pd.read_hdf in jupyter notebook on VM

fp = 'file.hdf5'

# write
df.to_hdf(fp, key='df', format='table', complevel=5, complib='zlib')

# read back
from_hdf = pd.read_hdf(fp)
display(from_hdf)
display(from_hdf.dtypes)

# clean up example file
os.remove(fp)

	name	age	location
user_id
0	Alice	67	New York
1	Bob	52	Toronto
1	Charles	39	Chicago
2	Denise	50	Berlin
3	Eric	53	Paris
4	Fiona	60	Singapore

name          object
age            int64
location    category
dtype: object

Feather

Does not support non-default indices, otherwise you’ll get this error:

ValueError: feather does not support serializing a non-default index for the index; you can .reset_index() to make the index into column(s)

!pip install -Uqq pyarrow

fp = 'file.feather'

# write
df.reset_index().to_feather(fp)

# read back
from_feather = pd.read_feather(fp).set_index('user_id')
display(from_feather)
display(from_feather.dtypes)

# clean up example file
os.remove(fp)

	name	age	location
user_id
0	Alice	67	New York
1	Bob	52	Toronto
1	Charles	39	Chicago
2	Denise	50	Berlin
3	Eric	53	Paris
4	Fiona	60	Singapore

name          object
age            int64
location    category
dtype: object

Parquet

Parallelizes reads across multiple cores
Still in active development, finicky
Does not support hierarchical indices
Run into error with very large string columns:
- pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot contain more than 2147483646 bytes, have 2147483647
- Seems to be under active development: https://issues.apache.org/jira/browse/ARROW-3762
If error AttributeError: module 'snappy' has no attribute 'compress' then it is trying to use the wrong package. Open terminal and run conda install python-snappy.
Error RuntimeError: Different dictionaries encountered while building categorical caused by multi-index.

!pip install -Uqq fastparquet

fp = 'file.parquet'

# write
df.to_parquet(fp, compression='gzip')

# read back
from_parquet = pd.read_parquet(fp, engine='fastparquet')
display(from_parquet)
display(from_parquet.dtypes)

# clean up example file
os.remove(fp)

	name	age	location
user_id
0	Alice	67	New York
1	Bob	52	Toronto
1	Charles	39	Chicago
2	Denise	50	Berlin
3	Eric	53	Paris
4	Fiona	60	Singapore

name          object
age            int64
location    category
dtype: object

Loading data

Pythonify column names

Remove duplicate index values

Saving/loading

Pickle

HDF

Feather

Parquet

Further reading