Pandas Downcast

Shrink Pandas DataFrames with precision safe schema inference.
pandas-downcast
finds the minimum viable type for each column, ensuring that resulting values
are within tolerance of original values.
Installation
pip install pandas-downcast
Dependencies
License
MIT
Usage
import pdcast as pdc
import numpy as np
import pandas as pd
data = {
"integers": np.linspace(1, 100, 100),
"floats": np.linspace(1, 1000, 100).round(2),
"booleans": np.random.choice([1, 0], 100),
"categories": np.random.choice(["foo", "bar", "baz"], 100),
}
df = pd.DataFrame(data)
df_downcast = pdc.downcast(df)
schema = pdc.infer_schema(df)
df_new = pdc.coerce_df(df, schema)
Smaller data types $\Rightarrow$ smaller memory footprint.
df.info()
df_downcast.info()
Numerical data types will be downcast if the resulting values are within tolerance of the original values.
For details on tolerance for numeric comparison, see the notes on np.allclose
.
print(df.head())
print(df_downcast.head())
print(pdc.options.ATOL)
print(pdc.options.RTOL)
Tolerance can be set at the module level or passed in function arguments.
pdc.options.ATOL = 1e-10
pdc.options.RTOL = 1e-10
df_downcast_new = pdc.downcast(df)
Or
infer_dtype_kws = {
"ATOL": 1e-10,
"RTOL": 1e-10
}
df_downcast_new = pdc.downcast(df, infer_dtype_kws=infer_dtype_kws)
The floats
column is now kept as float64
to meet the tolerance requirement.
Values in the integers
column are still safely cast to uint8
.
df_downcast_new.info()
Inferred schemas can be restricted to Numpy data types only.
df_downcast = pdc.downcast(df, numpy_dtypes_only=True)
schema = pdc.infer_schema(df, numpy_dtypes_only=True)
Example
The following example shows how downcasting data often leads to size reductions of greater than 70%, depending on the original types.
import pdcast as pdc
import pandas as pd
import seaborn as sns
df_dict = {df: sns.load_dataset(df) for df in sns.get_dataset_names()}
results = []
for name, df in df_dict.items():
size_pre = df.memory_usage(deep=True).sum()
df_post = pdc.downcast(df)
size_post = df_post.memory_usage(deep=True).sum()
shrinkage = int((1 - (size_post / size_pre)) * 100)
results.append(
{"dataset": name, "size_pre": size_pre, "size_post": size_post, "shrink_pct": shrinkage}
)
results_df = pd.DataFrame(results).sort_values("shrink_pct", ascending=False).reset_index(drop=True)
print(results_df)
dataset size_pre size_post shrink_pct
0 fmri 213232 14776 93
1 titanic 321240 28162 91
2 attention 5888 696 88
3 penguins 75711 9131 87
4 dots 122240 17488 85
5 geyser 21172 3051 85
6 gammas 500128 108386 78
7 anagrams 2048 456 77
8 planets 112663 30168 73
9 anscombe 3428 964 71
10 iris 14728 5354 63
11 exercise 3302 1412 57
12 flights 3616 1888 47
13 mpg 75756 43842 42
14 tips 7969 6261 21
15 diamonds 3184588 2860948 10
16 brain_networks 4330642 4330642 0
17 car_crashes 5993 5993 0