Pandas .memory_usage() underreports usage for string-heavy DFs?
I am using Pandas for a script. It loads some data, does processing on that data, then saves them back out. At the start I log the size of the data in memory since it could be somewhat large (gigabytes). To calculate that I use something like df.memory_usage().sum() / 1024 / 1024
. The script has been reporting a total of ~1 GiB in usage to store the two dataframes, split in a 1:3 ratio.
It has also been mysteriously running out of memory. Shortly before it runs out, it's using ~10 GiB with no obvious culprit for the extra 9 GiB. I looked into this closer with psutil
.
It turns out this that the log message was underreporting the dataframes' memory footprint. I tried running with a subset of 1/10 of one side of the data, and when I checked the memory usage using psutil
, I found it was using ~3 GiB to store both subset dataset A and full dataset B. Presumably it uses even more than that when using the full 10x dataset A.
I suspect that this is because memory_usage()
by default doesn't calculate a correct size for object-type data. You have to pass deep=True
to get correct numbers for object data. Both datasets consist of lots of string columns some of which have missing values (pd.NA
). These columns get typed as object
if I recall correctly. Passing deep=True
gives numbers that are much more in line with what psutil
reports.