Aug 29, 2023 1 min read programming

Pandas .memory_usage() underreports usage for string-heavy DFs?

I am using Pandas for a script. It loads some data, does processing on that data, then saves them back out. At the start I log the size of the data in memory since it could be somewhat large (gigabytes). To calculate that I use something like df.memory_usage().sum() / 1024 / 1024. The script has been reporting a total of ~1 GiB in usage to store the two dataframes, split in a 1:3 ratio.

It has also been mysteriously running out of memory. Shortly before it runs out, it's using ~10 GiB with no obvious culprit for the extra 9 GiB. I looked into this closer with psutil.

It turns out this that the log message was underreporting the dataframes' memory footprint. I tried running with a subset of 1/10 of one side of the data, and when I checked the memory usage using psutil, I found it was using ~3 GiB to store both subset dataset A and full dataset B. Presumably it uses even more than that when using the full 10x dataset A.

I suspect that this is because memory_usage() by default doesn't calculate a correct size for object-type data. You have to pass deep=True to get correct numbers for object data. Both datasets consist of lots of string columns some of which have missing values (pd.NA). These columns get typed as object if I recall correctly. Passing deep=True gives numbers that are much more in line with what psutil reports.

You might also like...

Things I learned calling shm_open() on a Mac

Lessons learned from battling Godot 3 for twelve hours

Beware map indexing in C++

Eval game rule 10

You could have invented PEM encoding