I am using Pandas for a script. It loads some data, does processing on that data, then saves them back out. At the start I log the size of the data in memory since it could be somewhat large (gigabytes). To calculate that I use something like
df.memory_usage().sum() / 1024 / 1024. The script has been reporting a total of ~1 GiB in usage to store the two dataframes, split in a 1:3 ratio.
It has also been mysteriously running out of memory. Shortly before it runs out, it's using ~10 GiB with no obvious culprit for the extra 9 GiB. I looked into this closer with
It turns out this that the log message was underreporting the dataframes' memory footprint. I tried running with a subset of 1/10 of one side of the data, and when I checked the memory usage using
psutil, I found it was using ~3 GiB to store both subset dataset A and full dataset B. Presumably it uses even more than that when using the full 10x dataset A.
I suspect that this is because
memory_usage() by default doesn't calculate a correct size for object-type data. You have to pass
deep=True to get correct numbers for object data. Both datasets consist of lots of string columns some of which have missing values (
pd.NA). These columns get typed as
object if I recall correctly. Passing
deep=True gives numbers that are much more in line with what