Reproducible Excel files
(with hate)
I sometimes write Python scripts that save output to Excel files using openpyxl
. It's convenient because unlike with CSV files, there's no long import wizard when opening the data in Excel for extra impromptu analysis or inspection. It Just Works.
One problem I've run into is that the results are not reproducible by default. This is annoying if you want to commit them to a repo for version control's sake (yes, even though it's a binary file) or even if you just want the hash to be reproducible for some reason — maybe for testing.
It is possible to fix this. There are two things to fix:
- First, the Excel file embeds a creation time and a modification time separate from the filesystem's values for those things. To make the file reproducible, I needed to set
workbook.properties.created
andworkbook.properties.modified
to some fixeddatetime
value. I also needed to use theExcelWriter
class directly to save the workbook rather than callingworkbook.save()
, becauseworkbook.save()
tries to update the workbook's modification time todatetime.utcnow()
– exactly what I don't want. UsingExcelWriter
I was able to bypass this logic. - Second, the Excel file as a zip file has its own internal modification times for each file in the zip. To fix this, I had to (1) save the workbook to another file
${myfile}.fake
, then (2) use a variation of theReproducibleZipInfo
workaround to rewrite the workbook zip to the target location of${myfile}
with fixed modification times. I'm only writing to this file, not reading from it, so I don't have to worry when writing to the target location about overwriting the data I'm reading.
This was a pain to figure out, but it's good to know it's possible. A nice side benefit is that the code is relatively straightforward now that I know how.