1 min read

Reproducible Excel files

(with hate)

I sometimes write Python scripts that save output to Excel files using openpyxl. It's convenient because unlike with CSV files, there's no long import wizard when opening the data in Excel for extra impromptu analysis or inspection. It Just Works.

One problem I've run into is that the results are not reproducible by default. This is annoying if you want to commit them to a repo for version control's sake (yes, even though it's a binary file) or even if you just want the hash to be reproducible for some reason — maybe for testing.

It is possible to fix this. There are two things to fix:

  1. First, the Excel file embeds a creation time and a modification time separate from the filesystem's values for those things. To make the file reproducible, I needed to set workbook.properties.created and workbook.properties.modified to some fixed datetime value. I also needed to use the ExcelWriter class directly to save the workbook rather than calling workbook.save(), because workbook.save() tries to update the workbook's modification time to datetime.utcnow() – exactly what I don't want. Using ExcelWriter I was able to bypass this logic.
  2. Second, the Excel file as a zip file has its own internal modification times for each file in the zip. To fix this, I had to (1) save the workbook to another file ${myfile}.fake, then (2) use a variation of the ReproducibleZipInfo workaround to rewrite the workbook zip to the target location of ${myfile} with fixed modification times. I'm only writing to this file, not reading from it, so I don't have to worry when writing to the target location about overwriting the data I'm reading.

This was a pain to figure out, but it's good to know it's possible. A nice side benefit is that the code is relatively straightforward now that I know how.