1 min read

The (de)serialization problem

Suppose you have a batch-processing pipeline with multiple steps. It's often nice to split this up a little bit – usually along time cost lines, usually around the parts that take a long time to run. Long-running steps get separated out from short steps to save on time; short steps get bundled together. You end up with your "main" scripts, the expensive ones, and preprocessing/postprocessing wrapped around them.

But why not go further? Why not split your pieces up into tens or hundreds of tiny scripts?

In short, because it's a pain. One of the reasons is that splitting a step means you now have N+1 things to remember – the earlier step and the later step. But another reason is what I've been thinking of as the serialization problem or the boundary problem.

Every time you split a script X into parts A and B, you now have to explicitly transfer data between part A and part B. When these parts were together it was easy – the two parts were running in the same executable or interpreter, so you could just pass around the data directly. Explicit transfer is a lot harder than implicit. You have a lot of options. If you have a shared filesystem, it's easy to read your inputs from and write your results to that filesystem.

However you do it, you're going to run into a problem: You now have to take care of serialization and deserialization somehow. This is extra code you have to write and maintain above and beyond what you needed for X alone. The only reason it needs to exist is so that A and B can coordinate with each other and transfer the appropriate data. That means extra time for you write it up front, and extra time in the future to make sure the two scripts stay on the same page about formatting and what's included in the data and so on. It means extra time to make sure you load and handle the data properly.

This seems like an important limiting factor on how much it is worthwhile to split your scripts.