Add checkpointing to the data build

We need some sort of checkpointing system for the data build, where the process pick up where it ended if it stops prematurely. It doesn't necessarily need to pick up exactly where it left off, but it seems reasonable to track information about how which files have been read/not read, what file we're currently on, how many samples are there, etc.