sparsediff: diff tool for linux sparse files(great for simulating embedded hardware!)

Working close to hardware is cool and all, but frequently, it is necessary to simulate/test some code intended for hardware while offline/without hardware/etc. A common way to do this is to rig the code with multiple interfaces/abstractions/mocks etc, that can substitute the hardware at the last possible step. But of course, no matter how low you insert your new test classes, there is always the chance that a problem happens in hardware and not during testing.

One way I found out that helps mitigate this problem is sparse files. If you are communicating with hardware through memory mapped IO in userspace by using mmap in /dev/mem, I thought, instead of mmaping /dev/mem, I can just map another file, and treat it as normal memory, maybe even have another program be constantly reading this file and pretending to be the hardware. The only problem was that I was trying to simulate 64 bit hardware, some addresses I needed to access would require files of over 60GB in size, not very practical.

I set out looking for multiple solutions, fuse filesystems, and whatnot, did numerous POCs, until I finally realized... The feature I was looking for was available in linux ext4/tmpfs all along.

$ truncate -s 500G /tmp/memory.bin
$ du -sh /tmp/memory.bin
    0       /tmp/memory.bin
$ du -b /tmp/memory.bin
    536870912000       /tmp/memory.bin

Turns out, if the filesystem supports it, you can seek to whatever offset in the file and write there, and it won't use all the space necessary to reach that offset. Linux knows how to allocate a page(4Kb in my testing) for that offset only. It's absolutely great for testing, because normally you only write a couple kilobytes anyway, but very sparsely across the memory space. This is great!

Now to the main topic of this article, this approach has a minor inconvenience:
Suppose you are trying to see what happens in memory after a specific change. You run your code, and after everything is done, you copy the sparse file that represents memory. Then you create a new one, apply the change to the code, run it all again and copy the new memory file.

If you try using diff, you will have the unpleasant surprise that it is not aware of sparse files, and will just try read on the entire file. And it iiiissssssss slllooooowwwwwwww. So I set out to make this faster, using lseek's SEEK_HOLE and SEEK_DATA.

I didn't realize this could help other people until much later, so it turned out a very messy code, and I haven't had the time to rewrite it, so don't judge me too much. Here it is, on my github.

Leave a Reply Cancel reply