Yesterday I needed to digitise a lot of paper based documents. I started out by scanning them (using Ubuntu’s simple-scan because as far as I can tell there is no way to batch scan double sided documents on macOS in the correct order directly). I ended up with a big PDF that contained all documents in one. The task now was to split them up.
I google a little and realised that I could use the
poppler-utils (called just
poppler on homebrew) to split and re-unite the PDFs. I wrote the following small bash script to do so:
#!/bin/bash for i in $(seq 1 2 10) do pdfseparate -f $i -l $((i + 1)) Scan.pdf Part-$i-%03d.pdf pdfunite Part-$i-*.pdf Part-$i.pdf rm Part-$i-*.pdf done
What we are doing here is the following: Our PDF (called
Scan.pdf) is 10 pages long and we want to separate it into five documents of two pages each. So we create a loop that iterates starting at 1 up to 10 in increments of two.
pdfseparate is passed the start page with
-f and the end page with
-l. So in the first iteration it will start at page 1 and end at page 2. What
pdfseparate does is it creates one new PFD for each page in the original document. It does so by creating a new file that has the page number wherever you provide the
%d in the last argument. So in our case after the first iteration we will have two documents:
Since we only have one-paged documents now we need to pair them up again. To do that we use the fact that the parts that belong together start with the same string (being
Part-<n>-). So we call
pdfunite and tell it to merge all documents that match the Pattern
Part-<n>-*) and combine them into a new document that is just called
Finally we delete the intermediate PDFs. Note that this would also delete all other PDFs that match the pattern in your current working directory so make sure that you did not have any do begin with.
Finally in our example we will end up with 5 new documents of two page each:
Part-9.pdf. We could now go on to rename those again so that they aren’t only odd-numbered but I did not really care this time. One could also use
pdfinfo to get the page count of the original document so that it does not need to be hardcoded in the
seq but this worked well enough for the task at hand. 🚀