Yesterday I needed to digitise a lot of paper based documents. I started out by scanning them (using Ubuntu’s simple-scan because as far as I can tell there is no way to batch scan double sided documents on macOS in the correct order directly). I ended up with a big PDF that contained all documents in one. The task now was to split them up.

I googled a little and realised that I could use the poppler-utils (called just poppler on homebrew) to split and re-unite the PDFs. I wrote the following small bash script to do so:

#!/bin/bash
for i in $(seq 1 2 10)
do
  pdfseparate -f $i -l $((i + 1)) Scan.pdf Part-$i-%03d.pdf
  pdfunite Part-$i-*.pdf Part-$i.pdf
  rm Part-$i-*.pdf
done

What we are doing here is the following: Our PDF (called Scan.pdf) is 10 pages long and we want to separate it into five documents of two pages each. So we create a loop that iterates starting at 1 up to 10 in increments of two.

pdfseparate is passed the start page with -f and the end page with -l. So in the first iteration it will start at page 1 and end at page 2. What pdfseparate does is it creates one new PDF for each page in the original document. It does so by creating a new file that has the page number wherever you provide the %d in the last argument. So in our case after the first iteration we will have two documents: Part-1-001.pdf and Part-1-002.pdf.

Since we only have one-paged documents now we need to pair them up again. To do that we use the fact that the parts that belong together start with the same string (being Part-<n>-). So we call pdfunite and tell it to merge all documents that match the Pattern Part-<n>-*) and combine them into a new document that is just called Part-<n>.pdf.

Finally we delete the intermediate PDFs. Note that this would also delete all other PDFs that match the pattern in your current working directory so make sure that you did not have any do begin with.

Finally in our example we will end up with 5 new documents of two page each: Part-1.pdf, Part-3.pdfPart-9.pdf. We could now go on to rename those again so that they aren’t only odd-numbered but I did not really care this time. One could also use pdfinfo to get the page count of the original document so that it does not need to be hardcoded in the seq but this worked well enough for the task at hand. 🚀