Crop image on whitespace: Preserving an endangered Pacific Island language

johnbent · Post by **johnbent** » 2014-12-16T10:48:35-07:00

Hello all. I'm a hobbyist trying to create a digitized dictionary for an obscure Pacific Island language, Palauan. I do have copyright holder permission. I have 350+ pages that look like this:

I do not need the accents transcribed. Unfortunately OCR thus far doesn't work well on this. I believe that OCR uses contextual language information to guess and it doesn't have contextual language information for Palauan. So my plan is to use human workers on Amazon's Mechanical Turk to help transcribe. Sadly I know very little about using Mechanical Turk nor about using ImageMagick but I do have decent computer skills, can program a bunch of languages like python, C, bash, etc., and prefer command line tools.

I've heard that Mechanical Turk workers prefer small tasks so I've thought about breaking up the pages in a few of the following ways:
1. A page at a time
2. A column at a time
3. A word entry at a time
4. A line from a column at a time

I was able to use 'convert' on the command line to split the columns but not perfectly; some pages have a slice of the other column in them. In another thread, user snibgo suggested option 4 above and provided the following steps:

I would tackle it like this:

1. Deskew each page.

2. Chop off head and tail of each page.

3. Divide each page into two columns, both trimmed left and right.

4. Divide each column into lines (but with no further trimming).

Now, you have one image per line in the dictionary. Each image that has a character at the far left is the start of a definition. Each image with white space at the left is a continuation.

So you then join up all the lines for each definition, and send that to the OCR.

That seems like a great suggestion. Can anyone kindly help out an imagemagick newbie with some potential command line parameters to achieve the above. All I've done is use convert with fixed pixel offsets for column splitting but that doesn't work great due to the different skewing on each of the pages. I did try multicrop with just the default parameters to see what would happen but all that produced was nine images of one letter each.

Or other ideas for how to accomplish my overarching goal are very much appreciated! Thanks all!