In my last post, I talked about the need for machine translation in Indian languages, and how I was looking for use-cases. I think I’ve found a viable use case and a viable market.
Now that that’s done, I’m looking to do Kannada OCR, followed by language translation. And I’ll document whatever I read, whatever I find, on this blog, for accountability, visibility and discussion.
I start with Kannada OCR. OCR is pretty much the first step to translation, when you are dealing with scanned documents. I found there’s lots and lots of software that deals with this. It occurred to me that it’s not a hard problem at all.
A little more googling gave me Tesseract. It seems to pretty much be the gold standard for OCR. I noticed that ABBYY Finereader doesn’t have Kannada as one of its options… I must admit its API is pretty topnotch. Tesseract is a C++ library. The good thing is, there’s a whole bunch of other language wrappers around it. I can’t seem to find a Python 3 wrapper around Tesseract that also works on Windows, so I suppose I’ll get started on it using Java.
I found a few nice papers on Kannada OCR too. This one is a good introductory, though old, paper. This one is about segmenting old Kannada documents. As someone who doesn’t have much knowledge of what OCR will entail, especially segmentation, I found these useful in my context. I assume there are better, more descriptive papers on OCR as such, and I should read some more comprehensive survey papers on the subject.
These two papers provide more information on Tesseract as such, and while trying to get it working in Java, I also ought to read them in order to get a more intuitive understanding of the system I’m working with.