Monthly Archives: January 2015
Someone who works for the Govt of India told me about the Indian Gazette, which published a summary of all the activities of the government in English and Hindi. And there are state gazettes as well, which I assumed did the same. I found that the central government puts out the gazette with the same content in both English and Hindi. As perfect a sentence-alignment as you can expect.
Unfortunately, it doesn’t seem like the Karnataka government does that. They publish everything in only Kannada. The Kerala government publishes in only English. And the Tamil Nadu government publishes some bullet points in English and some in Tamil.
I’d not checked on this earlier, unfortunately. Now I’m back to square one, looking for a dataset for Kannada machine translation. Know of any?
I started using Tesseract to do OCR in Kannada.
For starters, I started using Tess4J which is the Java wrapper around Tesseract. Getting started is reasonably easy. I just followed the instructions here.
And then the problem was, I was constantly getting dependency issues with the DLLs. I used Dependency Walker to diagnose what dependencies weren’t being satisfied. Turned out, msvcp110.dll and msvcr110.dll weren’t installed on my system. I installed them from here.
Then I downloaded the Kannada training data from the Parichit project listed on the Tesseract plugins page. And then I found a larger file and thought maybe that would be better. Apparently not. It resulted in a bunch of errors. There’s something wrong with this training file. It’s detailed here.
I used the earlier training file on an image input
The OCR output was
ಷ್ಠೆಕ್ವಿಗಳ.), ಪಾದ) ಸೇಂತೈರ), ತೈಶ್ವ ಜಸ್ಯ್ಹ್ದನಿಗಳು, ದೇ(‘ಶೆ ಭಸಿಕ್ತ್ಡರ), ಕೆವಿಗೋ
ಪಾತಿಲತಿಗಳು, ಗಣಿತೈ, ವೀಜಸ್ಯನ, ಜೋ್ಜ್ಹ (‘ತಿಷ್ಠೆ, ಅಯೇಕ್ಕಿವುೋದ ಮುಂತಿ.್ಕ
ಅವೆೋಕೆ ವಿಬಾರಗಳಲ್ತಿ ಪೇಂದಿತೈಠು, ವಿ(‘ರಠು, ಶಿ್ವ)್ಸ, ಸೇಂಗೀೋತ’, ವಾಟೃ, ವಾಷಿ
ಕುಶೇಲು ಕೆಲೆಗಳಲ್ತಿ ಪರೀಣಿತೈಠು ಹುಟ್ನೃ ವೆಮೆತಿ ದಡದ ಹ್ಯಾಗೋ ಮಾ;
Terrible. Pathetic. Reads all the na as va. And the ra becomes ttha. Needs a ton more training data. Ugh.
In my last post, I talked about the need for machine translation in Indian languages, and how I was looking for use-cases. I think I’ve found a viable use case and a viable market.
Now that that’s done, I’m looking to do Kannada OCR, followed by language translation. And I’ll document whatever I read, whatever I find, on this blog, for accountability, visibility and discussion.
I start with Kannada OCR. OCR is pretty much the first step to translation, when you are dealing with scanned documents. I found there’s lots and lots of software that deals with this. It occurred to me that it’s not a hard problem at all.
A little more googling gave me Tesseract. It seems to pretty much be the gold standard for OCR. I noticed that ABBYY Finereader doesn’t have Kannada as one of its options… I must admit its API is pretty topnotch. Tesseract is a C++ library. The good thing is, there’s a whole bunch of other language wrappers around it. I can’t seem to find a Python 3 wrapper around Tesseract that also works on Windows, so I suppose I’ll get started on it using Java.
I found a few nice papers on Kannada OCR too. This one is a good introductory, though old, paper. This one is about segmenting old Kannada documents. As someone who doesn’t have much knowledge of what OCR will entail, especially segmentation, I found these useful in my context. I assume there are better, more descriptive papers on OCR as such, and I should read some more comprehensive survey papers on the subject.
These two papers provide more information on Tesseract as such, and while trying to get it working in Java, I also ought to read them in order to get a more intuitive understanding of the system I’m working with.