Kannada OCR using Tesseract.

I started using Tesseract to do OCR in Kannada.

For starters, I started using Tess4J which is the Java wrapper around Tesseract. Getting started is reasonably easy. I just followed the instructions here.

And then the problem was, I was constantly getting dependency issues with the DLLs. I used Dependency Walker to diagnose what dependencies weren’t being satisfied. Turned out, msvcp110.dll  and msvcr110.dll weren’t installed on my system. I installed them from here.

Then I downloaded the Kannada training data from the Parichit project listed on the Tesseract plugins page. And then I found a larger file and thought maybe that would be better. Apparently not. It resulted in a bunch of errors. There’s something wrong with this training file. It’s detailed here.

I used the earlier training file on an image input


The OCR output was

ಷ್ಠೆಕ್ವಿಗಳ.), ಪಾದ) ಸೇಂತೈರ), ತೈಶ್ವ ಜಸ್ಯ್ಹ್ದನಿಗಳು, ದೇ(‘ಶೆ ಭಸಿಕ್ತ್ಡರ), ಕೆವಿಗೋ
ಪಾತಿಲತಿಗಳು, ಗಣಿತೈ, ವೀಜಸ್ಯನ, ಜೋ್ಜ್ಹ (‘ತಿಷ್ಠೆ, ಅಯೇಕ್ಕಿವುೋದ ಮುಂತಿ.್ಕ
ಅವೆೋಕೆ ವಿಬಾರಗಳಲ್ತಿ ಪೇಂದಿತೈಠು, ವಿ(‘ರಠು, ಶಿ್ವ)್ಸ, ಸೇಂಗೀೋತ’, ವಾಟೃ, ವಾಷಿ
ಕುಶೇಲು ಕೆಲೆಗಳಲ್ತಿ ಪರೀಣಿತೈಠು ಹುಟ್ನೃ ವೆಮೆತಿ ದಡದ ಹ್ಯಾಗೋ ಮಾ;

Terrible. Pathetic. Reads all the na as va. And the ra becomes ttha. Needs a ton more training data. Ugh.


About wanderlust

just your average books-and-music person who wants to change the world.

Posted on January 24, 2015, in Uncategorized. Bookmark the permalink. Leave a comment.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: