Kannada OCR using Tesseract.
I started using Tesseract to do OCR in Kannada.
For starters, I started using Tess4J which is the Java wrapper around Tesseract. Getting started is reasonably easy. I just followed the instructions here.
And then the problem was, I was constantly getting dependency issues with the DLLs. I used Dependency Walker to diagnose what dependencies weren’t being satisfied. Turned out, msvcp110.dll and msvcr110.dll weren’t installed on my system. I installed them from here.
Then I downloaded the Kannada training data from the Parichit project listed on the Tesseract plugins page. And then I found a larger file and thought maybe that would be better. Apparently not. It resulted in a bunch of errors. There’s something wrong with this training file. It’s detailed here.
I used the earlier training file on an image input
The OCR output was
ಷ್ಠೆಕ್ವಿಗಳ.), ಪಾದ) ಸೇಂತೈರ), ತೈಶ್ವ ಜಸ್ಯ್ಹ್ದನಿಗಳು, ದೇ(‘ಶೆ ಭಸಿಕ್ತ್ಡರ), ಕೆವಿಗೋ
ಪಾತಿಲತಿಗಳು, ಗಣಿತೈ, ವೀಜಸ್ಯನ, ಜೋ್ಜ್ಹ (‘ತಿಷ್ಠೆ, ಅಯೇಕ್ಕಿವುೋದ ಮುಂತಿ.್ಕ
ಅವೆೋಕೆ ವಿಬಾರಗಳಲ್ತಿ ಪೇಂದಿತೈಠು, ವಿ(‘ರಠು, ಶಿ್ವ)್ಸ, ಸೇಂಗೀೋತ’, ವಾಟೃ, ವಾಷಿ
ಕುಶೇಲು ಕೆಲೆಗಳಲ್ತಿ ಪರೀಣಿತೈಠು ಹುಟ್ನೃ ವೆಮೆತಿ ದಡದ ಹ್ಯಾಗೋ ಮಾ;
Terrible. Pathetic. Reads all the na as va. And the ra becomes ttha. Needs a ton more training data. Ugh.