In my last post, I talked about the need for machine translation in Indian languages, and how I was looking for use-cases. I think I’ve found a viable use case and a viable market.
Now that that’s done, I’m looking to do Kannada OCR, followed by language translation. And I’ll document whatever I read, whatever I find, on this blog, for accountability, visibility and discussion.
I start with Kannada OCR. OCR is pretty much the first step to translation, when you are dealing with scanned documents. I found there’s lots and lots of software that deals with this. It occurred to me that it’s not a hard problem at all.
A little more googling gave me Tesseract. It seems to pretty much be the gold standard for OCR. I noticed that ABBYY Finereader doesn’t have Kannada as one of its options… I must admit its API is pretty topnotch. Tesseract is a C++ library. The good thing is, there’s a whole bunch of other language wrappers around it. I can’t seem to find a Python 3 wrapper around Tesseract that also works on Windows, so I suppose I’ll get started on it using Java.
I found a few nice papers on Kannada OCR too. This one is a good introductory, though old, paper. This one is about segmenting old Kannada documents. As someone who doesn’t have much knowledge of what OCR will entail, especially segmentation, I found these useful in my context. I assume there are better, more descriptive papers on OCR as such, and I should read some more comprehensive survey papers on the subject.
These two papers provide more information on Tesseract as such, and while trying to get it working in Java, I also ought to read them in order to get a more intuitive understanding of the system I’m working with.
If I’m completely in the groove, with a firm topic in mind, I find it relatively easier to read papers. However when I’m attempting to get started on something, or am reading a paper which, say, I have to summarize for a course, I lose my footing. I procrastinate, I become reluctant to start.
I decided I wanted out of this shite, and hence googled for ‘How To Read A Paper’. I found this paper by someone from the University of Waterloo, and I suspect this will help out greatly.
Let me summarize it for you.
Essentially, given a research paper, you go over it in three passes.
First Pass (5-10 minutes):
- Read the Title, Abstract and Introduction.
- Read the section/subsection headings and ignore all else
- Read the conclusions
- Glance over the references and tick off those you’ve already read.
- By the end of this pass, you should be able to answer 5 C’s about the paper:
- Context (What papers are related? What bases are used to analyze the problem?)
- Correctness (Are the assumptions valid?)
- Contributions of the paper
- Clarity (Is the paper well-written?)
Second Pass (1 hour):
- Read the paper more carefully, while ignoring details like proofs
- Jot down points, make comments in the margins
- Look carefully at all figures, especially graphs
- Mark unread references for further reading (for background information).
- Summarize main themes of the paper to someone else.
- You mightn’t understand the paper completel. Jot down the points you don’t understand, and why.
- Now, either
- Decide not to read the paper
- Return later to the paper after reading background material
- Or persevere on to the third pass
Third Pass (4-5 hours):
- You need to virtually reimplement the paper. Recreate the paper, its reasonings
- Compare your recreation with the original
- Think of how you would present the ideas, and compare with how the ideas are presented.
- Here, you also jot down your ideas for future work
- Reconstruct the entire structure of the paper from memory.
- Now you should be able to identify the strong and weak points of the paper, the implicit assumptions and the issues there might be with experimental or analytical techniques, as well as missing citational information.
Additionally, I think as a form of accountability (which I so need at the moment), I will blog every single paper I read, in accordance with the above structure.