In my last post, I talked about the need for machine translation in Indian languages, and how I was looking for use-cases. I think I’ve found a viable use case and a viable market.
Now that that’s done, I’m looking to do Kannada OCR, followed by language translation. And I’ll document whatever I read, whatever I find, on this blog, for accountability, visibility and discussion.
I start with Kannada OCR. OCR is pretty much the first step to translation, when you are dealing with scanned documents. I found there’s lots and lots of software that deals with this. It occurred to me that it’s not a hard problem at all.
A little more googling gave me Tesseract. It seems to pretty much be the gold standard for OCR. I noticed that ABBYY Finereader doesn’t have Kannada as one of its options… I must admit its API is pretty topnotch. Tesseract is a C++ library. The good thing is, there’s a whole bunch of other language wrappers around it. I can’t seem to find a Python 3 wrapper around Tesseract that also works on Windows, so I suppose I’ll get started on it using Java.
I found a few nice papers on Kannada OCR too. This one is a good introductory, though old, paper. This one is about segmenting old Kannada documents. As someone who doesn’t have much knowledge of what OCR will entail, especially segmentation, I found these useful in my context. I assume there are better, more descriptive papers on OCR as such, and I should read some more comprehensive survey papers on the subject.
These two papers provide more information on Tesseract as such, and while trying to get it working in Java, I also ought to read them in order to get a more intuitive understanding of the system I’m working with.
I use Twitter quite some. A lot of the people I follow share quite a lot of links. When I browse twitter on my mobile in the morning, I can’t check out all the links. I usually ‘Favorite’ the links that seem interesting and then browse them later. I’d actually prefer a better interface to this, which enables me to tag these links privately so that I can look for them later as well.
I found one such webapp whose name I now forget. The problem with it was it had a sucky interface and didn’t let me preview all the links properly. Then there’s also Tweetree which offers previews of shared links. I also like the Google Reader/Gmail sort of interface which keeps track of new links and already read links. And also, when multiple people share the same link, I’d like to see it all collapsed as one with “X, Y and Z shared this” next to it. Or something.
So this is one thing I’d like to build using Google App Engine.
The steps to do so would be as follows:
- Find a nice Twitter API interface for Python which can preferably be integrated with Google App Engine.
- Write code to get tweets from your Twitter timeline.
2(a) Learn how to use Twitter OAuth.
- Detect tweets with links. When they do, extract the unshortened link.
- By now, you have a set of links, and can choose to display them as you wish.
- Use the App Engine datastore to store previously viewed links. Possible attributes to be stored along with link can include users who shared this link, timestamps of tweets which shared these links, viewed-or-not (when dropping into database after extraction, this attribute should have the value ‘No’), title of linked page. Also store time of last login.
- Workflow: On login, extract links from timeline and drop into database until the timestamp of the tweet you’re reading is lesser than the time of last login. Then display those links with ‘viewed-or-not’ value as ‘No’ as ‘Unread items’ and the rest as ‘read’ items. On clicking each link, mark them as read. Also provide checkboxes to mass-markAsRead.
- Basic interface: Gmail HTML sorts. Previews and stuff can come later.
- Link-extract-and-drop-in-database. This in turn includes Link extractor, unshortener, title-getter, database interface.
- Database queries to view links and mark them as read/unread.
- User interface.
I hope to maintain a log of the project I’m working on for my Data Mining course this quarter. I find blogging makes me feel more accountable on a day-to-day basis, and I could really use any help that comes my way on this.
So now to the problem:
Identifying which terms in a Wikipedia article need to be linked to other articles.
I have a dataset to work with. It has information about labels on the data and the words present in each document. I’m now trying to extract which words are linked.
So, yeah, still stuck in preprocessing.
I’ll post the python script after I’m done with it. Which should happen in the next few hours. Till then, I’m offline 🙂