Category Archives: Uncategorized

Kannada OCR using Tesseract.

I started using Tesseract to do OCR in Kannada.

For starters, I started using Tess4J which is the Java wrapper around Tesseract. Getting started is reasonably easy. I just followed the instructions here.

And then the problem was, I was constantly getting dependency issues with the DLLs. I used Dependency Walker to diagnose what dependencies weren’t being satisfied. Turned out, msvcp110.dll  and msvcr110.dll weren’t installed on my system. I installed them from here.

Then I downloaded the Kannada training data from the Parichit project listed on the Tesseract plugins page. And then I found a larger file and thought maybe that would be better. Apparently not. It resulted in a bunch of errors. There’s something wrong with this training file. It’s detailed here.

I used the earlier training file on an image input

kannada_qtext

The OCR output was

ಷ್ಠೆಕ್ವಿಗಳ.), ಪಾದ) ಸೇಂತೈರ), ತೈಶ್ವ ಜಸ್ಯ್ಹ್ದನಿಗಳು, ದೇ(‘ಶೆ ಭಸಿಕ್ತ್ಡರ), ಕೆವಿಗೋ
ಪಾತಿಲತಿಗಳು, ಗಣಿತೈ, ವೀಜಸ್ಯನ, ಜೋ್ಜ್ಹ (‘ತಿಷ್ಠೆ, ಅಯೇಕ್ಕಿವುೋದ ಮುಂತಿ.್ಕ
ಅವೆೋಕೆ ವಿಬಾರಗಳಲ್ತಿ ಪೇಂದಿತೈಠು, ವಿ(‘ರಠು, ಶಿ್ವ)್ಸ, ಸೇಂಗೀೋತ’, ವಾಟೃ, ವಾಷಿ
ಕುಶೇಲು ಕೆಲೆಗಳಲ್ತಿ ಪರೀಣಿತೈಠು ಹುಟ್ನೃ ವೆಮೆತಿ ದಡದ ಹ್ಯಾಗೋ ಮಾ;

Terrible. Pathetic. Reads all the na as va. And the ra becomes ttha. Needs a ton more training data. Ugh.

Advertisements

Kannada to English.

In my last post, I talked about the need for machine translation in Indian languages, and how I was looking for use-cases. I think I’ve found a viable use case and a viable market.

Now that that’s done, I’m looking to do Kannada OCR, followed by language translation. And I’ll document whatever I read, whatever I find, on this blog, for accountability, visibility and discussion.

I start with Kannada OCR. OCR is pretty much the first step to translation, when you are dealing with scanned documents. I found there’s lots and lots of software that deals with this. It occurred to me that it’s not a hard problem at all.

A little more googling gave me Tesseract. It seems to pretty much be the gold standard for OCR. I noticed that ABBYY Finereader doesn’t have Kannada as one of its options… I must admit its API is pretty topnotch. Tesseract is a C++ library. The good thing is, there’s a whole bunch of other language wrappers around it. I can’t seem to find a Python 3 wrapper around Tesseract that also works on Windows, so I suppose I’ll get started on it using Java.

I found a few nice papers on Kannada OCR too. This one is a good introductory, though old, paper. This one is about segmenting old Kannada documents. As someone who doesn’t have much knowledge of what OCR will entail, especially segmentation, I found these useful in my context. I assume there are better, more descriptive papers on OCR as such, and I should read some more comprehensive survey papers on the subject.

These two papers provide more information on Tesseract as such, and while trying to get it working in Java, I also ought to read them in order to get a more intuitive understanding of the system I’m working with.

Backups

I found today that Amazon S3 has a really cool one-click backup, where you can set things to back up regularly to Amazon Glacier.

And Amazon DynamoDB also has this thing where you can set it to automatically back up to a table in another region.

You can also set DynamoDB to back up to S3.

Glacier is apparently like a substitute for magnetic tape, without the inconveniences of tape. Takes a while to restore, as well. Pretty cool idea. I wonder what competition exists in this space, currently. A cursory search suggests none.

Glad this is there. It’s pretty essential.

Back to being back here

I don’t remember the last time I posted here. I don’t even think anyone remembers this place exists.

Irrespective.

I’ve grown a lot careerwise. This blog was supposed to help me along that journey, but somehow got ignored by the wayside. Also there’s this overreaching guilt of not doing enough to post here. My big plans still remain. But every time someone asks me about them, I chuckle sadly.

So what’s been on with me? I graduated from UC Irvine under Dr. Ihler in 2011. After that, I was doing NLP for the finance industry for two years. It’s quite an interesting field, I must say. I had one class that covered insider trading and EBITDA and Mergers and Acquisitions and I found all of it enormously interesting. I didn’t unfortunately keep up with my financial knowledge though. I didn’t really need it in what I did on a day to day basis.

And what did I do? I worked on a whole bunch of interesting things. So you have a ginormous quantity of documents coming in in so many different forms, and you need to parse them all and extract data from them. So you end up doing all these extremely basic things. You use OCR to convert image PDFs to text. You parse PDF in all its ugliness and convert it to a simpler format, while taking care to preserving some of the PDF-ey things about PDFs. And then it turns out there are 90 languages and your clients speak English. So you translate 90 languages into English. Some of it’s easy, especially European languages. A lot of it is painful. But we aren’t looking for high-quality translations…. just enough for the numbers in the financial documents to make sense. But then you run into a lot of unique problems. You don’t want to translate Yuan to Dollars. You find that most off-the-shelf translators are built for general language, not finance-specific language, so all the translations are different.

And then you do other interesting stuff with all the stuff you’ve processed so far. You try Named Entity Recognition. You try recommending similar documents. You try identifying series in document streams. You try creating summaries.

All of it was mighty interesting. On a given day, I’d code in C, C#, Perl, Java and Python and it’d all be no big deal. I learnt what MVC and MVVM meant. I began taking a real interest in software design. I learnt how to write maintainable large code. And the benefits of version control.

And then it was time to move.

I work now for a large online retailer’s Search&Discovery division. And that’s all I can say about it. Maybe some day I’ll reminisce fondly on what text mining challenges I face here, the scale of what I work on, and other things that would have by then become old hat. But not now.

I’ve had other interesting experiences with data in the meantime. Facebook NY came up with a Data Science round table. The invitee list looked like Chief Scientist, Head, Data Science, Asst Prof…. you get the drift…. and then me, with less than two years of work experience. It was insanely interesting to meet such people and have them treat me like they had a lot to learn from me. I learnt so much that day that though I’ve forgotten all their names, the discussions are still etched in my mind. It isn’t everyday that you have MCMC sampling explained to you over beer and fries someone else is paying for.

And then I tried a hack I’m not allowed to talk about, and I learnt there’s a feature in POS Taggers called the Gazetteer, where all you do is give it a set of phrases and the POS they belong to, and bam, any occurrence of those phrases (exact matches) is tagged thus. It’s insanely useful when you have your own new part of speech, like say, Celeb Names or Book Titles or some such.

So that’s been what I’ve been upto. Let’s hope I keep up this pace of blogging.

 

Getting back to machine learning

I got done with my Master’s in Science in Computer Science. I graduated with a thesis titled ‘Graphical Models for Entity Coreference Resolution‘.

Since then, it’s been a long break from all things hardcore machine learning and data mining and natural language processing. I have a nice day job which pays for my essentials and still leaves me with enough time and money to do a lot of other stuff. My team does a lot of ML, but that does not include what I’m working on at the moment. It might involve me writing some code which learns stuff from data and predicts on some other data, but I don’t know yet.

It’s been a good break. I needed this. I’m a much more confident person now. I have more confidence in my abilities to write and maintain large bits of code. I think it’s about time for me to get back to learning all about machine learning and graphical models with no stress of deadlines and enough opportunity to explore, and most importantly, no feeling intimidated. Also, going through material the second time over would be a good way of absorbing all that I missed the first time.

I’ve been cleaning out my hard disk in order to make conditions ideal for me to do this. A messy filesystem is really hard to work with. Especially with no version control or anything. Things get messy and when it’s crunch time, it only gets worse, not being able to find what you want because you haven’t labelled anything right.

I cleared out all my backups off of my external hard drive. Then I moved my entire pre-NYC-move photographs to the external HD. Going over which individual images to keep and which ones to delete was very cringeworthy – I had been quite camera-happy before 2009, and had clicked a lot of pictures. They say your first 10,000 pictures are your worst. Believe me, mine were. So overtly cringeworthy. More so since back then I didn’t even used to pay attention to how I dressed or how I did my hair or how I maintained my skin. Now those issues don’t exist anymore, so the cringing isn’t coupled with embarrassment and helplessness in my head like before.

I then uninstalled a lot of unnecessary software. Multiple builds of Python, with crazy sets of plugins on each build. Outdated versions of Eclipse. And oh, so many datasets. Deleted what I could, shifted the rest to my external HD. Tried organizing all my music, tagging them appropriately and attempting to put them into the right folders. Wasn’t so easy, so gave up midway. But I discovered that Mp3Tag seems to be a good app to do this.

I then organized my huge collection of ebooks using Calibre. I seem to have a lot of crap I downloaded from Project Gutenberg back in my young-and-foolish days in the infancy of the Google-powered Internet. Somehow, I just can’t delete classic books, no matter how I’ve never read them. So they stay for now.

Turns out, I have tons of movies stored as well, which I’d downloaded off of Putlocker back when I couldn’t even afford Netflix. Organized them well. I also seem to have a small collection of stuff downloaded off Youtube – clever and rare Indian ads, rare music videos of indie Indian pop/rock/movies. I need to upload them back to Youtube someday, for the originals I downloaded from seem pretty much deleted off the face of Youtube.

I even found all the original Stanford Machine Learning Class videos with Prof. Ng. Heh, with Coursera and Udacity, and Khan Academy now, you don’t need any of those like I did back in 2008-09. It was a different time back then, really.

I installed Python 3.2 after that. And Eclipse Juno. Followed by PyDev and the Google App Engine plugin for Eclipse. A windows installer for SciPy exists which is compatible with Python 3.2. However, MatPlotLib’s official Windows installer releases don’t yet support Python 3.2. Thankfully, unofficial ones exist here (oh yay, look, it’s from UCI).  I can of course build everything from source, but I want to keep this as hassle-free as possible.

I also need to get started with version control on Google Code or some such, so that I keep all my code somewhere I can access from everywhere.

Now next on the agenda is to go through a machine learning textbook, or an online course and slowly build my own libraries for machine learning from scratch. Maybe I’ll try building a Weka replica – uniform interface for training and testing each algorithm.

After that is to work on probabilistic graphical models and build those from scratch as well.

And in the midst of all this, I want to publish the work I’ve done in my thesis, which will mean trying to replicate those results, in a new and improved way, taking into consideration all the ideas I didn’t have time for, and those which I could have implemented better.

Let’s see how it goes 🙂 I hope to keep updating this place with all the stuff I do 🙂

Update: I found an ML textbook best suited to my needs finally! Machine Learning: An Algorithmic Perspective. I’ll start tomorrow, will see how much I learn.

Convex Optimization.

Course I’m taking. Need to brush up on basics before diving in. And I’ve got less than a day to do that.

Anyone know a good crash course in linear algebra? Will be grateful. Thanks.

 

Transfer Learning etc

I think this’d work best if I just updated my daily progress here than try giving comprehensive views of what I’m doing.

So you have data coming in that needs to be classified. Apparently the accuracy of most classifiers is abysmally low. We need to build a better classifier.

I took a month’s worth of data, and applied all possible classifiers on it, cross-validated it. Accuracy was roughly in the area of 85-90%. While that’s not excellent, it’s not bad, given the small amount of training data.

So what’s this low accuracy everyone’s talking of?

Turns out, the new data coming in turns out to be very different from the data you train on. You’ll train on June’s data, but July’s data’s going to be much different. The same words will end up reappearing in different class labels. Hence the low accuracy.

Also, you have not just one classifier, but many. It turns out that when you train many classifiers on subsets of the data, they perform better than training one classifier on the entire data.

Learning will have to keep evolving. I first thought of Active Learning in this context, where you’ll expect the user to label the stuff you are not sure about. But then, what if you confidently label stuff that is patently wrong?

The many classifiers bit of the problem helps us visualize the training data in a different way – Each category – class label – has many sub-categories. Now each classifier is trained on a month’s worth of data. It turns out that each month can be likened to a sub-category. You train on one sub-category, and test on another sub-category, and expect it to return the same class label. That’s like training a classifier on data that contains hockey-related terms for the Sports label, and then expecting it to recognize data that contains cricket-related terms as Sports too.

Sounds familiar?

This would be transfer learning/domain adaptation – you learn on one distribution, and test on a different distribution. The class labels and features however, remain the same.

This would more specifically be Transductive Transfer Learning – you have a training set, from distribution D1, and a test set, from distribution D2. You have this unlabelled test data available during training, and you somehow use this to tweak the model you’ll learn from the training data.

Many ways exist to do this. You can apply Expectation-Maximization on a Naive Bayes classifier trained on the training data, to maximize the expectation of the test data, while still doing well on the training data.  You can train an SVM, assign pseudo-labels to the test data, add those to the next iteration of training, until you get reasonable confidence measures on the test data, while still doing well on the training data.

All these approaches to Transductive Transfer Learning are fine. They assume that you have test data available during training time.

We have a slight twist on that. It might be too expensive to store the training data. Or you might have privacy concerns and hide your data, but just expose a classifier you’ve built on top of it.  So, essentially, all you have is a classifier, and you need to tweak that when training data is available.

Let’s complicate it further. You have a set of classifiers. You can pick and choose classifiers you want to combine based on some criteria on the test data, create a superclassifier, and then try tweaking that based on the test data.

For starters, check this paper by Dai out. Here, you have access to the training data. What if you don’t? Can you then tweak the classifier without knowing the data underlying it?

Let’s assume it’s possible.

Then, on some criteria you pick, you choose a few classifiers from the many that you have. You merge them. And then tweak that superclassifier. Like for example, your test data contains data related to hockey and Indian films [Labels are Sport and Film]. You have one classifier on cricket and Indian films, one on hockey and Persian films, another on football and Spanish films. So C1 and C2 are the classifiers closest to your data. You combine C1 and C2 such that you get a classifier that’d be equivalent to one that’s trained on hockey, persian films, cricket and Indian films. Optimal classifier. And then tweak it.

That’s the architecture.

The questions we seek to answer are the ones regarding How to choose which classifiers to merge. And How to tweak the classifier given test data; and whether we’d need any extra data.

And… a more fundamental question.. given that the test data’s going to be from a distribution none of your ensemble would have seen before, is it worth the while to merge classifiers? Or we can vary the KL-divergence between the training and test distributions and see how having an ensemble helps.

Nascent ideas so far.

Comeback. Again.

So this post will slightly deviate from the general tone of this blog. It is a tad more personal.

I’ve just come out of a phase of unstructured time where I really really wanted to fix short-term goals for myself, and failed miserably. At the end of it all, I watched Julie And Julia, where the lead protagonist uses her blog to set short-term goals for herself, while also using it to check off each goal achievement. I want to emulate that.

I am now interning at a research lab in the industry. I am deciding on a problem statement that will mostly involve some form of transductive transfer learning. I have a great work environment, and an awesome mentor who helps with the short-term goal setting.

In such a setting, I feel I should probably maintain a daily log of how things are progressing, so that I can refer back to these notes later when I want to know how to set goals and progress with research. I have a controlled environment now, and it’d be interesting as well as helpful to document my time here such that I can replicate it elsewhere.

Most of my work will involve previous work and data that’s in the public domain, so I don’t think it’ll be a breach of any contract or NDA to talk about them on a public forum. Though, I might choose to make the text unsearchable and hence make the posts hidden, while keeping the password public. Not many know of the existence of this blog, and this I guess would be a sane decision. I’ll anyway have to check with my superiors #TODO.

Alrighty. Next post possibly coming in another hour.

List of things to blog about

I come across a lot of interesting stuff. Every Day. And I haven’t blogged about ANY of them!

The Distinguished Speaker Series from the Center for Machine Learning at UCI has been really interesting so far. And so has the weekly seminars by the same Center.

I quite enjoyed Prof. Peter Stone’s “Learning and Multiagent Reasoning for Autonomous Agents”. Material related to the talk is available here. I haven’t given robotics a serious thought, though if I get more reasonably into Machine Learning, I should want to try out games, strategies, and all that.

Then there was Prof. Doug Oard who talked about extracting identities from a set of emails. So now you refer to a Judy in your email, or even cutie pie, the software should be able to pick out who you’re talking about, just by being given a set of previous correspondences.I was QUITE interested. It didn’t seem that challenging learning-wise or applying concepts-wise, but I guess the challenge is to figure out how to extract features. And it turns out simple methods are the ones that work best.

There was also Judy Olsen from PARC, who talked about identity resolution on the Net, and how that could be used to obtain biases in Amazon.com reviews. So say X reviews a book by Y. And you find that X and Y are strongly related. You can take that into account while giving weight to reviews. Here, they didn’t use anything more than plain keyword matching on Google search. Just googling for X and Y and determining how much of overlap between the results existed. And the results turned out surprisingly good. I’m wondering about also using sentiment analysis here to determine if the bias would be positive or negative.

The most recent one was Prof. Padhraic Smyth, one of the judges of the Netflix contest talking about the nature of the contest, how it went on, the details in the data, some assumptions….. I came in half an hour late, but still managed to enjoy the rest of the talk.

And the reason I missed half of it? I’m currently a Graduate Student Researcher with Prof. Bill Tomlinson. I’m working on visualizations using Google Earth. My most recent working code will always be posted here. I should aggregate all my knowledge about Google Earth’s API and KML here. Sometime Soon. It’s rather fun.

I’m also taking an AI course this quarter, and my class project has to do with cryptogram solving. The approach I will be using is word-based genetic algorithm. You can check the paper out here [pdf]. I will be uploading the code to Google Appengine once I’m done with it, or atleast gotten started.

And then there’s Probability Models, where I’m supposed to read up on how Markov Chains and the like are used in solving problems. I want to check out how it’s done in text mining. More details forthcoming.

That’s it for now. More coming up, hopefully. I should make it a regular habit to blog here every single day so that I don’t lose what I learn.

Hopeful increase in posting frequency

I know no one is subscribed to this space. That’s a major demotivation in posting… especially because my other blogs clock atleast a hundred hits a day. And I’m not a good geek yet… I just take in information.

But now, I’m a graduate student at the Bren School at UCI. Hopefully I’ll regurgitate all that I learn there on this space. It might possibly help me later on.

But the workload doesn’t seem a cakewalk… oh, well, I’ll find time somehow or the other. All that matters is the interest. And who knows, it might turn out productive.

I have lined up a course in Probability Models, one in Human-Computer Interaction, and yet another in Artificial Intelligence. AI seems to be going to be an awesome course… our project is cryptogram solving. Literature survey needs to be done. ASAP. And… I guess that’s going to be work in progress for the rest of the quarter, with implementations on the side.

I’ll also be a Graduate Student Researcher this quarter. There’ll probably be lots of learning on that job. I’ll blog about it as regularly as I can.

So.. hopefully, there’ll be an uptick of posts on this blog. And maybe there’ll be more followers, more networking, more all that that comes with it.

%d bloggers like this: