Monthly Archives: July 2012

Getting back to machine learning

I got done with my Master’s in Science in Computer Science. I graduated with a thesis titled ‘Graphical Models for Entity Coreference Resolution‘.

Since then, it’s been a long break from all things hardcore machine learning and data mining and natural language processing. I have a nice day job which pays for my essentials and still leaves me with enough time and money to do a lot of other stuff. My team does a lot of ML, but that does not include what I’m working on at the moment. It might involve me writing some code which learns stuff from data and predicts on some other data, but I don’t know yet.

It’s been a good break. I needed this. I’m a much more confident person now. I have more confidence in my abilities to write and maintain large bits of code. I think it’s about time for me to get back to learning all about machine learning and graphical models with no stress of deadlines and enough opportunity to explore, and most importantly, no feeling intimidated. Also, going through material the second time over would be a good way of absorbing all that I missed the first time.

I’ve been cleaning out my hard disk in order to make conditions ideal for me to do this. A messy filesystem is really hard to work with. Especially with no version control or anything. Things get messy and when it’s crunch time, it only gets worse, not being able to find what you want because you haven’t labelled anything right.

I cleared out all my backups off of my external hard drive. Then I moved my entire pre-NYC-move photographs to the external HD. Going over which individual images to keep and which ones to delete was very cringeworthy – I had been quite camera-happy before 2009, and had clicked a lot of pictures. They say your first 10,000 pictures are your worst. Believe me, mine were. So overtly cringeworthy. More so since back then I didn’t even used to pay attention to how I dressed or how I did my hair or how I maintained my skin. Now those issues don’t exist anymore, so the cringing isn’t coupled with embarrassment and helplessness in my head like before.

I then uninstalled a lot of unnecessary software. Multiple builds of Python, with crazy sets of plugins on each build. Outdated versions of Eclipse. And oh, so many datasets. Deleted what I could, shifted the rest to my external HD. Tried organizing all my music, tagging them appropriately and attempting to put them into the right folders. Wasn’t so easy, so gave up midway. But I discovered that Mp3Tag seems to be a good app to do this.

I then organized my huge collection of ebooks using Calibre. I seem to have a lot of crap I downloaded from Project Gutenberg back in my young-and-foolish days in the infancy of the Google-powered Internet. Somehow, I just can’t delete classic books, no matter how I’ve never read them. So they stay for now.

Turns out, I have tons of movies stored as well, which I’d downloaded off of Putlocker back when I couldn’t even afford Netflix. Organized them well. I also seem to have a small collection of stuff downloaded off Youtube – clever and rare Indian ads, rare music videos of indie Indian pop/rock/movies. I need to upload them back to Youtube someday, for the originals I downloaded from seem pretty much deleted off the face of Youtube.

I even found all the original Stanford Machine Learning Class videos with Prof. Ng. Heh, with Coursera and Udacity, and Khan Academy now, you don’t need any of those like I did back in 2008-09. It was a different time back then, really.

I installed Python 3.2 after that. And Eclipse Juno. Followed by PyDev and the Google App Engine plugin for Eclipse. A windows installer for SciPy exists which is compatible with Python 3.2. However, MatPlotLib’s official Windows installer releases don’t yet support Python 3.2. Thankfully, unofficial ones exist here (oh yay, look, it’s from UCI).  I can of course build everything from source, but I want to keep this as hassle-free as possible.

I also need to get started with version control on Google Code or some such, so that I keep all my code somewhere I can access from everywhere.

Now next on the agenda is to go through a machine learning textbook, or an online course and slowly build my own libraries for machine learning from scratch. Maybe I’ll try building a Weka replica – uniform interface for training and testing each algorithm.

After that is to work on probabilistic graphical models and build those from scratch as well.

And in the midst of all this, I want to publish the work I’ve done in my thesis, which will mean trying to replicate those results, in a new and improved way, taking into consideration all the ideas I didn’t have time for, and those which I could have implemented better.

Let’s see how it goes 🙂 I hope to keep updating this place with all the stuff I do 🙂

Update: I found an ML textbook best suited to my needs finally! Machine Learning: An Algorithmic Perspective. I’ll start tomorrow, will see how much I learn.

%d bloggers like this: