Category Archives: websites and tips

Navigating the Machine Learning job market

Over the past couple of months, I have been trying to navigate the machine learning job market. It has been a bewildering, confusing, and yet immensely satisfying and informative time. Talking with friends in similar situations, I find a lot of common threads, and I find surprisingly little clarity online regarding this.

So I’ve just decided to put together the sum total of my experiences. Your mileage may vary. After you’re done being a fresher, your situation and what you’re looking for gets a little more unique, so take whatever I say with a pinch of salt.

I’ve been passionate about machine learning for six years or more now. Though I didn’t realize it at that time, a lot of project choices, career choices and course choices I made were with the thought of ‘does this help me get closer to a research-oriented job that involves text mining in some form?’.  I went to grad school at a university that was very research oriented and worked on a master’s thesis on an NLP problem, as well as a ton of projects in courses. My first job after that involved NLP in the finance industry. My second job also involved text processing. The jobs I got offers from after this period also involve NLP strongly. I’ve literally never worked on anything else. So you can understand where I’m coming from.

So. Machine learning jobs. Where are they, usually?

Literally everywhere, it turns out. Every company seems to have a research division that involves something to do with data, and data mining. The nature of these positions can vary.

There are positions where you need to have some knowledge of machine learning, and it kind of informs your job, which might or might not involve having to use ML-based solutions. Usually these positions are at large companies. As an example, you might be in a team whose output is, say, an email client. There’s some ML used in some features of the product, and it is important for you to be able to grasp and work around those algorithms, or be able to analyze data, but on a day to day basis you’re working on writing code that doesn’t involve any ML.

There are other similar positions where you deal with a higher volume of data, and they have simple solutions to get meaning out of them. Maybe they use Vowpal Wabbit on a Hadoop cluster on occasion. Or Mahout. But they’ve got the ML bit nailed down, and more of the work involves just doing big data kind of work. These positions are more ubiquitous. If you have some ML on your resume, as well as Hadoop or HBase, these doors open up to you. Most of the places that require this kind of a skillset are mid-sized companies kind of out of the startup phase.

Then you have the Data Scientist positions. This phrase is pretty catchall, and you find a wide variety of positions if you look for this title. Often at big firms, it means that you have knowledge of statistics, and can deal with tools like R, Excel, SQL databases, and maybe Python in order to find insights that help with business decisions. The volume of data you deal with isn’t usually large.

At startups though, this title means a lot more. You are usually interviewing to be the go-to person for all the ML needs in the company. The kind of skills interview all the ones I mentioned above, apart from having a thorough knowledge of other things like scikit-learn and Weka, as well as having worked on ML projects. Some big data experience is usually a plus. Often, you’re finding insights in the data and prototyping things that an engineering team will put in production. Or maybe you’re also doing that if ML is not central to the startup’s core business.

Most people are looking for the Research Engineer job. You aren’t usually coming up with new algorithms. But you’re implementing some. On the upper end of the scale, you’re going through research papers and implementing the algorithms in them and making them work. You need a fair idea of putting code into production and deviate from research in adding layers to things to make your system work in a more deterministic, debuggable fashion. An example would be several jobs at LinkedIn where a lot of the features on the site need you to use collaborative filtering or classification. Increasingly, these jobs work on large data, but often that is not the case, and people manage fine using parallel processing instead of graph databases and mapreduce.

In a mature team, this position might not require you to use your ML skills on a day to day basis. In a new team, this position would need you to work on end to end systems that happen to use ML that you will be implementing.

In larger firms, you probably just need to have worked on ML in grad school, and your past jobs. It doesn’t matter the nature of the kind of data you’ve worked on. In startups though, they start looking for more specific skills. Like they’d want someone who’s specifically worked on topic modelling. Or machine translation. The complexity of their system doesn’t usually call for a PhD. They would grab an off the shelf solution if they could. But they would ideally want someone who has an idea of these things own this component and manage it completely, and be able to hit the ground running, which is why they want someone who’s worked on same or similar things previously.

Which brings me to another point. All ML jobs aren’t equally interviewed for.

Several large as well as mid-sized tech firms hire you for the company, not for a specific team or role. Usually, the recruiter finds you based on buzzwords in your resume, and sets up interviews with you. The folks interviewing you probably work in teams that have nothing to do with your skills. It is possible you go through interviews not answering even one ML question. Later when you get hired, they try to match you to a team, and they try to take into account your ML background to place you in a relevant team. If you’re interviewing for a specific kind of job, this makes it harder as you don’t know until you’re done with the whole process about what kind of work you’ll be doing.

Like I said before, at startups probably, you’ll know exactly what kinds of problems you’ll be working on. But more often, you’re hired into a group of sister teams. They all require similar skills. Maybe they work on different components of the same product, all of which use ML in different ways. So you have a fair idea of what you’ll be working on, but not necessarily a clear picture. You might end up working at the heart of the ML algorithm, or maybe you’re preprocessing text. The interviews will go over your ML background and previous projects as well as ML-related problem-solving.

Then there’s the Applied Researcher role. You usually require a demonstrated capability of working on reasonably complex ML problems. You are occasionally putting things in production and need good coding skills. Often, you’re prototyping things after researching different approaches. When you do put things in production, it is usually tools that other teams that use ML in their solutions use. Language is no bar, but usually there’s an agreed-upon suite of tools and languages that the team uses.

The Researcher role usually requires a PhD. Your team is probably the idea factory of the company, or that particular line of business of that company. Intellectual property generation is part of the job. I’m not highly insightful about this line of work, because I haven’t known very many people opting for these positions, and it feels increasingly like PhDs take up the Applied Researcher/Research Engineer role in a team, and do the prototyping and analyses while others help with that as well as put these prototypes into production.

There’s a lot of overlap in all these different types of positions I’ve mentioned, and it isn’t a watertight classification. It’s a rough guide to the different kinds of positions there are.

So where do you find these jobs?

LinkedIn is a great resource. You can use ‘machine learning’, ‘data mining’, ‘image processing’ or ‘data science’ or ‘text mining’ or ‘natural language processing’ as search keywords. I’ve also found Twitter to be a great place to search for jobs using these same keywords.

There are tons of job boards that also enable you to search using these keywords. Apart from them, I find a lot of ML-specific job fora. There’s KDNuggets Jobs, NLPPeople, LinguistList which are browsable job boards. Apart from them, there are also mailing lists like ML-News and SIG-IRList. I’ve also found /r/MachineLearning on Reddit to be a good resource on occasion for jobs.

Now that you’ve found a position and sent them off your resume and they got back to you, what do you expect in the interview? Wait for my next post to find out!


RIP, Reader

Yeah, this is yet another one of the funeral dirges for Google Reader. And I post it here instead of on my personal blog because I need to get into the habit of writing about technology here. Google Reader is hardly ‘technology’ as I intend it to be… I want to use this place for research updates and paper summaries.  But the anxiety about ‘not being good enough’ when it comes to all that is so much that I don’t want to write anything even remotely geeky. I need to snap out of that. And it’s NaNoWriMo, it’s about quantity more than quality. So here we go 🙂

So basically there are two main arguments against Google Reader’s integration with Google Plus. First is about how the user interface is sucky. And the second is about how the removal of sharing has killed the whole spirit of Reader. A third, if I may add, is that the platform/API is so bad, and everything is so messed up at first look that I can’t seem to wrap my mind around how to write a wrapper that makes things better.  Oh wait, there’s a fourth as well – the ‘stream’ format, as opposed to the folders-and-tags format, is the very antithesis of what Reader is supposed to be.

Let’s start with the appearance. Yes, white space is good. It makes things look ‘clean’. But that’s only when you have very specific things you want your user to see on your page. It works great for the homepage, for instance… all you want is a search bar. But when it’s a feed reader, it doesn’t work at all. When I log in, I don’t want to see half my screen space taken up by needless headers and whatnot. The bar with ‘Refresh’, ‘Mark as read’ and ‘Feed settings’ are needlessly large and prominent instead of being smaller and not taking up much space. They aren’t used all that much, to start with, that justifies their large font size. The focus here shouldn’t be on the options, but on the thing I’m reading. Fail.

Then everything’s gray, including links. If something’s not blue or purple, my mind doesn’t consider it a link. Sorry, but those are unwritten conventions on the Web. There’s no reason to change that now, and gray is a horrible color to show that something’s different from the rest of the black text. And the only spots of color on the page are a tiny dab of red to show the feed you’re currently reading, and a large button on the top left that says ‘Subscribe’. Dab of red, seriously? I much rather preferred the entire line showing the current feed highlighted instead of that little red bar. And I don’t add new feeds to read everyday that I need a large ‘Subscribe’ button. And when I do add feeds, I don’t add them using… I’m on the website I want to add, usually, and add feeds by clicking on the RSS icon, and then adding to reader.

Then the UI for sharing. It’s a lot more clicks to share something now. And yeah, the gripe is that whatever I share will be shared only on G+, but we’ll get to that in a moment. My problem with having to pick what circles I share with each time I share a feed is that it’s too much decision making too often. Atleast give me a set of check boxes of my circles so that all I need to do is two clicks instead of having to start typing my circle names.

It turned out, if you wanted to share something without publicly +1-ing something, you’d have to go to the top-right corner and click on ‘Share’. Well, how is that intuitive? And why would anyone design it that way, especially when the previous way to do that was by clicking on ‘share’ right below the post? Surely, it could have just had the Circles thing appear when you clicked the ‘share’ button, and +1-ing it could be a different button? And keep the top-right Share button if you like?

Now about sharing. I can share something with folks from Google Reader, yes, but they can only read it from Google Plus. Someone said that’s like retweeting something on Twitter from your client, like say Tweetdeck, but those who follow you can see your RT’s using only How retarded is that? I want a one-stop shop where I can do all my reading instead of having it spread over a zillion other places.

Due to which one of the things I wanted to do was build a wrapper website that integrated links shared on your G+ stream with your Reader feeds. I can’t seem to wrap my mind around how exactly it would work, but that’s one thing I certainly want to do.

The ‘stream’ format sucks for reading shared links. I have this problem with Twitter too, but on Twitter, you can ‘Favorite’ tweets which contain links and then read them one by one later. In fact, I was wondering about a platform that takes links on your Twitter timeline and puts them together for easy reading, feed reader style. Google Plus however has no such feature which you can use to tuck away stuff for later. If you’re too busy, you skip over a shared link and it’s lost. I much preferred the model where your feeds would all accumulate and if it got too much to handle, you could always mark all as read. Even better when your feeds would be properly organized.

And then Google Plus does a bad job of displaying shared links. It shows a small preview, but that’s more often than not insufficient. Buzz was better in this respect… atleast your images could be expanded, and posts could be expanded so that you could read it right there. Ha, one positive of this would however be that people would get a lot more hits on their websites. And it is not immediately apparent as to how inconvenient this sort of a visual format is, because people don’t share so much on Google Plus yet, and they don’t yet use it as a primary reader or such extensive use that it gets on their nerves.

And finally about the thing that has had the largest impact. Sharing.

Previously, in 2007, when Reader didn’t yet have sharing, we’d all come across nice links we’d want to share with our friends, and then either ping them on IM with it, or mail them the link. Needless to say, it was irksome. For both us and our friends. But somehow, when you shared it on Reader, the intrusiveness of sending links went away. It was just there, and if you liked it, you said so on the comments or by resharing it or referencing it in conversation. It stopped feeling like you were shoving it down someone’s throat, or someone shoved it down yours.

Sharing was also a nice way to filter content. For example, I loved reading Mental Floss’s feeds, but couldn’t stand the feed-puke that were feeds like TechCrunch and Reddit, whereas it was the other way for some of my friends. So we just followed each other, and I read the TechCrunch and Reddit content they deemed good enough to share, while I shared the interesting tidbits from Mental Floss.

Google Reader, I remember feeling, was a nice incubator for observing social network dynamics and introducing social features. It was my first first-hand exposure to recommender systems, before I moved to the USA and could actually shop on Amazon or watch movies on Netflix. It was interesting seeing how the recommendations incorporated stuff from your GTalk chats, your searches, stuff you ‘liked’… I remember freaking out about how after chatting often with a friend in LA my recommended feeds included a lot of LA-related blogs. And there was a search engine based treasure hunt at my undergrad college, and a friend and I remember saying “Oh man, googling stuff for this contest is so going to affect our Reader recommendations”.

It was also where I was recommended tons of blogs on ML and NLP and IR, due to which I went to grad school where I did, and did my thesis in what I did.

Also fun was the ‘Share as a note in reader’ bookmarklet. That way, I could share stuff from anywhere on the Internet with people who I knew would appreciate it.

Now it seems as if the Plus team wants to go and prove right that ex-Amazon Googler who said Google can’t do platforms well. Instead of providing services which can be used in a variety of ways to provide ‘just right’ experiences for a variety of people, Plus is trying to do it right all by itself. And failing miserably at that. The reason for Twitter’s success is the sheer variety of ways you can tweet – from your browser, from your smartphone, from your not-so-smart phone using Snaptu, from your dumb phone via text, your tablet, your desktop…. and I just don’t see that happening with Plus yet.

Maybe I wouldn’t be so mad if all the folks I share with on Reader were on Plus, but actually, hardly anyone is. And I don’t check my Plus feed on a regular basis either. I wouldn’t mind going on Plus to just read what everyone’s sharing, but the user experience is so bad I wouldn’t want that.

Google should have learnt from when it integrated Reader with Buzz and a lot of people found that irksome and simply silenced others’ Reader shares from their Buzz feed, that the Reader format doesn’t go well with the stream format.

There’s so much quite obviously broken with the product that you wonder if the folks who design and code this up actually use it as extensively as you do. Dogfooding is super-important in products like Google’s where there are a wide variety of users and user surveys can’t capture every single aspect.

But given that doing this to Google Reader seems just like when they cancelled Arrested Development, you begin to think they are probably aware of everything, and just don’t care about you the user and your needs anymore.

PS: Can anyone help me get the Google Plus Python API up and running on Google App Engine? I want to play with it, see what it does, and am not able to get it up and running.

PPS: Does a Greasemonkey script to make G+ more presentable sound like a good idea?

PPPS: Check out the folks at HiveMined. They are building a replacement for Google Reader 🙂

Recommender Systems Wiki

Use and contribute and link to:

Now should have one such for ML methods in NLP and my life will be great

How I read PDFs now

I find I have a lot to read these days. Mostly electronic stuff. And I can’t/don’t always print things out.

I would love an Amazon Kindle, but almost everything I read is in .pdf format. And I like writing notes on paper, or if I’m reading on my laptop, I use OneNote or Tomboy Notes.

Of late, I tried out speed-reading techniques, and found that I read best when I’m able to look at the entire page at one shot. The problem with trying to do that on Adobe Reader is that if you view the entire page at one shot, the font size is too small to read without squinting.

So. What I do now is to flip the screen by ninety degrees. Ctrl-Alt-leftArrow. I get the same feel as reading a book, my finger running down the page, as is recommended for speed reading. I’m just surprised it took me so long to arrive at this. The downside, or rather, the upside of this is that I can’t do anything other than read, because typing, moving the mouse pointer via the touchpad and everything else becomes more demanding.

It’s definitely not a substitute for paper, but it’s better than the other options.

What other options, you ask? I found this tool called Readability, which makes reading HTML pages so much easier. I tried converting pdf to HTML, but when you have a combination of text and images and tables, like in many published papers, the conversion is not perfect, and there are quite a few hiccups in reading seamlessly.

I also tried converting .pdf into the Sony Mobipocket format, but again, seamless conversion is a lofty aim, it turns out. The results are hard to read.

So… for now, it’s Ctrl-Alt-leftArrow.

Do you have a better way to read PDF files? Please do share.

The Knapsack Problem and its possible applications on

I’ve never been much of an online shopper. I avoided shopping online as much as I could. Until recently.

I discovered Amazon’s Mechanical Turk just before I left India. I created an ID with my India address and forgot all about it.

Until a few weeks back. I made some money on MTurk by participating in surveys, transcription, stuff like that. And then I wanted to transfer it to my US account.

Nope, nada. You can’t. You can only get it delivered to your India address. Which will take six weeks. And $4. And you can’t change the address unless you delete your account (and lose your money) and create a new one.

So the only other option I have is to convert it to balance. Which I did.

And after buying stuff for friends and family on the website, I now have a modest amount left to spend on myself. Let’s say I have $15.

I want to know what I can get for $15 in books. What sort of combinations. Woody Allen + Milan Kundera? Artemis Fowl + Le Petit Nicolas? What am I missing out on? What can I buy which is better?

So… this can be considered a knapsack problem. It occurred to me after my situation reminded me of this xkcd cartoon:
XKCD travelling salesman + knapsack problem

So what is the Knapsack problem? I’ll explain the simpler, more common 0/1 Knapsack Problem. So you have a set of items. Each item has a weight w and a value v. The knapsack can hold at most a weight of W. The problem is to choose which items to fill in the knapsack, such as to maximize the total V in the knapsack, while keeping the total weight under W. You can learn more here.

My problem can be considered a knapsack problem with the weight w of each item as its price, and the value v as how much Amazon thinks I’ll like the item. That of course is possible using collaborative filtering (yes! I know a term!) and other techniques. They can also use their ‘frequently bought together’ feature here.

Would others find it useful? Yes, if they are on a fixed budget like I am. Or if they want to buy just enough to be eligible for Free Super Saver Shipping.

Would it be possible in real time? I’m sure something can be worked out there.

And…. crowdsourcing this… any recommendations for nice books worth buying under $15?

On an aside, it’d be nice if had suggestions for most-frequent tags, and asked if we want these tags to be converted into categories.

And… Happy Thanksgiving! Wish you the best of Black Friday deals!

Creating IPhone Mockups using Adobe Fireworks

I’ve taken an introductory course in Human-Computer Interaction, and as part of it, I need to create paper prototypes of an IPhone app. We folks considered actually doing it on paper, as our instructor suggested, and decided it’ll be way too much of a pain. We found these rather useful links which told us how to use Adobe Fireworks for creating iPhone mockups.

First on is the link to download a trial version of Adobe Fireworks, or buy it, if you so wish. Here.

Then this toolkit by folks at Blogspark, where every element you need has been redrawn as a vector so that you can edit it to your heart’s content, copy-paste, drag-drop… here.

And finally, this video explaining how to go about making iPhone mockups using the toolkit. Here.

It’s really really easy. Even a total noob like me, who has no idea of what looks good and has no experience of designing goodlooking things on the Net could come up with rather slick-looking iPhone screens. It’s great that there’s a framework like Fireworks which is designed explicitly for web prototyping. Fifteen minutes into the video, and you should be able to figure out most things on your own.

Damn awesome. I’m using Fireworks much more often now.


When I was looking up stuff for Blog Gender Analysis, I came across Great site. I guess it can be used for rapid prototyping and things. Just to see if a particular approach might work or not. Or something like that.

What is it used for, basically? Please do tell me… I’d like to know.

%d bloggers like this: