Category Archives: text mining

Navigating the Machine Learning job market

Over the past couple of months, I have been trying to navigate the machine learning job market. It has been a bewildering, confusing, and yet immensely satisfying and informative time. Talking with friends in similar situations, I find a lot of common threads, and I find surprisingly little clarity online regarding this.

So I’ve just decided to put together the sum total of my experiences. Your mileage may vary. After you’re done being a fresher, your situation and what you’re looking for gets a little more unique, so take whatever I say with a pinch of salt.

I’ve been passionate about machine learning for six years or more now. Though I didn’t realize it at that time, a lot of project choices, career choices and course choices I made were with the thought of ‘does this help me get closer to a research-oriented job that involves text mining in some form?’.  I went to grad school at a university that was very research oriented and worked on a master’s thesis on an NLP problem, as well as a ton of projects in courses. My first job after that involved NLP in the finance industry. My second job also involved text processing. The jobs I got offers from after this period also involve NLP strongly. I’ve literally never worked on anything else. So you can understand where I’m coming from.

So. Machine learning jobs. Where are they, usually?

Literally everywhere, it turns out. Every company seems to have a research division that involves something to do with data, and data mining. The nature of these positions can vary.

There are positions where you need to have some knowledge of machine learning, and it kind of informs your job, which might or might not involve having to use ML-based solutions. Usually these positions are at large companies. As an example, you might be in a team whose output is, say, an email client. There’s some ML used in some features of the product, and it is important for you to be able to grasp and work around those algorithms, or be able to analyze data, but on a day to day basis you’re working on writing code that doesn’t involve any ML.

There are other similar positions where you deal with a higher volume of data, and they have simple solutions to get meaning out of them. Maybe they use Vowpal Wabbit on a Hadoop cluster on occasion. Or Mahout. But they’ve got the ML bit nailed down, and more of the work involves just doing big data kind of work. These positions are more ubiquitous. If you have some ML on your resume, as well as Hadoop or HBase, these doors open up to you. Most of the places that require this kind of a skillset are mid-sized companies kind of out of the startup phase.

Then you have the Data Scientist positions. This phrase is pretty catchall, and you find a wide variety of positions if you look for this title. Often at big firms, it means that you have knowledge of statistics, and can deal with tools like R, Excel, SQL databases, and maybe Python in order to find insights that help with business decisions. The volume of data you deal with isn’t usually large.

At startups though, this title means a lot more. You are usually interviewing to be the go-to person for all the ML needs in the company. The kind of skills interview all the ones I mentioned above, apart from having a thorough knowledge of other things like scikit-learn and Weka, as well as having worked on ML projects. Some big data experience is usually a plus. Often, you’re finding insights in the data and prototyping things that an engineering team will put in production. Or maybe you’re also doing that if ML is not central to the startup’s core business.

Most people are looking for the Research Engineer job. You aren’t usually coming up with new algorithms. But you’re implementing some. On the upper end of the scale, you’re going through research papers and implementing the algorithms in them and making them work. You need a fair idea of putting code into production and deviate from research in adding layers to things to make your system work in a more deterministic, debuggable fashion. An example would be several jobs at LinkedIn where a lot of the features on the site need you to use collaborative filtering or classification. Increasingly, these jobs work on large data, but often that is not the case, and people manage fine using parallel processing instead of graph databases and mapreduce.

In a mature team, this position might not require you to use your ML skills on a day to day basis. In a new team, this position would need you to work on end to end systems that happen to use ML that you will be implementing.

In larger firms, you probably just need to have worked on ML in grad school, and your past jobs. It doesn’t matter the nature of the kind of data you’ve worked on. In startups though, they start looking for more specific skills. Like they’d want someone who’s specifically worked on topic modelling. Or machine translation. The complexity of their system doesn’t usually call for a PhD. They would grab an off the shelf solution if they could. But they would ideally want someone who has an idea of these things own this component and manage it completely, and be able to hit the ground running, which is why they want someone who’s worked on same or similar things previously.

Which brings me to another point. All ML jobs aren’t equally interviewed for.

Several large as well as mid-sized tech firms hire you for the company, not for a specific team or role. Usually, the recruiter finds you based on buzzwords in your resume, and sets up interviews with you. The folks interviewing you probably work in teams that have nothing to do with your skills. It is possible you go through interviews not answering even one ML question. Later when you get hired, they try to match you to a team, and they try to take into account your ML background to place you in a relevant team. If you’re interviewing for a specific kind of job, this makes it harder as you don’t know until you’re done with the whole process about what kind of work you’ll be doing.

Like I said before, at startups probably, you’ll know exactly what kinds of problems you’ll be working on. But more often, you’re hired into a group of sister teams. They all require similar skills. Maybe they work on different components of the same product, all of which use ML in different ways. So you have a fair idea of what you’ll be working on, but not necessarily a clear picture. You might end up working at the heart of the ML algorithm, or maybe you’re preprocessing text. The interviews will go over your ML background and previous projects as well as ML-related problem-solving.

Then there’s the Applied Researcher role. You usually require a demonstrated capability of working on reasonably complex ML problems. You are occasionally putting things in production and need good coding skills. Often, you’re prototyping things after researching different approaches. When you do put things in production, it is usually tools that other teams that use ML in their solutions use. Language is no bar, but usually there’s an agreed-upon suite of tools and languages that the team uses.

The Researcher role usually requires a PhD. Your team is probably the idea factory of the company, or that particular line of business of that company. Intellectual property generation is part of the job. I’m not highly insightful about this line of work, because I haven’t known very many people opting for these positions, and it feels increasingly like PhDs take up the Applied Researcher/Research Engineer role in a team, and do the prototyping and analyses while others help with that as well as put these prototypes into production.

There’s a lot of overlap in all these different types of positions I’ve mentioned, and it isn’t a watertight classification. It’s a rough guide to the different kinds of positions there are.

So where do you find these jobs?

LinkedIn is a great resource. You can use ‘machine learning’, ‘data mining’, ‘image processing’ or ‘data science’ or ‘text mining’ or ‘natural language processing’ as search keywords. I’ve also found Twitter to be a great place to search for jobs using these same keywords.

There are tons of job boards that also enable you to search using these keywords. Apart from them, I find a lot of ML-specific job fora. There’s KDNuggets Jobs, NLPPeople, LinguistList which are browsable job boards. Apart from them, there are also mailing lists like ML-News and SIG-IRList. I’ve also found /r/MachineLearning on Reddit to be a good resource on occasion for jobs.

Now that you’ve found a position and sent them off your resume and they got back to you, what do you expect in the interview? Wait for my next post to find out!

Learning to Link With Wikipedia – II

I’m done with most of pre-processing. Feel free to tell me how crappy my code is. Just be polite, otherwise I’ll probably cry. This takes ages to write to disk. That’s the bottleneck. It’s a sort of hackjob, though I must say I used to write worse code.

And you can use this code if you like.

import xml.dom.minidom
import re

class xmlMine:
 stopWordDict = {'':1} #dictionary of stopwords

 titleArticleDict = {} #hashmap of titles mapped to articles.

 def xmlMine(self):
 print "instantiated"

 def getStopwords(self,stopWordFile):
 #loads stopwords from file to memory
 stopWordObj = open(stopWordFile)
 stopWordLines = stopWordObj.readlines()
 for stopWord in stopWordLines:
 stopWord = stopWord.replace("\n","")
 self.stopWordDict[stopWord] = 1
 #print self.stopWordDict

 def cleanTitle(self,title):
 #removes non-ascii characters from title
 return  "".join([x for x in title if ord(x) < 128])

 def extractLinksFromText(self,textContent):
 textContent = "]] "+textContent
 textContent = textContent.replace("\n"," ") #remove linebreaks
 textContent = textContent.replace("'","") #remove quotes. they mess up the regexes.

 #remove regions in wiki pages where looking for links is meaningless
 refs = re.compile("==[\s]*References[\s]*==.+")
 textContent = refs.sub(" ",textContent)

 refs = re.compile("==[\s]*See Also[\s]*==.+")
 textContent = refs.sub(" ",textContent)

 refs = re.compile("==[\s]*External links[\s]*==.+")
 textContent = refs.sub(" ",textContent)

 refs = re.compile("==[\s]*Sources[\s]*==.+")
 textContent = refs.sub(" ",textContent)

 refs = re.compile("==[\s]*Notes[\s]*==.+")
 textContent = refs.sub(" ",textContent)

 refs = re.compile("==[\s]*Notes and references[\s]*==.+")
 textContent = refs.sub(" ",textContent)

 refs = re.compile("==[\s]*Gallery[\s]*==.+")
 textContent = refs.sub(" ",textContent)

 refs = re.compile("\{\|[\s]*class=\"wikitable\".+?\|\}")
 textContent = refs.sub(" ",textContent)

 textContent = textContent + "[["

 #remove stuff that's not enclosed in [[]]
 brackets = re.compile("\]\].*?\[\[")
 textContent = brackets.sub("]] [[",textContent)
 wordList = textContent.split("]] [[") #and store only the list of words sans the brackets
 #print wordList

 newWordList = []

 for word in wordList:
 originalWord = deepcopy(word)
 word = word.lower() #convert to lowercase
 #remove part before |
 altText = re.compile(".*?\|")
 word = altText.sub("",word)
 #replace number, punctuation by space
 numbr = re.compile("\d") #number
 word = numbr.sub(" ",word)
 punct = re.compile("\W") #punctuation
 word = punct.sub(" ",word)

 #if space added, split by space. replace by two/more words
 newWords = word.split(" ")

 for newWord in newWords:
 #remove trailing s after consonant
 trailingS = re.compile("^(.*[bcdfghjklmnpqrtvwxyz])(s)$")
 if trailingS.match(newWord) is not None:
 lastS = re.compile("s$")
 newWord = lastS.sub("",newWord)
 #print newWord
 if newWord not in self.stopWordDict: #remove stopwords
 if len(newWord)>2: #no point of too-short words.
 return newWordList

 def extractTextFromXml(self,xmlFileName):
 # extracts the <title> and <text> fields from the xml files
 # processes both.
 xmlFile = xml.dom.minidom.parse(xmlFileName)
 root = xmlFile.getElementsByTagName("mediawiki");
 for mediaWiki in root:
 pageList = mediaWiki.getElementsByTagName("page")
 for page in pageList:
 titleWords = ""
 text = []
 textNodes = page.getElementsByTagName("text")
 for textNode in textNodes:
 if textNode.childNodes[0].nodeType == textNode.TEXT_NODE:
 #print textNode.childNodes[0].data
 text = self.extractLinksFromText(textNode.childNodes[0].data)
 #self.extractLinksFromText(repr("[[link0]] blah [[link1]] nolink [[link2]] nolink [[link3]]"))
 titleNodes = page.getElementsByTagName("title")
 for titleNode in titleNodes:
 if titleNode.childNodes[0].nodeType == titleNode.TEXT_NODE:
 #print titleNode.childNodes[0].data.encode('utf-8')
 titleWords =  self.cleanTitle(titleNode.childNodes[0].data)
 #print titleWords
 self.titleArticleDict[titleWords] = text

def main():
 a = xmlMine()
 opFile = open("links.txt","w")
 string = ""
 for article in a.titleArticleDict.keys():
 string = string + str(article)
 string = string + ":"
 linkList = a.titleArticleDict[article]
 for link in linkList:
 string = string + str(link) + ","

 lastComma = re.compile(",$")
 string = lastComma.sub("",string)
 string = string + "\n"

if __name__ == "__main__":

Learning to Link with Wikipedia – I

I hope to maintain a log of the project I’m working on for my Data Mining course this quarter. I find blogging makes me feel more accountable on a day-to-day basis, and I could really use any help that comes my way on this.

So now to the problem:

Identifying which terms in a Wikipedia article need to be linked to other articles.

I have a dataset to work with. It has information about labels on the data and the words present in each document. I’m now trying to extract which words are linked.

So, yeah, still stuck in preprocessing.

I’ll post the python script after I’m done with it. Which should happen in the next few hours. Till then, I’m offline 🙂

Gender Analysis.

I didn’t do as much literature survey on this as I’d’ve wanted, but I came across this paper [pdf]. Word frequencies are different among men and women, apparently. That’s the basis of disambiguation. Women use more pronouns than men do, and the frequency compares with that of fiction, while that of men compares with nonfiction.

So I guess it should work like this: identify genre of the piece, and then identify gender.

What say?

Blog Gender Analysis

Recently, on my main blog, I found people commenting to say that they were debating my gender going solely by my writing. That brought back an old set of ideas I had.

There’s no dearth of web apps that determine the gender of the writer given a sample piece of writing. But these mostly were erroneous when they started off – Jane Austen was classfied a male writer by one of these, I remember.

Now however, GenderAnalyzer seems to have improved. Guess it’s due to learning, increasing of the sample space, etc etc. Not at all… they have just gone on from randomly tagging things as Male to tagging things Female.

I thought this was strictly for entertainment purposes, until I saw this as one of the possible tasks on the TREC Blog Track. That set me thinking.

The first application of such a technology that came to mind was spawned by Agatha Christie’s novels – determining whether the writer of threatening notes was a man or a woman. It helps narrow down the suspects, look out for possible accomplices… yeah, it can be put to various uses.

So over the next couple of days, I should try reading more on this, and try analyzing the rationale (if any) behind this task. I’m skeptical, as I feel something so inherently biological like gender does not map perfectly to social and culturally influenced things like writing style, and hence any such task is an exercise in futility.

But let’s see.

Watch this space.

%d bloggers like this: