Back to being back here
I don’t remember the last time I posted here. I don’t even think anyone remembers this place exists.
I’ve grown a lot careerwise. This blog was supposed to help me along that journey, but somehow got ignored by the wayside. Also there’s this overreaching guilt of not doing enough to post here. My big plans still remain. But every time someone asks me about them, I chuckle sadly.
So what’s been on with me? I graduated from UC Irvine under Dr. Ihler in 2011. After that, I was doing NLP for the finance industry for two years. It’s quite an interesting field, I must say. I had one class that covered insider trading and EBITDA and Mergers and Acquisitions and I found all of it enormously interesting. I didn’t unfortunately keep up with my financial knowledge though. I didn’t really need it in what I did on a day to day basis.
And what did I do? I worked on a whole bunch of interesting things. So you have a ginormous quantity of documents coming in in so many different forms, and you need to parse them all and extract data from them. So you end up doing all these extremely basic things. You use OCR to convert image PDFs to text. You parse PDF in all its ugliness and convert it to a simpler format, while taking care to preserving some of the PDF-ey things about PDFs. And then it turns out there are 90 languages and your clients speak English. So you translate 90 languages into English. Some of it’s easy, especially European languages. A lot of it is painful. But we aren’t looking for high-quality translations…. just enough for the numbers in the financial documents to make sense. But then you run into a lot of unique problems. You don’t want to translate Yuan to Dollars. You find that most off-the-shelf translators are built for general language, not finance-specific language, so all the translations are different.
And then you do other interesting stuff with all the stuff you’ve processed so far. You try Named Entity Recognition. You try recommending similar documents. You try identifying series in document streams. You try creating summaries.
All of it was mighty interesting. On a given day, I’d code in C, C#, Perl, Java and Python and it’d all be no big deal. I learnt what MVC and MVVM meant. I began taking a real interest in software design. I learnt how to write maintainable large code. And the benefits of version control.
And then it was time to move.
I work now for a large online retailer’s Search&Discovery division. And that’s all I can say about it. Maybe some day I’ll reminisce fondly on what text mining challenges I face here, the scale of what I work on, and other things that would have by then become old hat. But not now.
I’ve had other interesting experiences with data in the meantime. Facebook NY came up with a Data Science round table. The invitee list looked like Chief Scientist, Head, Data Science, Asst Prof…. you get the drift…. and then me, with less than two years of work experience. It was insanely interesting to meet such people and have them treat me like they had a lot to learn from me. I learnt so much that day that though I’ve forgotten all their names, the discussions are still etched in my mind. It isn’t everyday that you have MCMC sampling explained to you over beer and fries someone else is paying for.
And then I tried a hack I’m not allowed to talk about, and I learnt there’s a feature in POS Taggers called the Gazetteer, where all you do is give it a set of phrases and the POS they belong to, and bam, any occurrence of those phrases (exact matches) is tagged thus. It’s insanely useful when you have your own new part of speech, like say, Celeb Names or Book Titles or some such.
So that’s been what I’ve been upto. Let’s hope I keep up this pace of blogging.