I’m working on clustering Web People-Search results. So I have the results and snippets for one query in an XML file, and my program reads that and writes the clusters into another XML file.
I have thirty such XML files to process, and hence in the main program, called the clustering function multiple times.
What I noticed was that the size of the output files and processing times were growing linearly – 4 KB, 8 KB, 12 KB…. and I wondered what had gone wrong with my code. On opening the output files, I saw there were an insane number of clusters; not at all what was expected. Oh, and the number of clusters were growing linearly too.
I wondered what the matter was… and went through my code multiple times.
Then it hit me.
I had a global dictionary and set of vectors on which the clustering function was called. Bad programming practice, I know… I swear to god I’ll never repeat it again.
Now when I was calling the clustering function multiple times, the global data remained the same, so input1 proceeded fine, but input2 processed the data of both input1 and input2, input3 processed input1, input2, input3….. oh hell!
And why this post? So that I’ll never repeat this error again, for one thing. And so that other noobs like me who read this will also keep this in mind. With Python, you can easily switch between OOP and scripting… and it’s too easy to screw up on that.
If you have global data, and want to run the code multiple times, don’t be too lazy to write a shellscript to call the program on each input.