Return to data mining index

Application: The Badges Problem

Like any good data miner, I started by watching The Office . Unfortunately, the episode “The List” did not help me figure out the winners and losers in our dataset, since Dwight, Oscar, Andy, Phyllis, Angela, Gabe, Kelly, Pam, and Meredith were all misclassified. Guess this was going to be a little harder than watching TV. Below is my story of how I solved Project 6: “What's in a Name?”. My process began with summary statistics, digressed into a few hours of chasing irrelevant features and wrestling with data mining tools, and finished when I discovered the pattern using the simplest of tools, a text editor. Finally, my story concludes in the devastating discovery that I, Trevor Whitney, am a loser.

We were given a list of names, with labels. 213 of these names were winners, while 85 were losers. My first step was to extract this information from the set. Maybe the distribution of classes could tell me something about the data? For example, if there were only 10 out of 295 names that were losers, that would be a pretty easy set to manually examine for patterns. Since the data was in a SQLite database, the first tool I used was a SQLite console. I ran a couple queries, such as select count(id) from people where label == " + " , and again where label was “ - ” to get some summary statistics about my class distribution. I found that 213, or 71% of my data were winners, while 85, or 29% were losers. While there was a considerable skew towards winners, 85 was still more records than I could go through manually, so I would need to extract features from these names and run them through a classifier to find my pattern.

Next I brainstormed some features I could pull from these names. I limited my scope to features that could easily be obtained algorithmically, since we were instructed to write scripts to extract these features. So, for example, I did not try to figure out if the name rhymed with anything, since this would be rather hard to teach a computer to do in a quick “script”. Instead I focused on things like counting the characters in the first and last name, extracting the first and last character from each name, and so on. The features I pulled out were:

  • Character count of each name: first, last, and middle if available.
  • Character count for the entire name.
  • First and last character of each name: first, last, and middle if available.
  • Vowel count for each name: first, last, and middle if available.
  • Vowel count for the entire name.

I put all these values into WEKA and ran it through a J48 Decision Tree Classifier. My best result with this data was an 86% accurate classification rate. However, my tree was very wide, and very tall. Because of the shape of my tree, and because I knew the data could be correctly classified at a rate of about 98%, it was time to rethink my process.

At this point I was baffled, and a little tired. I had no idea what the pattern could be. The problem with the attributes I had generated was that I had been thinking about general attributes. In other words, I was extracting attributes you could pull from any word, such as length, first and last characters, and vowel count. I needed to focus on the particular words in our dataset, on my particular domain. To do this I went back to my SQLite console and ran two simple queries to give me all the winners and all the losers, which I put in two separate text files. I loaded these files side by side in my text editor, and started to compare them. I had a few hypotheses, that I tested on small sets, and most of them failed. Then, suddenly, I saw it! All the names on the right had a vowel after the first letter in the first name. I scanned through the list, and while this wasn't true in all cases, it was true in most of them. Same on the left, while not true in all cases, most of the names had a consonant after the first letter. Since I knew there was some noise purposefully put into the data set, I went about testing my hypothesis. I created a new column in the database called “secondLetterVowel”, and populated it with a 1 if the name had a vowel as the second letter of the first name, and a 0 if it did not. I plugged my results into WEKA, ran it through a J48 Classification Tree, and generated the following output:

Viola! I found the solution! Sure there were still a couple instances of noise that prevented this from being a perfect classification, but the above tree correctly classified 98% of the data. From this process, I learned a few things about data mining. First of all, it's really important to know your data and it's domain. My first attempt failed because I was just creating general attributes, that could apply to any word. I needed to focus on the words in this particular dataset. Furthermore, I learned that, at least in this case, the best tools for exploring my data where SQL queries and a text editor. WEKA was useful in proving my hypothesis, and more importantly to show me the unimportance of all the features I extracted in my first attempt. However, a good visualization of my data, which in this case was a couple of text files showing the different classes of text data I had, was my most valuable tool. So in conclusion, exploring and understanding your data is often the most important part of data mining. That, and I'm a loser.

Return to data mining index