Return to data mining index

Project 9 - KDD Cup


My first step in building an information platform to mine this data was to get everything into a relational database. This would be the foundation of my platform, providing easy access to scripts and other tools I would use to look for patterns. A relational database made the data easier to browse, as well as making it easier to generate summary statistics and write scripts to both extract and create features if necessary. Due to the size of the data, however, this step was more difficult, or at least time consuming, than it should have been. It was more time consuming because I was sloppy, and did not develop in a test-driven manner. I should have extracted a few records from each of the data files that had relations to one another, and tested my database populating algorithm with that. By making sure I had records representing all variables I would encounter, such as albums or tracks without artists, tracks without albums, and albums and tracks both with and without genres, I could verify my algorithm using a much smaller dataset. Since I did not do this, and instead ran each iteration on the entire dataset, every error took exponentially longer to discover. For example, I had a bug that nested the command inside the conditional block that only ran if the element had genres associated with it. Thus, all albums without genres were not getting saved. This was hard to detect, however, since most of the records were still being saved, so the script was processing a lot of albums, making it look like everything was working fine. I did not detect the problem until the algorithm moved on to adding tracks. It got about 15 tracks in until it spit up an error that the specified album ID for the current track could not be found. It was not until this point that I took a count of the database records in the Albums table and discovered I was missing a bunch of records. Also, because there still were tens of thousands or records in there, it took a while to pin-point the missing record I was looking for, so that I could see which records were saved before and after it in order to figure out my problem. I've included this anecdote because it was important realization about building an information platform. When working with large datasets, it is essential to test import and processing scripts on a subset of the data. It takes a long time to work through large datasets, and this procedure is often done in a batch job, overnight, on another machine, or in parallel. Therefore, errors may not be caught until hours, or even days later, at which point the entire process may have to be restarted. Testing scripts on smaller subsets of the data gives immediate feedback, so that all problems can be fixed before the rest of the data is processed. This was a lesson I learned the hard way, and will keep in mind for future mining endeavors. The code I used to populate the database was stored in a Populate class, which can be seen below.

Once I finally got all the information into a database, I took some quick summary statistics, just to see how many records I was working with, and if all the records were imported properly. I found the following statistics:

Table # of Records
Users 44
Tracks 507172
Artists 27888
Albums 88909
Genres 19354

Next I began thinking about how I would compare these records. I was thinking of comparing similar artists, similar genres, artists of similar genres, albums of similar artists, etc. Therefore, if you liked a track from a certain genre, you might also like an artist from a similar genre, or if you liked a track from one artist, you might also like a track from a similar genre. My idea for this was to cluster genres based on similar characteristics. Unfortunately, we didn't have any real data about genres, albums, or artists, all we had was numbers. Therefore, the only way we could group similar artists, albums, genres, and tracks was by comparing how users liked them. In other words, if two users highly rated the same genres, then maybe those genres are similar. Unfortunately, that doesn't really say anything about the similarity of the genres, just the similarity of the users' tastes. Since the only way to group entities was through the users' tastes, I decided to just compare the users' tastes. To do this, I created a new table called "similarities" in which I would compare each user against every other user for each type. The schema for this table contained four columns, user1_id, user2_id, score, and type . Each record represented a relationship between two users where type was either album, artist, genre, or track, and the score was the euclidean distance between their ratings for all elements within that type that they both rated. The code to create populate this table was stored in a class called Compare , and can be seen below.

Unfortunately, all my the similarity scores I found were either 1 or 0. I used the exact same euclidean distance algorithm I use in project 4, which was tested and proved to work. Therefore, there must be something wrong with my code somewhere. It's either in my Compare class, or my Populate class. While it seems the records went into the database right, maybe they got screwed up so everyone either has no shared scores, or all the same scores for the same items. Or, maybe the way I'm building the data structures I feed to the euclidean distance function are wrong. The unfortunate fact is, wherever my problem lies, I have run out of time to fix it. I started this project on Monday. I had the bug in my populate method that I mentioned above, which I didn't discover and fix until Tuesday night. On Wednesday I started my seemingly successful import, and it took over 24 hours to complete. I would appreciate feedback on whether this was due to poorly written code or just the sheer size of the data. Needless to say, I was not until Thursday night I could really start working with the data in the database. This is an important lesson about working with big data, everything just takes longer. I need to fully rethink my approach, and do the whole project, start to finish, on a subset of the data. That way I can better verify my algorithms and the data in the database by writing test suites. With this size of the data, I'm not really sure where to begin with test suites, or even how to test my code. Once I thought I got the populate class to work, I figured I could proceed with the full data set as that would be the most error prone part. Apparently, I was wrong, and given the chance to start over, I would approach this entirely. That said, it was a valuable lesson about building an information platform and working with big data sets.

If I had successfully populated the similarities table, my plan was to use an artificial neural network. For each relationship in the "similarities" table, I was going to rank each users likeness to each other by sorting the scores. So, the users with the highest similarity to each other for albums would be ranked 0 for albums, and the two users with the worst similarity would be ranked N-1, where N is the number of users. Therefore, every user would have four relationships, one for each type, to every other user. Each of these relationships would be ranked 0 to N-1 for that type. For each piece of data in the training set, I was first going to find all others users who rated that element. I was going to weight their rating for that item by their rank to the respective user I am trying to predict a rating for in that category. The weight would be equal to $1/(1 + \frac{rank}{N-1})$. So, for example, imagine we are trying to predict whether or not Bob will like the album "Smash" by The Offspring. I would find all other users who rated "Smash", weight their rating based on their rank, and then find the average of all those ratings. This way, if Steve is the best match for Bob in terms of liking the same albums, his rating gets a full weight of 1, while all the others will be weighted less than 1. I would figure out this average rating value for track, album, and artist for each item. Therefore, if I'm predicting how much Bob will like "Smash", I will find the average weighted rating of other users who rated "Smash", as well as their combined average rating for any tracks on "Smash", and their combined average weighting for the band The Offspring. These 3 values would be fed into an ANN, and the output of the ANN would be Bob's predicted rating for that album. Since genres were just tags, different users might have labeled the same elements as different genres. Therefore, I wasn't sure how to use genres in my above approach, so my plan was to omit them.


I was not successful in completing the project. The data set was too big for the time I had to mine it. I did, however, learn some important things for mining big data in the future. First, always test on a sub-set. This doesn't just mean the parsing/populating step. Do an entire test of your information platform, from start to finish, on a subset of the data. While the results might be off because you did not have enough data to train the model you created, at least you can check to make sure the data and code is accurate at each step. Second, do not use SQLite as a backend. I could not read from SQLite in Weka, Knime, Rapid Miner, or Matlab. It's seems MySQL or Postgres would be better options, though I did not get a chance to test either of these. I have added my SQLite database file to my repository so you can see my data, the problems I'm talking about with the similarity table, and maybe provide some insight on what I did wrong. I terminated the script before it calculated all similarities since it was clearly not working.

Return to data mining index