|crunch, crunch, crunch (data)
||[Oct. 18th, 2006|09:35 am]
So the Netflix Prize just changed a rule - now instead of only being able to submit once a week (see my earlier screwup), starting tomorrow you can submit once a day! This is good in general, but given the progress that people are making, it leads me to believe that the contest might be over in January (the earliest it can be over under the rules). Neat that people are doing so well, but I feel a bit outclassed. At least they added more people to the leaderboard so my next submission has a shot of making it on there!
But for now, I'm continuing to crunch data. In my WoW-addled state last night I started computing the correlations between all pairs of users and storing all that data, before realizing that since there are around 500,000 users this would be 250 billion lines of data, which I don't have the hard drive space for and, given how slow file IO seems to be, would take forever. Lo and behold this morning just under 2500 users were done, which would mean it would take around 9.5 weeks to finish. So now I'm only storing the top 100 results for each user - we'll see how big a win that is when I get home. I might have to go to a probabilistic approach if even that is going to take too long (right now it still calculates all pairs of correlations, just doesn't write them all to disk). Had a conversation with djedi this morning to solidify some ideas about how to manage that data...
After work we're planning on going to LL Bean and getting a real jacket, although it's nice weather out today (high of 77!).