Greg - I love data [entries|archive|friends|userinfo]
Greg

[ website | gregstoll.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

Links
[Links:| * Homepage * Mobile apps (Windows Phone, Win8, Android, webOS) * Pictures * LJBackup * Same-sex marriage map * iTunesAnalysis * Where's lunch? ]

I love data [Apr. 25th, 2007|04:05 pm]
Previous Entry Share Next Entry
[Tags|, ]
[Current Mood |giddygiddy]

Thanks to Peter Norvig's (he wrote my AI textbook! and is director of research at Google!!) article about writing a spelling corrector, I was reminded that Google released a giiiiant list of n-grams found on the web. Unfortunately, it's only for noncommercial use unless you join and pay thousands of dollars (noncommercial is OK) and costs $180(!) to buy and ship. On the other hand, it's 6 DVDs of compressed data (24 GBs of gzipped files). This is soooo tempting.
LinkReply

Comments:
[User Picture]From: wonderjess
2007-04-25 10:19 pm (UTC)

(Link)

too bad it's past your birthday. :)
[User Picture]From: quijax
2007-04-26 02:36 am (UTC)

(Link)

I have that book! I didn't realize you dabble in NLP. Just out of curiosity, do you have any projects in mind?
[User Picture]From: gregstoll
2007-04-26 03:24 am (UTC)

(Link)

I dabble in random AI stuff (although not in a while), and NLP seems neat!

I'm making a list of projects I would do if I get the DVDs. If these seem convincing enough in a week (to avoid the "this sounds super super cool" effect and then I forget all about it, which happens a lot with me) then I'll probably get them. Maybe we could share them if you're interested :-)

Anyway:
- showing simple letter frequency
- some sort of solver of cryptograms (knowing the prior probabilities of words would be helpful, and it would be a super huge dictionary)
- simple interface to see which words are more common. You could do it with people's names for extra motivation!
- Google Suggest kinda thing where you type in a few letters and you get the most common words starting with that letter.
- one of those markov chain things that spit out english text based on 1, 2, etc. grams.

This is just off the top of my head - I'm open to other cool ideas!
[User Picture]From: quijax
2007-04-26 10:38 pm (UTC)

(Link)

wow, cryptograms. That one sounds hard enough to be interesting.