So it turns out that to get the Google n-grams, you have to sign a user license agreement that forbids the more interesting things I'd like to do. Bummer. So I sent them an email:

Firstly, I'd like to thank Google for making this gigantic corpus available to the public. When I first read about it I was excited and immediately thought of a number of neat applications I could do using the data (see my post at http://gregstoll.livejournal.com/114129.html). Unfortunately, it looks like I won't be able to use the data as provided.

The first issue was the price - I would be just doing these projects in my free time, for noncommercial use, but the page at the Linguistic Data Consortium ( http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13) lists the price for nonmembers as $180 (including a $30 shipping fee). A fee for processing and handling is certainly reasonable, but $180 is a bit steep for 6 DVDs. Admittedly, this price may be set by the LDC itself, as most corpora are at least this expensive, but I had to carefully consider whether it was worth the price for a side project.

Eventually I decided it would be worth it to get my hands on such a great resource, and I even sent in the order to LDC. Unfortunately, I sent in the wrong license agreement, and was dismayed to find the agreement specific to this corpus. (http://www.ldc.upenn.edu/Catalog/mem_agree/Web_1T_5gram_V1_User_Agreement.html) Since I wouldn't be allowed to "publish, retransmit, display, redistribute, reproduce or commercially exploit the Data in any form", except for "limited excerpts from the Data in articles, reports and other documents describing the results of User’s linguistic education and research", this means that the more interactive ideas I had (a cryptogram solver, algorithm to calculate the probability of a given sentence, generator of English-like text) wouldn't be allowed.

So I'm forced to rethink my plans and try to gather my own corpus from the web, which will be undeniably smaller and less accurate. Of course I understand that Google was under no obligation to provide this data in the first place, but it is a little frustrating to have it so tantalizingly close and yet be unable to use it.

Thanks anyway!


(crossposted to http://gregstoll.livejournal.com/115202.html)

Anyway, I guess I'll try to make/find my own corpus. That could be fun in itself :-)

