Originally published to posts.postlight.com on Feb 17, 2016. Corpora time! Here’s an article about processing 3.5 million books at onceon Google’s cloud. Here’s a fun concept-driven thesaurus automatically derived from all the comments on Reddit (cough, cough, choke). That was created using spacy.io, a nifty performant Python natural-language processing framework that is starting to fill in for the venerable NLTK.
Of course you might want to use something other than Reddit for your big data explorations. You may start with this Quora question, “Where can I find large datasets open to the public?” Or this site for Data Portals. Or this Data Hub site, or the “data sets” Subreddit, or the Stanford Large Network Dataset Collection (two million beer reviews!). Or the MusicBrainz catalog of albums, artists, and songs, or every object in the Cooper Hewitt, Smithsonian Design Museum on Github. The CIA World Factbook or Project Gutenberg. One of my favorites is Common Crawl, which is an open archive of the web (billions of pages!) and of course there are old standbys like theU.S. census.
Sigh. There’s so much data. It looks amazing when you go shopping for it. So much, so free! In so many formats. You can download all of Wikipedia, for example. But have you? It’s a mess! It’s perfectly optimized forWikipedia, but actually parsing and using that data is pretty hard. TheWikidata people are working on that, but they have their own data formats that you must use.
I didn’t even get to government data, and there’s so much of it!
There are a huge number of conversations about the right way to format data (not documents) for re-use. The question I have is: What is the point of all that data? A large data set is a product like any other. It must be maintained and updated, given attention. What are we to make of it?
In my opinion there are two good ways to format data: (1) Text files encoded as UTF-8, with one record per infinite-length line; and — no one does this! — (2) as SQLite files.
Hopefully I don’t have to sell you on UTF-8-formatted text. But I think data people under-value SQLite as a distribution format. Sure, it’s a big binary blob, but even ASCII data is binary when you look at it under a microscope. (I can hear your nerd head exploding, but well-documented completely open binary formats with clear text output strategies are not de-facto evil.)
SQLite is incredibly well-documented. It’s also instantly usable as a database from the command line with no pre-processing at all, even for very large files, and there are immediately usable SQLite APIs for every programming language. Plus it’s incredibly easy to turn SQLite data into plain text, it has freely available extensions for geo, full-text, and hierarchical data, and it’s tiny and public-domain.
In my dream universe, there would be a massive searchable torrent site filled with open, explorable data sets, in SQLite format, some with full text search indexes already in place — because what’s a few extra gigabytes to download in 2016 if I can immediately start to explore large corpora and build fun interfaces and experiments on top of them?
Imagine this: I could download one government gazetteer, and one large set of Wikipedia data, and merge them into a new SQLite data set of indeterminate utility — and then share that as a torrent for other people to use. Or I could do what I usually do, which is download some CSV and grep for a while until I can back it into PostgreSQL or what have you and my fingers are bleeding.
Anyway, just throwing that out there. If I ever have a large data set to distribute, look for me to distribute it in text and SQLite. If you ever distribute a large data set and release it in SQLite, let me know. I’ll thank you for it. If you think I’m wrong, please leave a comment.
You can watch pedestrians move in real time around Union Square in New York City. “Google Ideas Becomes Jigsaw” is a Medium post by Eric Schmidt. I haven’t read the article — who could? — but the headline is like somethingGelett Burgess would write.
GIFBattle of the day
I’ve decided to commit Postlight to publishing a newsletter, so it’s time to commit this newsletter to Postlight. POSTLIGHT is a growing web agency in New York City. We build great things for the web and mobile. We help giant media companies fix their publishing problems. We help music companies stream millions of songs. We help financial firms create modern, beautifully designed web and mobile apps. We help new publishing ventures create amazing reading experiences. We work with everyone and we love what we do. Thanks.