http://arstechnica.com/science/news/2009/02/aaas-60tb-of-behavioral-data-the-everquest-2-server-logs.ars
Is going around all kinds of sites; I came across that article by way of Slashdot, Damion Schubert came across it on Massively, etc.
How do you process 60TB of data? I'd argue that the right approach is to spread the data out across many machines, and split the processing into a number of tasks to map it to the questions you're trying to answer. Then you'd probably need to reduce the results from the cluster into a single set of data, storing intermediate results in a giant key/value store. Which is to say, use Hadoop.
I wonder what Sony is using? I wonder what the guys in academia will be using?


Or, instead of designing the data model to be easily human parseable or useable for real-time gameplay, rewrite it to be easily mineable.
What kind of massaging are you talking about, that can't be farmed out to multiple machines as a first pass? Remember, MapReduce is not an approach you take once on a dataset and you're done. Google, who has probably some of the best people coming up with MapReduce algorithms, estimates something like 10-15 MapReduce iterations to produce their search index.
And if you are making it easily mineable... what does that mean? Does that make it economical and fast to process on a single machine (with a single machine's I/O bandwidth)? Does it significantly reduce the cluster size?
I don't really disagree with anything you just said, it just seems to be small steps that are ultimately part of a good MapReduce/BigTable cluster for data analytics.
Sony didn't discard it as soon as CS and legal didn't want it, despite the technical challenges of keeping 60T of data. They kept it. They realized that in the future they may need to answer questions they didn't originally think about. Too many people in this industry seem to think that the data mining problem is "easy" because they can just write real time filtering tools, or tune the data collection to just a few critical parameters. When you're troubleshooting an issue, or exploring a new area you don't (usually by definition) have any idea what data will be helpful. Log everything and keep those logs.
With consumer grade SATA disks selling for $100/TB and as little as 5 watts of power per disk, the cost of "cold" storage has fallen to as little as $.80/GB/year (unburdened, there are significant overhead costs for datacenter space, server equipment power and cooling. James Hamilton had a lot to say about this.)
They take the sh out of IT!
I'm open to opposing arguments, though. :-)