Jeffrey Bosboom's Blog

[blog] [projects] [about]

Papers: history of databases, logging at Twitter

I’m sitting in on 6.885 From ASCII to Answers: Advanced Topics in Data Processing this term. While I’m not taking the course for a grade, I’m keeping up with the readings, so I may as well share my impressions.

What Goes Around Comes Around (Michael Stonebraker, Joey Hellerstein)

gratis PDF

This paper describes the various “eras” of database system design from both academic and commercial viewpoints. The authors specifically contend that XML databases are doomed to failure because they are too complex compared to relational databases, and further that their failure can be easily predicted from history. The paper also spends some words discussing the tradeoffs (and market size) of the schema first, schema later, and schema evolution styles of database design, arguing that few domains need schema later and that schemata are important for reliability in business use.

This paper is somewhat informal and summarizes the “lessons learned” at the end of each era, so it pretty much summarizes itself.

The Unified Logging Infrastructure for Data Analytics at Twitter (George Lee, Jimmy Lin, Chuang Liu, Andrew Lorek, Dmitriy Ryaboy)

ACM DL, gratis PDF

This paper documents Twitter’s logging infrastructure with a focus on how it enables data analytics. Twitter shifted from ad hoc, application-specific logging to a structured log format with common fields and application-defined key-value pairs in order to simplify data analysis, both computationally and ergonomically. Twitter’s early logs were a mix of well-formed Thrift messages, plain-text logs from format strings, and deeply-nested JSON, which made analysis across multiple applications difficult. Furthermore their aggregation system made data discovery difficult for analysts. Moving to a structured log format enabled many kinds of common queries, including reconstructing user sessions (for which the system builds special data structures for efficiency), while including key-value pairs preserved extensibility. Both data analysts and other employees can explore the data via an automatically-generated “data catalog” generated from the structured format, making the data useful to more users. Finally, the paper describes some applications enabled by the session sequence data structure, including funnel analytics and natural-language-processing-inspired user modeling.

Moral of the story: schemata, even partial ones, are very helpful with making sense out of the data.

Even if the paper can be summed up in a sentence, as an academic, these sorts of “war stories” from industry are very interesting; it’s a shame they aren’t published more often (because they don’t meet some paper committee’s interpretation of “research”, and for other reasons).