Jeffrey Bosboom's Blog

[blog] [projects] [about]

Papers: Wrangler

I was surprised to realize two lectures per week meant two readings per week; I’m already drowning in papers I’d like to read someday.

Wrangler: Interactive Visual Specification of Data Transformation Scripts (Sean Kandel, Andreas Paepcke, Hellerstein, Heer)

ACM DL, gratis PDF

Before data can be analyzed, it usually needs to be “wrangled” – structured and regularized. Wrangler is a graphical tool for generating data wrangling scripts by example. An inference engine builds transformation suggestions from user interaction with the data; for example, if the user selects fields in two rows, one suggested transformation will extract a column with those fields based on inferred delimiters. Rather than just producing the transformed output, Wrangler provides a program implementing the transformation, allowing it to be applied to other data and documenting the data’s provenance.

Wrangler sounds like an awesome tool, so I decided to try it out on some autotuner performance logs. My data looks like this:

    ----------------------------------------------
    1 - 73673.140359 - 73673.140359
    ----------------------------------------------
    2 - 12961.885859 - 12961.885859
    ----------------------------------------------
    3 - 12961.885859 - 45333.924963
    ----------------------------------------------
    4 - 12961.885859 - 91607.27447
    ----------------------------------------------
    5 - 12961.885859 - 80043.298174
    ----------------------------------------------
    6 - 12961.885859 - 81076.074906
    ----------------------------------------------
    7 - 12482.54653 - 12482.54653
    ----------------------------------------------
    8 - 12212.512286 - 12212.512286
    ----------------------------------------------
    9 - 12212.512286 - 12671.435092
    ----------------------------------------------
    10 - 12212.512286 - 12775.603971
    ----------------------------------------------

…and so on for about 5000 rows of values. My goal is to extract the two floating point values, something easily done with a regular expression.

I began by selecting one of the rows of dashes. One of Wrangler’s suggestions was to delete all the rows of dashes, which I selected and it performed. Next I selected the first row’s second field (73673.140359); Wrangler offered to extract it into its own column, but couldn’t match some of the rows because they don’t all have the same number of digits. After I gave Wrangler another example on one of those rows (row 7 in this example), it was able to extract the column. However, Wrangler would not extract the second float field via an analogous procedure; it wouldn’t remember my first example when I tried to give it a second one. At this point I tried to select an example of each field within the same row, in the hopes that Wrangler would pick up on the - delimiter. However, my attempt to select another field in the same row as my existing selection was interpreted as an attempt to delete the entire table. I gave up after trying a few different selection orders. Wrangler does allow manually selecting operations from a toolbar at the top of its window, but at that point I may as well just write the regular expression and be done with it.

I think Wrangler’s inference from example manipulations is a great idea, especially for non-programmers, but its current implementation falls short of proving the concept. (Though it should be noted that the paper includes a controlled user study comparing with Excel. Wrangler compared favorably so it must be intuitive for Excel experts.)