I checked in at the registration desk and picked up my badge and some swag: a minimalist backpack, a combination pen-laser-pointer-USB-drive (for once, an actual “pen drive”), and a combination pen-LED-flashlight. The USB drive contained the conference (and associated workshop) proceedings; I don’t know how much the devices cost but they’re definitely better value than paper proceedings. I used the pens to take notes during each talk, but I kept fidgeting with them, either using the laser or flashlight or disassembling and reassembling them. (I managed to resist the urge to laser-point at a speaker’s slides.)
On Sunday, before PLDI proper began, there were six tutorials, grouped into two three-track sessions. Each tutorial was roughly divided into three parts separated by breaks.
The first tutorial covered an implementation of Lightweight Modular Staging
(LMS) in Scala and the domain-specific languages (DSLs) LMS can be used to
build. At least as presented in the slides, LMS is basically expression
templates that can be evaluated (compiled) at runtime. In LMS any value of type
T can be replaced by an expression template of type
Rep[T], and operations
on the expression template continue to yield
Rep[T] values. This makes
converting eagerly-compiled code into staged code easy: just change a few types.
The examples demonstrated partially evaluating a matrix multiplication routine
on a constant matrix and a regular expression interpreter on a constant regular
expression, generating Scala code (not JVM bytecode) as output.
To go from partial specialization over Scala code to a DSL, LMS provides two
options: “smart constructors”, which are
Rep[T] => Rep[U] functions that
pattern-match over the expression template AST to generate code, and the graph
IR, which wasn’t explained until later in the presentation. By their AST-based
nature smart constructors are limited in the scope of their transformation, but
control flow dependencies are encoded as a list of pseudo-values, so statements
can be reordered as the DSL semantics permit.
At this point in the presentation I was only mildly impressed. LMS is a very clean implementation of expression templates (compare Boost.Proto or Boost.Phoenix), but it isn’t breaking any new ground. (I asked the speaker to compare LMS to Boost.Proto and received a vague answer that LMS allows more global transformations through its graph IR than Proto can.) LMS is a good engineering effort, but not very interesting as a research project.
The next portion of the tutorial described Delite, a framework for writing DSLs with a special focus on parallelism. Delite is structured like most compilers, with front ends for different languages (DSLs), a middle end performing common optimizations, and back ends for different target architectures. Delite language implementations provide some domain-specific optimizations, then are mapped onto Delite IR operations such as map, reduce, filter, group-by, sort, and join. The set of primitive operations is extensible, though Delite’s authors suggest the built-in operations are sufficient for most languages. Delite applies generic optimizations over its IR, then compiles to Scala, C++, OpenCL or CUDA. Each code generator can perform target-specific optimizations. All optimizations are done heuristically and the code generators sometimes undo the generic optimizations if they are pessimizations for that particular target. Finally Delite does data partitioning via stencil analysis and lifetime analysis, the latter important for the native code backends.
Five examples were presented: OptiML, a language for statistical machine learning, with k-means as the specific example; OptiQL, a LINQ-like language; OptiGraph, for graph computation, based on Green-Marl; OptiMesh, for 3D meshes; and OptiWrangler, for data cleaning. While these languages used different Delite IR operations, they all used generic code motion and loop fusion optimizations.
Overall, I was impressed by this part of the tutorial; Delite seems like a pretty solid compiler system. I’m most concerned about the use of heuristics during optimization; the optimal set of optimizations varies widely between processors even within the same architectural family, so heuristic-driven approaches require a lot of tuning effort to get good performance everywhere. I asked about factoring out the transformations from the middle and back ends into one large list, then using an autotuner to learn what transforms to apply for each program and architecture, and the presenter agreed it was an interesting idea for future work.
I’m also interested in compiling (subsets of) Delite’s ops to other representations, such as StreamIt or Halide (which I’ll discuss later in this trip report series), to exploit “structure knowledge” (analogous to domain knowledge but for computation structure), and vice versa, to take advantage of Delite’s many back ends.
The final part of the tutorial presented the Spiral code generator, built on LMS, used for example in generating Fourier transform (and similar signal-processing functions) implementations for Intel’s Integrated Performance Primitives. Spiral uses two levels of IR: “declarative, mathematical, point-free” SPL (signal processing language), which represents the mathematical structure of the algorithm, and a C-AST-like IR, named CIR, which tracks much more closely to the C code to be emitted. SPL optimizations focus on divide-and-conquer subdivision rules, while CIR optimizations (at least as presented) are about constant propagation and arithmetic simplifications. These optimizations are search-directed to find the fastest configuration. Spiral generates FFTs that match the performance of the well-known fftw library.
Unfortunately, this presentation didn’t really sell me on why this code generator is different. Maybe LMS is a particularly clean way to do mathematical code generation, but neither the divide-and-conquer optimizations nor the arithmetic simplifications were remarkable.
WALA is a framework for static analysis on Java bytecode (and other languages via translation to Java bytecode or custom synthetic bytecodes). WALA focuses on interprocedural analysis, so its SSA-based IR is structured around a context-sensitive call graph constructed via pointer analysis. Classes can be excluded from the call graph to improve scalability. Besides general utilities for graph algorithms and iterative dataflow, WALA includes tabulation-based interprocedural dataflow analysis, pointer and call graph analysis with user-specified contexts, and a context-sensitive slicer. WALA IR is immutable, but WALA includes a bytecode instrumentation library named Shrike. Shrike passes produce patches which Shrike applies to the original bytecode all at once, avoiding instrumenting added instrumentation. The Dila library applies Shrike instrumentation passes at load time.