Jeffrey Bosboom's Blog

[blog] [projects] [about]

PLDI 2013 trip report, part 2

Keynote

Look Up: Your Future is in the Cloud (Jim Larus, Microsoft Research)

slides, my notes page 1, page 2

The first day’s keynote, delivered by Jim Larus of Microsoft Research, was on cloud computing. The first two-thirds of the talk was devoted to explaining what cloud computing is, why people are using it, and what challenges it poses. I’d expect anyone attending PLDI to already know about the cloud (it’s hard to not hear about it these days), but if by some chance you hadn’t, this was a good introduction. To summarize:

If any of that is new to you, I’d recommend reading James’ slides.

The remainder of the talk focused on the Orleans framework for cloud computing. Orleans is an actor-based framework with asynchronous messaging, lightweight transactions (ACI but not D) and automatic load balancing. Explicit asynchrony and lack of shared memory allows automatic replication, though the user must explicitly merge state when un-replicating. The transaction model provides isolation, atomicity and consistency by managing which requests can go to which actor replicas, providing a simple programming model and avoiding the challenges of full distributed transactions at the cost of non-serializability. Actors are location-independent, allowing automatic load balancing by creating, merging or relocating replicas. Overall, Orleans was designed to be “the Visual Basic of cloud computing”, though so far it’s been used mostly by experts.

I would have liked to hear more about what makes Orleans different from other actor-based systems (I’ve been looking at Akka lately), but with so much of the talk dedicated to explaining cloud computing, there wasn’t enough time for details.

Low-Level Issues session

Fast Condensation of the Program Dependence Graph

ACM DL, my notes

The program dependence graph (PDG) is an intermediate representation that explicitly models both control and data dependencies between program statements. Many loop- and parallelism-related optimizations depend on knowledge of dependence cycles, which appear as strongly-connected components in the PDG. This talk described an algorithm for computing a DAG whose nodes are the strongly-connected components in the PDG without first building the full PDG itself. While the new algorithm has the same worst-case performance, it is empirically faster in common cases.

It’s an algorithm. Not much to say about this one.

Scalable Variable and Data Type Detection in a Binary Rewriter

ACM DL, gratis PDF, my notes page 1, page 2

This talk presented SecondWrite, a system for decompiling binaries to an enhanced LLVM IR, with particular focus on inferring variable locations, data type definitions and function prototypes. Such a system might be useful for reverse engineering, analysis of mixed-language programs where source-level analyses cannot be combined, binary optimization, or maintenance or porting of legacy binaries when the source has been lost. SecondWrite is conservative enough to produce functional IR in the face of external calls or data type analysis failures, but is precise enough that binaries round-tripped through the system suffer only ~10% overhead. Some poorly-optimized binaries actually exhibit speedups after LLVM optimization.

I was surprised by the accuracy of the analysis, particularly in the presence of external code. While it does rely on some assembly-level argument movement at call sites, if equipped with a suitable model of APIs, it might make binary translation an alternative to virtualization for supporting legacy code compiled against old APIs (e.g., translate suitably well-behaved Win16 programs to Win32). The ability to apply optimizations might allow programs to transparently use larger SIMD registers than were available when they were compiled.

Of course it would be easier if every program came with source, but…

Fast RMWs for TSO: Semantics and Implementation

ACM DL, my notes

Current implementations of read-modify-write atomic instructions (RMWs) on total store order (TSO) machines like the x86 impose a full memory barrier (store buffer drain), because they are defined to be atomic with respect to writes to any address. This talk defined two weaker RMW models: one where RMWs are atomic only with respect to reads and writes of the same address as the RMW, and another where RMWs are atomic only with respect to only writes of the same address. These weaker models permit more efficient microarchitectural implementations while still being strong enough to implement the sequential consistency guarantees in the C++11 memory model.

The implementation presented requires additional coherence traffic to maintain a Bloom filter of addresses with pending RMW operations, and coherence protocols are both easy to get wrong and performance-critical, so I wouldn’t expect to see this implemented in the near term. There’s also a backwards-compatibility problem: existing binaries rely on the current strong semantics, so new instructions will need to be defined. Still, if they eventually happen, these kinds of micro-scale improvements add up.

High Performance Computing session

Terra: A Multi-Stage Language for High-Performance Computing

ACM DL, gratis PDF, Terralang.org, my notes

This talk described Terra, a statically-typed, low-level, C-compatible language embedded in Lua for multi-stage programming. Terra code generation shares the lexical environment of the surrounding Lua code execution, but the generated Terra code is entirely separate and can be run outside the Lua runtime. The system supports the usual quotation and escape operators, which can be used to splice a Terra expression or the evaluation of a Lua expression into another Terra expression, respectively. The paper provides a formal semantics for a subset of the Terra/Lua system called Terra Core. A matrix multiplication kernel generator implemented in 200 lines of Terra achieves performance comparable to the state-of-the-art ATLAS system. Orion, a DSL for 2D stencil computation on images written in Terra performs 2.3 times faster than hand-tuned C code. (Compare Halide, also presented at PLDI 2013.)

Although the authors claim this is the first use of multi-stage programming for portable high-performance code, this talk and paper left me feeling the authors had merely built a good implementation of something already known (similar to the Scala LMS tutorial from the day before). Of course, there’s something to be said for building a system so intuitive it seems obvious.

SMAT: An Input Adaptive Auto-Tuner for Sparse Matrix-Vector Multiplication

ACM DL, gratis PDF, my notes

Existing sparse matrix libraries have two interfaces: a fully general high-level interface handling matrices in arbitrary format, and a low-level interface to specific kernels that can operate only on certain matrices. Programmers are left with a choice between coding to the high-level interface for productivity or the low-level interface for best performance. SMAT is an autotuning system that uses a predictor (developed through machine learning) to find the best format and algorithm to produce good performance from a general interface.

Unfortunately, due to a language barrier, I was unable to determine if parameters other than the format and algorithm class were being tuned, even after asking the speaker afterward.

I’m skeptical of using a predictor trained on a fixed data set rather than empirical autotuning or a predictor trained on a particular application’s data. Obviously the fixed predictor doesn’t require a time investment for each application, but scientific computing codes usually run long enough to amortize one-time training costs.

When Polyhedral Transformations Meet SIMD Code Generation

ACM DL, gratis PDF, my notes

The paper’s title is perfectly descriptive: this paper is about combining polyhedral transformations and SIMD code generation. I admit I don’t understand polyhedral models, so I don’t have anything to say about this paper.

Fun Ideas and Thoughts (FIT) session

Hot Streak and Cold Streak Programming Tools

gratis PDF, slides, my notes

In sports, a player who accrues consecutive successful performances (e.g., a basketball player who makes ten shots in a row) is often said to be on a “hot streak”, while a player repeatedly failing might be on a “cold streak”. While the evidence suggests hot and cold streaks are merely human perception of chance effects, there may be reinforcing psychological effects of being on a hot or cold streak, and at the very least, sustaining a hot streak is more enjoyable than not. This talk applied these concepts to programming, looking for ways to sustain hot streaks and end cold streaks.

First, the talk discussed possible metrics for determining whether a programmer is on a hot or cold streak, both by looking at code (e.g., passing tests) and the programmer (amount of sleep, hours consecutively worked, even cups of coffee). Many tools already in wide use have the effect of sustaining hot streaks; for example, code completion avoids having to look up symbols in documentation. The talk did not include suggestions for new tools. Finally, two strategies for working out of cold streaks were suggested, both aimed at giving the programmer a better understanding of the code being worked on: writing documentation and refactoring existing code. The talk concluded with the question of what new tools or techniques could be developed to maintain or end streaks.

In the question and comment period after the talk, I said that the most important thing for maintaining flow is having an uninterrupted block of time to program with, and that’s a human problem (convincing others to leave you alone), not a technological one. The speaker suggested you could use metrics to develop an outside-the-door status display; when you’re doing well, the display would instruct others to keep out, but if you’re doing poorly, the display would ask others to help you out. I still think the best solution is to explain to others not to interrupt you, because no sign will stop someone who doesn’t understand.

Synthesis for Collaborative Editing

my notes

The speaker started this talk with the observation that students working together on a programming project usually end up sitting around one computer, with one person programming while the others are arguing about low-level details. He proposed to develop a synthesis engine that could recognize similarities between programs written separately by each student, stitching them together into a single program.

While I didn’t argue it during the session, I think this is a doomed technical solution to a people problem. I’m working with another student on my research, and we work together to agree on the overall project design and the interfaces between the components we are developing, then program separately, and this division works great. We have pair-programmed a couple of times when we’ve had to modify an interface after the fact; I usually “drive” while he explains to me how his code works, and long discussion is only about the new interface (particularly the “oops, that won’t work” moments). I think time would be much better spent teaching students about techniques for modularity rather than trying to develop a synthesis engine. (The current strategy for teaching modularity in a team context is to assign a team project and expect the students to figure it out themselves, which is like throwing a group of non-swimmers in a pool and expecting them to swim twenty laps.)

That’s not to say tools can’t enhance collaborative editing. The Collabode system merges Eclipse and EtherPad to provide asynchronous collaborative editing, where changes made by one author are buffered until the program is in a working state. Besides authoring code for team projects, Collabode is also great for discussing code with a group or class or for in-class examples, where the instructor or discussion leader can pull up one participant’s code on a projector. I haven’t used Collabode, so I don’t know how effective it is in practice, but it’s the kind of tool worth spending more effort on.

Soundiness

my notes

This talk began with the assertion that basically all whole program analyses are unsound, even those that claim to be sound, because such claims require the absence of troublesome but practically-relevant language features such as reflection, eval, and pointer arithmetic. While the analyses could be modified to be truly sound, doing so would introduce so much imprecision as to make them useless. The speaker argued that true soundness was not necessary, as IDEs already tolerate errors (due to incomplete or erroneous code), and runtime systems can use the analysis results as hints, guarded by dynamic checks (speculative optimizations with guards are common in virtual machines). The speaker termed this soundiness (by analogy with Colbert’s truthiness). Besides acknowledging unsoundness, the speaker called for papers to explain the implications of unsoundness, such as the possible runtime checks that must be inserted to achieve full soundness, and strategies for dynamic feedback to unsound analysis (similar to deoptimization in virtual machines).

I enjoyed this talk. While soundiness is catchy, more important is the suggestion that papers discuss how to manage unsoundness in practical programs, since most non-benchmark applications will use at least one unsound feature. A clear description of an efficient implementation for dynamic checks would make most analysis papers considerably more useful.