Jeffrey Bosboom's Blog

How many bytes could you save if Java had constructorless classes?

Not enough to be worthwhile.

All Java methods are members of some class, so it’s common to simulate free functions with “utility classes” containing only static methods. These classes are not designed to be instantiable, but the Java compiler will add a public default constructor if they do not declare one, so they typically include a private (and thus inaccessible) constructor to prevent instantiation. Usually the body is empty, but some instead throw an exception to prevent reflective invocation, like java.util.Objects:

private Objects() {
    throw new AssertionError("No java.util.Objects instances for you!");

These constructors serve a compile-time purpose only, but take up space in the compiled class files. java.util.Objects is a particularly egregious example, with an exception message string that will be displayed only to developers trying hard to do something they shouldn’t be doing.

While the Java programming language specifies that a default constructor will be generated for any class not declaring other constructors, the Java virtual machine specification does not require a class to contain any constructors at all. Classes lacking instance initialization methods (the JVM term for the methods named <init> that constructors are compiled to) cannot be instantiated by bytecode and reflective attempts to instantiate such a class will fail with NoSuchMethodException, but their static methods can still be called. Omitting constructors has the same effect as private constructors, at no space cost.

To investigate how much space we could save if Java natively supported constructorless classes (perhaps through a @Noninstantiable class annotation), I implemented an ASM-based class file transformer (available as a Gist) that removes unused private constructors. Private constructors can only be accessed by bytecode from within their class (e.g., static factory methods) or enclosing classes, so the transformer don’t need to perform a global analysis. The transformer will remove constructors only used by reflective or native code, so the result is an overestimate of the actual space savings.

The transformer removes 27804 bytes in total from the class files in Oracle JDK 8u40’s rt.jar, 170 of them coming out of java.lang.Objects. A few bytes of these savings are due to better constant pool ordering; without managing the ordering, much of the gain is lost to unnecessary wide indices. rt.jar is 60.2 MiB in size, so saving 27.2 KiB is hardly worth complicating the compiler. (The annotation may still be worthwhile for documentation purposes.)

Papers: compressed virtual memory

The Case for Compressed Caching in Virtual Memory Systems (Wilson, Kaplan, Smaragdakis; gratis PDF)

Earlier papers in compressed virtual memory (CVM) found mixed performance benefits and penalties depending on the program, concluding CVM was mostly useful for diskless systems that had to swap over a network. This paper argues that different compression schemes and better adaptive CVM tuning, combined with faster processors, make CVM practical even for systems with fast disks.

This paper observes that general-purpose Lempel-Ziv compression algorithms are unable to exploit special properties of program data: word-aligned object fields, pointers that often point nearby or to small distant regions, small integer values, etc.. The authors propose a compression scheme based on a dictionary of 16 recently-seen 32-bit words. Each input word is encoded with a 2-bit tag denoting an all-zero word (a special case), a word not in the dictionary (followed by the word), a dictionary entry matching in the top 22 bits (followed by the index and the remaining 10 bits), or a dictionary match (followed by the index). Partial matching exploits small integers (top bits zero) and pointers pointing to small regions (differing only in bottom 10 bits). On the programs used in this study, this scheme achieved comparable compression to LZO at a lower cost. (Note that current CVM implementations often use LZO, though processors are even faster today than they were in 1999.)

MemZip: Exploring Unconventional Benefits from Memory Compression (Shafiee, Taassori, Balasubramonian, Davis; gratis PDF)

Most compressed memory systems use compression to increase memory capacity (and avoid page faults). This paper instead targets reduced memory bandwidth, leading to decreased energy consumption. This is mostly a matter of not packing compressed data as tightly as possible (leaving the data at its original start address) and exploiting some DDR3 protocol features; the remaining space in each cache line can be used for free ECC or special power-saving coding.

Review and analysis: LYNE

LYNE is a route and cover game played on king’s graphs (the graphs on an integer lattice where each vertex is connected to every vertex a king’s move away), possibly containing holes. Each vertex is a triangle, square, diamond or hexagon. Triangles, squares and diamonds (“colored” nodes) each have two designated terminal nodes which must be connected by a path through all other nodes of that color and possibly some of the hexagon nodes. Hexagons do not restrict color, but must be included in paths a number of times equal to the number of pips on that hexagon (two, three or four). Each edge may only be part of one path and crossing (diagonal) edges cannot both be in paths.

LYNE includes quite a few puzzles: 26 sets of 25 puzzles each, plus between one and three sets of daily puzzles which increase in difficulty throughout the week in the crossword tradition. Typically two or three sets are available for play at any one time, but unlocking a new set requires fully completing a previous set — so you can still get stuck at 24 of 25 in your three current sets.

LYNE saves your progress after every puzzle, which is good because 25 puzzles turns out to be a tedious number to solve in a single sitting. While some inferences can be made using rules described below, puzzles are not solvable through inference alone. In practice, interface limitations prevent making effective use of inference. The interface requires tracing from one terminal to another, with no option to mark out middle parts of the path early. There’s also no provision for marking edges as unused or tracking which colors are possible on an edge. Due to these interface limitations, I solved most of my puzzles by trial and error, and once that process started taking too long, I moved on to automation.

Single-color LYNE is NP-complete

Even with a single color, LYNE is NP-complete. LYNE is played on a king’s graph, but the graph may contain holes. If the graph has holes in a checkerboard pattern (no two vertices horizontally or vertically adjacent), it’s a grid graph rotated 45 degrees, and if there are any more holes, the grid graph also has holes. Hamiltonian path with designated start and end vertices on grid graphs with holes is NP-complete. LYNE contains this problem as the special case in which all nodes are of a single color, where the terminal nodes are the designated start and end vertices (interchangeable in an undirected graph), so LYNE is also NP-complete.

Admittedly, this reduction does not seem to capture the challenge of actual LYNE puzzles intended for play by humans. In my experience with the game, the challenge derives mostly from the hexagons, suggesting a planar 3SAT reduction with two colors representing true and false, one of which must connect through hexagons in each clause gadget, where the whole gadget can be covered only if the true color makes it to the hexagons. My attempts at such a reduction have been stymied by the connectedness of king’s graphs, so the resulting graph will probably still be very sparse relative to actual LYNE puzzles.


lynebot is a bot that automatically solves LYNE puzzles. lynebot tracks the possible colors on each edge (triangle, square, diamond or none), applying local inference rules to eliminate or force colors. When the local rules cannot make progress, the bot begins a backtracking search, choosing one of the edges with the fewest possible remaining colors and setting it to each possibility in turn. When all edges have been assigned colors, the bot enumerates the possible paths to check if they cover all nodes and connect the terminals.

This is a backtracking search and so has exponential time in the worst case, but it solves nearly all puzzles in LYNE’s level sets in under a second, such that the two-second animation of solved puzzles sliding off to the left to be replaced by a new puzzle on the right is the rate-determining step when solving a level set.

All of lynebot’s inference rules are local to a node or edge. In some puzzles, it’s possible to eliminate some edges by reasoning about paths in the graph. For example, considering the subgraph of all nodes of one color and hexagons, if a colored node is an articulation point in the graph and both terminals are in the same biconnected component, none of the edges in the other component can be in the path, as the path could not return to the second terminal. I have not bothered with these additions because lynebot is already working well, but they may provide insight relevant to a more natural hardness proof.

Papers: on the plane to SPLASH 2014

Another conference, another plane ride, another blog post.

Academic urban legends (Ole Bjørn Rekdal; full text)

Is spinach a good source of iron? Turns out it isn’t, despite widespread belief to the contrary (go ahead, tell your parents). This article describes how the academic explanation for this widespread belief evolved through a game of citation telephone. (I won’t spoil any more of the story here — you’ll have to read the article.)

Data Size Optimizations for Java Programs (Ananian, Rinard; PDF, gratis PDF)

From the contributions section:

Implementation: We have fully implemented all of the analyses and techniques presented in the paper. Our experience with this implementation enables us to discuss the pragmatic details necessary for an effective implementation of our techniques.

It’s not “just engineering”.

Computational Sprinting (Raghavan, Luo, Chandawalla, Papaefthymiou, Pipe, Wenisch, Martin; PDF, gratis PDF; Raghavan, Emuerian, Shao, Papaefthymiou, Pipe, Wenisch, Martin; PDF, gratis PDF)

Intel’s Turbo Boost can increase voltage and clock speed of one core of a multicore processor when the other cores are idle, effectively reallocating the TDP of those cores. Computational sprinting is the obvious inverse: run normally with one core, but turn on more cores in short, thermally-unsustainable sprints. Often these “cores” are uncore parts or integrated accelerators. These papers explore the design space of computational sprinting in the context of smartphones, which are constrained by all of energy, thermal dissipation, volume and weight. We usually assume chips are packaged to maximize the rate of thermal dissipation, but computational sprinting additionally requires thermal capacitance to store the heat generated by a sprint for later dissipation during nominal operation.

Cherry-Picking: Exploiting Process Variations in Dark-Silicon Homogeneous Chip Multiprocessors (Raghunathan Turakhia, Garg, Marculescu; PDF, gratis PDF)

When designing a chip that cannot power all its components simutaneously (i.e., it must leave some “dark silicon”), the obvious approach is to include a small number of general-purpose processors plus many different accelerators, aiming to use the parts of the chip most efficient for a particular workload. This paper argues we should instead build homogeneous chips with many general-purpose processors; due to variations in manufacturing, some processors will support higher clock rates or will use less power than the others, and we can “cherry-pick” the best ones. Under a model in which a parallel task’s runtime is dominated by the slowest processor, having more general-purpose processors to pick from results in a less-slow processor being the long pole, with a corresponding decrease in execution time. A similar analysis holds for power because power grows with the square of voltage, so the processor requiring the highest voltage for reliable operation has a disproportionate impact on the total power.

Review and analysis: Strata

Strata is a constraint-satisfaction puzzle game played on small grids (up to 6x6), some squares of which are colored with one of up to six colors. The goal is to cover each row and column with a colored ribbon such that the top ribbon on each colored square matches the square’s color. Ribbons are placed one at a time, bottom-to-top, inducing a partial order; in other words, the ribbons cannot be interwoven.

Strata bills itself as “a challenging game of forethought and sequencing”, but the difficulty is fake. Below, I describe a simple algorithm for solving Strata puzzles. This algorithm orders the ribbons top-to-bottom, recursing on smaller puzzles until no constraints remain. If Strata’s user interface allowed placing the ribbons top-to-bottom, it would be clear which rows or columns remained with unsatisfied constraints, making the game trivial. Instead, Strata requires bottom-to-top ribbon replacement, so already-placed ribbons can’t be used to define the subpuzzles. Strata quickly stopped being fun.

So as with most puzzle games, I started thinking about algorithms.

Analysis of Strata

The algorithm is based on the following two properties of Strata puzzles.

  • The topmost ribbon in a solved puzzle crosses only squares of its color or no color. This fact follows immediately from the rules of the game, as no other ribbon can override the topmost ribbon’s color. This also implies that puzzles lacking a row or column whose squares have at most one color have no solution.
  • Removing the row or column of the topmost ribbon in a solved puzzle produces an independent puzzle with the same solution except for the topmost ribbon. Removing that row or column removes any colored squares that were satisfied only by the topmost ribbon; as ribbons only satisfy squares in their row or column, the remaining ribbons must cover the remaining squares. Thus a solution to the reduced puzzle provides a solution to the larger puzzle if the removed row or column contains squares of at most one color (required by the previous property).

These properties suggest a simple algorithm for solving Strata puzzles. For brevity, we use row generically to refer to both rows and columns.

  • If the puzzle being solved is empty, return an empty solution.
  • Search for a row containing squares of at most one color.
    • If no such row is found, the puzzle has no solution (by the first property). Return null.
    • If such a row is found, remove that row and recurse on the reduced puzzle.
      • If the returned solution is null, return null.
      • Otherwise append to the returned solution an assignment of the removed row’s ribbon to the single color of the squares in that row (or to an arbitrary color if there are no colored squares in the row) and return this modified solution.

This algorithm returns null if the puzzle has no solution and an ordering and color assignment for the ribbons otherwise. Note that some ribbons may not be present in the solution; consider, e.g., a puzzle with all rows having at most one color, for which the returned solution may omit all the columns. These omitted ribbons are not used to satisfy any colored squares, so they may be given an arbitrary order below all present ribbons and an arbitrary color.

By the first property, if a puzzle has a solution, the algorithm will find a row or column to remove. By the second property, if the reduced puzzle has a solution, the original puzzle also has a solution. The base case is the empty puzzle, which has no constraints (no colored squares) and is vacuously solvable. Thus the algorithm is correct.

The algorithm recurses at most r+c times for a puzzle with r rows and c columns, at each step performing O(rc) work searching for a row or column with at most one color of squares, so it runs in O((r+c)rc) time.


I implemented the above algorithm in stratabot, a Java program that can play Strata by parsing its graphics and submitting mouse clicks. I spent more time on the automation than on the algorithm; Strata includes some nondeterministic “texture” (noise) in its graphics, making parsing difficult. In the end, I hardcoded pixel offsets to read colors and send clicks.

stratabot can play single puzzles or a “wave” (Strata’s word for a level set) at a time, provided a human tells it the puzzle size and number of colors. The wave and puzzle select screens could be automated with a bit more effort, but stratabot’s current state is enough to get all the cheevos, so I stopped there.

Papers: on the plane to PLDI 2014

I couldn’t sleep on the plane to PLDI 2014, so I took the opportunity to sort the paper PDFs cluttering up my desktop. Here are some of the more “frivolous” (irrelevant to my research) ones.

Checked Load: Architectural Support for JavaScript Type Checking on Mobile Processors (Anderson, Fortuna, Ceze, Eggers; PDF, gratis PDF)

This paper measures the overhead introduced by type guards in JIT-compiled JavaScript code, then proposes new load instructions (“checked load”) that compare the high byte or word of the loaded cache line against an immediate type tag, trapping on a (mis)match. The paper proposes trapping by jumping through a new architectural register, but it’s the same idea as a processor fault (already used for null checks) — avoid (static and dynamic) branch instructions in the common case. The authors point out that the type tag check can proceed in parallel with the normal cache tag check, so the processor’s critical path is not affected (in the first order). Existing branch prediction hardware can be used for predicting checked load success, achieving performance better than the naive never-trap static prediction (but note that VMs could also recompile often-failing checked loads to use an explicit branch instead).

Selective Core Boosting: The Return of the Turbo Button (Wamhoff, Diestelhorst, Fetzer, Marlier, Felber, Dice; gratis PDF)

This paper proposes giving applications direct control over processor frequency and voltage transitions to improve performance (rather than reduce power consumption). After describing both AMD and Intel’s turbo features and the performance and latency characteristics of various strategies for managing processor performance states, the authors present some applications, the most interesting of which is the FastLane software transactional memory system. In FastLane, a master thread runs with minimal instrumentation and never aborts transactions, while worker threads run with more instrumentation and must abort when conflicts are detected. Performance state control can be used to keep the master thread running at high frequencies at the expense of the worker threads, increasing overall throughput.

AMD Mantle whitepaper (gratis PDF)

I’ll leave the technical details to the paper; the real case or controversy with Mantle is whether it generalizes.

“These concepts are applicable to a range of modern GPU architectures, and are not specifically tied to GCN. Mantle is intended to provide a thin hardware abstraction layer that can be broadly compatible with both current and future architectures, while allowing architecture-specific and platform-specific features to be exposed through an extension mechanism.”

The first sentence is immediately, if not contradicted, at least placed in doubt by the first bullet in the next section, “Why a New API?”: “Success of the GCN architecture […] this makes it a great baseline on which to establish a new programming model.” The second sentence sounds a lot like the OpenGL approach — perhaps OpenGL could be lighter (I’m not an expert on 3D graphics), but it sounds like a difference in degree, not of paradigm.

Understanding the Robustness of SSDs under Power Fault (Mai Zheng, Joseph Tucek, Feng Qin, Mark Lillibridge; gratis PDF)

SSDs fail under power fault, including models marketed with power-loss protection and “enterprise” (expensive) drives.

Image Forgery Localization via Block-Grained Analysis of JPEG Artifacts (Bianchi, Piva; PDF)

This paper presents algorithms for finding edited regions of images by detecting the artifacts of JPEG generation loss after recompression. I don’t know JPEG well enough to understand how this paper is an improvement over the previous state of the art; I’m just happy that we can finally say “this looks shopped, I can tell by some of the pixels” in a quantifiable way.

The three-dimensional dynamics of the die throw (M. Kapitaniak, Strzalko, Grabski, T. Kapitaniak; PDF)

Using simulations derived from observing die throws with a high-speed (1500 fps) camera, the authors conclude that dice are not fair: they tend to land on the face that begins facing down. Further, they show that irregularly-shaped dice cannot be made fair “by continuity” (by the intermediate value theorem).

Papers: XJava

XJava is a Java language extension for stream programming (pipeline and task parallelism). The authors apparently think in terms of least publishable units, as the language extensions, compiler and runtime, and performance heuristics are each described in a separate paper.

High-Level Multicore Programming with XJava (Frank Otto, Victor Pankratius, Walter F. Tichy; PDF, gratis PDF (requires Google Scholar referrer))

Tasks are special methods with input and output ports, defined with syntax such as public T => U foo() {/* code */}. Periodic tasks have an inner work(T) method that is called for every input item; output is by push u; statements inside the work method. Periodic tasks are equivalent to StreamIt’s filters, except with a fixed input data rate of 1 (per work invocation) and no output rates.

Nonperiodic tasks don’t have work methods; instead they contain calls to other tasks connected by the ||| parallel operator and the => pipelining operator (thus they are the equivalent of StreamIt’s splitjoins and pipelines). The => operator distributes items to a parallel construct in round-robin order (StreamIt’s split roundrobin;), =>* duplicates items to each successor task (StreamIt’s split duplicate;), and =>? distributes items nondeterministically (StreamIt is statically scheduled, so has no equivalent).

XJava: Exploiting Parallelism with Object-Oriented Stream Programming (Frank Otto, Victor Pankratius, Walter F. Tichy; PDF, gratis PDF (requires Google Scholar referrer))

Tasks are desugared to Java classes; task calls instantiate those classes and pass them to a thread pool for execution. Periodic tasks are (in this paper) executed in a round-robin fashion as long as they have work; non-periodic tasks (which create new periodic tasks) are executed in parallel if threads are idle or serially otherwise.

A Language-Based Tuning Mechanism for Task and Pipeline Parallelism (Frank Otto, Victor Pankratius, Walter F. Tichy; PDF, gratis PDF)

Thread pool size, scheduling strategy, pipeline fusion, horizontal fusion (by not expanding periodic tasks), data-parallelization of replicable (stateless) tasks, and task granularity (items processed per execution) are exposed as tunable parameters. Fixed heuristics are presented, e.g., using a thread pool size of 1.5 threads per core, or using a fork-join scheduling strategy for parallel tasks and longest-queue-first for pipelined tasks. These heuristics are shown to be nearly as performant as exhaustive search on the (small) space.

My thoughts

XJava is basically StreamIt in Java but without static data rates or static scheduling. On one hand, XJava can express applications with large amounts of dynamism via dynamic load balancing. On the other hand, it must dynamically dispatch tasks using a thread pool, incurring switching and synchronization costs and inhibiting the underlying JVM JIT’s optimizations, whereas StreamIt schedules tasks completely statically. For example, XJava profits from oversubscribing cores by a factor of 1.5 (presumably because some threads are blocked synchronizing or bringing code or data back into cache), whereas StreamIt has no reason to oversubscribe.

The (syntactic) language extensions seem like a poor tradeoff: they require the use of a separate compiler for no significant gain.

Review: Thomas Was Alone

Thomas Was Alone is a 2D puzzle platformer in which the player maneuvers a set of rectangles through minimalistic levels to matching goal rectangles. Each rectangle has a different size and jump height, and they must be used together to complete the level.

The art style is minimalistic: the rectangles are just flat colors, and the levels are flat black platforms, with liquid taking the place of bottomless pits. The backgrounds do have some detail, including lights that cast dynamic shadows as the rectangles move around the stage, which is a neat effect, especially in a 2D game. Levels are all skewed slightly to the left or right, resulting in obvious jaggies.

I enjoyed the soundtrack, though I have no music vocabulary so I won’t attempt to describe it here. There’s also Bastion-style narration switching between the various rectangles’ points of view, though their personalities came across as shallow, and none of the lines were particularly humorous or deep.

Unfortunately, all the puzzles I saw in a half hour of play involved either using the best-jumping rectangle to jump to a switch, or stacking rectangles to allow the worst-jumping rectangle to reach a ledge, often multiple times in a row. The only complication was that sometimes I’d take one rectangle too far into the stage without carrying another along, forcing me to restart the stage from the beginning. I like the game’s concept, but the designers either couldn’t come up with enough puzzle types or introduced them too slowly, because my time with the game was more tedious than fun.

Papers: Knuth on profiling and optimizing the common case

An Empirical Study of FORTRAN programs (Knuth, 1970; PDF, gratis PDF)

In this paper, Knuth uses static and dynamic analysis on a sample of Fortran programs to understand how the language is actually used and the performance of the resulting programs. To both compiler writers and programmers, Knuth emphasizes the importance of optimizing the common case.

The static analysis results showed that most program statements are simple assignments with no arithmetic or of the form x = x op y; most DO loops (Fortran’s for) are only a few statements long, use an increment of one, and are not deeply nested; most functions/arrays (syntactially identical) use few arguments/offsets; most programs have simple control flow; in general, “compilers spend most of their time doing surprisingly simple things”. In this section, Knuth focuses on speed of compilation (compilers should translate simple constructs quickly). In the paper’s introduction, he says he was once impressed that an algorithm was able to save two machine instructions from an 11-operand expression compared to his algorithm, but now knows an expression’s average operand count is only 2. The implication, of course, is that a small savings on programs users don’t actually write does not merit the complexity of the new algorithm, and in general optimization effort should be focused on constructs actually appearing in programs.

Dynamic analysis was used to gather “frequency counts”, about which Knuth says, “it is difficult to express in words just how tremendously ‘eye-opening’ they are” and “frequency counts are so important they deserve a special name”, thus coining the term profile. (Knuth notes that frequency counts go back to von Neumann and Goldstine 1948, and were previously used both as a debugging aid and for performance analysis, but he treats profile as an original coinage.) Programs in his sample spent most of their time in 4 percent of their code; by the nature of Fortran programs, these were usually small inner loops. Profiling the profiling instrumenter itself identified two loops as hot spots, and an hour of optimization effort doubled its speed. Knuth also gives an example of profiling prompting a change in data structure: a program was spending most of its time inserting into a sorted array, but not much time performing binary search on that array, so it was modified to use a hash table instead.

To determine how advanced compiler optimizations affect hot spots, Knuth hand-translated some of the inner loops at various (notional) levels of optimization, concluding that the then-known optimizations performed well enough on the programs in his sample. Knuth also notes that compilers can use profiling data to spend more time on the most frequently-executed portions of the program. (Today’s profile-guided optimizations go further than just working harder; for example, a function called from both hot and cold regions might be cloned so different sets of optimizations can be applied.)

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.

Review: Super Hexagon

Super Hexagon is a very simple game in which the player rotates a small triangle around a hexagon in the center of the screen to avoid incoming lines (parallel with the edges of the hexagon), accompanied by a chiptune soundtrack.

Despite the inherent division into six regions, the game uses continuous control, so while I could pick up on the lines just fine and decide to move three regions over, I often under- or over-rotated and lost. I could adapt, but I don’t find the game compelling enough to keep playing.

For 29 cents (on sale), I don’t regret trying it, but I don’t expect to touch it again.