Sunday, April 12, 2009

Multiple return/tuples and closures in Java

Learning from Functional Programming

I spent some time crafting a file scanner and developing byte-level file scanning. Since I've spent some time learning Erlang I had an eye on how to duplicate the same algorithms in a functional language.

Java's lack of transparent multiple-return values inevitably leads to some implementation patterns that result in sub-optimal (object hungry) or thread-hostile code. The JVM has been optimized to handle these, but key language features may produce a more expressive language and will certainly allow novices a more direct and intuitive path to better performing implementations of common algorithms.

At the very least all Java developers can be helped a great deal by programming in functional / side-effect free code. The trade-off for multithread safe code is the rapid creation/collection of small objects.

The Scanning Algorithm

The basic algorithm is present here intertwined with some code that marks the offsets of message boundaries in an array.

    // LISTING 1 - primordeal Scanner algorithm

int length = _prototypes.remaining();

int[] offsets = new int[length / 15 + 1];

int offsetIndex = 0;
int i = 0;
loop: while (true)
{
if (i >= length)
{ // EOF case
offsets = ensureCapacity(offsets, offsetIndex + 1);
offsets[offsetIndex++] = i;
break loop;
}
switch (_prototypes.get(i))
{
case 10: // unix newline \10
offsets = ensureCapacity(offsets, offsetIndex + 1);
offsets[offsetIndex++] = i;
break;
case 13: // dos or apple newline \13 \10
offsets = ensureCapacity(offsets, offsetIndex + 1);
offsets[offsetIndex++] = i;
if (_prototypes.get(i + 1) == 10)
i++; // skip 10
break;
case 0: // pad case
offsets = ensureCapacity(offsets, offsetIndex + 1);
offsets[offsetIndex++] = i;
break loop;
}
i++;
}
int[] copy = new int[offsetIndex];
System.arraycopy(offsets, 0, copy, 0, Math.min(offsets.length,
offsetIndex));
_prototypeOffsets = copy;


This served its purpose for reading a single chunk of text and splitting it. Different inputs and outputs or continuing past a single buffer dictate the need for a different operation to take place at each of the cases.

Object-Oriented Abstraction

Here I extracted the scanning algorithm from that code, producing an abstract and concrete implementation. If I were going to scan for a different message boundary or introduce additional end-of-file conditions they could be included in an alternate scanner implementation.
  // LISTING 2 - Scanner as an object
abstract class ByteScanner {
public enum ScanResult {
UNDERFLOW,
BOUNDARY_MATCH,
END_MATCH;
}

protected int position = 0;

public void position(int pos) {
position = pos;
}

public int position() {
return position;
}

public abstract ScanResult scan(ByteBuffer buff, int start, int limit);
}

class NewlineScanner extends ByteScanner {
public ScanResult scan(final ByteBuffer buff, final int start, final int limit) {
position = start;

while(true) {
if (position >= limit) { // read data
return ScanResult.UNDERFLOW;
}

switch (buff.get(position)) {
case 10: // unix newline \10
if (position - start <= 1)
continue; // dos newline detect
return ScanResult.BOUNDARY_MATCH;
case 13: // dos or apple newline \13 (\10)
return ScanResult.BOUNDARY_MATCH;
case 0: // pad case
return ScanResult.END_MATCH; // eof
}
position++;
}
}
}


While the first listing is capable of processing separate files with only additional overhead of its required input & output variables, the second listing also requires a new instance of the scanner because it contains the position state.

Toward Side-Effect Free Code

Many many Java libraries hold some state or object caches in their instances including the chief Java anti-patterns: java.util.Calendar and java.util.Date. Java's Date class is so badly broken that it simply should be avoided in a multithreaded application (try joda-time or JSR 310's time API). XML input and output factories and other encoding libraries may seem stateless, but enough of them hold a state (or the use-case to avoid side-effect is unclear) that it is usually safer to make use of ThreadLocal to work around any hidden state.

// LISTING 3 - Scanner as an object with faked multi-return
abstract class ByteScanner2 {
public enum ResultType {
UNDERFLOW,
BOUNDARY_MATCH,
END_MATCH;
}

public static class ScanResult {

public int position;
public ResultType type;
}

public ScanResult result(int position, ResultType type) {
ScanResult sr = new ScanResult();
sr.position = position;
sr.type = type;
return sr;
}

public abstract ScanResult scan(ByteBuffer buff, int start, int limit);
}

class NewlineScanner2 extends ByteScanner2 {
public ScanResult scan(final ByteBuffer buff, final int start, final int limit) {
int position = start;

while(true) {
if (position >= limit) { // read data
return result(position,ResultType.UNDERFLOW);
}

switch (buff.get(position)) {
case 10: // unix newline \10
if (position - start <= 1)
continue; // dos newline detect
return result(position,ResultType.BOUNDARY_MATCH);
case 13: // dos or apple newline \13 (\10)
return result(position,ResultType.BOUNDARY_MATCH);
case 0: // pad case
return result(position,ResultType.END_MATCH); // eof
}
position++;
}
}
}


Listing 3 presents a side-effect free scanner implementation. There is a fair amount of boilerplate code needed to encapsulate the multiple parts of the state. This is the pattern on which I continue to base my state-free parsers.

Reducing Boilerplate and Encouraging Functional Programming in Novices

I'm still studying the Java closures and tuples proposals, but here is an approximation of the implementation with those features included. The motivation is not only to make listing 3 look better, but to make it more expressive and elegant thereby encouraging implementations which
are free of side-effects without the need for analysis. This means that novices can regularly produce correct (multithread safe or side-effect free) code the first time they compile it.

Here let "Function<int,ResultType>" represent a closure with a 2 parameter return statement. This isn't in the current closures spec and the tuples spec doesn't address the closure syntax issue.

// LISTING 4 - Scanner as a closure with true multi-return
public enum ResultType {
UNDERFLOW,
BOUNDARY_MATCH,
END_MATCH;
}

Function<int,ResultType> scan = { ByteBuffer buff, int start, int limit =>
int position = start;

while(true) {
if (position >= limit) { // read data
return position,ResultType.UNDERFLOW;
}

switch (buff.get(position)) {
case 10: // unix newline \10
if (position - start <= 1)
continue; // dos newline detect
return position,ResultType.BOUNDARY_MATCH;
case 13: // dos or apple newline \13 (\10)
return position,ResultType.BOUNDARY_MATCH;
case 0: // pad case
return position,ResultType.END_MATCH; // eof
}
position++;
}
}
}

1 comment:

  1. I fixed my layout to stretch and show all the preformatted sections properly. Reload if it's chopped off.

    ReplyDelete