Haskell and Java - comparison on a real task (satellites, ICFP Contest)
Today on a habra article about Nemerle and ICFP 2009 passed . I would like to share my own recent research on this topic. My task was to write the ideal VM compiler from the job, do it in Haskell, and most importantly, compare the speeds of the resulting code in Java and Haskell. The full solution to the problem for ICFP is not given here, because this is a brute-force task, and VM in it is the innermost place on which the performance of the brute-force solution depends, which is why it is interesting.
Introduction to the task: the organizers give a certain program in the form of bytecode, to which the specification is given. The byte code does not contain transition codes, but contains only instructions of the form m1 <- m2 + m3 and the like, including an instruction for conditional selection of two cells. There are three addressable memory types, each of which has a double array type: input parameter memory, which is read only, working memory: read / write, and output parameter memory - write only. One pass of the program gives the coordinates of celestial bodies at the next point in time, the memory must be preserved between the passages, because it contains the state of the world. Using this “world simulator”, it is necessary to apply to the program input a control action on a certain satellite, which must fly around this world and do something. All coordinates inside are calculated using well-known formulas close to real ones, and everything turns out very beautifully. These formulas are encoded in VM. Using this VM, it is necessary, ultimately, to optimally control the satellite, and these control sequences are what they will ultimately give a prize for, or maybe not.
Before comparing Haskell and Java, I would like to clarify about comparing OCaml and C / C ++ program speeds, which was discussed in the comments. In the original article, the interpreter in C ++ was compared with the JIT compiler in Ocaml, and from there the superiority of OCaml in speed. In our version, we will compare identical optimal virtual machines generated by the compiler - ahead of time.
What can I say about speed? Haskell on this task is more than 2 times more productive than Java. It’s a well-known thing, if you generate code in C, then the performance will be higher once again in 5-10: we tried with an imperfect code in the process of competition.
So, the task: task 4001 as a whole, that is, all objects: 12 passive satellites + moon + our satellite. Iteration count / sec.
Hardware: Intel Q6600 non-overclocked, Arch Linux 64bit, GHC 6.10.3, JDK 1.6u14, all without multithreading: for measuring Haskell quality as a compiler. Measurements in Java after Hotspot warmed up, and it happened in the very first seconds, after which the result did not change)
Result (iterations per second, the whole calculated world in tasks 4001..4004 = 1 iteration):
* Java version (as sent to contest, simple compilation): 22K iter / sec
* Java version (perfect compilation): 44K iter / src
* Java version (perfect compilation, 32bit JDK): 32K iter / src
* Haskell version (with lazy data in VMState): 1.73K iter / sec, up to 8.1K iter / sec ==> not counting.
* Haskell version (with strict data in VMState):96K iter / sec.
An ideal Java VM looks like this (meaningful snippet): What do we have here? Everything temporary within the iteration is taken out in local variables, constant in members, expressions are given in brackets (except for reused ones that went into local variables). Constants are made by constants. Ports are made in an array, you could give them personal names, but they are few, and therefore this does not affect performance. One call to the function transfers the object to a new state (tick +1). A nice feature of the assignment is the lack of links forward to temporary variables.
In contrast, an imperfectly compiled virtual machine looks like this: members are used instead of temporary variables. Instead of parentheses, save the result to members at each step. That is, almost one to one according to the specification issued in the PDF for the task: Now we look at how this is done on Haskell. It is not customary to modify an object here, and there are no simple means for this. Instead, we describe an object (algebraic data type), then a function that creates the original instance, and also write a function that creates another instance from one instance, 1 tick apart. The code is similar to Java, but there are no temporary variables, let-bindings instead. Testing consisted of displaying the entire state on the screen (to calculate all the lazy chains), after 45,000 iterations.
In the VMState type, constructor parameters (structure fields) are made lazy. If the VMState instance was spawned from the previous instance via nextState, then it contains lazy stubs (thunks, I don't know how to say it) that are on the heap. And if we need to calculate 45,000 ticks, then we need to go through 45,000 copies, each of which has a fair (hundred) parameters, and all are lazy. In short, this code gives us 1.73K iterations per second, and it's terrible. But if you just replace them with strict, then everything starts to run smartly. (96K iter / sec). In the profiler, we see that there is practically no extra memory allocation (only VMState instances), and the load on the GC is extremely small ...
In principle, this research may already be enough, but perhaps the intermediate options will give us some hope that lazy can give acceptable speeds?
To do this, we allow ourselves to take the subtask: in most cases, tasks 4001-4004 only need our own spatial position in response to our momentum, and the coordinates of the satellites can be calculated once and for all (in advance). Let's try to require from VM only a position, and not the entire structure with the coordinates of the planets, etc. Our position corresponds to the fields o2, o3 (output ports 0 × 2, 0 × 3 from the specification). Since Haskell is a lazy language, we can use the same function (nextState, lazy version) for our narrow purposes. In this case, the calculation speed is 4.5K iter / sec, which is twice as fast, because other values are not calculated!
Now let's try to make these two fields strict, and leave all other fields lazy! The speed is 8.1K iter / sec! This is probably because the garbage collector needs to do a little less work, because after assigning a strict field the chain of lazy calculations can be released immediately, and not only after the result is displayed at the end of all iterations. It becomes faster and easier to build.
Why is Haskell faster than Java on our task? Probably because it compiles into the executable, even taking into account that the assembler code generated in it is terrible compared to the one that C / C ++ produces. And, probably, the compiler works better here, which can be sure that the same pieces in expressions (for example, under the conditions of “m12345 x == 0 ″) are enough to calculate only once. This is a luxury that Java cannot afford. For Java, we would have to write additional code that details these nuances, but for Haskell, there’s no need, an intelligent compiler.
Comments on the compiler code (which lies here) :
1. it generates full lazy variant (strictness annotations were added by hands on top of the compilation result for research purposes)
2. constants are inserted in a user-friendly format, and not bitwise (bitwise)
3. there is a code inside that generates Java as well, it is inactive, but it is.
4. compiler code with comments and empty lines, and a branch for Java - 238 lines.
5. the compiler is sufficiently productive for bin4.obf (4 task), the code is not optimized (beauty for).
Improvements and comments to the compiler are welcome. If necessary, I can write a similar article about the compiler itself, namely about the translation from flat expressions into expressions with brackets, etc., how easy this is.
Thanks for attention.
Introduction to the task: the organizers give a certain program in the form of bytecode, to which the specification is given. The byte code does not contain transition codes, but contains only instructions of the form m1 <- m2 + m3 and the like, including an instruction for conditional selection of two cells. There are three addressable memory types, each of which has a double array type: input parameter memory, which is read only, working memory: read / write, and output parameter memory - write only. One pass of the program gives the coordinates of celestial bodies at the next point in time, the memory must be preserved between the passages, because it contains the state of the world. Using this “world simulator”, it is necessary to apply to the program input a control action on a certain satellite, which must fly around this world and do something. All coordinates inside are calculated using well-known formulas close to real ones, and everything turns out very beautifully. These formulas are encoded in VM. Using this VM, it is necessary, ultimately, to optimally control the satellite, and these control sequences are what they will ultimately give a prize for, or maybe not.
Before comparing Haskell and Java, I would like to clarify about comparing OCaml and C / C ++ program speeds, which was discussed in the comments. In the original article, the interpreter in C ++ was compared with the JIT compiler in Ocaml, and from there the superiority of OCaml in speed. In our version, we will compare identical optimal virtual machines generated by the compiler - ahead of time.
What can I say about speed? Haskell on this task is more than 2 times more productive than Java. It’s a well-known thing, if you generate code in C, then the performance will be higher once again in 5-10: we tried with an imperfect code in the process of competition.
So, the task: task 4001 as a whole, that is, all objects: 12 passive satellites + moon + our satellite. Iteration count / sec.
Hardware: Intel Q6600 non-overclocked, Arch Linux 64bit, GHC 6.10.3, JDK 1.6u14, all without multithreading: for measuring Haskell quality as a compiler. Measurements in Java after Hotspot warmed up, and it happened in the very first seconds, after which the result did not change)
Result (iterations per second, the whole calculated world in tasks 4001..4004 = 1 iteration):
* Java version (as sent to contest, simple compilation): 22K iter / sec
* Java version (perfect compilation): 44K iter / src
* Java version (perfect compilation, 32bit JDK): 32K iter / src
* Haskell version (with lazy data in VMState): 1.73K iter / sec, up to 8.1K iter / sec ==> not counting.
* Haskell version (with strict data in VMState):96K iter / sec.
An ideal Java VM looks like this (meaningful snippet): What do we have here? Everything temporary within the iteration is taken out in local variables, constant in members, expressions are given in brackets (except for reused ones that went into local variables). Constants are made by constants. Ports are made in an array, you could give them personal names, but they are few, and therefore this does not affect performance. One call to the function transfers the object to a new state (tick +1). A nice feature of the assignment is the lack of links forward to temporary variables.
public class Vm {
double m2039;
{m2039=0.0; }
public void fw(double[] i, double[] o) {
double t60=((6.67428e-11)*(t59+t41))
double t61=(((((t53*t53)*t53) != 0) ? t60/((t53*t53)*t53) : 0));
m2039=(t12+1.0);
o[0]=((((Math.sqrt (((t1989*t1989)+(t1987*t1987))))-6357000.0)<0) ? ...
}
}
In contrast, an imperfectly compiled virtual machine looks like this: members are used instead of temporary variables. Instead of parentheses, save the result to members at each step. That is, almost one to one according to the specification issued in the PDF for the task: Now we look at how this is done on Haskell. It is not customary to modify an object here, and there are no simple means for this. Instead, we describe an object (algebraic data type), then a function that creates the original instance, and also write a function that creates another instance from one instance, 1 tick apart. The code is similar to Java, but there are no temporary variables, let-bindings instead. Testing consisted of displaying the entire state on the screen (to calculate all the lazy chains), after 45,000 iterations.
d57 = d2042;
d59 = (d13 == 0.0 ? d56 : d57);
d60 = d34 * d59;
d61 = (d55 != 0.0 ? d60 / d55 : 0.0);
d62 = d45 * d61;
d63 = d62 + d41;
d64 = d63 * d4;
d65 = d2114;
d66 = d9 * d9;
d67 = d11 * d11;
data VMState = VMState { m2039,...,o101::Double } -- ленивые "поля структуры"
initState = VMState {m2039=0.0....} -- создание нулевого экземпляра
nextState x (i16000,i3,i2)= -- вычисление следующего экземпляра
let .... -- временные переменные
t18=((if t13==0 then 3.84399e8 else m2051 x)) :: Double
in -- создание нового экземпляра, синтаксически означает "копия x, но с некоторыми новыми значениями"
x { m2039= .... , o39=(if (t1699-(t12+t14*2))==0 then 0.0 else 1.0) }
In the VMState type, constructor parameters (structure fields) are made lazy. If the VMState instance was spawned from the previous instance via nextState, then it contains lazy stubs (thunks, I don't know how to say it) that are on the heap. And if we need to calculate 45,000 ticks, then we need to go through 45,000 copies, each of which has a fair (hundred) parameters, and all are lazy. In short, this code gives us 1.73K iterations per second, and it's terrible. But if you just replace them with strict, then everything starts to run smartly. (96K iter / sec). In the profiler, we see that there is practically no extra memory allocation (only VMState instances), and the load on the GC is extremely small ...
data VMState = VMState { m2039,...,o101:: !Double } -- замечаем воскл. знак перед Double
In principle, this research may already be enough, but perhaps the intermediate options will give us some hope that lazy can give acceptable speeds?
To do this, we allow ourselves to take the subtask: in most cases, tasks 4001-4004 only need our own spatial position in response to our momentum, and the coordinates of the satellites can be calculated once and for all (in advance). Let's try to require from VM only a position, and not the entire structure with the coordinates of the planets, etc. Our position corresponds to the fields o2, o3 (output ports 0 × 2, 0 × 3 from the specification). Since Haskell is a lazy language, we can use the same function (nextState, lazy version) for our narrow purposes. In this case, the calculation speed is 4.5K iter / sec, which is twice as fast, because other values are not calculated!
Now let's try to make these two fields strict, and leave all other fields lazy! The speed is 8.1K iter / sec! This is probably because the garbage collector needs to do a little less work, because after assigning a strict field the chain of lazy calculations can be released immediately, and not only after the result is displayed at the end of all iterations. It becomes faster and easier to build.
Why is Haskell faster than Java on our task? Probably because it compiles into the executable, even taking into account that the assembler code generated in it is terrible compared to the one that C / C ++ produces. And, probably, the compiler works better here, which can be sure that the same pieces in expressions (for example, under the conditions of “m12345 x == 0 ″) are enough to calculate only once. This is a luxury that Java cannot afford. For Java, we would have to write additional code that details these nuances, but for Haskell, there’s no need, an intelligent compiler.
Comments on the compiler code (which lies here) :
1. it generates full lazy variant (strictness annotations were added by hands on top of the compilation result for research purposes)
2. constants are inserted in a user-friendly format, and not bitwise (bitwise)
3. there is a code inside that generates Java as well, it is inactive, but it is.
4. compiler code with comments and empty lines, and a branch for Java - 238 lines.
5. the compiler is sufficiently productive for bin4.obf (4 task), the code is not optimized (beauty for).
Improvements and comments to the compiler are welcome. If necessary, I can write a similar article about the compiler itself, namely about the translation from flat expressions into expressions with brackets, etc., how easy this is.
Thanks for attention.