Why is C faster than Java (from the perspective of a Java developer)
- Transfer
The discussion on the Git mailing list expanded on how a high-level programming language reduces application performance in connection with the JGit discussion . The discussion is especially interesting because it was attended by programmers, experts of the highest level in both C and Java. One of them is Shawn O. Pearce, a well-known Java Java programmer, an active committer at Eclipse, a co-author of Git, and author of a Java implementation of Git called JGit. In your messagehe called the real limitations that a highly skilled developer is faced with when trying to write efficient Java code that is comparable in performance to the most optimized C code. Although the letter dates back to April 2009, some of Sean’s arguments have still not lost their relevance.
PS Sean Pierce's post was written in 2009 and the author does not take into account changes made in Java 1.7. For example, Java now uses escape analysis to avoid allocating memory on the heap whenever possible.
List: git
Subject: Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git,
From: “Shawn O. Pearce” <spearce () spearce! Org>
As mentioned earlier, we made a lot of small ones optimizations in the Git code in C to achieve really high performance. 5% here, 10% there, and suddenly you're already 60% faster than before. Nico [Petre], Linus [Torvalds] and Junio [Hamano] - all they spent some time in the last three to four years to optimize individual fragments of Git, solely to make it work as fast as possible.
High-level programming languages to some extent hide the machine, so we can not carry out all these optimizations.
For example, JGit suffers from absencemmap()
, and when using Java NIO MappedByteBuffer, we still need to make a copy to a temporary arraybyte[]
in order to be able to actually process the data. There is no such copy in Git in C. Of course, in other high-level languages, the methodmmap
may be more convenient, but they also tend to garbage collection, and most languages try to associate managementmmap
with the garbage collector "for safety and simplicity."
JGit also suffers from the lack of unsigned data types in Java. There are many places in JGit where we really needunsigned int32_t
eitherunsigned long
(a machine word of maximum size) orunsigned char
, but these data types are simply missing in Java. Converting a byte to int, just to present it as unsigned, requires an extra operation& 0xFF
to nullify the sign extension.
JGit suffers from a lack of an effective way to introduce SHA-1. In C code, you can simply writeunsigned char[20]
and immediately copy the string into memory to the container. In Java, itbyte[20]
will cost an additional 16 bytes of memory, and access to them will be longer, because these bytes themselves are in a different area of memory from the container. We try to get around this by converting frombyte[20]
five ints, but it costs extra machine instructions.
Git in C takes for granted that the operationmemcpy(a, b, 20)
extremely cheap when copying the contents of memory from a tree (inflated tree) to an object of structure. JGit has to pay a heavy fine for copying these 20 bytes into five ints, because later these five ints are cheaper.
Other high-level programming languages also lack the ability to mark a type as unsigned. Or they are forced to pay similar fines for storing a 20-byte binary array.
Java-native collections (collection types) have become for us a real trap in JGit. We used typesjava.util.*
in convenient cases, and seemed to almost solve the problem with the data structure, but as a rule, they worked much worse than writing a specialized data structure.
For example, we hadObjectIdSubclassMap
for what was supposed to look likeMap<ObjectId,Object>
. Only he demanded that the type of Object, which you use as a “value”, come from ObjectId, because this representation of the object works both as a key and as a value. This causes a real nightmare when used onHashMap<ObjectId,Object>
. (If anyone does not know, ObjectId is JGit'sunsigned char[20]
for SHA-1).
Just a couple of days ago I wroteLongMap
a faster optionHashMap<Long,Object>
for hashing objects by indexes in a packed file. Here the same thing, the cost of packaging in Java for convertinglong
(the largest integer) into an object suitable for the standard HashMap type was quite high.
And now JGit is still slower when it comes to handling a commit or tree object, where you need to monitor object links. Or when a call occursinflate()
. We spend a lot more time on these procedures than git does in C, although we try to get down to the lowest level possiblebyte[]
, as much as we can , avoiding copying anything and avoiding memory allocation whenever possible.
Typically, JGit performs an operationrev-list --objects –all
about twice as long as Git does on a project like the Linux kernel, andindex-pack
for a file of about 270 MB, it also lasts about twice as long.
Both parts of JGit are as good as I have the knowledge to optimize them, but we are really at the mercy of JIT, and any changes in JIT can lead to a deterioration (or improvement) in our performance. Unlike Git in C, where Linus Torvalds can change whole pieces of code in assembler and try different approaches.
So yes, there is practical sense in creating Git in a high-level language, but you just won’t be able to get the same performance or strict memory consumption there as Git in C. That's what the abstractions of a high-level language cost you. However, JGit works quite fine; fast enough for us to use it as a git server inside Google.
PS Sean Pierce's post was written in 2009 and the author does not take into account changes made in Java 1.7. For example, Java now uses escape analysis to avoid allocating memory on the heap whenever possible.