Why is C faster than Java (from the perspective of a Java developer)

Original author: Sean Pierce
  • Transfer
The discussion on the Git mailing list expanded on how a high-level programming language reduces application performance in connection with the JGit discussion . The discussion is especially interesting because it was attended by programmers, experts of the highest level in both C and Java. One of them is Shawn O. Pearce, a well-known Java Java programmer, an active committer at Eclipse, a co-author of Git, and author of a Java implementation of Git called JGit. In your messagehe called the real limitations that a highly skilled developer is faced with when trying to write efficient Java code that is comparable in performance to the most optimized C code. Although the letter dates back to April 2009, some of Sean’s arguments have still not lost their relevance.

List: git
Subject: Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git,
From: “Shawn O. Pearce” <spearce () spearce! Org>


As mentioned earlier, we made a lot of small ones optimizations in the Git code in C to achieve really high performance. 5% here, 10% there, and suddenly you're already 60% faster than before. Nico [Petre], Linus [Torvalds] and Junio ​​[Hamano] - all they spent some time in the last three to four years to optimize individual fragments of Git, solely to make it work as fast as possible.

High-level programming languages ​​to some extent hide the machine, so we can not carry out all these optimizations.

For example, JGit suffers from absence mmap(), and when using Java NIO MappedByteBuffer, we still need to make a copy to a temporary array byte[]in order to be able to actually process the data. There is no such copy in Git in C. Of course, in other high-level languages, the method mmapmay be more convenient, but they also tend to garbage collection, and most languages ​​try to associate management mmapwith the garbage collector "for safety and simplicity."

JGit also suffers from the lack of unsigned data types in Java. There are many places in JGit where we really need unsigned int32_teither unsigned long(a machine word of maximum size) orunsigned char, but these data types are simply missing in Java. Converting a byte to int, just to present it as unsigned, requires an extra operation & 0xFFto nullify the sign extension.

JGit suffers from a lack of an effective way to introduce SHA-1. In C code, you can simply write unsigned char[20]and immediately copy the string into memory to the container. In Java, it byte[20]will cost an additional 16 bytes of memory, and access to them will be longer, because these bytes themselves are in a different area of ​​memory from the container. We try to get around this by converting from byte[20]five ints, but it costs extra machine instructions.

Git in C takes for granted that the operationmemcpy(a, b, 20)extremely cheap when copying the contents of memory from a tree (inflated tree) to an object of structure. JGit has to pay a heavy fine for copying these 20 bytes into five ints, because later these five ints are cheaper.

Other high-level programming languages ​​also lack the ability to mark a type as unsigned. Or they are forced to pay similar fines for storing a 20-byte binary array.

Java-native collections (collection types) have become for us a real trap in JGit. We used types java.util.*in convenient cases, and seemed to almost solve the problem with the data structure, but as a rule, they worked much worse than writing a specialized data structure.

For example, we had ObjectIdSubclassMapfor what was supposed to look likeMap<ObjectId,Object>. Only he demanded that the type of Object, which you use as a “value”, come from ObjectId, because this representation of the object works both as a key and as a value. This causes a real nightmare when used on HashMap<ObjectId,Object>. (If anyone does not know, ObjectId is JGit's unsigned char[20]for SHA-1).

Just a couple of days ago I wrote LongMapa faster option HashMap<Long,Object>for hashing objects by indexes in a packed file. Here the same thing, the cost of packaging in Java for converting long(the largest integer) into an object suitable for the standard HashMap type was quite high.

And now JGit is still slower when it comes to handling a commit or tree object, where you need to monitor object links. Or when a call occurs inflate(). We spend a lot more time on these procedures than git does in C, although we try to get down to the lowest level possible byte[], as much as we can , avoiding copying anything and avoiding memory allocation whenever possible.

Typically, JGit performs an operation rev-list --objects –allabout twice as long as Git does on a project like the Linux kernel, and index-packfor a file of about 270 MB, it also lasts about twice as long.

Both parts of JGit are as good as I have the knowledge to optimize them, but we are really at the mercy of JIT, and any changes in JIT can lead to a deterioration (or improvement) in our performance. Unlike Git in C, where Linus Torvalds can change whole pieces of code in assembler and try different approaches.

So yes, there is practical sense in creating Git in a high-level language, but you just won’t be able to get the same performance or strict memory consumption there as Git in C. That's what the abstractions of a high-level language cost you. However, JGit works quite fine; fast enough for us to use it as a git server inside Google.

PS Sean Pierce's post was written in 2009 and the author does not take into account changes made in Java 1.7. For example, Java now uses escape analysis to avoid allocating memory on the heap whenever possible.

Also popular now: