Changes to the internal representation of strings in Java

Original author: Mikhail Vorontsov
  • Transfer

Introduction


This is my first article translation. I decided to devote it to Java, oddly enough. This article talks about changes to the string data type that have occurred in Java.

IMPORTANT: In the 17 update, Java 7 still hasn't changed anything about the article.

Split char array of characters []

In the current implementation of the String class, there are 4 instance fields: an char [] value array of characters containing string characters, int offset is the index of the first character used in the value array, int count is the number of characters used, and int hash is the cached value of the calculated hash code for this string . As you can see, in most cases the string will have the values ​​offset = 0 and count = value.length. The only exception to this rule is possible when a string is created by calling the viaString.substring method or any method that uses this method (for example, split).

String.substring creates a string that shares the char [] internal array of characters with the original string, which allows:
  1. Save memory using it together;
  2. Make the execution time of the String.substring method constant (O (1)).

At the same time, with such an implementation, a memory leak is possible: if you extract a small substring from a large string and you no longer need a large string, you will still have a link to a large array of characters of the original string. The only way to avoid this is to call the constructor new String (String), which will force you to create a copy of the array of characters, that is, disconnect your small string from the large "parent".

In Java 1.7.0_06, the offset and count fields were removed from the String class. This means that we can no longer share an array of characters. Now we can forget about the memory leaks described above and never again use the new String (String) constructor. As a flaw, you should always remember that the String.substring method now has linear complexity as opposed to the constant that used to be.

Hash Logic Changes

Another change introduced to the String class in the same update: a new hash algorithm. Oracle says the new algorithm provides a better distribution of hash codes, which should improve the performance of some hash-based collections: HashMap, Hashtable, HashSet, LinkedHashSet, WeakHashMap, and ConcurentHashMap. Unlike the changes described at the beginning of the article, these changes are experimental and are turned off by default.

As you probably noticed, these changes are true only for the case when the collection keys are strings. If you want to enable these changes, you must set the value of the system property jdk.map.althashing.threshold to a non-negative value (by default it is -1). This value is the threshold for the size of the collection, after which the new hashing algorithm will be used. A small correction: the hash algorithm will be changed only if there is not enough memory. If the collection was redistributed with size = 160 and jdk.map.althashing.threshold = 200, then the hashing method will be changed only when the collection size grows to 320.

The String class now has a hash32 () method, the result of which is cached in the int hash32 field. The result of the hash32 () method for identical strings can be different for different JVM starts (in fact, it will be almost always different, because it uses System.currentTimeMillis () and two calls to System.nanoTime for initialization). As a result, the iteration over the same collection may be different for different program launches. In fact, I was a little surprised by this method. Why do we need it if the real implementation of the hashCode method works well enough? I decided to check how many collisions will be obtained using the hash32 () method. The String.hash32 () method is not public, but I looked at the implementation of the HashMap class to determine how it calls it. Answer: sun.misc.Hashing.stringHash32 (String). On the test containing one million lines, the method returned 304 collisions, compared to more than one collision using the String.hashCode method. I think we should wait for further improvements.

New hash logic can seriously affect multi-threaded code

Oracle made a mistake while implementing hashing in the following classes: HashMap, Hashtable, HashSet, LinkedHashSet, WeakHashMap. Only ConcurentHashMap does not contain this error. The problem is that all non-parallel classes now have a field:

/**
 * Случайное значение связанное с этим экземпляром для того чтобы осложнить поиск коллизий.
 */transientfinalint hashSeed = sun.misc.Hashing.randomHashSeed(this);

This means that each time you create the above collections, the sun.misc.Hashing.randomHashSeed method will be called. In turn, the randomHashSeed method calls the java.util.Random.nextInt method. As you know, the Random class is not very friendly with multithreading: it contains the private final AtomicLong seed field. Atomics work well under average load, but poorly under heavy load.

As a result, many highly loaded web applications that process HTTP / JSON / XML can be affected by this error, because almost all known parsers (parsers) use at least one of the affected classes. All these parsers can create nested map collections, which will increase the number of map collections created per second.

How to solve this problem?

1. ConcurrentHashMap path: call the randomHashSeed method only when the system property jdk.map.althashing.threshold is defined. However, this is only available to JDK kernel developers.

/**
 * Случайное значение связанное с этим экземпляром для того чтобы осложнить поиск коллизий.
 */privatetransientfinalint hashSeed = randomHashSeed(this);
privatestaticintrandomHashSeed(ConcurrentHashMap instance){
    if (sun.misc.VM.isBooted() && Holder.ALTERNATIVE_HASHING) {
        return sun.misc.Hashing.randomHashSeed(instance);
    } 
    return0;
}

2. Hacker way: change the class sun.misc.Hashing. Highly discouraged. If you still decide to do this, then you can do this: the java.util.Random class is not final. You can expand it and override its nextInt method by returning, for example, a constant value. Then you need to change the field sun.misc.Hashing.Holder.SEED_MAKER by setting it to your class inherited from Random. Do not worry that this field is private, static and final - reflection will help you.

Summary

  • Starting with Java 1.7.0_06, the String.substring method always creates a new array of char [] characters for each string it creates. This means that now this method has linear complexity compared to constant as in previous versions. As a result of this change, the string takes 4 bytes less than before, and it is guaranteed that memory leaks caused by the String.substring method are eliminated.
  • Starting with the same update, a second method for hashing, called hash32, has appeared in the String class. This method is not public and can be accessed without reflection only by calling the sun.misc.Hashing.stringHash32 (String) method. This method is used in JDK 7 by hash-based collections if their size exceeds the value of the system property jdk.map.althashing.threshold. This is an experimental method, and I do not recommend using it in my code.
  • Starting with Java 1.7.0_06, all standard non-multithreaded collections have an undesirable side effect caused by the new hashing algorithm. This error only manifests itself in multithreaded applications that create many map collections per second.

Push

The fact that the strings in Java shared an array of characters suggests that they were persistent. Persistent structures are known to be called structures that retain all their previous versions upon change. This was done explicitly, as an optimization to reduce the amount of memory used + to make the running time of the substring method constant. However, the benefits of this optimization are not so great, which is why this idea was abandoned in version 7 of Java.

In .NET, strings were initially not persistent. Here is a link to an interesting article by Eric Lippert, in which he explained in detail why this is so.

Thanks for reading.
I hope this article has been helpful.

Also popular now: