Klotos January 22, 2013 at 21:01

Efficient string concatenation in .NET

Transfer

For .NET programmers, one of the first tips to improve the performance of their programs is “Use StringBuilder to Concatenate Strings”. Like “ Using Exceptions Is Expensive, ” the statement of concatenation is often misunderstood and turns into a dogma. Fortunately, it is not as destructive as the myth of exception performance, but it is much more common.

It would be nice if you read my previous article about strings in .NET before reading this article . And, in the name of readability, from now on I will designate strings in .NET with just strings, not “string” or “System.String”.

I included this article in the list of articles about the .NET Framework in general, and not in the list of C # -specific articles, since I believe that all languages on the .NET platform under the hood contain the same string concatenation mechanism.

The problem that they are trying to solve

The problem of concatenating a large array of strings, in which the resulting string grows very quickly and strongly, is very real, and the advice to use StringBuilder to concatenate is very correct. Here is an example:

using System;
public class Test
{
    static void Main()
    {
        DateTime start = DateTime.Now;
        string x = "";
        for (int i=0; i < 100000; i++)
        {
            x += "!";
        }
        DateTime end = DateTime.Now;
        Console.WriteLine ("Time taken: {0}", end-start);
    }
}

On my relatively fast laptop, this program took about 10 seconds to complete. If you double the number of iterations, the execution time will increase to a minute. On .NET 2.0 beta 2, the results are slightly better, but not so much. The problem of low performance is that the lines are immutable, and therefore, when using the " +=" operator , the line at the next iteration is not added to the end of the first. In fact, the expression is x += "!";absolutely equivalent to the expression x = x+"!";. Here, concatenation is the creation of a completely new line, for which the necessary amount of memory is allocated, into which the contents of the existing value are copied x, and then the contents of the concatenated string are copied ("!") As the resulting row grows, so does the amount of data that gets copied back and forth all the time, and that’s why when I doubled the number of iterations, the time grew more than doubled.

This concatenation algorithm is definitely inefficient. After all, if someone asks you to add something to the shopping list, you will not copy the entire list before adding, right? This is how we approach StringBuilder.

Using StringBuilder

And here is the equivalent (equivalent in the sense of an identical final value x) of the above program, which is much, much faster:

using System;
using System.Text;
public class Test
{
    static void Main()
    {
        DateTime start = DateTime.Now;
        StringBuilder builder = new StringBuilder();
        for (int i=0; i < 100000; i++)
        {
            builder.Append("!");
        }
        string x = builder.ToString();
        DateTime end = DateTime.Now;
        Console.WriteLine ("Time taken: {0}", end-start);
    }
}

On my laptop, this code is so fast that the time-measuring mechanism that I use is inefficient and does not give satisfactory results. With an increase in the number of iterations to one million (i.e., 10 times more than the initial amount, at which the first version of the program was executed for 10 seconds), the execution time increases to 30-40 milliseconds. Moreover, the execution time grows approximately linearly in the number of iterations (i.e., by doubling the number of iterations, the execution time will also double). This performance jump is achieved by eliminating the unnecessary copy operation - only the data that is attached to the resulting row is copied. StringBuilder contains and maintains its internal buffer, and when a string is added, it copies its contents to the buffer. When new joined lines do not fit in the buffer, it is copied with all its contents, but already with a large size. In essence, the internal buffer of a StringBuilder is the same ordinary string; strings are immutable only in terms of their public interfaces, but mutable on the assembly sidemscorlib. It would be possible to make this code even more productive by specifying the final size (length) of the string (in this case, we can calculate the size of the string before the start of concatenation) in the constructor of StringBuilder , so that the internal buffer of StringBuilder would be created exactly suitable for the resulting string size, and in the process of concatenation it would not have to increase through copying. In this situation, you can determine the length of the resulting string before concatenation, but even if you can’t, then it doesn’t matter - when filling the buffer and copying it, StringBuilder doubles the size of the new copy, so there will not be so many fillings and copies of the buffer.

So when concatenating should I always use StringBuilder?

In short, no. All of the above clarifies why the statement “Use StringBuilder to concatenate strings” may be correct in some situations. At the same time, some people take this statement as a dogma, without understanding the basics, and as a result, they begin to remodel such a code:

string name = firstName + " " + lastName;
Person person = new Person (name);

like this:

// Bad code! Do not use!
StringBuilder builder = new StringBuilder();
builder.Append (firstName);
builder.Append (" ");
builder.Append (lastName);
string name = builder.ToString();
Person person = new Person (name);

And all this in the name of productivity. If you look at the problem in general, even if the second version would be faster than the first, then obviously it would not be much faster , because there are only a few concatenations. The point in using the second version can only be if this piece of code is called very, very many times. The deterioration of the readability of the code (and I think you all agree that the second version is much less readable than the first) for the sake of a microscopic increase in performance is a very bad idea.

Moreover, in fact, the second version, with StringBuilder, is less efficientthan the first, although not by much. And if the second version was more easily perceived than the first, then after the argument from the previous paragraph I would say - use it; but when the version with StringBuilder is less readable and less productive, then using it is just nonsense.

If we assume that firstName and lastName are “real” variables and not constants (more on that below), then the first version will be compiled into a String.Concat call , something like this:

string name = String.Concat (firstName, " ", lastName);
Person person = new Person (name);

The String.Concat method takes a set of strings (or objects) as an input and “glues” them into one new line, simply and clearly. String.Concat has different overloads - some take several strings, some take several type variables Object(which are converted to strings when concatenated), and some take arrays of strings or arrays of Object. All overloads do the same thing. Before actually starting the concatenation process, String.Concat reads the lengths of all the strings passed to it (at least if you passed strings to it - if you passed variables of typeObject, then String.Concat for each such variable will create a new temporary (intermediate) string and will concatenate it already). Due to this, at the time of concatenation, String.Concat accurately “knows” the length of the resulting string, due to which it allocates an exactly sized buffer for it, and therefore there are no unnecessary copy operations, etc.

Compare this algorithm with the second StringBuilder version. At the time of its creation, StringBuilder does not know the size of the resulting string (and we didn’t “tell” that size; and if we did, we would make the code even less clear), which means that, most likely, the size of the start buffer will be exceeded , and StringBuilder will have to increase it by creating a new one and copying the contents. Moreover, as we recall, StringBuilder doubles the buffer, which means that, ultimately, the buffer will be much larger than what the resulting string requires. In addition, we should not forget about the overhead associated with creating an additional object, which is not in the first version (this object is StringBuilder). So what is the second version better?

An important difference between the example from this section and the example from the beginning of the article is that in this we immediately have all the strings that need to be concatenated, and therefore we can pass them all to String.Concat, which, in turn, will produce the result as much as possible effectively, without any intermediate lines. In the early example, we do not have access to all the rows at once, and therefore we need a temporary storage of intermediate results, which StringBuilder is best suited for. That is, in general, StringBuilder is effective as a container with an intermediate result, since it allows you to get rid of internal copying of strings; if all the lines are available immediately and there are no intermediate results, then StringBuilder will not be of any use.

Constants

The situation escalates even more when it comes to constants (I'm talking about string literals declared as const string). What do you think the expression will be compiled into string x = "hello" + " " + "there";? It is logical to assume that String.Concat will be called, but this is not so. In fact, this expression will be compiled here in this: string x = "hello there";. The compiler knows that all the components of the string xare compile time constants, and therefore all of them will be concatenated at the time of compilation of the program, and a string xwith a value will be stored in the compiled code "hello there". Translation of such code under StringBuilder is inefficient both in terms of memory consumption and in terms of CPU resources, not to mention readability.

Rules of thumb for concatenation

So, when to use StringBuilder, and when "simple" concatenation?

Definitely use StringBuilder when you concatenate strings in a non-trivial loop, and especially when you don't know (at the time of compilation) exactly how many iterations will be performed. For example, reading the contents of a text file by reading one character at a time within one iteration in a loop, and concatenating that character through an operator will +=presumably “kill” your application in terms of performance.
Definitely use the operator +=if you can specify all the strings needed to concatenate in a single statement. If you need to concatenate an array of strings, use an explicit call to String.Concat, and if a separator is needed between these strings, use String.Join .
Do not be afraid in the code to break literals into several parts and link them through +- the result will be the same. If your code contains a long literal string, then breaking it into several substrings will improve the readability of the code.
If you need intermediate concatenation results somewhere else , except actually being intermediate results (i.e. serve as a temporary repository of strings, changing at each iteration), then StringBuilder will not help you. For example, if you create a full name by concatenating the first and last name, and then add the third element (for example, login) to the end of the string, then StringBuilder will be useful only if you do not need to use the string (first name + last name) by itself, without a login, somewhere else (as we did in the example, creating an instance Personbased on the name and surname).
If you need to concatenate several substrings, and you cannot concatenate them in one statement via String.Concat, then the choice of “classical” - or StringBuilder concatenation will not play a special role. Here the speed will depend on the number of lines involved in the concatenation, on their length, as well as on the order in which the lines will be concatenated. If you think that concatenation is the bottleneck of performance and you definitely want to use the fastest method, then measure the performance of both methods and only then choose the fastest one.

Tags: