timyrik20 April 27, 2014 at 22:17

Is the string operator + simple?

Introduction

A string data type is one of the fundamental types, along with numeric (int, long, double) and logical (bool). It is hard to imagine at least any useful program that does not use this type.

On the .NET platform, a string type is represented as an immutable String class. In addition, it is highly integrated into the common language environment of the CLR, and also has support from the C # compiler.

In this article, I would like to talk about concatenation, an operation that is performed on strings as often as the addition operation on numbers. It would seem that here we can talk about, because we all know about the string operator +, but as it turned out, he has his own subtleties.

Language Specification for String Operator +

The C # language specification provides three operator + overloads for strings:

string operator + (string x, string y)
string operator + (string x, object y)
string operator + (object x, string y)

If one of the operands of string concatenation is null, then an empty string is substituted. Otherwise, any argument that is not a string is cast to a string by calling the virtual ToString method . If the ToString method returns null, an empty string is substituted. It should be said that according to the specification, this operation should never return null.

The description of the operator looks clear enough, however, if we look at the implementation of the String class, we find an explicit definition of only two operators == and! =. A reasonable question arises, what happens behind the scenes of string concatenation? How does the compiler handle the string operator +?

The answer to this question was not so complicated. You need to take a closer look at the static String.Concat method. String.Concat method - combines one or more instances of the String class or a String representation of the values of one or more instances of Object. The following overloads of this method are available:

public static String Concat(String str0, String str1)
public static String Concat(String str0, String str1, String str2)
public static String Concat(String str0, String str1, String str2, String str3)
public static String Concat(params String[] values)
public static String Concat(IEnumerable values)
public static String Concat(Object arg0)
public static String Concat(Object arg0, Object arg1)
public static String Concat(Object arg0, Object arg1, Object arg2)
public static String Concat(Object arg0, Object arg1, Object arg2, Object arg3, __arglist)	
public static String Concat(IEnumerable values)

More details

Suppose we have the following expression s = a + b, where a and b are strings. The compiler converts it to a call to the static Concat method, i.e.

s = string.Concat(a, b)

The string concatenation operation, like any other addition operation in C #, is left-associative.

With two lines everything is clear, but what if there are more lines? The expression s = a + b + c, taking into account the left-associativity of the operation, could be replaced by

s = string.Concat(string.Concat(a, b), c)

however, given the presence of an overload that takes three arguments, it will be converted to

s = string.Concat(a, b, c)

The situation is similar with the concatenation of four lines. To concatenate 5 or more lines, we have overloaded string.Concat (params string []), so you need to consider the overhead associated with allocating memory for an array.

It should also be said that the string concatenation operation is completely associative : it does not matter in which order we concatenate the strings, therefore the expression s = a + (b + c) despite the explicit indication of the priority of concatenation is processed as

s = (a + b) + c = string.Concat(a, b, c)

instead of the expected

s = string.Concat(a, string.Concat(b, c))

Thus, to summarize the above: the string concatenation operation is always presented from left to right, and uses a call to the static String.Concat method.

Compiler optimizations for literal strings

The C # language compiler has optimizations associated with literal strings. So, for example, the expression s = "a" + "b" + c, given the left-associativity of the operator +, is equivalent to s = ("a" + "b") + c is converted to

s = string.Concat("ab", c)

The expression s = c + "a" + "b" despite the left-associativity of the concatenation operation (s = (c + "a") + "b") is converted to

s = string.Concat(c, "ab")

In general, no matter where the literals are located, the compiler concatenates everything it can, and only then tries to select the appropriate Concat method overload. The expression s = a + "b" + "c" + d is converted to

s = string.Concat(a, "bc", d)

It should also be said about optimizations associated with an empty and null string. The compiler knows that adding an empty string does not affect the result of concatenation, so the expression s = a + "" + b is converted to

s = string.Concat(a, b),

instead of the expected

s = string.Concat (a, "", b)

Similarly, for a const string whose value is null, we have:

const string nullStr = null;
s = a + nullStr + b;

converted to

s = string.Concat(a, b)

The expression s = a + nullStr is converted to s = a ?? "" if a is a string, and calling the string.Concat (a) method, if a is not a string, for example, s = 17 + nullStr, is converted to s = string.Concat (17).

An interesting feature related to the optimization of literal processing and the left-associativity of the string operator +.

Consider the expression:

var s1 = 17 + 17 + "abc";

given left-associativity, it is equivalent

var s1 = (17 + 17) + "abc"; // вызов метода string.Concat(34, "abc")

as a result, at the compilation stage, the addition of numbers will occur, so that the result is 34abc.

Expression on the other hand

var s2 = "abc" + 17 + 17;

equivalently

var s2 = ("abc" + 17) + 17; // вызов метода string.Concat("abc", 17, 17)

resulting in abc1717.

So, it would seem that the same concatenation operation leads to different results.

String.Concat VS StringBuilder.Append

A few words should be said about this comparison. Consider the following code:

string name = "Timur";
string surname = "Guev";
string patronymic = "Ahsarbecovich";
string fio = surname + name + patronymic;

It can be replaced with code using StringBuilder:

var sb = new StringBuilder();
sb.Append(surname);
sb.Append(name);
sb.Append(patronymic);
string fio = sb.ToString();

But in this situation, we hardly get the benefits of using StringBuilder. In addition to making the code less readable, it also becomes less efficient, because the Concat method calculates the length of the resulting string and allocates memory only once, unlike StringBuilder, which does not know anything about the length of the resulting string.

Concat method implementation for 3 lines:

public static string Concat(string str0, string str1, string str2)
 {
   if (str0 == null && str1 == null && str2 == null)
     return string.Empty;
   if (str0 == null)
     str0 = string.Empty;
   if (str1 == null)
     str1 = string.Empty;
   if (str2 == null)
     str2 = string.Empty;
   string dest = string.FastAllocateString(str0.Length + str1.Length + str2.Length); // выделяем память для строки
   string.FillStringChecked(dest, 0, str0); /
   string.FillStringChecked(dest, str0.Length, str1);
   string.FillStringChecked(dest, str0.Length + str1.Length, str2);
   return dest;
 }

The + operator in Java

A few words about the string operator + in Java. Although I do not program in Java, it’s interesting to know how things are there. The Java compiler optimizes the + operator so that it uses the StringBuilder class and the append method call.

The previous code is converted to

String fio = new StringBuilder(String.valueOf(surname)).append(name).append(patronymic).ToString()

It is also worth mentioning that C # intentionally refused such optimization, Eric Lippert has a post on this subject. The fact is that such optimization is not an optimization as such, it is a rewrite of the code. In addition, the creators of the C # language believe that developers should know the features of working with the String class and, if necessary, will switch to using StringBuilder.

By the way, it was Eric Lippert who was engaged in C # compiler optimizations related to string concatenation.

Conclusion

Perhaps at first glance it may seem strange that the String class does not define the + operator until we think about the possibilities of compiler optimization related to the visibility of a larger piece of code. For example, if the + operator were defined in the String class, then the expression s = a + b + c + d would lead to the creation of two intermediate lines, the only call to the string.Concat (a, b, c, d) method allows you to combine more effectively.

Tags: