String.Intern makes strings even more interesting
- Transfer
Preface from the translator:
When going through / conducting interviews, one has to deal with questions that reveal a general understanding of how .NET works. In my opinion, questions about the work of the “garbage collector” are the most popular among these questions, but once I was asked a question about string interning. And he honestly put me in a dead end. A search in Runet returned several articles, but they did not give answers to the questions that I was looking for. I hope my translation of the article by Andrew Stellman (author of “Head First C #” ) fills this gap. I think this material will be useful for beginners. NET developers and those who became interested in what is the interning of strings in .NET.
One of the first things every C # novice developer comes across is working with strings. I show the basics of working with strings at the beginning of “Head First C #,” as they do in almost every other C # book. So it should not be surprising that C # junior and middle-level developers feel that they have gotten a pretty good row-by-line base. But the lines are more interesting than they seem. One of the most interesting aspects of strings in C # and .NET is the String.Intern method. Understanding how this method works can improve your skills in C # development. In this post, I will make a short tutorial for the String.Intern method to show you how it works.
Note:At the end of this post, I am going to show something “under the hood” using ILDasm. If you've never worked with ILDasm before, this would be a good opportunity to get to know a very useful .NET tool.
Let's start with a brief overview of what the System.String class expects . (I will not go into details - if someone wants a post about the basics of strings in .NET, add a comment or contact me at Building Better Software , and I will be happy to discuss a possible article together!)
Create a new console application in Visual Studio. (Everything works the same way from the command line if you want to use csc.exe to compile the code, but for the sake of ease of perception of the material, let's stick with the development in Visual Studio.) Here is the code for the Main () method - the entry point of the console application:
Program.cs:
There should be no surprises in this code. The program displays three lines to the console (remember, if you are working in Visual Studio, use Ctrl-F5 to run the program outside the debugger; “Press any key ...” will also be added to the program to prevent the console window from closing):
hello, hello world
False
False
The first WriteLine () outputs two lines. The second compares them using the equality operator == , which returns False because the strings do not match. And the latter compares them to see if both variables refer to the same String object . Since this is not the case, the method displays the value False .
Then add these two lines to the end of the Main () method :
And again, you get a pretty obvious answer. The equality operator returns True , since both strings are equal. But when you used the concatenation of the "Hello" and "world" strings, the + operator concatenates them and returns a new instance of System.String . That is why object.ReferenceEquals () quite reasonably returns False . The ReferenceEquals () method returns True only if both arguments refer to the same object.
This method allows you to work normally with objects. Two different objects can have the same value. This behavior is quite practical and predictable. If you create two “home” objects and set all their properties to the same values, you will have two identical “home” objects, but these will be different objects.
Does this still seem a bit confusing? If so, then I definitely recommend paying attention to the first few chapters of “Head First C #” , which will give you an idea of writing programs, debugging, and using objects and classes. You can download them as free clippings from this book .
So, while we work with strings - everything is fine. But as soon as we start playing with line references, things get a little weird.
Create a new console application. The code below is for him. But, before compiling and executing, look carefully at the code. Try to guess what it will display in the console?
Program.cs:
Now run the program. Here is what it displays in the console:
hello world, hello world: True, False
And so, this is exactly what we expected. In the helloWorld and helloWorld2 objects, the lines contain “Hello world" so that they are equal, but the links are different.
Now add this code to the bottom of your program:
Run it. This time the code will display the following line in the console:
hello world, hello world: True, True
Wait, it turns out that now HelloWorld and HelloWorld2 refer to the same line? Perhaps some may find this behavior strange or, at least, a little unexpected. We did not change the value of helloWorld2 at all. Many end up thinking something like this: “the variable was already equal to hello world. Installing it in hello world one more time shouldn't change anything. ”So what's the deal? Let's figure it out.
When using strings in C #, the CLR does something tricky and this is something called string interning. This is a way to store one copy of any string. If you store in a hundred or, even worse, in a million string variables, the same value will turn out that memory for storing string values will be allocated again and again. String interning is a way around this problem. The CLR maintains a table called internment pool. This table contains one unique link to each line that is either declared or created programmatically during the execution of your program. And the .NET Framework provides two useful methods for interacting with the intern pool: String.Intern () and String.IsInterned () . String.Intern ()
Methodworks in a very simple way. You will pass it a string as an argument. If this row is already in the internment pool, the method returns a link to this row. If it is not already there, it adds a row to the pool and returns a link to it. Here is an example:
This code will display True even if HelloWorld and HelloWorld2 refer to two different string objects, because they both contain the string "Hello World".
Stop for a moment. It’s worth a little more understanding with String.Intern () because sometimes the method gives slightly illogical results at first glance. Here is an example of this behavior:
Executing the code will print two lines to the console. The first WriteLine () method will show the value False , and this is understandable, since the String.Copy () method creates a new copy of the string and returns a link to the new object. But why by first executing String.Intern (about.ToString ()) then String.Intern (a) will return a link to about ? Stop for a moment to think about it. This becomes even more counterintuitive if you add three more lines:
It looks like these lines of code did the same, only with the new o2 object variable . But in the last WriteLine () will output the value False . So what is going on?
This little mess will help us figure out what's going on under the hood of String.Intern () and the internment pool. The first thing to understand for yourself is that the string object method in ToString () always returns a reference to itself. The variable o points to a string object containing the value "abc", so calling your own ToString () method returns a link to this string. So here is what happens.
At the beginning of apoints to the object of line number 1, which contains "abc". The variable o points to another object of line No. 2 which also contains "abc". A call to String.Intern (o.ToString ()) adds a reference to line number 2 to the intern pool. Now, when the object of line No. 2 is in the internment pool, at any time String.Intern () calling with the parameter “abc” will return a link to the object of line No. 2.
Therefore, when you pass the variable o and String.Intern (a) to the ReferenceEquals () method , it returns True , because String.Intern (a) returned a reference to the object of line No. 2. Now we have created a new variable o2and used the String.Copy () method to create another String object . This will be the object of line number 3, which also contains the string "abc". Calling String.Intern (o2.ToString ()) does not add anything to the internment pool this time because “abc” already exists, but will return a pointer to line number 2.
So this call to Intern () actually returns a reference to line number 2, but we discard it instead of assigning it to a variable. We could do something like this: string q = String.Intern (o2.ToString ()) , which would make the q variable a reference to the object of row # 2. That is why the last WriteLine () outputs False since this is a comparison of the link of line No. 3 with reference to line No. 2
There is another, somewhat paradoxically named method, which is useful when working with interned strings: String.IsInterned () . It takes a reference to a string object. If this string is in the internment pool, it returns a reference to the interned string of the string; if it is not already in the internment pool, the method returns null .
The reason its name sounds a little illogical is that this method starts with “Is,” but it does not return a Boolean type, as many programmers expect.
When working with the IsInterned () method, it is convenient to use the null-coalescing operator - ?? to display that the string is not in the internment pool . For example, writing:
Now the result of IsInterned () will be returned to the variable o if it is not zero, or the string “not interned” if there is no line in the internment pool.
If this is not done, then the Console.WriteLine () method will output empty lines (which this method does when null is encountered ).
Here is a simple example of how String.IsInterned () works:
The first WriteLine () statement will display “not interned” in the console because “xyz” is not yet in the internment pool. The second WriteLine () statement prints “xyz” because the internment pool already contains “xyz”. And the third WriteLine () will output True , since the object s points to the object added to the internment pool.
By adding just one line to the end of the method and running the program again:
something completely unexpected will happen!
The program will never display “not interned,” and the last two WriteLine () methods will show False ! If we comment out the last line, the program acts exactly as you expected. Why?! How did adding code at the end of the program change the behavior of the code program above it? This is very, very strange!
It seems really strange the first time you come across this, but it really makes sense. The reason for changing the behavior of the entire program is because the code contains the literal “xyz”. And when you add a literal to your program, the CLR automatically adds it to the intern pool even before the program starts. Commenting on this line, you remove the literal from the program and the internment pool will no longer contain the string "xyz".
Understanding that “xyz” is already in the internment pool when the program starts, since this string appeared as a literal in the code, such a change in the program’s behavior immediately becomes clear. String.IsInterned (s) no longer returns null . Instead, it returns a reference to the literal “xyz”, which also explains whyReferenceEquals () returns False. This is due to the fact that the string s will never be added to the internment pool (“xyz” is already in the pool, pointing to another object).
Change the last line of code to this:
Run the program. It works exactly as if you were using the literal “xyz”! Is + not an operator? Isn't this a method that runs on the CLR at runtime? If so, then there should be code that prevents the internization of the literal “xyz”.
In fact, this will happen if you replace "x" + "y" + "z" with String.Format ("{0} {1} {2}", 'x', 'y', 'z') . Both lines of code return "xyz". Why, then, with the + operator to concatenate, do we get the behavior as if you were using the literal “xyz”, although at the same time as String.Format () is executed at run time?
The easiest way to answer this question is to see what we actually get when compiling the code “x” + “y” + “z” .
Program.cs:
The next step is to find out that the compiler has compiled an executable type application. For this, we will use ILDasm.exe, the MSIL disassembler. This tool is installed with every version of Visual Studio (including Express editions). And even if you don’t know how to read IL, you can understand what is happening.
Run Ildasm.exe. If you are using a 64-bit version of Windows, run the following command: "% ProgramFiles (x86)% \ Microsoft SDKs \ Windows \ v7.0A \ Bin \ Ildasm.exe" (including quotation marks), either from the Start >> Run window , or from the command line. If you are using a 32-bit version of Windows, you should run the following command: "% ProgramFiles% \ Microsoft SDKs \ Windows \ v7.0A \ Bin \ ildasm.exe" .
This is what ILDasm looks like on first run:
Then compile your code into an executable file. Click on the project in Solution Explorer - in the Properties window , the Project Folder field should be located . Double click on it and copy. Going to the ILDasm window, select File >> Open in the menu, and paste the path to the folder. Then go to the bin folder. Your executable file must either be in the bin \ Debug or bin \ Release folder . Open the executable file. ILDasm should show you the contents of the assembly.
(If you need to brush up on how assemblies are created, see this post for an understanding of C # and .NET assemblies and namespaces ).
Expand the Program class and double-click the Main () method . After these steps, the disassembled code of the method should appear:
You do not need to know IL to see the presence of the literal “xyz” in the code. If you close ILDasm and then change the code to use "xyz" instead of "x" + "y" + "z", the IL code is parsed looks exactly the same! This is because the compiler is smart enough to replace “x” + “y” + “z” with “xyz” at compile time, so you don’t have to spend extra operations on method calls that “xyz” will always return. And when the literal is compiled in the program, the CLR adds it to the internment pool when the program starts.
The material in this article should give you a good idea about string interning in C # and .NET. In principle, this is even more than necessary for understanding the operation of string interning. If you are interested in learning more, the Performance Considerations section of MSDN's String.Intern pages is a good base .
PS: Thanks to the team for the hard proofreading and objective criticism of the translation.
When going through / conducting interviews, one has to deal with questions that reveal a general understanding of how .NET works. In my opinion, questions about the work of the “garbage collector” are the most popular among these questions, but once I was asked a question about string interning. And he honestly put me in a dead end. A search in Runet returned several articles, but they did not give answers to the questions that I was looking for. I hope my translation of the article by Andrew Stellman (author of “Head First C #” ) fills this gap. I think this material will be useful for beginners. NET developers and those who became interested in what is the interning of strings in .NET.
String.Intern makes strings even more interesting
One of the first things every C # novice developer comes across is working with strings. I show the basics of working with strings at the beginning of “Head First C #,” as they do in almost every other C # book. So it should not be surprising that C # junior and middle-level developers feel that they have gotten a pretty good row-by-line base. But the lines are more interesting than they seem. One of the most interesting aspects of strings in C # and .NET is the String.Intern method. Understanding how this method works can improve your skills in C # development. In this post, I will make a short tutorial for the String.Intern method to show you how it works.
Note:At the end of this post, I am going to show something “under the hood” using ILDasm. If you've never worked with ILDasm before, this would be a good opportunity to get to know a very useful .NET tool.
Some basics of working with strings
Let's start with a brief overview of what the System.String class expects . (I will not go into details - if someone wants a post about the basics of strings in .NET, add a comment or contact me at Building Better Software , and I will be happy to discuss a possible article together!)
Create a new console application in Visual Studio. (Everything works the same way from the command line if you want to use csc.exe to compile the code, but for the sake of ease of perception of the material, let's stick with the development in Visual Studio.) Here is the code for the Main () method - the entry point of the console application:
Program.cs:
using System;
class Program
{
static void Main(string[] args)
{
string a = "hello world";
string b = a;
a = "hello";
Console.WriteLine("{0}, {1}", a, b);
Console.WriteLine(a == b);
Console.WriteLine(object.ReferenceEquals(a, b));
}
}
There should be no surprises in this code. The program displays three lines to the console (remember, if you are working in Visual Studio, use Ctrl-F5 to run the program outside the debugger; “Press any key ...” will also be added to the program to prevent the console window from closing):
hello, hello world
False
False
The first WriteLine () outputs two lines. The second compares them using the equality operator == , which returns False because the strings do not match. And the latter compares them to see if both variables refer to the same String object . Since this is not the case, the method displays the value False .
Then add these two lines to the end of the Main () method :
Console.WriteLine((a + " world") == b);
Console.WriteLine(object.ReferenceEquals((a + " world"), b));
And again, you get a pretty obvious answer. The equality operator returns True , since both strings are equal. But when you used the concatenation of the "Hello" and "world" strings, the + operator concatenates them and returns a new instance of System.String . That is why object.ReferenceEquals () quite reasonably returns False . The ReferenceEquals () method returns True only if both arguments refer to the same object.
This method allows you to work normally with objects. Two different objects can have the same value. This behavior is quite practical and predictable. If you create two “home” objects and set all their properties to the same values, you will have two identical “home” objects, but these will be different objects.
Does this still seem a bit confusing? If so, then I definitely recommend paying attention to the first few chapters of “Head First C #” , which will give you an idea of writing programs, debugging, and using objects and classes. You can download them as free clippings from this book .
So, while we work with strings - everything is fine. But as soon as we start playing with line references, things get a little weird.
Something with this link is wrong ...
Create a new console application. The code below is for him. But, before compiling and executing, look carefully at the code. Try to guess what it will display in the console?
Program.cs:
using System;
class Program
{
static void Main(string[] args)
{
string hello = "hello";
string helloWorld = "hello world";
string helloWorld2 = hello + " world";
Console.WriteLine("{0}, {1}: {2}, {3}", helloWorld, helloWorld2,
helloWorld == helloWorld2,
object.ReferenceEquals(helloWorld, helloWorld2));
}
}
Now run the program. Here is what it displays in the console:
hello world, hello world: True, False
And so, this is exactly what we expected. In the helloWorld and helloWorld2 objects, the lines contain “Hello world" so that they are equal, but the links are different.
Now add this code to the bottom of your program:
helloWorld2 = "hello world";
Console.WriteLine("{0}, {1}: {2}, {3}", helloWorld, helloWorld2,
helloWorld == helloWorld2,
object.ReferenceEquals(helloWorld, helloWorld2));
Run it. This time the code will display the following line in the console:
hello world, hello world: True, True
Wait, it turns out that now HelloWorld and HelloWorld2 refer to the same line? Perhaps some may find this behavior strange or, at least, a little unexpected. We did not change the value of helloWorld2 at all. Many end up thinking something like this: “the variable was already equal to hello world. Installing it in hello world one more time shouldn't change anything. ”So what's the deal? Let's figure it out.
What is String.Intern? (plunging into the internment pool ...)
When using strings in C #, the CLR does something tricky and this is something called string interning. This is a way to store one copy of any string. If you store in a hundred or, even worse, in a million string variables, the same value will turn out that memory for storing string values will be allocated again and again. String interning is a way around this problem. The CLR maintains a table called internment pool. This table contains one unique link to each line that is either declared or created programmatically during the execution of your program. And the .NET Framework provides two useful methods for interacting with the intern pool: String.Intern () and String.IsInterned () . String.Intern ()
Methodworks in a very simple way. You will pass it a string as an argument. If this row is already in the internment pool, the method returns a link to this row. If it is not already there, it adds a row to the pool and returns a link to it. Here is an example:
Console.WriteLine(object.ReferenceEquals(
String.Intern(helloWorld),
String.Intern(helloWorld2)));
This code will display True even if HelloWorld and HelloWorld2 refer to two different string objects, because they both contain the string "Hello World".
Stop for a moment. It’s worth a little more understanding with String.Intern () because sometimes the method gives slightly illogical results at first glance. Here is an example of this behavior:
string a = new string(new char[] {'a', 'b', 'c'});
object o = String.Copy(a);
Console.WriteLine(object.ReferenceEquals(o, a));
String.Intern(o.ToString());
Console.WriteLine(object.ReferenceEquals(o, String.Intern(a)));
Executing the code will print two lines to the console. The first WriteLine () method will show the value False , and this is understandable, since the String.Copy () method creates a new copy of the string and returns a link to the new object. But why by first executing String.Intern (about.ToString ()) then String.Intern (a) will return a link to about ? Stop for a moment to think about it. This becomes even more counterintuitive if you add three more lines:
object o2 = String.Copy(a);
String.Intern(o2.ToString());
Console.WriteLine(object.ReferenceEquals(o2, String.Intern(a)));
It looks like these lines of code did the same, only with the new o2 object variable . But in the last WriteLine () will output the value False . So what is going on?
This little mess will help us figure out what's going on under the hood of String.Intern () and the internment pool. The first thing to understand for yourself is that the string object method in ToString () always returns a reference to itself. The variable o points to a string object containing the value "abc", so calling your own ToString () method returns a link to this string. So here is what happens.
At the beginning of apoints to the object of line number 1, which contains "abc". The variable o points to another object of line No. 2 which also contains "abc". A call to String.Intern (o.ToString ()) adds a reference to line number 2 to the intern pool. Now, when the object of line No. 2 is in the internment pool, at any time String.Intern () calling with the parameter “abc” will return a link to the object of line No. 2.
Therefore, when you pass the variable o and String.Intern (a) to the ReferenceEquals () method , it returns True , because String.Intern (a) returned a reference to the object of line No. 2. Now we have created a new variable o2and used the String.Copy () method to create another String object . This will be the object of line number 3, which also contains the string "abc". Calling String.Intern (o2.ToString ()) does not add anything to the internment pool this time because “abc” already exists, but will return a pointer to line number 2.
So this call to Intern () actually returns a reference to line number 2, but we discard it instead of assigning it to a variable. We could do something like this: string q = String.Intern (o2.ToString ()) , which would make the q variable a reference to the object of row # 2. That is why the last WriteLine () outputs False since this is a comparison of the link of line No. 3 with reference to line No. 2
Use String.IsInterned () to check if a string is in the intern pool
There is another, somewhat paradoxically named method, which is useful when working with interned strings: String.IsInterned () . It takes a reference to a string object. If this string is in the internment pool, it returns a reference to the interned string of the string; if it is not already in the internment pool, the method returns null .
The reason its name sounds a little illogical is that this method starts with “Is,” but it does not return a Boolean type, as many programmers expect.
When working with the IsInterned () method, it is convenient to use the null-coalescing operator - ?? to display that the string is not in the internment pool . For example, writing:
string o = String.IsInterned(str) ?? "not interned";
Now the result of IsInterned () will be returned to the variable o if it is not zero, or the string “not interned” if there is no line in the internment pool.
If this is not done, then the Console.WriteLine () method will output empty lines (which this method does when null is encountered ).
Here is a simple example of how String.IsInterned () works:
string s = new string(new char[] {'x', 'y', 'z'});
Console.WriteLine(String.IsInterned(s) ?? "not interned");
String.Intern(s);
Console.WriteLine(String.IsInterned(s) ?? "not interned");
Console.WriteLine(object.ReferenceEquals(
String.IsInterned(new string(new char[] { 'x', 'y', 'z' })), s));
The first WriteLine () statement will display “not interned” in the console because “xyz” is not yet in the internment pool. The second WriteLine () statement prints “xyz” because the internment pool already contains “xyz”. And the third WriteLine () will output True , since the object s points to the object added to the internment pool.
Literals intern automatically
By adding just one line to the end of the method and running the program again:
Сonsole.WriteLine(object.ReferenceEquals("xyz", с));
something completely unexpected will happen!
The program will never display “not interned,” and the last two WriteLine () methods will show False ! If we comment out the last line, the program acts exactly as you expected. Why?! How did adding code at the end of the program change the behavior of the code program above it? This is very, very strange!
It seems really strange the first time you come across this, but it really makes sense. The reason for changing the behavior of the entire program is because the code contains the literal “xyz”. And when you add a literal to your program, the CLR automatically adds it to the intern pool even before the program starts. Commenting on this line, you remove the literal from the program and the internment pool will no longer contain the string "xyz".
Understanding that “xyz” is already in the internment pool when the program starts, since this string appeared as a literal in the code, such a change in the program’s behavior immediately becomes clear. String.IsInterned (s) no longer returns null . Instead, it returns a reference to the literal “xyz”, which also explains whyReferenceEquals () returns False. This is due to the fact that the string s will never be added to the internment pool (“xyz” is already in the pool, pointing to another object).
The compiler is smarter than you think!
Change the last line of code to this:
Console.WriteLine(
object.ReferenceEquals("x" + "y" + "z", s));
Run the program. It works exactly as if you were using the literal “xyz”! Is + not an operator? Isn't this a method that runs on the CLR at runtime? If so, then there should be code that prevents the internization of the literal “xyz”.
In fact, this will happen if you replace "x" + "y" + "z" with String.Format ("{0} {1} {2}", 'x', 'y', 'z') . Both lines of code return "xyz". Why, then, with the + operator to concatenate, do we get the behavior as if you were using the literal “xyz”, although at the same time as String.Format () is executed at run time?
The easiest way to answer this question is to see what we actually get when compiling the code “x” + “y” + “z” .
Program.cs:
using System;
class Program
{
public static void Main()
{
Console.WriteLine("x" + "y" + "z");
}
}
The next step is to find out that the compiler has compiled an executable type application. For this, we will use ILDasm.exe, the MSIL disassembler. This tool is installed with every version of Visual Studio (including Express editions). And even if you don’t know how to read IL, you can understand what is happening.
Run Ildasm.exe. If you are using a 64-bit version of Windows, run the following command: "% ProgramFiles (x86)% \ Microsoft SDKs \ Windows \ v7.0A \ Bin \ Ildasm.exe" (including quotation marks), either from the Start >> Run window , or from the command line. If you are using a 32-bit version of Windows, you should run the following command: "% ProgramFiles% \ Microsoft SDKs \ Windows \ v7.0A \ Bin \ ildasm.exe" .
If you have the .NET Framework 3.5 or earlier
If you have the .NET Framework 3.5 or earlier, you may need to look for ildasm.exe in neighboring folders. Launch the Explorer window and go to the Program Files folder. As a rule, the desired program is located in the "Microsoft SDKs \ Windows \ vX.X \ bin" folder. In addition, you can run the command line from the "Visual Studio Command Prompt" which is located in the Start menu, and then type "ILDASM" to launch it.
This is what ILDasm looks like on first run:
Then compile your code into an executable file. Click on the project in Solution Explorer - in the Properties window , the Project Folder field should be located . Double click on it and copy. Going to the ILDasm window, select File >> Open in the menu, and paste the path to the folder. Then go to the bin folder. Your executable file must either be in the bin \ Debug or bin \ Release folder . Open the executable file. ILDasm should show you the contents of the assembly.
(If you need to brush up on how assemblies are created, see this post for an understanding of C # and .NET assemblies and namespaces ).
Expand the Program class and double-click the Main () method . After these steps, the disassembled code of the method should appear:
You do not need to know IL to see the presence of the literal “xyz” in the code. If you close ILDasm and then change the code to use "xyz" instead of "x" + "y" + "z", the IL code is parsed looks exactly the same! This is because the compiler is smart enough to replace “x” + “y” + “z” with “xyz” at compile time, so you don’t have to spend extra operations on method calls that “xyz” will always return. And when the literal is compiled in the program, the CLR adds it to the internment pool when the program starts.
The material in this article should give you a good idea about string interning in C # and .NET. In principle, this is even more than necessary for understanding the operation of string interning. If you are interested in learning more, the Performance Considerations section of MSDN's String.Intern pages is a good base .
PS: Thanks to the team for the hard proofreading and objective criticism of the translation.