When the identifier is not an identifier or the Mongolian vowel separator attacks

Original author: Jon Skeet
  • Transfer

Translator Notes
In the translation, I allowed myself to use some Englishisms, such as "valid", "native" and "binary". I hope there will be no questions with them.

Identifiers are a special term of the C # specification that identifies everything that can be addressed by name, such as the name of a class, the name of a variable, etc.

Roslyn is a C # code compiler written in C #. It was created to replace the existing csc.exe. I usually omit the word compiler in this text.

To start, a few things you might not have heard about:
  • Identifiers in C # may include Unicode escape sequences (such as \ u1234).
  • Identifiers in C # may include Unicode characters in the Cf (other, format) category, but when comparing identifiers for identity, these characters are ignored.
  • The symbol “Mongolian vowel separator” (U + 180E), depending on the Unicode version, belongs to either the Cf (other, format) category or the Zs (separator, space) category.
  • .NET stores its own list of Unicode categories, independent of those in Win32.
  • Roslyn is a .NET application, and therefore uses the Unicode categories specified in .NET files. The native compiler (csc.exe) uses either system (Win32) categories or stores a copy of Unicode tables.
  • None of the Unicode character tables (neither .NET nor Win32) exactly follow any version of the Unicode standard.
  • Compilers may have bugs.

From all this some problems arise ...

Vladimir is to blame for everything


It all started with a discussion at a meeting of the ECMA technical group last week. We looked at “normative links,” and in particular which version of the Unicode standard we will use. At that time, the ECMA-335 specification (4th edition) uses Unicode 4.0, while Microsoft's C # 5 specification uses Unicode 3.0. I don’t know for sure whether compiler developers take such features into account. In my opinion, it would be better if ECMA and Microsoft did not specify a specific version of Unicode in their specifications. Let compiler developers use the latest version of Unicode currently available. However, then the compilers will have to come with their personal copy of the Unicode table, which is a bit strange in my opinion.

During our discussion, Vladimir Reshetnikov casually mentioned “Mongolian Vowel Separator ”(U + 180E), whom life has greatly tormented. This character was added in Unicode 3.0.0 to the Cf (other, format) category. Then, in Unicode 4.0.0 it was moved to the Zs category (separator, space), and in Unicode 6.3.0 it was returned to the Cf category again.

Of course, I tried to condemn such actions. My initial goal was to show you code that behaves differently, depending on the version of the Unicode table that the compiler uses. However, it turned out that in fact, everything is a little more complicated. But first, we assume that we use a “hypothetical compiler” that does not contain bugs, and uses any version of Unicode that we want (which is a bug according to the requirements of the current C # specification, but we will leave aside this subtlety).

Hypothetical example 1: right or wrong


For simplicity, let's forget about all sorts of UTFs for a while and use the usual ASCII:

class MvsTest
{
    static void Main ()
    {
        string stringx = "a";
        string \ u180ex = "b";
        Console.WriteLine (stringx);
    }
}


If the compiler uses Unicode version 6.3 or higher (or version lower than 4.0), then U + 180E will be considered a character from the Cf category, and, therefore, allowed for use in the identifier. If the symbol is allowed to be used in the identifier, then instead of this symbol we can use the escape sequence, and the compiler will gladly process it correctly. The identifier in the second line of this method is considered “identical” to stringx, so “b” will be displayed.

So what about a compiler that uses Unicode version 4.0 - 6.2 inclusive? In this case, U + 180E will be considered a character from the Zs category, which makes it a whitespace character. Whitespace is allowed inside C # code, but not in the identifiers themselves. And since this character is not a valid identifier and is not inside a character \ string literal, from the point of view of the compiler, using the escape sequence in this section is incorrect, and therefore this section of code simply does not compile.

Hypothetical example 2: correct, in two different ways


However, we can write the same code without using an escape sequence. To do this, create a regular ASCII file:

class MvsTest
{
    static void Main ()
    {
        string stringx = "a";
        stringAAAx = "b";
        Console.WriteLine (stringx);
    }
}


Then open it in a hex editor and replace the AAA characters with bytes E1 A0 8E. So we got a file containing UTF-8 representing the U + 180E character in the same place where it was displayed using the escape sequence in the first example.

The compiler that successfully adopted the first example will also compile this option (assuming that you were able to tell the compiler that the file is encoded in UTF-8), and the result will be exactly the same - “b” will be displayed, since the second the construction in the method is a simple assignment to an existing variable.

However, even if the compiler perceives U + 180E as a whitespace character (that is, it refuses to compile the program from Example 1), there will still be no problems with this option, the compiler will accept the second expression in the method as declaring a new local variable x and assigning it some kind of initial value. You may receive a compiler warning about declaring an unused local variable, but the code will be successfully compiled and “a” will be displayed.

Reality: Microsoft Compilers


When we talk about the Microsoft C # compiler, we need to distinguish between the native compiler (csc.exe) and Roslyn (rcsc, although I usually just call it Roslyn).

Since csc.exe is written in native code, it either uses the built-in Windows tools for working with Unicode, or simply stores a Unicode character table in its executable file. (I searched the entire MSDN in search of a native Win32 function for determining whether a character belongs to a specific Unicode category, but I didn’t find anything. It’s a pity that such a function would be very useful ...)

At this time, Roslyn, which is written in C # and for defining Unicode Categories (as far as I know) uses char.GetUnicodeCategory () , which relies on the built-in mscorlib.dll Unicode tables.

My experiments suggest that no matter what the native compiler uses to determine the category, U + 180E is always taken as the symbol for the Cf category. At least I tried to find old machines (including VM images) on which no updates were installed since September 2013 (at that time the Unicode 6.3 standard was published) and they all compiled the program from the first example without any either errors. I'm starting to suspect that csc.exe probably has a copy of the Unicode 3.0 table built into the binary. He definitely perceives U + 180E as a formatting character, but doesn’t like the U + 0600 and U + 00AD characters in identifiers (U + 0600 was not represented before Unicode 4.0, but it was always a formatting character; U + 00AD in Unicode 3.0 was a punctuation character (dash), but starting with Unicode 4.

However, the table built into mscorlib.dll has definitely changed with the advent of new versions of the .NET Framework. If you run such a program:

using System;

class Test
{
    static void Main ()
    {
        Console.WriteLine (Environment.Version);
        Console.WriteLine (char.GetUnicodeCategory ('\ u180e'));
    }
}


Then under CLRv2, “SpaceSeparator” will be displayed, while under CLRv4 (at least on a recently updated system), “Format” will be displayed.

Of course, Roslyn will not work on older versions of the CLR. However, we still have hope in the face of csharppad.com , which launches Roslyn in some kind of environment (of unknown origin, maybe Mono? Is not sure about this), and, as a result, “SpaceSeparator” is displayed. I am sure that the program from the first example will not be compiled. However, with the second example, everything is more complicated - csharppad.com does not allow downloading the source code file, and copy / paste gives a strange result.

Reality: mcs (Mono C # compiler)


The Mono compiler uses also uses the GetUnicodeCategory () method, which makes our experiments much easier, but, unfortunately, the Mono parser has at least 2 bugs:
  • It allows you to use any escape sequence as an identifier, regardless of whether this escape sequence is a valid identifier or not. For example, from the point of view of the Mono compiler, the string \ u0020x = “” construct is valid. Reported as bug 24968 . Source .
  • It does not allow formatting characters to be used inside identifiers, including characters from the categories Mn, Mc, Nd, and Pc, but not Cf. Reported as bug 24969 . Source .

For this reason, the program from the first example is always compiled, and displays “b” on the screen. However, the program from the second example will produce a compilation error, regardless of which category (Zs or Cf), according to the compiler, refers to the character U + 180E.

So what version is it?


Next, let's think about the Unicode table in .NET itself, since it is not clear which version of Unicode the various BCL implementations use. Run this program:

using System;

class Test
{
    static void Main ()
    {
        Console.WriteLine (char.GetUnicodeCategory ('\ u00ad'));
        Console.WriteLine (char.GetUnicodeCategory ('\ u0600'));
        Console.WriteLine (char.GetUnicodeCategory ('\ u180e'));
    }
}


On my computer, this program running under CLRv4 produces “DashPunctuation, Format, Format”, and for Mono (3.3.0) and CLRv2 it produces “DashPunctuation, Format, SpaceSeparator”.

This is at least strange. This behavior does not comply with any version of the Unicode standard, as far as I can say.
  • U + 00AD was the Po (other, punctuation) character in Unicode 1.x, then Pd (dash, punctuation) in 2.x and 3.x, and since 4.0 is a Cf character.
  • U + 0600 was first introduced in Unicode 4.0 and has always been a Cf symbol.
  • U + 180E was introduced as a Cf character in Unicode 3.0, then became a Zs character in Unicode 4.0, and finally returned to the Cf category in Unicode 6.3.

Thus, no version of the Unicode standard matches the first or third line of output. Now I'm really baffled ...

What about nameof and callerMemberName?


Identifiers are used not only for comparison, they are available as strings (C # strings) without any use of Reflection. Starting with C # 5, the CallerMemberName attribute is available to us, allowing us to do such things:

public static void X \ u0600y ()
{
    ShowCaller ();
}
 
public static void ShowCaller ([CallerMemberName] string caller = null)
{
    Console.WriteLine ("Called by {0}", caller);
}


And in C # 6 we can write like this:

string x \ u0600y = "";
Console.WriteLine ("nameof = {0}", nameof (x \ u0600y));


What will these two programs display? They simply output “Xy” and “xy” as names, as if the compiler just threw out all the formatting characters. But what should they deduce? We must take into account that in the second case we could just write nameof (xy) and such a string would still remain equal to the string of the declared identifier.

We can’t even say: “What is the name of the declared member?” Because you can overload it with a “different but equal to” identifier:

public static void Xy () {}
public static void X \ u0600y () {}
public static void X \ u070fy () {}
...
Console.WriteLine (nameof (X \ u200by));


What should be displayed on the screen? I’m sure you will feel relieved to know that the creators of C # have a plan for this, but this is really one of those scenarios for which “there is no obvious right answer”. Things get even weirder when the CLI specification comes into play. Section I.8.5.1 of ECMA-335 6th Edition says:
Assemblies should be guided by Appendix 7 of Technical Report 15 of the Unicode Standard 3.0 defining the set of characters allowed for use in identifiers, which is available at www.unicode.org/unicode/reports/tr15/tr15-18.html. Identifiers must be in the canonical format defined in the “Unicode Normalization Form C”. To satisfy the CLS specification, two identifiers should be the same only if their lowercase representations (defined by the Unicode locale-independent one-to-one lowercase mapping) are the same. For this reason, in order for the two identifiers to be considered different according to the CLS, they must differ more than just in the character case. However, in order to redefine an inherited definition, the CLI requires the exact encoding used to encode the original definition.

I would like to study the effect of this document by adding a Cf character to IL, but, unfortunately, I still could not figure out a way to influence the encoding used by ilasm in order to convince him that my “corrected” IL is what I want him to be.

Conclusion


As mentioned earlier, the text is complex .

It turned out that even limiting itself to identifiers only, "the text is complex." Who would have thought?

From a translator: thank user impwx for translating a previous publication by John Skeet

Also popular now: