When is a string not a string?

Original author: Jon Skeet
  • Transfer
As part of my “work” on standardizing C # 5 in the technical group ECMA-334 TC49-TG2, I was lucky to see some interesting ways that Vladimir Reshetnikov tested C # for strength. This article describes one of the issues that he raised. Of course, it most likely will not affect 99.999% C # developers in any way ... but it’s still interesting to understand.

Specifications used in the article:

What is a string?

How would you declare a type string(or System.String)? I can suggest several answers to this question, from vague to rather specific:

  • “Some text in quotation marks”
  • Character sequence
  • Unicode character sequence
  • 16-bit character sequence
  • UTF-16 Word Sequence

Only the last statement is completely true. The C # 5 specification (section 1.3) states:

C # string handling uses UTF-16. The type charrepresents the word UTF-16, and the type string represents the sequence of words UTF-16.

So far so good. But this is C #. What about IL? What is used there, and does it matter? It turns out that it has ... Strings should be declared in IL as constants, and the nature of this way of representing is important - not only the encoding, but also the interpretation of this encoded data. In particular, a UTF-16 word sequence cannot always be represented as a UTF-8 word sequence.

Everything is very bad (formed)

For example, take a string literal “X\uD800Y”. This is a string representation of the following UTF-16 words:

  • 0x0058 - 'X'
  • 0xD800 - the first part of a surrogate pair
  • 0x0059 - 'Y'

This is a completely normal string — it is even a Unicode string according to the specification (section D80). But it is poorly formed (section D84). This is because the word UTF-16 0xD800does not correspond to any Unicode scalar value (section D76) - surrogate pairs are explicitly excluded from the list of scalar values.

For those who first hear about surrogate pairs: UTF-16 uses only 16-bit words, and therefore cannot fully cover all valid Unicode values, the range of which is from U+0000to and U+10FFFFincluding. If you need to represent a character with a larger code in UTF-16 U+FFFF, then two words are used: the first part of the surrogate pair (in the range from 0xD800to 0xDBFF) and the second (0xDC00 … 0xDFFF) Thus, only the first part of a surrogate pair in itself does not make any sense - it is the correct word UTF-16, but only gets meaning if it is followed by the second part.

Show the code!

And how does this all relate to C #? Well, constants must somehow be represented at the IL level. It turns out that there are two ways to represent it - in most cases, UTF-16 is used, but for the arguments of the attribute constructor, UTF-8.

Here is an example:

using System;
using System.ComponentModel;
using System.Text;
using System.Linq;
class Test
    const string Value = "X\ud800Y";
    static void Main()
        var description = (DescriptionAttribute)
            typeof(Test).GetCustomAttributes(typeof(DescriptionAttribute), true)[0];
        DumpString("Атрибут", description.Description);
        DumpString("Константа", Value);
    static void DumpString(string name, string text)
        var utf16 = text.Select(c => ((uint) c).ToString("x4"));
        Console.WriteLine("{0}: {1}", name, string.Join(" ", utf16));

In .NET, the output of this program will be as follows:

Атрибут: 0058 fffd fffd 0059
Константа: 0058 d800 0059

As you can see, the “constant” has remained unchanged, but symbols appeared in the attribute property value U+FFFD(a special code used to mark broken data when decoding binary values ​​to text). Let's take a deeper look and look at the IL code that describes the attribute and constant:

.custom instance void [System]System.ComponentModel.DescriptionAttribute::.ctor(string)
= ( 01 00 05 58 ED A0 80 59 00 00 )
.field private static literal string Value
= bytearray (58 00 00 D8 59 00 )

The format of the constant ( Value) is pretty simple - it's UTF-16 with byte order from low to high ( little-endian ). The attribute format is described in the ECMA-335 specification in section II.23.3. We will analyze it in detail:

  • Prologue (01 00)
  • Fixed Arguments (for the selected constructor)
  • 05 58 ED A0 80 59 (one packed line)
    • 05 (length equal to 5 - PackedLen)
    • 58 ED A0 80 59 (string value encoded in UTF-8)
  • Number of named arguments (00 00)
  • Named arguments themselves (none)

The most interesting part here is the "string value encoded in UTF-8." The value is not a valid UTF-8 string because it is poorly formed. The compiler took the first word of the surrogate pair, determined that it was not followed by the second, and simply processed it the same way as it should handle any other characters in the range from U+0800to U+FFFFinclusive.

It should be noted that if we had a whole surrogate pair, UTF-8 would encode it as one Unicode scalar value using 4 bytes. For example, change the declaration Valueto the following:

const string Value = "X\ud800\udc00Y";

In this case, at the IL level, we get the following set of bytes: 58 F0 90 80 80 59- where F0 90 80 80is the representation of the words UTF8 under the number U+10000. This line is correctly formed and its values ​​in the attribute and constant would be the same.

However, in our original example, the value of a constant is decoded without checking whether it is correctly formed, while an additional check is used for the attribute value, which detects and replaces incorrect codes.

Encoding behavior

So which approach is right? According to the Unicode specification (section C10), both are true:

When a process interprets a sequence of codes, which may be a Unicode encoded character, poorly formed sequences should cause an error state, rather than being processed as characters.

And at the same time:

Processes that adhere to this specification should not interpret poorly formed sequences. However, the specification does not prohibit the processing of codes that do not represent Unicode encoded characters. For example, to improve performance, low-level string operations can process codes without interpreting them as characters.

It’s not entirely clear to me whether the values ​​of the constants and argument arguments “should be encoded Unicode characters”. In my experience, the specification practically does not indicate anywhere whether a correctly formed string is required or not.

In addition, implementations System.Text.Encodingcan be customized by specifying behavior when trying to encode or decode poorly formed data. For instance:


It will return a sequence of bytes 58 EF BF BD 59- in other words, it will detect incorrect data and replace it with U+FFFD, and decoding will work without problems. However:

new UTF8Encoding(true, true).GetBytes(Value)

Throws an exception. The first argument of the constructor specifies the need to generate a BOM , the second - on how to deal with invalid data (also uses properties EncoderFallbackand DecoderFallback).

Language behavior

So should this code compile at all? At the moment, the language specification does not prohibit this - but the specification can be corrected :)

Generally speaking, both csc, and Roslyn still prohibit the use of poorly formed strings in some attributes, for example DllImportAttribute:

static extern void Foo();

This code will throw a compiler error if the value is Valuepoorly formed:

error CS0591: Invalid value for argument to 'DllImport' attribute

Perhaps there are other attributes with the same behavior - not sure.

If we assume that the value of the attribute argument will not be decoded to its original form when creating the attribute instance, this can be considered, in good conscience, as an error at the compilation stage. (Unless, of course, we change the runtime so that it preserves exactly the value of the poorly formed string)

But what to do with the constant? Should this be valid? Could this make sense? In the form in which the line used in the example is unlikely, but there may be a case where the line must end with the first part of a surrogate pair, then add it to another line starting with the second part and get the correct line. Of course, extreme care must be taken here - inThe Unicode Technical Report # 36 (Security Considerations) presents some highly alarming possibilities for errors.

Consequences of the foregoing

One interesting aspect of all this is that “string encoding arithmetic” may not work as you think:

// Плохой код!
string SplitEncodeDecodeAndRecombine(string input, int splitPoint, Encoding encoding)
    byte[] firstPart = encoding.GetBytes(input.Substring(0, splitPoint));
    byte[] secondPart = encoding.GetBytes(input.Substring(splitPoint));
    return encoding.GetString(firstPart) + encoding.GetString(secondPart);            

It may seem to you that there can be no errors if there is nowhere null, and the value splitPointis in the range. However, if you find yourself in the middle of a surrogate couple, everything will be very sad. There may also be additional problems due to things like a form of normalization - most likely, of course, no, but by this moment I’m not 100% sure of anything.

If it seems to you that this example is divorced from reality, then imagine a large piece of text divided into several network packets, or files - it does not matter. It may seem to you that you are prudent enough and take care that binary data is not shared in the middle of the UTF-16 code pair - but even this will not save you. Oh oh.

It directly breaks me to abandon word processing in general. Floating point numbers are a real nightmare, dates and times ... well, you know what I think about them. I wonder if there are any projects that use only integers that are guaranteed never to overflow? If you have such a project - let me know!


Text is hard!

Translator's note: I found a
link to the original of this article in the post “Let's talk about the differences between Mono and MS.NET” . Thanks DreamWalker ! In his blog, by the way, also has a small continues the theme of the note on how to behave on the same sample under Mono.

Also popular now: