inout March 29, 2010 at 12:39 pm

National domain names: from ASCII format to IDN and vice versa

If it becomes necessary to work with national domain names, then for most cases the format “xn—abrakatabra.com” coming from the client will be sufficient. But there are times when it is necessary to work with domain names in their national representation, i.e. "Example.com".

This article discusses software implementations of coding national domain names from ASCII format to IDN and vice versa using MS VisualStudio and the ICU library.

History.
If you have already heard the abbreviation IDN, then the next four paragraphs can be safely skipped.

Historically, ASCII characters were used to represent domain names on the Internet: “Az”, “0-9”, “-”. With the development of the Internet, symbols have become lacking (more precisely, short and convenient names) and ICANN saidon the need to expand the representation of domain names through the use of national alphabets (represented in Unicode).

IDN - (Internationalized Domain Names) are domain names that contain characters of national alphabets. For example, "site.com".

Numerous discussions at the few IDN forums boil down to two opinions: “go nuts - give two!” And “they try - to put it mildly - to deceive”. The second is based on the specifics of the implementation of this technology.

New characters are well-coded old ones :)

In essence, IDN is a convenient and beautiful wrapper for a long and uncomfortable character set. On the client side, national characters are encoded into valid ASCII characters, which are the domain name. If you enter "example.test" in the address bar, then it is transcoded into "xn - e1afmkfd.xn - 80akhbyknj4f". For this, the encoding from the ASCII family of compatible encodings (ACE) - Punycode is used, which is currently used in the multilingual domain name system. The Punycode coding algorithm is quite simple and described in detail in RFC-3492 (it is also implemented in C there).

What encoding and transcoding facilities are at our disposal?

1. Microsoft tools.

In VisualStudio, in the System.Globalization namespacethe IdnMapping class has been implemented, among the methods of which you can find, in particular, GetAscii and GetUnicode, which perform the conversion in accordance with IDNA standards. Not a class, but a dream - nowhere is easier:

using namespace System::Globalization; using System::String; String^ s1 = "привет.пример"; String^ s; IdnMapping idn; s = idn.GetAscii(s1, 0, s1->Length); System::Console::WriteLine(s); String^ s2 = "xn--b1agh1afp.xn--e1afmkfd"; s = idn.GetUnicode(s2, 0, s2->Length); System::Console::WriteLine(s);

Result:

xn - b1agh1afp.xn - e1afmkfd
hi example

For the same purposes, the soft ones have two API functions IdnToAscii and IdnToUnicode . Unfortunately, Minimum supported client is Windows Vista. Very sorry. An example of using the function can be found on their website .

2. Means of ICU (International Components for Unicode). ICUs are C / C ++ and Java open source libraries that implement the support and capabilities of Unicode and Globalization. The following domain name conversion functions are implemented in this library:

int32_t uidna_toUnicode / uidna_toAscii (const UChar * src, int32_t srcLength, UChar * dest, int32_t destCapacity, int32_t options, UParseError * parseError, UErrorCode * status)

- used for ASCII to IDN / IDN to ASCII conversions of simple names (component parts of a domain name). For example, “www.example.com” consists of three parts - “www”, “example”, “com”.

int32_t uidna_IDNToUnicode / uidna_IDNToASCII (const UChar * src, int32_t srcLength, UChar * dest, int32_t destCapacity, int32_t options, UParseError * parseError, UErrorCode * status)

- used for ASCII to IDN / IDN to ASCII full domain name translations . For example, "www.example.com".

Parameters:

src - pointer to the input string to be converted.
srcLength - src length. If src is a si string, then you can specify -1.
dest - a pointer to the lines where the converted string will be written.
destCapacity - size of dest.
Options - option bit. It can take one of the following values:
UIDNA_DEFAULT - by default. In case of an error, returns U_UNASSIGNED_ERROR.
UIDNA_ALLOW_UNASSIGNED - if this flag is set, it is considered that unassigned code elements in the input line are presented in Unicode encoding.
UIDNA_USE_STD3_RULES - Domain name syntax must comply with STD3 ASCII standards. In case of an error, returns U_IDNA_STD3_ASCII_RULES_ERROR.

parseError - a pointer to a UParseError structure. It can be set to zero.
status - error code.

The return value is the length of the converted string. To avoid overflow, compare with destCapacity.

#include "unicode/utypes.h" #include "unicode/parseerr.h" #include "unicode/uidna.h" wchar_t* s1 = L"пример.прием"; wchar_t pPunycode[MAX_PATH]; UErrorCode status = U_ZERO_ERROR; int32_t i = uidna_IDNToASCII(s1, -1, pPunycode, MAX_PATH, UIDNA_USE_STD3_RULES, NULL, &status); if(status == U_IDNA_STD3_ASCII_RULES_ERROR) wprintf(L"Error");

wchar_t* s2 = L"xn--e1afmkfd.xn--e1afnjf"; wchar_t pUnicode[MAX_PATH]; UErrorCode status = U_ZERO_ERROR; int32_t i = uidna_IDNToUnicode(s2, -1, pUnicode, MAX_PATH, UIDNA_ALLOW_UNASSIGNED, NULL, &status); if(status == U_IDNA_STD3_ASCII_RULES_ERROR) wprintf(L"Error")

The results are similar to the previous example.

Before using the library you need to collect. In order (for MS VS):

1. Select the latest release (I have ICU4C 4.4 2010-03-17) here .
2. Download sorts.
3. Set the environment variable PATH: “\ bin \”
4. Open the solution: “\ source \ allinone \ allinone.sln”
5. Build-> Batch Build ...-> Select All-> Rebuild.
6. Build-> Rebuild Solution.

If not, open “\ Readme.html -> How To Build And Install ICU and check. If gathered without errors - use.

Ps I will be glad to any comments and amendments.
Pp.s. I will also be happy with interesting additions on the topic.

Tags:

National domain names: from ASCII format to IDN and vice versa

Also popular now: