PatapSmile December 24, 2009 at 16:47

Unicode Data Compression

In one future project, the task was to transfer and store data in VCard format, which contain Cyrillic letters. Since the size of the transmitted information is limited, it was necessary to reduce the size of the data.

There were several options:

Use traditional encodings (for Cyrillic - CP1251).
Use Unicode compression formats. Today it is SCSU and BOCU-1. I give a detailed description of these two formats below.
Use universal compression algorithms (gzip).

A beautiful but useless results graph:

Input data

Business card in VCard 3.0 format (length 260 characters):

BEGIN: VCARD
VERSION: 3.0
N: Pupkin; Vasily
FN: Vasily Pupkin
ORG: Horns and Hooves LLC
TITLE: The Most Important
TEL; TYPE = WORK, VOICE: +380 (44) 123-45-67
ADR; TYPE = WORK:; 1; Khreshchatyk; Kiev ;; 01001; UKRAINE
EMAIL; TYPE = PREF, INTERNET: vasiliy.pupkin@example.com
END: VCARD

All of the results below relate to this example. For Cyrillic business cards, the results should not differ. For business cards in other languages, you probably need additional research, this was not part of the task.

Option using CP1251 dropped immediately. Despite the small size of the finished files, its use (like any other traditional encoding) greatly limits the capabilities of the service.

SCSU

The Standard Compression Scheme for Unicode (SCSU) is based on the development of Reuters.

The main idea of SCSU is to define dynamic windows in the Unicode code space. Characters belonging to small alphabets (for example, Cyrillic) can be encoded with one byte, which indicates the index of the character in the current window. Windows are preinstalled to the blocks that are most often used, so the encoder does not have to set them.

For large alphabets, including Chinese, SCSU allows you to switch between single-byte and Unicode mode, which is actually UTF-16BE.

Control bytes ("tags") are inserted into the text to change windows and modes. Escape tags ("quote" tags) are used to switch to another window only for the next character. This is useful for encoding a single character outside the current window, or a character that conflicts with tags.

Example

... N: P at p to and n ... FN: In and with and l and y ...
... 4E 3A 12 9F C3 BF BA B8 BD ... 46 4E 3A 92 B0 C1 B8 BB B8 B9 ...

Before the first Cyrillic letter, the encoder uses the SC2 tag (hexadecimal 12) to switch to the dynamic window No. 2, which is preinstalled on the Cyrillic block of characters. When reusing Cyrillic characters, the tag is not used.

Benefits

text encoded in SCSU basically takes up as much space as in traditional encoding. Tag overhead is relatively negligible.
SCSU decoder is easy to implement, in comparison with universal compression algorithms.

disadvantages

SCSU uses ASCII control characters, not only as tags, but also as part of regular characters. This makes SCSU an inappropriate format for protocols such as MIME, which interpret control characters without decoding them.

BOCU-1

The Binary Ordered Compression for Unicode (BOCU) concept was developed in 2001 (by Mark Davis and Markus Scherer for the ICU project).

The basic idea of BOCU-1 is to encode each character as the difference (distance in the Unicode table) to the previous character. Small differences occupy fewer bytes than large ones. Coding differences, BOCU-1 achieves the same compression for small alphabets, no matter in which block they are.

BOCU-1 complements the concept with the following rules:

The previous or base value is aligned in the middle of the block, in order to avoid big jumps from the beginning of the block to the end.
32 ASCII control characters, as well as spaces, do not change during encoding, for compatibility with email and for preserving binary order
The space does not cause a change in the base value. This means that when encoding non-Latin words separated by spaces, there is no need to make a big jump to U + 0020 and back.
ASCII control characters reset the base value, making adjacent lines in the file independent.

Example

... N: P at p to and n; V ... FN: V a s and l and th ...
... 9e 8a d3 d3 93 8f 8a 88 8d 4c 11 d3 c6 ... 96 9e 8a d3 c6 80 91 88 8b 88 89 ...

For each block change (transition from Cyrillic to Latin), the encoder needs two bytes (d3 - to switch to the Cyrillic block, 4c - for the colon).

Benefits

ASCII control characters remain during encoding
Like SCSU, BOCU-1 requires little overhead compared to traditional encodings

disadvantages

Latin letters (and all ASCII characters except control characters) change their meaning during encoding. So, for example, an XML parser should receive information about BOCU-1 encoding from a higher-level protocol.
The BOCU-1 algorithm is not formally described anywhere, except for the program code accompanying UTN # 6.

Universal Compression Algorithms

To compare the efficiency, the gzip compression algorithm was chosen. For large texts, gzip shows a large degree of compression compared to SCSU and BOCU-1 (since the number of different characters, even in multilingual documents, is limited). On small texts, like VCard in the example, it is difficult to get an unambiguous result.

A significant drawback of universal compression algorithms is the complexity of the implementation of the compressor and decompressor.
The question of two-stage compression (for example, SCSU + gzip) remains open to me.

results

The results of the SCSU and BOCU-1 algorithms for the source data in comparison with CP1251 and UTF8 are shown in the first column.

The second column represents the number of bytes issued directly by the encoder (without the Marker order of bytes, BOM, which indicates the type of Unicode representation in the file).

The third column represents the compressed gzip file.

The fourth column shows the number of bytes of data in the gzip archive (without a header, 18 bytes).

File	File without BOM	In the archive	Archived Data Length
CP1251	260	260	258	240
SCSU	267	264	266	248
BOCU-1	278	275	273	255
Utf8	329	326	299	281

All files in the archive (5 kb).

Application

SQL Server 2008 R2 uses SCSU to store nchar (n) and nvarchar (n). Symbian OS uses SCSU to serialize strings.

You can use ICU or SC UniPad
to work with SCSU and BOCU-1 .

I was very surprised that not one of the common barcode readers recognizes SCSU or BOCU-1.

What else to read

Unicode Technical Note # 14. A Survey of Unicode Compression
Unicode Technical Standard # 6. A Standard Compression Scheme for Unicode
Unicode Technical Note # 6. BOCU-1: MIME-Compatible Unicode Compression

Tags:

Unicode Data Compression

Input data

SCSU

Example

Benefits

disadvantages

BOCU-1

Example

Benefits

disadvantages

Universal Compression Algorithms

results

Application

What else to read

Also popular now: