PHP 6 will not, not mastered

It’s funny, but I didn’t find mention of it on the hub even in the comments. It is time to eliminate this drawback, because many use only Habr as a source of information.

So PHP 6 will not be, at all. On March 11, 2010, the development team decided to cancel the release of PHP 6 in its current form. As a result, the trunk with PHP 6 was transferred to the brunch, and a new version was formed in the trunk - 5.4, into which the developers transferred all the achievements from PHP 6, except for Unicode.

Below is a short summary of the presentation (pdf) made by Andrei Zmievski at the PHP Community Conference in 2011.

But first, let's look at how Unicode is supported now.

In the source code:
- You can use non-Latin characters in the names of functions and variables that fall under the regularity [a-zA-Z_ \ x7f- \ xff] [a-zA-Z0-9_ \ x7f- \ xff] * . The following code is absolutely valid:function привет(){}
- Documents with UTF-8 BOM are normally processed, but only if there are no empty lines, otherwise headers will be sent. To avoid this, it is recommended to keep the HTML code in the templates, and do not use the closing of the PHP tag in the PHP files themselves ?>
When processing strings:
- Standard string functions normally digest at least UTF-8, but they don’t know how to work with characters, so things like determining the position of a character or the size of a string will not work correctly.
- The standard xml extension can distill text from UTF-8 to ISO-8859-1 and vice versa, but in essence it is useless functionality.
- Extension mbstring (Multibyte String), which supports most encodings , including Unicode. True, it is usually not enabled by default.

General meaning of the presentation

What I wanted to have

So what is the problem, you ask? Have mbstring , use and rejoice.
But the fact is that the developers decided to support Unicode not at the library level, but at the kernel level. This means that any php function, as well as string operators, can potentially accept Unicode and more or less guarantee that Unicode characters will not be distorted, deleted and no errors will occur. In addition, mbstring itself does not implement all standard string functions.

Take a look at the following examples:

and

these are the most obvious examples of what is meant by kernel level support.

Choice of coding

UTF-8 - Advantages: backward compatibility in ASCII, the prevailing encoding on the web, there are many libraries that support it.
UTF-16 - Advantages: internal encoding of the ICU library, which was chosen for Unicode and is used by major players on the web.
UTF-32 - Advantages: direct indexing.

As a result, the choice fell on UTF-16 due to the ICU library. It should be noted that, looking back, the developers said that if they were given a choice again, they would choose UTF-8, since the backward compatibility mode would reduce the amount of work.

Something went wrong

Firstly: there are few specialists who understand the intricacies of Unicode and the use of ICU.
Secondly: it is a technically difficult task to embed Unicode everywhere.
Thirdly: people are simply tired of embedding Unicode support in already perfectly working code.

Result

Since in the culture of a PHP developer it is customary to focus on interesting work, and Unicode support is not an interesting work, the project ran out of steam.

Future

The mbstring extension is inferior. You can write a separate class for working with Unicode, but it will also not be fully functional. Therefore, exactly how you need to support Unicode in PHP is still an open question.

Lessons learned

Rewriting a lot of working code is difficult.
Making people do tedious things is hard.
It is difficult to expect results from the above activities, which lasts a long time.
Be committed.

Tags: