Thoughts on Python 3

I bring to your attention a retelling of a wonderful article by Jinja2, Werkzeug and Flask, co-author of Sphinx and Pygments Armin Ronacher. I took great pleasure in sorting the source codes of his creations and learned a lot for myself. Armin writes excellent frameworks, and how no one else can explain what the transition from Python 2 to Python 3 is fraught with and why it is not so easy to implement.



Thoughts on Python 3


Recently, I have often been visited with thoughts about the state of Python 3. Although not at first glance, I fell in love with Python and am more than pleased with the course it is taking. Ten years my life goes with Python. And while this is a big part of my life.

I warn you in advance: this is a very personal article. I counted a hundred copies of one capital letter in this text.

This is because I am very grateful to all the opportunities that have been gained over the past two years: the ability to travel the world, communicate with people and share the spirit of cooperation that allows freely distributed projects such as Python to encourage innovation and delight people. Python has a wonderful community, which I often forget to say out loud.

And on the other hand, although I love Python and like to discuss ways and solutions, I am not bound by the project with any obligations, despite devotion to it. When I attend meetings about the language, I immediately understand why my proposals are perceived with hostility, and I myself am still considered that splinter. "He constantly complains and does nothing." There is so much that I would like to see in Python, but in the end, I am its user, not the developer.

When you read my comments about Python 3, given that its first version has already been released, you get the impression that I hate it and don’t want to switch to it at all. Just as I want, but not in the form in which he is now.

Considering my experience that people cite articles a long time after they were written, let me first clarify the situation with Python 3 at the time of writing: version 3.2 is released, the next version is 3.3, and there are no plans to ever release Python 2.8. Moreover, there is PEP in which it is written in black and white: there should be no release. Perfectly developing, PyPy remains a project whose architecture is so remote from everything else that no one will take it seriously for a long time. In many ways, PyPy does things that “I wouldn’t do” and this seems surprising to me.

Why do we use Python?

Why are we using Python? It seems to me that this is a very correct question, which we rarely ask ourselves. Python has a bunch of flaws, but I use it anyway. At the party, on the last day of this year's PyCodeConf conference, I managed to discuss a lot with Nick Coflan. We were a fool, and thanks to this the discussion turned out to be very sincere. We agreed to acknowledge the fact that Python is not perfect as a language, that work continues on some flaws and that upon careful consideration, some of them have no excuses. PEP on “yield from” was considered as an example of the development of a dubious design (coroutine as a generator) to give it a more or less working look. But even with the changes adopted in “yield from”, all this is very far from the convenience of greenlets.

This conversation was a continuation of the “Biased Opinion on Programming Languages” lecture given by Geri Bernard on the same memorable day of the conference. We agreed that Ruby blocks have an amazing design, but for many reasons it wouldn’t have taken root in Python (in its current state).

Personally, I do not think that we use Python because it is a perfect and impeccable language. Moreover, if you go back in time and look at earlier versions of Python you will see that it is very, very ugly. Do not be surprised that in its early years, Python remained unnoticed by anyone.

It seems to me that the scope gained by Python since then can be considered a great miracle. And that's why, it seems to me, we use Python: the evolution of this language was very smooth, and the embodied ideas were true. Early Python was terrible, it lacked the concept of iterators, and moreover, to iterate through a dictionary, you had to create an intermediate list of all its keys. At some point, exceptions were strings; string methods were not methods but functions from the module of the same name (string). The exception-catching syntax torments us in all aspects of Python 2, and Unicode appeared too late and partially never.

However, there was much more to it. Let and faulty, the idea of ​​modules with their own namespaces was amazing. Multimethod based language structure *still largely unparalleled. Every day we benefit from this decision, although we do not give ourselves a report in this. This language always did its job honestly and did not hide what was happening in the interpreter (tracebacks, stack frames, opcodes, code objects, ast, etc.), which, coupled with the dynamic structure, allows the developer to quickly debug and solve problems that are easily unattainable in other languages.

Indentation syntax was often criticized, but seeing how many new languages ​​are introducing this approach (HAML, CoffeeScript, and many others come to mind) proves that it has gained recognition.

Even when I disagree with the way Raymond *writes something new for the standard library, the quality of its modules does not raise the slightest doubt, and this is one of the main reasons why I use Python. I can not imagine working with Python without access to the collections module or itertools.

But the real reason I loved and idolized Python was the anticipation of each new version, like an impatient child waiting for Christmas. Small, barely noticeable improvements enthralled me. Even the ability to indicate the beginning of the index for the enumerate function made me feel grateful for the new Python release. And all this taking into account backward compatibility.

Importing from __future__ is something we sometimes hate so much and exactly what made upgrades easy and painless. I used to use PHP and was not at all happy about new releases. There were no namespaces in PHP, but there were always new built-in functions and with every release I really hoped to avoid conflicts in the names (I know that I could have avoided them if I used prefixes, but that was long before I learned the basics of development BY).

What has changed?

How did it happen that I was not up to the new releases of Python? I can only speak for myself, but I noticed that others have changed their attitude to new releases.

I never wondered what the core developers of the next Python 2.x were doing.
Of course, something was not so well thought out, for example, the implementation of abstract classes or the features of their semantics. But basically it all came down to criticism of the high-level functionality.

With the advent of Python 3, external factors also appeared, due to which I suddenly had to change the general approach to working with the language. Previously, I did not use the new features of the language for a long time, although I was glad of it, because mostly wrote libraries. It would be a mistake to use the newest and the best. Werkzeug’s code is still crammed with hacks allowing it to run in Python 2.3, although now the minimum requirements have risen to version 2.5. I left bugfixes for the standard library in the code, because some manufacturers (notorious Apple) never update the interpreter until a critical vulnerability is found in it.

All this is impossible with Python 3. With it, everything turns into development for 2.x or 3.x. And no middle decision is expected.

Following the announcement of Python 3, Guido always delightedly talked about 2to3 and how it would make porting easier. But it turned out that 2to3 is the worst thing that could happen to Python.

I experienced enormous difficulties in porting Jinja2 using 2to3, which I later regretted very much. Moreover, in the rendered JSON Jinja project, I removed all the hacks written for 2to3 to work correctly and will never use it again. Like many others, now I am trying my best to maintain code that works both on versions 2.x and 3.x. You will ask why? Because 2to3 is very leisurely, it integrates poorly into the testing process, it depends on the version of Python 3 used, and everything else can be configured except with the use of black magic. This is a painful process that negates all the pleasure you get from writing libraries. I liked to trim Jinja2, but stopped doing it when the port on Python 3 was ready, because I was afraid of breaking something in it.

Now, the idea of ​​a shared codebase rests on the fact that I have to support Python up to version 2.5.

The changes caused by Python 3 made all of our code unusable, which in no way justifies its immediate rewriting and upgrade. In my deeply subjective opinion, Python 3.3 / 3.4 should be more like Python 2 and Python 2.8 should be closer to Python 3. It so happened that Python 3 is XHTML in the world of programming languages. He is not compatible with what he is trying to replace, and in return offers practically nothing except that he is more “correct”.

A little bit about Unicode

Obviously, the biggest change in Python 3 was Unicode handling. It may seem that planting Unicode for everyone and everyone is a blessing. And yet, this is a view of the world through pink glasses, because in the real world we are faced not only with bytes and Unicode, but also with strings with a well-known encoding. Worst of all, in many ways, Python 3 has become a sort of Fisher Price * in the world of programming languages. Some features have been removed since kernel developers felt that they could be "easily cut." And all this was given at the cost of removing the widely used functionality.

Here is a specific example: operations with codecs in 3.x are currently limited to Unicode <-> bytes conversions. No bytes <-> bytes or Unicode <-> Unicode. It looks reasonable, but looking closely you will see that this remote functionality is just what is vital.

One of the most remarkable features of the codec system in Python 2 was that it was created with an eye to diverse work with a huge number of encodings and algorithms. You could use a codec to encode and decode strings, and you could also ask the codec for an object that provides operations on streams and other incomplete data. And yet, the codec system worked equally well with content and transfer encodings. It was worth writing a new codec and registering it, as each part of the system learned about it automatically.

Anyone who undertook to write an HTTP library in Python was happy to discover that codecs can be used not only for decoding UTF-8 (current character encoding), but for example for gzip (compression algorithm). This applies not only to strings, but also to generators or file objects, unless of course you know how to use them.

At the moment, in Python 3, all of this simply doesn't work. They not only removed these functions from the string object, but also removed the byte -> byte encoding, leaving nothing in return. If I'm not mistaken, it took 3 years to recognize the problem and start a discussion about the return of the above functionality to 3.3.

Next, Unicode was pushed to where he did not belong at all. Such places include the file system layer and the URL module. And yet, a bunch of Unicode functionality was written from the point of view of a programmer living in the 70s.

UNIX file systems are byte-based. So it is arranged and nothing can be done about it. Naturally, it would be great to change this, which is actually impossible without breaking the existing code. And all because changing the encoding is only a small part of what is needed for a Unicode-oriented file system. In addition, questions of normalization and storage of register information with normalization already carried out remain open. If the bytestring type remained in Python 3, these problems could have been avoided. However, it does not exist and its replacement, type byte, does not behave the way strings behave. It behaves like a data type written to punish people using byte data that simultaneously exists as a string. It does not seem to be developed as a tool with which programmers could solve these problems. Problems,

So, if you are working with a file system from Python 3, then the strange feeling will not leave you despite the presence of a new encoding with surrogate pairs and shielding. This is a painful process, painful because there is no tool for raking this bedlam. Python 3 kind of turns to you, "Buddy, your Unicode file system is from now on," but it doesn’t explain from which end you need to rake this mess. It doesn’t even make it clear whether the file system actually supports Unicode, or whether this Python fakes this support. It does not disclose details about normalization or how file names should be compared.

It works in the laboratory, but breaks down in the field. It so happened that my poppy has an American keyboard layout, an American locale, and almost everything is American, except that dates and numbers are formatted differently. As a result of all this (and as I assume that I upgraded my poppy from the time of Tiger), I had the following situation: going to my remote server, I got the locale set to the string value "POSIX". You ask, what kind of "POSIX"? And hell knows. So Python, being in the same ignorance as I, decided to work with "ANSI_X3.4_1968". On this memorable day, I learned that ASCII has many names. It turned out that this is just one of the ASCII names. And here you go, my remote Python interpreter crookedly displayed the contents of a directory with internationalized file names. How did they get there? I threw in Wikipedia articles there with their original names. I did this using Python 3.1, which was silent about what is happening with the files, instead of throwing exceptions or involving any hacks.

But malfunctions with file systems are just flowers. Python also uses environment variables (where, as you know, it's full of garbage) to set the default file encoding. During the conference, I asked a couple of visitors to guess the encoding used by default for text files in Python 3. More than 90% of this small sample were sure that it was UTF-8. And no! It is installed depending on the locale of the platform. As I told you, greetings from the 70s.

For fun, I logged in to both of the servers I control and found that one of them had a latin1 encoding when logging in through the console, which switches to latin15 when logging in via ssh under root, and UTF-8 if I logged in using my user account. Damn entertaining, but only himself remains to blame. I have no doubt that I am not the only one whose server magically switches encodings given that, by default, SSH sends the locale settings on login.

Why am I writing about this here? Yes, because I have to prove again and again that Unicode support in Python 3 gives me much more trouble than in Python 2.

Unicode encoding and decoding does not get in the way of anyone who follows Zen 2 in that "explicit is better than implicit." “Bytes come in, Unicode goes out” - this is how the pieces of applications that communicate with other services work. This can be explained. You can explain it well by documenting it. You emphasize that there are reasons for internal text processing in the form of Unicode. You tell the user that the world around us is harsh and based on bytes, so you have to encode and decode to communicate with this world. This concept may be new to the user. But you just have to find the right words and paint everything well, how one headache will become less.

Why am I talking about this with such confidence? Because since 2006, all my programs have been pushing Unicode users, and the number of queries regarding Unicode does not compare with the breakthrough of queries about working with packages or the import system. Even with distutils2, in the realm of Python, packages remain a much bigger problem than Unicode.

There isn’t a natural development of events: hiding Unicode away from the Python 3 user. But it turned out to be more difficult for people to imagine how it all works. Do we need a priori implicit behavior? I am not so sure.

Sure, Python 3 is on the right track right now. I found that more and more talk is about returning some byte APIs. My naive idea was the idea of ​​a third type of string in Python 3, called estr, or something like that. It would work exactly like str in Python 2, store bytes and have the same set of string APIs. However, it would also contain encoding information that would be used to transparently decode to a Unicode string or cast to a byte object. This type would be a holy grail that could facilitate porting.

But it is not there, and the Python interpreter was not developed with a reserve for yet another type of string.

We destroyed their world

Nick talked about how the developers of the Python core destroyed the world of web programmers. So far, the destruction extends to where Python's backward incompatibility ends. But our world was destroyed no more than the world of other developers. After all, we have one world. The network is based on encrypted bytes, but this mainly concerns low-level protocols. Communication with most of what lies at the lower level occurs in byte language with encodings.

However, the main changes affected the way of thinking, which is needed when working at these levels. Python 2 used Unicode objects very often to communicate with lower levels. If necessary, objects were encoded in bytes and vice versa. A pleasant side effect for us, for example, was the ability to speed up some operations by encoding and decoding data in the early stages and transferring them to a channel that understands Unicode. In many ways, this allows the serialization module to function in the kernel. For example, Pickle communicates with streams supporting both bytes and Unicode. To some extent, the same can be said of simplejson. And so, Python 3 appears in which all of a sudden you need to separate Unicode and byte streams. Many APIs will not survive on the way to Python 3, without major changes in their interfaces.

Yes, this is a more correct approach, but in reality he no longer has any advantages, except that he is more correct.

When working with I / O functionality in Python 3, I made sure it was great. But in reality, it can not be compared with how Python 2 worked. It may seem that I have a lot of prejudices, because I worked so much with Python 2 and so little with Python 3, however, writing more code for achieving the same functional is considered bad form. And with Python 3, I have to do all this, given all its aspects.

But porting works!

Of course, porting to Python 3 works. It has been proven, and more than once. But just because something is possible and passes all the tests does not mean that everything is well done. I am a man with disabilities and make a bunch of mistakes. At the same time, I am proud to strive to shine my favorite APIs. Sometimes I find myself rewriting a piece of code over and over again to make it more user-friendly. When working with Flask, I spent so much time honing the core functionality that it's time to start talking about obsession.

I want it to work perfectly. When I use the API for a common task, I want them to have the same level of excellence as Porshe’s design. Yes, this is just the outer layer for the developer, but the product should be well developed from start to finish.

I can make my code "work" in Python 3 and still I will hate it. I want it to work. But at the same time, using my own or other people's libraries, I want to get the same pleasure with Python 3 that I get from Python 2.

Jinja2, for example, incorrectly uses the input / output layer in Python 3, since it is impossible to use the same code on 2.x and 3.x without switching between the implementation at runtime. Now, templates open in binary mode both in 2.x and 3.x, because this is the only reliable approach, and after that, Jinja2 itself decodes the data from this binary stream. Actually, this works, thanks to the normalization of newline separators. But I’m more than sure that everyone who works on Windows and doesn’t normalize line separators on their own will sooner or later get into a situation with a mash from various separators, completely unaware of this.

Taking Python 3

Python 3 has changed a lot, that's a fact. No doubt the future we are heading for is behind it. There is much promise in Python 3: a significantly improved import system, the appearance of __qualname__, a new way to distribute Python packages, a unified representation of strings in memory.

But for now, porting a library to Python 3 looks like developing a library in Python 2 and creating it (sorry for my French) with a smart ass version for Python 3 just to prove that it works. About Jinja2 in Python 3, you can say in every way that it is damn ugly. This is terrible and I should be ashamed of it. For example, in the version for Python 3, Jinja2 loaded two one-megabyte regular expressions into memory, and I absolutely did not care about freeing it. I just wanted her to work somehow.

So why did I have to use megabyte regular expressions in Jinja2? Yes, because the regex engine in Python does not support Unicode categories. And with such restrictions, I had to choose the lesser evil of the two: either hammer on the new Unicode identifiers in Python 3 and restrict ourselves to ASCII identifiers, or create a huge regular expression manually by entering all the necessary definitions into it.

The above is the best example explaining why I am not yet ready for Python 3. It does not provide tools for working with its own innovations. Python 3 is vital for Unicode-oriented regular expressions, it needs an API to work with locales that take Unicode into account. He needs a more advanced path module that exposes the behavior of the underlying file system. It should stronger impose a single standard encoding for text files, independent of the runtime environment. It should provide more tools for working with encoded strings. He needs IRI support, not just URLs. He needs something more than yield from. It should have auxiliary mechanisms for transcoding, which are necessary for mapping URLs to the file system.

To all of the above, you can add the release of Python 2.8, which would be a little closer to Python 3. For me, there is only one realistic way to upgrade to Python 3: libraries and programs should be fully Unicode aware and integrated into the new Python 3 ecosystem.

Do not let amateurs pave your way

The biggest mistake made by Python 3 is its binary incompatibility with Python 2. Here I mean that Python 2 and Python 3 interpreters cannot work together in a common process space. As a result, you cannot run Gimp simultaneously with the scripting interfaces of both Python 2 and Python 3. The same applies to vim and Blender. We simply cannot. It’s not difficult to write a bunch of hacks with separate processes and fanciful IPC, but nobody needs it.

Thus, a programmer who has to learn Python 3 before others will do it from under a stick. And not the fact that this programmer is generally familiar with Python. And the reason, in all honesty, is that money revolves around Python 2. Even if we spend all our energy on Python 3 at night, in the afternoon we will still return to Python 2. This will be for the time being. However, if a bunch of graphic designers start writing scripts in Blender under Python 3, then here you have the necessary adaptation.

I really don't want to see kak CheeseShop *I’ll be tormented by the abundance of crooked port ports in Python 3. I don’t really want to see another Jinja2 and an especially ugly bunch of code designed to work on 2.x and 3.x. There, hacks like sys.exc_info () [1], for circumventing syntactic differences, hacks for converting literals at runtime to be compatible with 2.x and 3.x, and much, much more. All this reflects badly not only on performance at runtime, but also on Python's core tenets: beautiful and legible code without hacks.

Recognize Failures, Learn and Adapt

Now is the time for us to get together and discuss everything that people do to work their code on 2.x and 3.x. Technology is evolving at a fast pace and I’m very sorry to see Python fall apart just because someone overlooked the dark clouds on the horizon.

Python is not "too big to be forgotten about." He can very quickly lose his popularity. Pascal and Delphi fell into a narrow niche, despite the fact that they remained amazing languages ​​even after the birth of C # and the .NET framework. Most of all, their management was affected by their fall. People are still developing at Pascal, but are there many who start writing new projects on it? Deplhi does not work on iPhone and Android. It is not very well integrated into the UNIX market. And to be honest, Python is already losing ground in some areas. Python was quite popular in the field of computer games, but this train has long been gone. In the web-community, new competitors appear like mushrooms after the rain, and whether we like it or not, JavaScript more and more often assumes the position of Python as a scripting programming language.

Delphi could not adapt in time and the people simply switched to other technologies. If 2to3 is our transition path to Python 3, then py2js is our transition path to JavaScript.

And here is what I suggest: could we make a list of everything that complicates the transition to Python 3 and a list of solutions to solve these problems? Could we re-discuss Python 2.8 development if it can help with porting? Could we recognize PyPy as a valid Python implementation, powerful enough to influence how we write code?

Armin Ronacher,
December 7, 2011.

From a translator: After reading this article, the first desire was to share with others, there was an acute sensation that “the world should know”. My colleague Irina Pirogova and my wife Aila Mehdiyeva helped to retell the article, for which many thanks to them!

Also popular now: