Dreadatour November 4, 2014 at 22:27

Data Types Strike Back

Transfer

This is the second part of my thoughts on “Python, how I would like to see it”, and in it we will take a closer look at the type system. To do this, we will again have to delve into the features of the Python language and its interpreter CPython.

If you are a Python programmer, the data types for you have always remained behind the scenes. Somewhere there they exist on their own and somehow interact with each other, but most often you think about their existence only when an error occurs. And then the exception tells you that one of the data types does not behave as you expected from it.

Python has always been proud of its type system implementation. I remember how many years ago I read the documentation, in which there was a whole section about the benefits of duck typing. Let's be honest: yes, for practical purposes, duck typing is a good solution. If you are not limited by anything and there is no need to deal with data types due to their absence, you can create very beautiful APIs. Python is especially easy to solve everyday tasks.

Almost all the APIs that I implemented in Python did not work in other programming languages. Even such a simple thing as a command line interface ( click library ) simply does not work in other languages, and the main reason is that you have to constantly struggle with data types.

Not so long ago, the question of adding static typing to Python was raised, and I sincerely hope that the ice has finally broken. I will try to explain why I am against explicit typing, and why I hope that Python will never go this route.

What is a "type system"?

A type system is a set of rules according to which types interact with each other. There is a whole section of computer science devoted exclusively to data types, which in itself is impressive, but even if you are not interested in theory, it will be difficult for you to ignore the type system.

I will not go too deep into the type system for two reasons. Firstly, I myself do not fully understand this area, and secondly, in fact, it is not at all necessary to understand everything in order to “feel” the relationships between data types. It is important for me to take into account their behavior because it affects the architecture of the interfaces, and I will talk about typing not as a theorist, but as a practitioner (using the example of building a beautiful API).

Type systems can have many characteristics, but the most important difference between them is the amount of information that a data type provides about itself when you try to work with it.

Take, for example, Python. There are types in it. Here is the number 42, and if you ask this number what type it is, it will answer that it is an integer. This is comprehensive information, and it allows the interpreter to define a set of rules according to which integers can interact with each other.

However, there is one thing that is missing in Python: composite data types. All data types in Python are primitive, and this means that at a certain point in time you can work with only one of them, unlike composite types.

The simplest composite data type that most programming languages have is structure. In Python, they are not, as such, but in many cases, libraries need to define their own structures, for example, ORM models in Django and SQLAlchemy. Each column in the database is represented through a Python handle, which corresponds to a field in the structure, and when you say that primary key is called id, and this is IntegerField (), you define the model as a composite data type.

Compound types are not limited to structures only. When you need to work with more than one number, you use collections (arrays). Python has lists for this, and each list item can have a completely arbitrary data type, as opposed to lists in other programming languages that have a given item type (for example, an integer list).

The phrase “integer list” always makes more sense than just a list. You can argue with this, because you can always go through the list and see the type of each element, however, what to do with an empty list? When you have an empty list in Python, you cannot determine the type of its data.

The same problem occurs when using the value None. Suppose you have a function that accepts a User argument. If you pass the None parameter to it, you will never know that it was supposed to be an object of type "User".

What is the solution to this problem? Do not have null pointers and have arrays with explicitly specified element types. Everyone knows that in Haskell it is, but there are other languages that are less hostile to developers. For example, Rust is a programming language that is closer and more understandable to us, since it is very similar to C ++. And Rust has a very powerful type system.

How can one pass the value “user not set” if null pointers are missing? For example, in Rust, there are optional types for this. So, the expression Option is a marked enumeration that wraps the value (of a specific user in this case), and it means that either some user (Some) or None can be passed. Since now a variable can either have a value or not have it, all the code working with this variable should be able to correctly handle cases of passing the value None, otherwise it just will not compile.

Gray future

Previously, there was a clear separation between interpreted languages with dynamic typing and compiled languages with static typing. New trends are changing the current rules of the game.

The first sign that we are stepping into uncharted territory was the emergence of the C # language. This is a compiled language with static typing, and at first it was very similar to Java. As the C # language evolved, new features began to appear in its type system. The most important event was the emergence of generalized types, which allowed us to strictly typify collections that were not processed by the compiler (lists and dictionaries). Further - more: the creators of the language introduced the ability to abandon the static typing of variables for entire blocks of code. This is very convenient, especially when working with data provided by web services (JSON, XML, etc.), because it allows you to perform potentially unsafe operations, catch exceptions from the type system and inform users about incorrect data.

Nowadays, the C # language type system is very powerful and supports generic types with covariant and contravariant specifications. It also supports working with types that allow null pointers. For example, to define default values for objects represented as null, a union operator with the value null ("??") was added. Although C # has already gone too far to get rid of null, all bottlenecks are under control.

Other compiled languages with static typing are also trying new approaches. So, in C ++ it has always been a language with static typing, however, its developers began experiments with type inference at many levels. MyType View Iterator Days:: const_iterator is a thing of the past, and now in almost all cases you can use autotypes, and the compiler will substitute the desired data type for you.

In the programming language Rust, type inference is also implemented very well, and this allows you to write programs with static typing, without specifying the types of variables at all:

use std::collections::HashMap;
fn main() {
    let mut m = HashMap::new();
    m.insert("foo", vec!["some", "tags", "here"]);
    m.insert("bar", vec!["more", "here"]);
    for (key, values) in m.iter() {
        println!("{} = {}", key, values.connect("; "));
    }
}

I believe that in the future we will see the emergence of powerful type systems. But in my opinion this will not lead to the end of dynamic typing; rather, these systems will develop along the path of static typing with local type inference.

Python and explicit typing

Some time ago, at a conference, someone convincingly argued that static typing is great, and Python really needs it. I don’t remember exactly how this discussion ended, but the result was the mypy project, which, in combination with annotation syntax, was proposed as the gold typing standard in Python 3.

In case you have not seen this recommendation, it offers the following solution:

from typing import List
def print_all_usernames(users: List[User]) -> None:
    for user in users:
        print(user.username)

I sincerely believe that this is not the best solution. There are many reasons, but the main problem is that the type system in Python, unfortunately, is not so good anymore. In fact, a language has different semantics depending on how you look at it.

For static typing to make sense, the type system must be implemented well. If you have two types, you should always know how these types need to interact with each other. In Python, this is not the case.

Python Type Semantics

If you read the previous article on the slot system, you should remember that types in Python behave differently, depending on the level at which they are implemented (C or Python). This is a very specific feature of the language and this you will not see anywhere else. At the same time, at an early stage of development, many programming languages implement fundamental data types at the interpreter level.

Python simply does not have “fundamental” types, however there is a whole group of data types implemented in C. And these are not only primitives and fundamental types, it can be anything, without any logic. For example, the collections.OrderedDict class is written in Python, and the collections.defaultdict class from the same module is written in C.

This causes a lot of problems to the PyPy interpreter, who needs to emulate the original types as well as possible. This is necessary in order to get a good API in which any differences with CPython will not be noticeable. It is very important to understand what the main difference is between the interpreter level written in C and the rest of the language.

Another example is the re module in versions of Python prior to 2.7. In later versions, it was completely rewritten, but the main problem is still relevant: the interpreter does not work like a programming language.

The re module has a compile function for compiling a regular expression into a pattern. This function takes a string and returns a pattern object. It looks something like this:

>>> re.compile('foobar')
<_sre.SRE_Pattern object at 0x1089926b8>

We see that the pattern object is defined in the _sre module, which is an internal module, and nevertheless it is available to us:

>>> type(re.compile('foobar'))

Unfortunately, this is not so, because the _sre module does not actually contain this object:

>>> import _sre
>>> _sre.SRE_Pattern
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: 'module' object has no attribute 'SRE_Pattern'

Well, this is not the first and not the only time a type is deceiving us of its location, and in any case it is an internal type. We move on. We know the type of pattern (_sre.SRE_Pattern), and this is the descendant of the object class:

>>> isinstance(re.compile(''), object)
True

We also know that all objects implement some of the most common methods. For example, instances of such classes have the __repr__ method:

>>> re.compile('').__repr__()
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: __repr__

What is going on? The answer is quite unexpected. For reasons unknown to me, in Python prior to version 2.7, the SRE pattern object had its own tp_getattr slot. In this slot, its own attribute search logic was implemented, which provided access to its own attributes and methods. If you examine this object using the dir () method, you will notice that many things are simply missing:

>>> dir(re.compile(''))
['__copy__', '__deepcopy__', 'findall', 'finditer', 'match',
 'scanner', 'search', 'split', 'sub', 'subn']

This small study of the behavior of a pattern object leads us to rather unexpected results. This is what really happens.

The data type declares that it inherits from object. This is true in CPython, but in Python itself it is not. At the Python level, this type is not associated with an interface of type object. Every call that goes through the interpreter will work, unlike calls that go through the Python language. So, for example, type (x) will work, but x .__ class__ will not.

What is a subclass

The above example shows us that in Python there may be a class that inherits from another class, but at the same time its behavior will not correspond to the base class. And this is an important issue if we are talking about static typing. So, in Python 3, you cannot implement an interface for a dict type until you write it in C. The reason for this limitation is that this type dictates behavior to visible objects that simply cannot be implemented. It's impossible.

Therefore, when you apply type annotation and declare that a function accepts a dictionary with keys as strings and integer values as an argument, it will be impossible to figure out from your annotation whether this function accepts a dictionary, or an object with dictionary behavior, or it can be will pass the subclass of the dictionary.

Undefined behavior

The strange behavior of the regex pattern object was changed in Python 2.7, but the problem remained. As was shown by the example of dictionaries, the language behaves differently, depending on how the code is written, and it is simply impossible to fully understand the exact semantics of the type system.

A very strange behavior of the interiors of the interpreter of the second version of Python can be seen when comparing types of class instances. In the third version, the interfaces were changed, and this behavior is no longer relevant for her, however, a fundamental problem can still be detected at many levels.

Let's take the sorting of sets as an example. Python sets are a very useful data type, but they behave very strangely when compared. In Python 2, we have the cmp () function, which takes two objects as arguments and returns a numeric value indicating which of the arguments passed is greater.

Here's what happens if you try to compare two instances of the set object:

>>> cmp(set(), set())
Traceback (most recent call last):
  File "", line 1, in 
TypeError: cannot compare sets using cmp()

Why is that? To be honest, I have no idea. Perhaps the reason is how the comparison operators work with sets, and this does not work in cmp (). And at the same time, instances of frozensets are remarkably compared:

>>> cmp(frozenset(), frozenset())
0

Except in those cases when one of these sets is not empty, then we again get an exception. Why? The answer is simple: this is an optimization of the CPython interpreter, not Python behavior. An empty frozenset always has the same value (it is an immutable type and we cannot add elements to it), therefore it is always the same object. When two objects have the same address in memory, the cmp () function immediately returns 0. Why is this happening I could not immediately figure it out, since the code of the comparison function in Python 2 is too complicated and confusing, however, this function has several ways, which can lead to such a result.

The point is not only that it is a bug. The point is that in Python there is no clear understanding of the principles of the interaction of types with each other. Instead, there was always one answer to all the behaviors of the type system in Python: “CPython works like that.”

It is hard to overestimate the amount of work that PyPy has done to reconstruct CPython's behavior. Given that PyPy is written in Python, an interesting problem looms up. If the Python programming language were described in the way the current Python part of the language is implemented, PyPy would have much less problems.

Instance Level Behavior

Now let's imagine that, hypothetically, we have a version of Python in which all the problems described are fixed. And even in this case, we cannot add static types to the language. The reason is that at the Python level, types do not play a significant role, much more important is how the objects interact with each other.

For example, datetime objects can generally be compared with other objects. But if you want to compare two datetime objects with each other, then this can only be done if their time zones are compatible. Also, the result of many operations can be unpredictable until you carefully examine the objects involved in them. The result of concatenating two strings in Python 2 can be either unicode or bytestring. Different encoding or decoding APIs from a codec system may return different objects.

Python, as a language, is too dynamic for type annotations to work well. Just imagine what an important role generators play in a language, and yet they can perform many types of conversion operations in each iteration.

The introduction of type annotations will produce, at best, an ambiguous effect. However, it is more likely that this will adversely affect the architecture of the API. At a minimum, if these annotations are not cut out before the programs are launched, they will slow down code execution. Type annotations will never allow efficient static compilation without turning Python into something that Python is not.

Baggage and semantics

I think that my personal negative attitude towards Python was due to the absurd complexity that this language reached. There are simply no specifications in it, and today the interaction between types has become so confusing that we may never be able to figure it all out. There are so many crutches and all these small behavioral features in it that the only language specification possible today is a detailed description of the CPython interpreter.

In my opinion, taking into account all of the above, the introduction of type annotation makes almost no sense.

If anyone in the future wants to develop a new programming language with a predominantly dynamic typing, they should spend additional time on a clear description of how the type system should work. In JavaScript, this is done quite well, all the semantics of the built-in types are described in detail, even in cases where it does not make sense, and this is good practice in my opinion. If you have clearly defined how the semantics of the language work, in the future it will be easy for you to optimize the speed of the interpreter or even add optional static typing.

Maintaining a well-balanced and well-documented language architecture avoids many problems. Architects of future programming languages should definitely avoid all the mistakes made by the developers of the languages PHP, Python and Ruby, when the behavior of the language is ultimately explained by the behavior of the interpreter.

I believe that Python is unlikely to change for the better. It takes too much time and effort to rid the tongue of all this difficult legacy.

Translated Dreadatour , the text read% username%.

Tags: