Talk about usernames
- Transfer
A couple of weeks ago I released django-registration 2.4.1. Builds 2.4.x will be the last in django-registration 2.x version, then only bug fixes will come out. The main branch is now preparing for version 3.0, where it is planned to remove a bunch of obsolete junk that has accumulated over the last decade of support, and I will try to take into account the best practices of modern Django applications.
In the near future I will write more about the new version, but right now I want to talk a little about the deceptively simple problem that I have to deal with. These are usernames. Yes, I could write one of the popular statues like "Programmers' Misconceptions About X"but still I prefer to really explain why it is more complicated than it seems and offer some advice on how to solve the problem. And not just joking without a useful context.
Usernames - in the form in which they are implemented on many sites and services and in many popular frameworks (including Django) - are almost certainly the wrong way to solve the problem that they are trying to solve with their help. What we really need in terms of user identification is some combination of the following:
Many systems request a username - and use the same username for all three of these tasks. This is probably wrong. A more competent approach is a three-way identification template in which each identifier is different, and several login identifiers and / or public identifiers can be associated with one system identifier.
Many problems and suffering when trying to build and scale the account system are caused by ignoring this model. An annoyingly large number of hacks are used in systems that do not support such a template so that they look and work as if they support it.
So if you are developing a system from scratch now in 2018, I would suggest taking this model and using it as a basis. You’ll have to work a bit at first, but in the future it will provide good flexibility and save time, and one day someone can even create an acceptable universal implementation for reusable use (of course, I thought of doing it for Django, or maybe I will do it someday).
In the rest of this article, we will assume that you are using a more common implementation, in which a unique username serves as at least a system identifier and login to the system, as well as, most likely, a public identifier that is shown to all users. And by "username" I mean essentially any string identifier. For example, you may have usernames like on forums like Reddit or Hacker News, or you can use email addresses or some other unique lines. It doesn't matter, you are probably still using some sort of unique string. So you need to know about some problems.
Perhaps you ask the question: how difficult is it? You can simply create a unique column in the database - and you're done! Let's create a table with users in Postgres:
Here is our table with users and a column with unique names. Easy!
Well, it's easy until we start thinking about the real application. If you are registered as
This is a simple thing that is not implemented correctly in many systems. During research for this article, I found that the auth system in Django does not ensure case-insensitive usernames, despite the correct approach to implementing many other things. There is a ticket in the bug tracker to make usernames case insensitive, but now it is marked as WONTFIX because creating case-insensitive usernames in bulk will break backward compatibility - and no one is sure how to do it and whether to do it at all . I'll probably think about forcibly applying this in django-registration 3.0, but I'm not sure that this can be implemented even there - problems will start on any site where case-sensitive accounts already exist.
So if you will be going today to build a system from scratch, then you need to do from the start checking on the uniqueness of user name insensitive:
But this is only the beginning. We live in a Unicode world, and here comparing two names for coincidence is more difficult than just performing an operation
Illustration from the article “Normalizing Unicode” - approx. trans.
Also, when developing a system for checking the uniqueness of case-insensitive names, you will have to consider characters that are not included in ASCII. Should usernames
If all this is confusing - and it is, even if you are an Unicode expert! - I recommend following the advice from the Unicode Technical Report 36and normalize names in the form of NFKC . If you use
For other languages, look for a good Unicode library.
Unfortunately, that is not all. Case-insensitive uniqueness checking in normalized strings is the beginning, but it does not cover all the cases that need to be caught. For example, consider the following user name:
In the font that I use for this article, and in any font available for my blog, they seem the same. But for software, they are completely different , and will remain different after Unicode normalization and case insensitive comparisons (regardless of whether you chose to check with normalization in lower or upper case).
To understand the reason, pay attention to the second code point. In one of usernames it is
This is the basis of homographic attacks, which were first widely known in the context of internationalized domain names . And to solve the problem, it will take a little more work.
For network hosts, one option would be to display the names in the Punycode viewcreated to solve this particular problem by displaying names in any encoding using only ASCII characters. Returning to the above usernames, the differences between them become obvious. If you want to try it yourself, here is a one-liner in Python and the result on a username with a Cyrillic symbol:
(If you have problems copying past non-ASCII characters, this name can be expressed as a string literal
But in practice, displaying user names in this form is not good. Of course, you can show Punycode every time, but this will break the display of many perfectly normal usernames with non-ASCII characters. What we really want is to reject the above username during registration. How to do it?
Well, this time heading to the Unicode Technical Report 39and we begin to read sections 4 and 5. Sets of code points that differ from each other (even after normalization), but are visually identical or similar when mixed, are called “confusables”, and Unicode provides mechanisms for detecting such code points points.
The username in our example is what is referred to in Unicode as “mixed-script confusable” and this is what we want to discover. In other words, the username is entirely in Latin with “confusing” characters, which can probably be considered normal. And a fully Cyrillic username with “confusing” characters can probably also be considered normal. But if the name is composed mainly of Latin characters, plus the only Cyrillic code point, which, when visualized, turned out to be like a Latin character before mixing ... then this will not work.
Unfortunately, in the standard library, Python does not provide the necessary access to the full set of Unicode properties and tables to make such a comparison. But a kind developer namedVictor Felder wrote a suitable library and released it under a free, open source license. Using the library,
The real result of executing the function
Django allows the use of non-ASCII characters in usernames, but does not check for identical characters from different encodings. However, since version 2.3, django-registration has become dependent on the library,
If we are dealing with Unicode code points leading to confusion, it makes sense to think about what to do with similar characters from the same alphabet . For example,
I want to mention another problem with the uniqueness of names. True, it refers mainly to email addresses, which today are often used as usernames (especially in services that rely on a third-party identity provider and use OAuth and similar protocols). Suppose we need to ensure that email addresses are unique. How many different addresses are listed below?
There is no definite answer. Most mail servers have long ignored all characters after the sign
So if you need unique email addresses or you use email addresses as a user ID, you probably need to remove all points from the local part, as well
In addition, when processing Unicode code points leading to confusion in email addresses, apply this check separately to the local part and to the domain. People cannot always change the alphabet used in a domain, so they cannot be punished for using different alphabets in the local part and domain part. If neither the local part, nor part of the domain individually contains a mixture of alphabets leading to confusion, then everything is probably in order (and the django-registration validator does this check).
You may encounter many other problems regarding user names that are too similar to each other so as not to be considered “different,” but as soon as you start turning off case sensitivity, starting normalization and checking for a mixture of alphabets, then quickly enter the territory with diminishing returns [when the benefit decreases with each innovation - approx. trans. ], especially since many rules that depend on the language, alphabet or region begin to apply. This does not mean that you do not need to think about them. It's just that it's hard to give universal advice that is suitable for everyone.
Let's expand the situation a bit and consider a different type of problem.
Many sites use a username not only as a field in the login form. Some create a profile page for each user and put the username in the URL. Some create email addresses for each user. Some create subdomains. So a few questions arise:
If you think these are just stupid hypothetical questions, well then, some of this actually happened . And not once, but several times . No, in fact, such things have happened several times .
You can - and should - take some precautions to ensure that, say, an automatically created subdomain for a user account does not conflict with an existing subdomain that you are actually using for some purpose. Or that automatically generated email addresses do not conflict with important and / or existing addresses.
But for maximum security, you probably just need to prevent certain user names from being registered. This was the first time I saw such advice - and a list of reserved names, as well as the first two articles mentioned above - in this article by Jeffrey Thomas . Starting with version 2.1, django-registration comes with a list of reserved names, and this list grows with each version; now there are about a hundred entries.
In the django-registration list, the names are divided into several categories, which allows you to create subsets of them depending on your needs (the validator applies them all by default, but you can reconfigure it with only the necessary sets of reserved names specified):
The django-registration validator will also reject any username that begins with
As with the confusing characters in user names, I recommend that you copy the necessary elements of the django-registration list and add it if necessary. In turn, this list is an extended version of the Jeffrey Thomas list.
Not all that can be done to test usernames is listed here. If I tried to write a complete list, I would be stuck here forever. However, this is a good starting platform, and I recommend following most or all of these tips. I hope the article approximately showed what difficulties may be hidden behind such a seemingly “simple” problem as user accounts.
As I already mentioned, Django and / or django-registration already performs most of these checks. And that does not, will probably be added at least in version django-registration 3.0. Django itself may not be able to implement such checks in the near future (or ever at all) due to strong backward compatibility issues. All source code is open (under the BSD license), so copy, adapt and improve it without any problems.
If I missed something important, please let me know about it: you can report a bug or send a pull request to django-registration on GitHub or just contact me directly .
In the near future I will write more about the new version, but right now I want to talk a little about the deceptively simple problem that I have to deal with. These are usernames. Yes, I could write one of the popular statues like "Programmers' Misconceptions About X"but still I prefer to really explain why it is more complicated than it seems and offer some advice on how to solve the problem. And not just joking without a useful context.
Remark: the right way to identify
Usernames - in the form in which they are implemented on many sites and services and in many popular frameworks (including Django) - are almost certainly the wrong way to solve the problem that they are trying to solve with their help. What we really need in terms of user identification is some combination of the following:
- System level identifier for foreign keys in the database.
- Login ID to perform credential verification.
- Public identifier to display to other users.
Many systems request a username - and use the same username for all three of these tasks. This is probably wrong. A more competent approach is a three-way identification template in which each identifier is different, and several login identifiers and / or public identifiers can be associated with one system identifier.
Many problems and suffering when trying to build and scale the account system are caused by ignoring this model. An annoyingly large number of hacks are used in systems that do not support such a template so that they look and work as if they support it.
So if you are developing a system from scratch now in 2018, I would suggest taking this model and using it as a basis. You’ll have to work a bit at first, but in the future it will provide good flexibility and save time, and one day someone can even create an acceptable universal implementation for reusable use (of course, I thought of doing it for Django, or maybe I will do it someday).
In the rest of this article, we will assume that you are using a more common implementation, in which a unique username serves as at least a system identifier and login to the system, as well as, most likely, a public identifier that is shown to all users. And by "username" I mean essentially any string identifier. For example, you may have usernames like on forums like Reddit or Hacker News, or you can use email addresses or some other unique lines. It doesn't matter, you are probably still using some sort of unique string. So you need to know about some problems.
Uniqueness is harder than it seems
Perhaps you ask the question: how difficult is it? You can simply create a unique column in the database - and you're done! Let's create a table with users in Postgres:
CREATE TABLE accounts (
id SERIAL PRIMARY KEY,
username TEXT UNIQUE,
password TEXT,
email_address TEXT
);
Here is our table with users and a column with unique names. Easy!
Well, it's easy until we start thinking about the real application. If you are registered as
john_doe
, what will happen if I register as JOHN_DOE
? This is a different username, but can I make people think that I am you? Will people accept my friend requests and share confidential information with me because they don’t realize that for a computer, different case are different characters?This is a simple thing that is not implemented correctly in many systems. During research for this article, I found that the auth system in Django does not ensure case-insensitive usernames, despite the correct approach to implementing many other things. There is a ticket in the bug tracker to make usernames case insensitive, but now it is marked as WONTFIX because creating case-insensitive usernames in bulk will break backward compatibility - and no one is sure how to do it and whether to do it at all . I'll probably think about forcibly applying this in django-registration 3.0, but I'm not sure that this can be implemented even there - problems will start on any site where case-sensitive accounts already exist.
So if you will be going today to build a system from scratch, then you need to do from the start checking on the uniqueness of user name insensitive:
john_doe
, John_Doe
and JOHN_DOE
should be considered the same name. Once one of them is registered, the rest become inaccessible. But this is only the beginning. We live in a Unicode world, and here comparing two names for coincidence is more difficult than just performing an operation
username1 == username2
. Firstly, there is a composition and decomposition of characters. They differ when comparing them as sequences of Unicode code points, but they look the same on the screen. Therefore, here you need to think about normalization , select the normalization form (NFC or NFD), and then normalize each username to the selected form beforePerform any uniqueness checks. Illustration from the article “Normalizing Unicode” - approx. trans.
Also, when developing a system for checking the uniqueness of case-insensitive names, you will have to consider characters that are not included in ASCII. Should usernames
StraßburgJoe
and be considered identical StrassburgJoe
? The answer often depends on whether you are doing a check with normalization in lower or upper case. And while there are still different options for decomposition in Unicode; you can get (and get) different results for many lines, depending on whether you use canonical equivalence or compatibility mode. If all this is confusing - and it is, even if you are an Unicode expert! - I recommend following the advice from the Unicode Technical Report 36and normalize names in the form of NFKC . If you use
UserCreationForm
Django or its subclass (django-registration uses subclasses UserCreationForm
), then this is already done for you. If you use Python, but without Django (or don’t use it UserCreationForm
), then this can be done in one line using the helper from the standard library:import unicodedata
username_normalized = unicodedata.normalize('NFKC', username)
For other languages, look for a good Unicode library.
No, really, uniqueness is harder than it seems
Unfortunately, that is not all. Case-insensitive uniqueness checking in normalized strings is the beginning, but it does not cover all the cases that need to be caught. For example, consider the following user name:
jane_doe
. Now consider a different username: jane_doe
. Is this the same username? In the font that I use for this article, and in any font available for my blog, they seem the same. But for software, they are completely different , and will remain different after Unicode normalization and case insensitive comparisons (regardless of whether you chose to check with normalization in lower or upper case).
To understand the reason, pay attention to the second code point. In one of usernames it is
U+0061 LATIN SMALL LETTER A
. And in another it U+0430 CYRILLIC SMALL LETTER A
. And no normalization of Unicode or elimination of case sensitivity will make these code points the same, although they are often visually completely indistinguishable. This is the basis of homographic attacks, which were first widely known in the context of internationalized domain names . And to solve the problem, it will take a little more work.
For network hosts, one option would be to display the names in the Punycode viewcreated to solve this particular problem by displaying names in any encoding using only ASCII characters. Returning to the above usernames, the differences between them become obvious. If you want to try it yourself, here is a one-liner in Python and the result on a username with a Cyrillic symbol:
>>> 'jаne_doe'.encode('punycode')
b'jne_doe-2fg'
(If you have problems copying past non-ASCII characters, this name can be expressed as a string literal
j\u0430ne_doe
). But in practice, displaying user names in this form is not good. Of course, you can show Punycode every time, but this will break the display of many perfectly normal usernames with non-ASCII characters. What we really want is to reject the above username during registration. How to do it?
Well, this time heading to the Unicode Technical Report 39and we begin to read sections 4 and 5. Sets of code points that differ from each other (even after normalization), but are visually identical or similar when mixed, are called “confusables”, and Unicode provides mechanisms for detecting such code points points.
The username in our example is what is referred to in Unicode as “mixed-script confusable” and this is what we want to discover. In other words, the username is entirely in Latin with “confusing” characters, which can probably be considered normal. And a fully Cyrillic username with “confusing” characters can probably also be considered normal. But if the name is composed mainly of Latin characters, plus the only Cyrillic code point, which, when visualized, turned out to be like a Latin character before mixing ... then this will not work.
Unfortunately, in the standard library, Python does not provide the necessary access to the full set of Unicode properties and tables to make such a comparison. But a kind developer namedVictor Felder wrote a suitable library and released it under a free, open source license. Using the library,
confusable_homoglyphs
we can identify the problem:>>> from confusable_homoglyphs import confusables
>>> s1 = 'jane_doe'
>>> s2 = 'j\u0430ne_doe'
>>> bool(confusables.is_dangerous(s1))
False
>>> bool(confusables.is_dangerous(s2))
True
The real result of executing the function
is_dangerous()
for the second username is a data structure with detailed information about potential problems, but the main thing is that you can identify a string with mixed alphabets and code points that lead to confusion. This is what we need. Django allows the use of non-ASCII characters in usernames, but does not check for identical characters from different encodings. However, since version 2.3, django-registration has become dependent on the library,
confusable_homoglyphs
and its function is_dangerous()
is used in the process of validating user names and email addresses. If you need to implement user registration in Django (or even Python) and you cannot or do not want to use django-registration, then I recommend using the library confusable_homoglyphs
in the same way.Did I mention that uniqueness is difficult to achieve?
If we are dealing with Unicode code points leading to confusion, it makes sense to think about what to do with similar characters from the same alphabet . For example,
paypal
andpaypa1
. In some fonts, they are difficult to distinguish from each other. So far, all my suggestions have been generally suitable for everyone, but here we are entering a territory specific to specific languages, alphabets, and geographical regions. Here decisions should be made with caution and taking into account possible consequences (for example, a ban on misleading Latin characters can cause more false positive results than you would like). It’s worth thinking about. The same goes for user names that are different, but still very similar to each other. At the database level, you can check in various forms - for example, Postgres comes with Soundex and Metaphone support , as well as with support for Levenshtein distance andfuzzy matching trigrams - but then again, only occasionally, rather than constantly, will have to be dealt with. I want to mention another problem with the uniqueness of names. True, it refers mainly to email addresses, which today are often used as usernames (especially in services that rely on a third-party identity provider and use OAuth and similar protocols). Suppose we need to ensure that email addresses are unique. How many different addresses are listed below?
johndoe@example.com
johndoe+yoursite@example.com
john.doe@example.com
There is no definite answer. Most mail servers have long ignored all characters after the sign
+
in the local part of the address when determining the username. In turn, many people use this technical feature to indicate arbitrary text after the “plus” as a special system of labels and filtering. And Gmail also famously ignores points ( .
) in the local part, including in the distributed domains on its services, so without a DNS query it is generally impossible to understand whether someone else's mail server can distinguish between johndoe
and john.doe
. So if you need unique email addresses or you use email addresses as a user ID, you probably need to remove all points from the local part, as well
+
and any text after it before performing a uniqueness check. Currently django-registration does not do this, but I have plans to add this feature in version 3.x. In addition, when processing Unicode code points leading to confusion in email addresses, apply this check separately to the local part and to the domain. People cannot always change the alphabet used in a domain, so they cannot be punished for using different alphabets in the local part and domain part. If neither the local part, nor part of the domain individually contains a mixture of alphabets leading to confusion, then everything is probably in order (and the django-registration validator does this check).
You may encounter many other problems regarding user names that are too similar to each other so as not to be considered “different,” but as soon as you start turning off case sensitivity, starting normalization and checking for a mixture of alphabets, then quickly enter the territory with diminishing returns [when the benefit decreases with each innovation - approx. trans. ], especially since many rules that depend on the language, alphabet or region begin to apply. This does not mean that you do not need to think about them. It's just that it's hard to give universal advice that is suitable for everyone.
Let's expand the situation a bit and consider a different type of problem.
Some names should be reserved.
Many sites use a username not only as a field in the login form. Some create a profile page for each user and put the username in the URL. Some create email addresses for each user. Some create subdomains. So a few questions arise:
- If your site puts the username in the URL on the profile page, what happens if I create a user with the name
login
? If I post on my profile the text “Our login page has been moved, please click here to login” with a link to my credential collection site. What do you think, how many users can I fool? - If your site creates email addresses from user names, what happens if I register as a user with the name
webmaster
orpostmaster
? Will I receive emails sent to these addresses for your domain? Can I get an SSL certificate for your domain with the correct username and automatically generated email address? - If your site creates subdomains from user names, what happens if I register as a user with a name
www
? Orsmtp
, ormail
?
If you think these are just stupid hypothetical questions, well then, some of this actually happened . And not once, but several times . No, in fact, such things have happened several times .
You can - and should - take some precautions to ensure that, say, an automatically created subdomain for a user account does not conflict with an existing subdomain that you are actually using for some purpose. Or that automatically generated email addresses do not conflict with important and / or existing addresses.
But for maximum security, you probably just need to prevent certain user names from being registered. This was the first time I saw such advice - and a list of reserved names, as well as the first two articles mentioned above - in this article by Jeffrey Thomas . Starting with version 2.1, django-registration comes with a list of reserved names, and this list grows with each version; now there are about a hundred entries.
In the django-registration list, the names are divided into several categories, which allows you to create subsets of them depending on your needs (the validator applies them all by default, but you can reconfigure it with only the necessary sets of reserved names specified):
- Host addresses used for autodiscovery / autoconfiguration of some well-known services.
- Host addresses associated with commonly used protocols.
- Email addresses used by certificate authorities to verify domain ownership.
- Email addresses listed in RFC 2142 that are not listed on any other set of reserved names.
- Common addresses no-reply @.
- Lines matching confidential file names (for example, cross-domain access policies).
- A long list of other potentially sensitive names like
contact
andlogin
.
The django-registration validator will also reject any username that begins with
.well-known
to protect anything that uses the RFC 5785 standard to indicate "well-known URIs." As with the confusing characters in user names, I recommend that you copy the necessary elements of the django-registration list and add it if necessary. In turn, this list is an extended version of the Jeffrey Thomas list.
This is just the beginning.
Not all that can be done to test usernames is listed here. If I tried to write a complete list, I would be stuck here forever. However, this is a good starting platform, and I recommend following most or all of these tips. I hope the article approximately showed what difficulties may be hidden behind such a seemingly “simple” problem as user accounts.
As I already mentioned, Django and / or django-registration already performs most of these checks. And that does not, will probably be added at least in version django-registration 3.0. Django itself may not be able to implement such checks in the near future (or ever at all) due to strong backward compatibility issues. All source code is open (under the BSD license), so copy, adapt and improve it without any problems.
If I missed something important, please let me know about it: you can report a bug or send a pull request to django-registration on GitHub or just contact me directly .