Stop checking email with regular expressions!

Original author: David Celis
  • Transfer
Seriously, stop it. This is a waste of time and effort. Look for a regularity to check Email in Google, take a look at it - and you will want to move away to breathe fresh air. I recall one very famous quote:

When faced with a problem, some people think, “Oh, I will use regular expressions.”
Now they have two problems.

Jamie Zawinski, regex.info

Here is a fairly common example code from a Rails application containing some semblance of an authorization system:

class User < ActiveRecord::Base
  # Эта регулярка взята из проекта from https://github.com/plataformatec/devise,
  # самой популярной библиотеки авторизации для Rails
  validates_format_of :email, :with => /\A[^@]+@([^@\.]+\.)+[^@\.]+\z/
end

It looks pretty simple (unless you know the regular expressions at all), but it can be much worse:

class User < ActiveRecord::Base
  validates_format_of :email, :with => /^(|(([A-Za-z0-9]+_+)|([A-Za-z0-9]+\-+)|([A-Za-z0-9]+\.+)|([A-Za-z0-9]+\++))*[A-Za-z0-9]+@((\w+\-+)|(\w+\.))*\w{1,63}\.[a-zA-Z]{2,6})$/i
end

Or really bad:

class User < ActiveRecord::Base
  validates :email, :with => EmailAddressValidator
end
class EmailValidator < ActiveModel::Validator
  EMAIL_ADDRESS_QTEXT           = Regexp.new '[^\\x0d\\x22\\x5c\\x80-\\xff]', nil, 'n'
  EMAIL_ADDRESS_DTEXT           = Regexp.new '[^\\x0d\\x5b-\\x5d\\x80-\\xff]', nil, 'n'
  EMAIL_ADDRESS_ATOM            = Regexp.new '[^\\x00-\\x20\\x22\\x28\\x29\\x2c\\x2e\\x3a-\\x3c\\x3e\\x40\\x5b-\\x5d\\x7f-\\xff]+', nil, 'n'
  EMAIL_ADDRESS_QUOTED_PAIR     = Regexp.new '\\x5c[\\x00-\\x7f]', nil, 'n'
  EMAIL_ADDRESS_DOMAIN_LITERAL  = Regexp.new "\\x5b(?:#{EMAIL_ADDRESS_DTEXT}|#{EMAIL_ADDRESS_QUOTED_PAIR})*\\x5d", nil, 'n'
  EMAIL_ADDRESS_QUOTED_STRING   = Regexp.new "\\x22(?:#{EMAIL_ADDRESS_QTEXT}|#{EMAIL_ADDRESS_QUOTED_PAIR})*\\x22", nil, 'n'
  EMAIL_ADDRESS_DOMAIN_REF      = EMAIL_ADDRESS_ATOM
  EMAIL_ADDRESS_SUB_DOMAIN      = "(?:#{EMAIL_ADDRESS_DOMAIN_REF}|#{EMAIL_ADDRESS_DOMAIN_LITERAL})"
  EMAIL_ADDRESS_WORD            = "(?:#{EMAIL_ADDRESS_ATOM}|#{EMAIL_ADDRESS_QUOTED_STRING})"
  EMAIL_ADDRESS_DOMAIN          = "#{EMAIL_ADDRESS_SUB_DOMAIN}(?:\\x2e#{EMAIL_ADDRESS_SUB_DOMAIN})*"
  EMAIL_ADDRESS_LOCAL_PART      = "#{EMAIL_ADDRESS_WORD}(?:\\x2e#{EMAIL_ADDRESS_WORD})*"
  EMAIL_ADDRESS_SPEC            = "#{EMAIL_ADDRESS_LOCAL_PART}\\x40#{EMAIL_ADDRESS_DOMAIN}"
  EMAIL_ADDRESS_PATTERN         = Regexp.new "#{EMAIL_ADDRESS_SPEC}", nil, 'n'
  EMAIL_ADDRESS_EXACT_PATTERN   = Regexp.new "\\A#{EMAIL_ADDRESS_SPEC}\\z", nil, 'n'
  def validate(record)
    unless record.email =~ EMAIL_ADDRESS_EXACT_PATTERN
      record.errors[:email] << 'is invalid'
    end
  end
end

Yeah. Is it really necessary to use something so complex? If you follow the link at the beginning of the article, you will see that for many years people have been writing (or trying to write) regulars to check email addresses that would correspond to the RFC description . Some of them are simply ridiculously abstruse, as in the last example, and still do not miss some valid addresses.

Which email address is valid is described in sections 3.2.4 and 3.4.1 . It says that if there is a backslash and quotation marks, there are not many things that cannot be used in the address. The local part of the address (the line that goes before the @ symbol) may contain the following characters:

! $ & * - = ^ `| ~ #% '+ /? _ {}

But you know what? You can use almost any character you like if you escape it with quotation marks. For example, this is the correct address:

"Look at all these spaces!" @ Example.com

Perfectly!

For this reason, recently I check all email addresses with the following regular expression:

class User < ActiveRecord::Base
  validates_format_of :email, :with => /@/
end

Elementary, isn't it? The address must contain the @ symbol. As a rule, this is what I limit myself to. Together with the field for re-entering the address, these two methods allow you to filter out the lion's share of errors associated with entering incorrect data.

But what if I offered you a way to check email for validity that doesn't use regular expressions at all? It is unexpectedly simple, and most likely you are already using it.

Just send the user his letter!

No, I'm not joking. Just send the user an email. The practice of sending a letter with an activation code has been used for more than one year, but almost always it is supplemented by a complex address check. If you are still going to send a letter to this address, why bother with huge regular expressions?

Imagine such a scenario. I register on your site under the following address:

qwiufaisjdbvaadsjghb@gmail.com

Come on you! No mail daemon will work with this crap, but formatting is in perfect order: this is a valid email address! To solve this problem, you write a system that, after registration, sends me an email with a link that I should click on. This is required in order to make sure that I really have access to the mailbox I'm registering for. In that case, why check the address format? The result of sending a letter to the wrong address will be exactly the same - the server will not accept the letter. If the user has entered an incorrect address, he will not receive a letter and will try to register on your site again if he really needs it. That's all.

So do not lean on confused regular expressions. If you really want to verify the address directly on the registration form, add a field for re-entry. Yes, some users simply copy the line from the first and paste into the second, but even then there is no need to inflate the problem out of this. Complex validation with regular expressions is not an additional solution, but only extra hemorrhoids.

If you still can not calm down until you check the address for correctness, just check for the @ symbol in it. And if you feel that you are capable of more - add a check to the point:

/.+@.+\..+/i

All that is beyond this is gun firing on sparrows.

Translator's note: I found a
link to this article in a comment on another translation . Thanks jetman !

Also popular now: