SLY_G November 9, 2014 at 19:09

Expressive JavaScript: Regular Expressions

Content

Some people, faced with a problem, think: "Oh, but I use regular expressions." Now they have two problems.
Jamie Zawinski

Ewan-Ma said: “It takes a lot of force to cut a tree across the wood structure. It takes a lot of code to program across the structure of the problem.
Master Ewan-Ma, “The Book of Programming” Programming

tools and techniques survive and spread in a chaotic-evolutionary way. Sometimes it’s not beautiful and ingenious who survive, but simply those who work quite well in their field - for example, if they are integrated into another successful technology.

In this chapter we will discuss such a tool - regular expressions. This is a way to describe patterns in string data. They create a small separate language that is included in JavaScript and in many other languages and tools.

Regulars are at the same time very strange and extremely useful. Their syntax is cryptic, and the programming interface in JavaScript is awkward for them. But it is a powerful tool for exploring and processing strings. Having dealt with them, you will become a more effective programmer.

Create a regex

Regular is the type of object. You can create it by calling the RegExp constructor, or by writing the desired template, surrounded by slashes.

var re1 = new RegExp("abc");
var re2 = /abc/;

Both of these regular expressions represent one pattern: the character “a”, followed by the character “b”, followed by the character “c”.

If you use the RegExp constructor, then the template is written as a regular string, so all the rules regarding backslashes apply.

The second entry, where the pattern is between slashes, processes backslashes differently. Firstly, since the pattern ends with a forward slash, you need to put a backslash before the forward slash, which we want to include in our pattern. In addition, backslashes that are not part of special characters such as \ n will be saved (and not ignored, as in lines), and will change the meaning of the pattern. Some characters, such as a question mark or plus, have a special meaning in regulars, and if you need to find such a character, it must also be preceded by a backslash.

var eighteenPlus = /eighteen\+/;

To know which characters must be preceded by a slash, you need to learn a list of all special characters in regulars. While this is unrealistic, so if in doubt, simply put a backslash in front of any character that is not a letter, number, or space.

Check for matches

Regulars have several methods. The simplest is test. If you pass a string to it, it will return a Boolean value, indicating whether the string contains an occurrence of the specified pattern.

console.log(/abc/.test("abcde"));
// → true
console.log(/abc/.test("abxde"));
// → false

A regularity consisting only of non-special characters is simply a sequence of these characters. If abc is somewhere in the line we are checking (not just at the beginning), test will return true.

Looking for a character set

To find out if the string contains abc, one could also use indexOf. Regulars allow you to go further and make more complex patterns.

Suppose we need to find any number. When we put a character set in square brackets in the regular season, this means that this part of the expression matches any of the characters in the brackets.

Both expressions are in lines containing a number.

console.log(/[0123456789]/.test("in 1992"));
// → true
console.log(/[0-9]/.test("in 1992"));
// → true

In square brackets, a dash between two characters is used to specify a range of characters where the sequence is specified by Unicode. Characters from 0 to 9 are just there in a row (codes from 48 to 57), so [0-9] captures them all and matches any digit.

Several groups of characters have their own built-in abbreviations.

\ d Any digit
\ w Alphanumeric character
\ s White space character (space, tab, line feed, etc.)
\ D is not a number
\ W is not an alphanumeric character
\ S is not a white space character
. any character other than a newline

This way you can set the date and time format, like 30-01-2003 15:20 with the following expression:

var dateTime = /\d\d-\d\d-\d\d\d\d \d\d:\d\d/;
console.log(dateTime.test("30-01-2003 15:20"));
// → true
console.log(dateTime.test("30-jan-2003 15:20"));
// → false

It looks awful, right? Too many backslashes that make it difficult to understand the pattern. Later we will slightly improve it.

Backslashes can also be used in square brackets. For example, [\ d.] Means any digit or period. Note that the dot inside the square brackets loses its special meaning and turns into just a dot. The same goes for other special characters, such as +.

You can invert a character set — that is, say that you need to find any character other than those in the set — by putting the ^ sign immediately after the opening square bracket.

var notBinary = /[^01]/;
console.log(notBinary.test("1100100010100110"));
// → false
console.log(notBinary.test("1100100010200110"));
// → true

Repeat the parts of the template

We know how to find one number. And if we need to find the whole number - a sequence of one or more digits?

If you put a + sign after something in the regular season, this will mean that this element can be repeated more than once. / \ d + / means one or more digits.

console.log(/'\d+'/.test("'123'"));
// → true
console.log(/'\d+'/.test("''"));
// → false
console.log(/'\d*'/.test("'123'"));
// → true
console.log(/'\d*'/.test("''"));
// → true

The asterisk * has almost the same value, but it allows the pattern to be present zero times. If there is an asterisk after something, then it never prevents the template from being in the line — it just exists there zero times.

The question mark makes part of the template optional, that is, it can occur zero or once. In the following example, the u symbol may occur, but the pattern also matches when it does not exist.

var neighbor = /neighbou?r/;
console.log(neighbor.test("neighbour"));
// → true
console.log(neighbor.test("neighbor"));
// → true

To specify the exact number of times that a pattern should occur, curly braces are used. {4} after an element means that it must occur on line 4 times. You can also set the interval: {2,4} means that the element must meet at least 2 and no more than 4 times.

Another version of the date and time format where days, months and hours of one or two digits are allowed. And she’s a little more readable.

var dateTime = /\d{1,2}-\d{1,2}-\d{4} \d{1,2}:\d{2}/;
console.log(dateTime.test("30-1-2003 8:45"));
// → true

You can use open end gaps by omitting one of the numbers. {, 5} means that the pattern can occur from zero to five times, and {5,} - from five or more.

Grouping Expressions

To use the * or + operators on multiple elements at once, you can use parentheses. The part of the regularity, enclosed in parentheses, is considered one element from the point of view of operators.

var cartoonCrying = /boo+(hoo+)+/i;
console.log(cartoonCrying.test("Boohoooohoohooo"));
// → true

The first and second pluses apply only to the second letters o in the words boo and hoo. The third + belongs to the whole group (hoo +), finding one or more of these sequences.

The letter i at the end of the expression makes the regular case insensitive to characters, so that B matches b.

Matches and Groups

The test method is the easiest method for checking regulars. It only reports whether a match was found or not. Regulars also have an exec method that will return null if nothing was found, otherwise it will return an object with information about the match.

var match = /\d+/.exec("one two 100");
console.log(match);
// → ["100"]
console.log(match.index);
// → 8

The returned exec object has an index property, which contains the number of the character with which the match occurred. In general, the object looks like an array of strings, where the first element is the string that was checked for coincidence. In our example, this will be the sequence of numbers we were looking for.

Strings have a match method that works in much the same way.

console.log("one two 100".match(/\d+/));
// → ["100"]

When the regular expression contains subexpressions grouped by parentheses, the text that matches these groups also appears in the array. The first element is always a match entirely. The second is the part that coincided with the first group (the one with the parentheses met before everyone else), then with the second group, and so on.

var quotedText = /'([^']*)'/;
console.log(quotedText.exec("she said 'hello'"));
// → ["'hello'", "hello"]

When a group is not found at all (for example, if there is a question mark behind it), its position in the array contains undefined. If the group matches several times, then only the last match will be in the array.

console.log(/bad(ly)?/.exec("bad"));
// → ["bad", undefined]
console.log(/(\d)+/.exec("123"));
// → ["123", "3"]

Groups are useful for extracting parts of strings. If we don’t just need to check if there is a date in the string, but extract it and create an object representing the date, we can enclose the sequence of digits in parentheses and select the date from the result of exec.

But for starters, a small digression in which we learn the preferred way to store date and time in JavaScript.

Date type

JavaScript has a standard type of object for dates - or rather, moments in time. It is called Date. If you simply create a date object via new, you will get the current date and remy.

console.log(new Date());
// → Sun Nov 09 2014 00:07:57 GMT+0300 (CET)

You can also create an object containing a given time.

console.log(new Date(2015, 9, 21));
// → Wed Oct 21 2015 00:00:00 GMT+0300 (CET)
console.log(new Date(2009, 11, 9, 12, 59, 59, 999));
// → Wed Dec 09 2009 12:59:59 GMT+0300 (CET)

JavaScript uses a convention in which month numbers start from zero and day numbers start from one. This is stupid and ridiculous. Watch out.

The last four arguments (hours, minutes, seconds, and milliseconds) are optional, and if not equal to zero.

Timestamps are stored as the number of milliseconds that have elapsed since the beginning of 1970. Negative numbers are used for time until 1970 (this is due to the Unix time agreement, which was created around that time). The getTime method of a date object returns this number. It is naturally large.


console.log(new Date(2013, 11, 19).getTime());
// → 1387407600000
console.log(new Date(1387407600000));
// → Thu Dec 19 2013 00:00:00 GMT+0100 (CET)

If you give the Date constructor a single argument, it is treated as the number of milliseconds. You can get the current millisecond value by creating a Date object and calling the getTime method, or by calling the Date.now function.

The Date object has methods getFullYear, getMonth, getDate, getHours, getMinutes, and getSeconds to retrieve its components. There is also a getYear method that returns a rather useless two-digit code, such as 93 or 14.

Having enclosed the necessary parts of the template in parentheses, we can create a date object directly from the string.

function findDate(string) {
  var dateTime = /(\d{1,2})-(\d{1,2})-(\d{4})/;
  var match = dateTime.exec(string);
  return new Date(Number(match[3]),
                  Number(match[2]) - 1,
                  Number(match[1]));
}
console.log(findDate("30-1-2003"));
// → Thu Jan 30 2003 00:00:00 GMT+0100 (CET)

Word and line borders

Unfortunately, findDate will also happily retrieve the meaningless date 00-1-3000 from the string "100-1-30000". Coincidence can happen anywhere on the line, so in this case it just starts with the second character and ends on the penultimate one.

If we need to force a match to take the entire string, we use the ^ and $ labels. ^ matches the beginning of the line, and $ the end. Therefore, / ^ \ d + $ / matches a line consisting of only one or several digits, / ^! / Matches a line starting with an exclamation point, and / x ^ / does not match any line (before the beginning of a line there cannot be x).

If, on the other hand, we just need to make sure that the date begins and ends at the word boundary, we use the \ b label. A word boundary can be the beginning or end of a line, or any place in a line where the alphanumeric character \ w is on the one hand, and not alphanumeric on the other.

console.log(/cat/.test("concatenate"));
// → true
console.log(/\bcat\b/.test("concatenate"));
// → false

Note that the border label is not a symbol. This is simply a restriction, meaning that a match only occurs if a certain condition is met.

Choice Templates

Suppose we need to find out if the text contains not just a number, but a number followed by pig, cow, or chicken in the singular or plural.

One could write three regulars and check them in turn, but there is a better way. Symbol | denotes a choice between templates to the left and right of it. And we can say the following:

var animalCount = /\b\d+ (pig|cow|chicken)s?\b/;
console.log(animalCount.test("15 pigs"));
// → true
console.log(animalCount.test("15 pigchickens"));
// → false

The brackets limit the part of the pattern to which | applies, and you can put many of these operators one after another to indicate a choice of more than two options.

Search engine

Regular expressions can be thought of as flowcharts. The following diagram describes the latest livestock example.

The expression matches the string if you can find the path from the left side of the diagram to the right. We remember the current position in the line, and each time, passing the rectangle, we check that the part of the line immediately after our position in it coincides with the contents of the rectangle.

So, checking the coincidence of our regular line in the “the 3 pigs” line when passing through the flowchart looks like this:

- at position 4 there is a word boundary, and go through the first rectangle
- starting from position 4 we find a number and go through the second rectangle
- at position 5, one path closes back in front of the second rectangle, and the second goes further to the rectangle with a space. We have a space, not a number, and we choose the second way.
- now we are at position 6, the beginning of the “pigs”, and on the triple branching paths. There is no “cow” or “chicken” in the line, but there is “pig”, so we choose this path.
- at position 9 after triple branching, one path goes around “s” and goes to the last rectangle with the word boundary, and the second goes through “s”. We have an “s,” so we go there.
- at position 10 we are at the end of the line, and only the word boundary can match. The end of the line is considered the border, and we go through the last rectangle. And so we successfully found our template.

Basically, regular expressions work like this: the algorithm starts at the beginning of a line and tries to find a match there. In our case, there is a word boundary there, so it passes the first rectangle - but there is no number there, so it stumbles on the second rectangle. Then he moves to the second character in the line, and tries to find a match there ... And so on, until he finds a match or reaches the end of the line, in which case a match is found.

Kickbacks

The regularity / \ b ([01] + b | \ d + | [\ da-f] h) \ b / matches either a binary number followed by b, a decimal number without a suffix, or hexadecimal (digits from 0 to 9 or characters a to h) followed by h. Corresponding chart:

In search of a match, it may happen that the algorithm goes along the upper path (binary number), even if there is no such number in the line. If there is a line “103”, for example, it is clear that only after reaching the number 3 the algorithm will understand that it is on the wrong path. In general, the line coincides with the regular season, just not in this branch.

Then the algorithm rolls back. At the fork, he remembers the current position (in our case, this is the beginning of the line immediately after the word boundary) so that you can go back and try a different path if the selected one does not work. For line “103”, after meeting the triple, he will return and try to go the way for decimal numbers. This will work, so a match will be found.

The algorithm stops as soon as it finds a complete match. This means that even if several options can come up, only one of them is used (in the order in which they appear in the regular season).

Kickbacks occur when using repetition operators such as + and *. If you are looking for /^.*x/ in the string “abcxe”, part of the regularity. * Will try to absorb the entire line. The algorithm then realizes that it also needs an “x”. Since there is no “x” after the end of the line, the algorithm will try to find a match by rolling back one character. After abcx, there is also no x, then it rolls back again, already to the substring abc. And after the line, he finds x and reports on a successful match, at positions 0 to 4.

You can write a regular season that will lead to multiple rollbacks. This problem occurs when the pattern can match the input in many different ways. For example, if we make a mistake when writing a regular for binary numbers, we might accidentally write something like / ([01] +) + b /.

If the algorithm searches for such a pattern in a long string of zeros and ones that does not contain “b” at the end, it will first go through the inner loop until it runs out of numbers. Then he will notice that there is no “b” at the end, he will roll back one position, go through the outer loop, give up again, try to roll back one more position along the inner loop ... And he will continue to search in this way, using both loops. That is, the amount of work with each character of the string will double. Even for a few dozen characters, finding a match will take a very long time.

Replace method

Rows have a replace method that can replace part of a string with another string.

console.log("папа".replace("п", "м"));
// → мапа

The first argument can be regular, in which case the first occurrence of the regular in the string is replaced. When the “g” (global, universal) option is added to the regular, all occurrences are replaced, not just the first

console.log("Borobudur".replace(/[ou]/, "a"));
// → Barobudur
console.log("Borobudur".replace(/[ou]/g, "a"));
// → Barabadar

It would make sense to pass the “replace all” option through a separate argument, or through a separate method of type replaceAll. But unfortunately, the option is passed through the regular season itself.

All the power of the regulars is revealed when we use the links to the groups found in the line specified in the regular season. For example, we have a line containing the names of people, one name per line, in the format "Last Name, First Name". If we need to swap them and remove the comma to get “First Name Last Name”, we write the following:

console.log(
  "Hopper, Grace\nMcCarthy, John\nRitchie, Dennis"
    .replace(/([\w ]+), ([\w ]+)/g, "$2 $1"));
// → Grace Hopper
//   John McCarthy
//   Dennis Ritchie

$ 1 and $ 2 in the replacement line refer to groups of characters enclosed in parentheses. $ 1 is replaced with text that matches the first group, $ 2 with the second group, and so on, up to $ 9. The whole match is contained in the variable $ &.

You can also pass a function as the second argument. For each replacement, a function will be called, the arguments of which will be the found groups (and the entire matching part of the string as a whole), and its result will be inserted into a new line.

A simple example:

var s = "the cia and fbi";
console.log(s.replace(/\b(fbi|cia)\b/g, function(str) {
  return str.toUpperCase();
}));
// → the CIA and FBI

And here is a more interesting one:

var stock = "1 lemon, 2 cabbages, and 101 eggs";
function minusOne(match, amount, unit) {
  amount = Number(amount) - 1;
  if (amount == 1) // остался только один, удаляем 's' в конце
    unit = unit.slice(0, unit.length - 1);
  else if (amount == 0)
    amount = "no";
  return amount + " " + unit;
}
console.log(stock.replace(/(\d+) (\w+)/g, minusOne));
// → no lemon, 1 cabbage, and 100 eggs

The code takes a string, finds all occurrences of numbers followed by a word, and returns a string where each number is reduced by one.

The group (\ d +) falls into the argument amount, and (\ w +) falls into unit. The function converts amount to number - and it always works, because our template is just \ d +. And then he makes changes to the word, in case there is only 1 subject left.

Greed

Using replace, it's easy to write a function that removes all comments from JavaScript code. Here is the first attempt:

function stripComments(code) {
  return code.replace(/\/\/.*|\/\*[^]*\*\//g, "");
}
console.log(stripComments("1 + /* 2 */3"));
// → 1 + 3
console.log(stripComments("x = 10;// ten!"));
// → x = 10;
console.log(stripComments("1 /* a */+/* b */ 1"));
// → 1  1

The part before the “or” operator matches two slashes followed by any number of characters except line feed characters. The part that removes multi-line comments is more complex. We use [^], i.e. any character that is not empty as a way to find any character. We cannot use a period because block comments continue on a new line, and the newline does not match the period.

But the conclusion of the previous example is wrong. Why?

The [^] * part will first try to capture as many characters as it can. If, because of this, the next part of the regular season does not find a match, it will roll back one character and try again. In the example, the algorithm tries to capture the entire string, and then rolls back. Having rolled back 4 characters, he will find * / - in the line, but this is not what we achieved. We only wanted to capture one comment, and not go to the end of the line and find the last comment.

Because of this, we say that the repetition operators (+, *,?, And {}) are greedy, that is, they first grab as much as they can, and then go back. If you put a question after such an operator (+ ?, * ?, ??, {}?), They will turn into non-greedy ones and begin to find the smallest possible occurrences.

And that is what we need. Having made the asterisk find matches in the minimum possible number of line characters, we absorb only one block of comments, and nothing more.

function stripComments(code) {
  return code.replace(/\/\/.*|\/\*[^]*?\*\//g, "");
}
console.log(stripComments("1 /* a */+/* b */ 1"));
// → 1 + 1

Many errors occur when using greedy operators instead of non-greedy ones. When using the replay operator, always consider the non-greedy option first.

Dynamically creating RegExp objects

In some cases, the exact pattern is unknown at the time of writing the code. For example, you will need to search for the username in the text, and enclose it in underscores. Since you only recognize the name after starting the program, you cannot use a slash entry.

But you can build a string and use the RegExp constructor. Here is an example:

var name = "гарри";
var text = "А у Гарри на лбу шрам.";
var regexp = new RegExp("\\b(" + name + ")\\b", "gi");
console.log(text.replace(regexp, "_$1_"));
// → А у _Гарри_ на лбу шрам.

When creating word boundaries, you have to use double slashes, because we write them in a normal line, and not in a regular line with forward slashes. The second argument to RegExp contains options for regulars - in our case, “gi”, i.e. global and case-insensitive.

But what if the name is “dea + hl [] rd” (if our user is a kulhacker)? As a result, we get a meaningless regular season that does not find matches in the string.

We can add backslashes before any character that we don't like. We cannot add backslashes before letters, because \ b or \ n are special characters. But you can add slashes before any non-alphanumeric characters without problems.

var name = "dea+hl[]rd";
var text = "Этот dea+hl[]rd всех достал.";
var escaped = name.replace(/[^\w\s]/g, "\\$&");
var regexp = new RegExp("\\b(" + escaped + ")\\b", "gi");
console.log(text.replace(regexp, "_$1_"));
// → Этот _dea+hl[]rd_ всех достал.

Search method

The indexOf method cannot be used with regulars. But there is a search method that just expects a regularity. Like indexOf, it returns the index of the first occurrence, or -1 if it did not happen.

console.log("  word".search(/\S/));
// → 2
console.log("    ".search(/\S/));
// → -1

Unfortunately, it is impossible to set the method to look for a match, starting with a specific offset (how can this be done with indexOf). That would be helpful.

LastIndex property

The exec method also does not provide a convenient way to start a search from a given position in a string. But an inconvenient way gives.

A regular object has properties. One of them is source, containing a string. Another one is lastIndex, which controls, in some conditions, where the next search for entries will begin.

These conditions include the need for the presence of the global option g, and the fact that the search must go with the exec method. A more reasonable solution would be to simply allow an additional argument to be passed to exec, but reasonableness is not a fundamental feature in the JavaScript regular interface.

var pattern = /y/g;
pattern.lastIndex = 3;
var match = pattern.exec("xyzzy");
console.log(match.index);
// → 4
console.log(pattern.lastIndex);
// → 5

If the search was successful, the call to exec updates the lastIndex property so that it points to the position after the found entry. If there was no success, lastIndex is set to zero - like lastIndex for the newly created object.

When using a global regular variable and multiple exec calls, these automatic updates to lastIndex can cause problems. Your regular can start the search from the position remaining from the previous call.

var digit = /\d/g;
console.log(digit.exec("here it is: 1"));
// → ["1"]
console.log(digit.exec("and now: 1"));
// → null

Another interesting effect of the g option is that it changes the way the match method works. When it is called with this option, instead of returning an array similar to the result of exec, it finds all occurrences of the template in the string and returns an array of the found substrings.

console.log("Банан".match(/ан/g));
// → ["ан", "ан"]

So be careful with global regular variables. In cases where they are needed - calls to replace or places where you specifically use lastIndex - perhaps all the cases in which they should be applied.

Occurrence cycles

A typical task is to go through all the occurrences of the template in a string so as to have access to the match object in the body of the loop using lastIndex and exec.

var input = "Строчка с 3 числами в ней... 42 и 88.";
var number = /\b(\d+)\b/g;
var match;
while (match = number.exec(input))
  console.log("Нашёл ", match[1], " на ", match.index);
// → Нашёл 3 на 14
//   Нашёл 42 на 33
//   Нашёл 88 на 40

The fact that the assignment value is the assigned value is used. Using the match = re.exec (input) construct as a condition in the while loop, we search at the beginning of each iteration, save the result in a variable, and end the loop when all matches are found.

Parsing INI Files

In conclusion of the chapter, we consider a problem using regulars. Imagine that we are writing a program that collects information about our enemies via the Internet in automatic mode. (We will not write the entire program, only the part that reads the settings file. Sorry.) The file looks like this:

searchengine=http://www.google.com/search?q=$1
spitefulness=9.7
; перед комментариями ставится точка с запятой
; каждая секция относится к отдельному врагу
[larry]
fullname=Larry Doe
type=бычара из детсада
website=http://www.geocities.com/CapeCanaveral/11451
[gargamel]
fullname=Gargamel
type=злой волшебник
outputdir=/home/marijn/enemies/gargamel

The exact file format (which is quite widely used, and is usually called INI), is as follows:

- blank lines and lines starting with a semicolon are ignored
- lines enclosed in square brackets begin a new section
- lines containing an alphanumeric identifier, followed by =, add the setting in this section.

Everything else is incorrect data.

Our task is to convert such a string into an array of objects, each with a name property and an array of settings. For each section, you need one object, and one more for global settings on top of the file.

Since you need to parse the file line by line, it's a good idea to start by breaking the file into lines. For this, in chapter 6 we used string.split ("\ n"). Some OSes use not one \ n character to translate a line, but two \ r \ n characters. Since the split method accepts regulars as an argument, we can divide the lines using the expression / \ r? \ N /, allowing both single \ n and \ r \ n between lines.

function parseINI(string) {
  // Начнём с объекта, содержащего настройки верхнего уровня
  var currentSection = {name: null, fields: []};
  var categories = [currentSection];
  string.split(/\r?\n/).forEach(function(line) {
    var match;
    if (/^\s*(;.*)?$/.test(line)) {
      return;
    } else if (match = line.match(/^\[(.*)\]$/)) {
      currentSection = {name: match[1], fields: []};
      categories.push(currentSection);
    } else if (match = line.match(/^(\w+)=(.*)$/)) {
      currentSection.fields.push({name: match[1],
                                  value: match[2]});
    } else {
      throw new Error("Строчка '" + line + "' содержит неверные данные.");
    }
  });
  return categories;
}

The code goes through all the lines, updating the object of the current section “current section”. First, he checks whether the line can be ignored by using the regular /^\s*(;.*)?$/. Wondering how it works? The part between the brackets matches the comments, huh? makes the regularity coincide with the lines consisting of only spaces.

If the line is not a comment, the code checks to see if it starts a new section. If so, it creates a new object for the current section, to which subsequent settings are added.

The last meaningful opportunity - the string is the usual setting, in which case it is added to the current object.

If no option has worked, the function throws an error.

Notice how the frequent use of ^ and $ takes care that the expression matches the entire string, not the part. If you do not use them, the code as a whole will work, but sometimes it will produce strange results, and such an error will be difficult to track.

The if (match = string.match (...)) construct is similar to a trick using assignment as a condition in a while loop. Often you don’t know that the match call will succeed, so you can only access the resulting object inside the if block that checks this. In order not to break the beautiful chain of if checks, we assign the search result to a variable, and immediately use this assignment as a check.

International characters

Due to the initially simple implementation of the language, and the subsequent fixation of such an implementation "in granite", JavaScript regulars are stupid with characters that are not found in the English language. For example, the “letter” symbol in terms of JavaScript regulars may be one of the 26 letters of the English alphabet, and for some reason also underscore. Letters of type é or β, which are uniquely letters, do not coincide with \ w (and coincide with \ W, that is, with a non-letter).

By a strange set of circumstances, historically \ s (space) matches all characters that are considered space characters in Unicode, including such things as an inextricable space or the Mongolian vowel separator.

Some regularization implementations in other languages have special syntax for searching for special categories of Unicode characters, such as “all uppercase letters”, “all punctuation marks” or “control characters”. There are plans to add such categories to JavaScript, but they will probably not be implemented soon.

Total

Regulars are objects that represent search patterns in strings. They use their syntax to express these patterns.

/ abc / Sequence of characters
/ [abc] / Any character from the list
/ [^ abc] / Any character except the characters from the list
/ [0-9] / Any character from the range
/ x + / One or more occurrences of the pattern x
/ x +? / One or more occurrences, non-greedy
/ x * / Zero or more occurrences
/ x? / Zero or one occurrence
/ x {2,4} / Two to four occurrences
/ (abc) / Group
/ a | b | c / Any from several patterns
/ \ d / Any digit
/ \ w / Any alphanumeric character ("letter")
/ \ s / Any whitespace
/./ Any character except line feeds
/ \ b / Word boundary
/ ^ / Beginning of line
/ $ / End of line

Regular has a test method to check if there is a pattern in the line. There is an exec method that returns an array containing all found groups. The array has an index property, which contains the number of the character with which the match occurred.

Strings have a match method to search for patterns, and a search method that returns only the starting position of an occurrence. The replace method can replace occurrences of a template with another string. In addition, you can pass to replace a function that will build a replacement line based on the template and found groups.

Regulars have settings that write after the closing slash. Option i makes the regularity case insensitive, and option g makes it global, which, among other things, forces the replace method to replace all found entries, not just the first one.

The RegExp constructor can be used to create regulars from strings.

Regulars are a sharp instrument with an uncomfortable handle. They greatly simplify some tasks, and can become unmanageable when solving other, complex tasks. Part of the ability to use regulars is to be able to resist the temptation to cram a task in them for which they are not intended.

Exercises

Inevitably, when solving problems, you will have incomprehensible cases, and you can sometimes despair, seeing the unpredictable behavior of some regulars. Sometimes it helps to study the behavior of the regular season through an online service such as debuggex.com, where you can see its visualization and compare it with the desired effect.

Regular golf

“Golf” in the code is a game where you need to express a given program with a minimum number of characters. Regular golf is a practical exercise in writing the smallest possible regulars to find a given pattern, and only that one.

For each of the tweaks, write a regular line to check if they are in the line. The regular should find only these specified substrings. Don’t worry about word boundaries unless otherwise mentioned. When you get a working regular, try reducing it.

- car and cat
- pop and prop
- ferret, ferry, and ferrari
- Any word ending with ious
- A space followed by a period, comma, colon or semicolon.
- A word longer than six letters
- A word without letters e

// Впишите свои регулярки
verify(/.../,
       ["my car", "bad cats"],
       ["camper", "high art"]);
verify(/.../,
       ["pop culture", "mad props"],
       ["plop"]);
verify(/.../,
       ["ferret", "ferry", "ferrari"],
       ["ferrum", "transfer A"]);
verify(/.../,
       ["how delicious", "spacious room"],
       ["ruinous", "consciousness"]);
verify(/.../,
       ["bad punctuation ."],
       ["escape the dot"]);
verify(/.../,
       ["hottentottententen"],
       ["no", "hotten totten tenten"]);
verify(/.../,
       ["red platypus", "wobbling nest"],
       ["earth bed", "learning ape"]);
function verify(regexp, yes, no) {
  // Ignore unfinished exercises
  if (regexp.source == "...") return;
  yes.forEach(function(s) {
    if (!regexp.test(s))
      console.log("Не нашлось '" + s + "'");
  });
  no.forEach(function(s) {
    if (regexp.test(s))
      console.log("Неожиданное вхождение '" + s + "'");
  });
}

Quotes in text

Suppose you wrote a story, and everywhere you used single quotes to indicate dialogs. Now you want to replace the quotation marks of the dialogs with double ones, and leave single ones in abbreviations of words like aren't.

Come up with a pattern that distinguishes between these two uses of quotation marks, and write a call to the replace method that performs the replacement.

Numbers again

A sequence of numbers can be found by a simple regular pattern / \ d + /.

Write an expression that finds only numbers written in JavaScript style. It should support a possible minus or plus in front of the number, a decimal point, and an exponential notation of 5e-3 or 1E10 - again with possible plus or minus. Also note that there may not necessarily be numbers before or after a point, but a number cannot consist of a single point. That is, .5 or 5. are valid numbers, but one point in itself is not.

// Впишите сюда регулярку.
var number = /^...$/;
// Tests:
["1", "-1", "+15", "1.55", ".5", "5.", "1.3e2", "1E-4",
 "1e+12"].forEach(function(s) {
  if (!number.test(s))
    console.log("Не нашла '" + s + "'");
});
["1a", "+-1", "1.2.3", "1+1", "1e4.5", ".5.", "1f5",
 "."].forEach(function(s) {
  if (number.test(s))
    console.log("Неправильно принято '" + s + "'");
});

Tags: