One way to search for unescaped characters using new JavaScript tools

    1. How it all began


    Recently, I had a need to write another utility that processes a text file in a format similar to the simplified BBCode, namely in the source format for ABBYY Lingvo dictionaries - DSL (Dictionary Specification Language). (Not to be confused with another DSL (Domain-specific language) - an interesting case when a hyponym is a homonym to a hyperonym).

    Suffice it to say that the language uses tags in square brackets and that square brackets can be escaped with a backslash if you want to use them as part of plain text.

    One of the tasks of the utility was to find these tags with the exception of escaped combinations.

    Since you can use lookbehind assertions (for personal purposes) in JavaScript regular expressions recently, I wondered if it is possible to implement a search using this tool, especially since in this kind of lookbehind you can use variable length expressions.

    2. Preliminary remarks


    To appreciate the further experiment, you need to become familiar with some of the new JavaScript features.

    1. Template literals - long-awaited lines with variable interpolation.

    2. String.raw () . The capabilities of this function can be compared with single quotes in Perl and a prefix r''in Python: all of them help to create lines with a literal interpretation of the special escape character.

    3. Lookbehind assertions (including how to activate them in Google Chrome and Node.js).

    3. Implementation


    Script code with a trial (naive) implementation of search and verification:

    /******************************************************************************/
    'use strict';
    /******************************************************************************/
    const r = String.raw;
    const startOfString = '^';
    const notEscapeSymbol = r`[^\x5c]`;
    const escapedEscapeSymbols = r`(?:${startOfString}|${notEscapeSymbol})(?:\x5c{2})+`;
    const tag = r`\x5b[^\x5d]+\x5d`;
    const tagRE = new RegExp(
      `(?<=${startOfString}|${notEscapeSymbol}|${escapedEscapeSymbols})${tag}`, 'g'
    );
    console.log(r`[tag]text[/tag]`.match(tagRE));
    console.log(r`\\[tag]text\\\\[/tag]`.match(tagRE));
    console.log(r`\[tag]text\\\[/tag]`.match(tagRE));
    /******************************************************************************/
    

    First, we create a synonym for String.rawso that you can use a short form, like a prefix r''in Python.

    Then we create the components of the future regular expression.

    I proceeded from the assumption that the right tag can be preceded by one of three options: the beginning of a line, any character except a backslash, and an escaped backslash (that is, a combination of two backslashes). In this case, you must ensure that the character escaping the slash does not undergo escaping: in other words, the tag can be preceded by an even number of backslashes, before which, in turn, there can be either the beginning of a line or any character other than them.

    Thus, we need four key elements of a complex regular expression: the tag itself and its three valid predecessors - the beginning of the line, any character except the backslash, and the escaped slash or its repetition any number of times. The third tag predecessor can be represented as a combination of one of the first two predecessors and a pair of backslashes in any quantity.

    In order not to ripple in my eyes, I replaced all the literal characters of backslashes and square brackets with hexadecimal literals ( [ — \x5b, \ — \x5c, ] — \x5d).

    The equivalent of the regular expression compiled from parts will be the following combination (it can be used instead of the entire first part by assigning it to a variable tagREdirectly):

    /(?<=^|[^\x5c]|(?:^|[^\x5c])(?:\x5c{2})+)\x5b[^\x5d]+\x5d/g

    At the end of the script, the resulting expression is tested on a minimal set of valid and escaped tags. The first line contains the tag after the beginning of the line and after the character other than the backslash. The second line contains tags after the escaped backslash, which (or of which) is preceded by either the beginning of the line or a character other than themselves. The third line contains escaped tags.

    The following result is displayed in the console: When evaluating a solution, two caveats should be kept in mind: 1. This is an implementation for home use, and not for mass production (until lookbehind assertions exit the flag in Node.js and Google Chrome and will be implemented in other browsers).

    [ '[i]', '[/i]' ]
    [ '[i]', '[/i]' ]
    null






    2. This expression is not intended to verify the correctness of the contents of the tags themselves, only to distinguish them from escaped combinations.

    I would be grateful for pointing out not unnoticed risks and for optimization tips.

    Also popular now: