Flex & utf8

    “Once upon a time, it seems, last Friday,” I needed a lexical analyzer that could work with Unicode data.

    The builder of the lexical analyzer wanted to have Flex , and this turned out to be a whole problem.
    Flex itself does not know how to work with Unicode data. when constructing an automaton, it is assumed that the characters are 7 or 8 bit.

    I met flex-2.5.4a-unicode-patch , but only for 16-bit characters and a specific version with all that it implies.

    Meanwhile, there is a simple and quite workable solution that does not require dirty hands to climb into the holy of holies rebuilding tools.

    Announce
    %option 8bit
    %option c++
    ...
    alpha   [A-Za-z]
    U1      [\x80-\xbf]
    U2      [\xc2-\xdf]
    U3      [\xe0-\xef]
    U4      [\xf0-\xf4]
    ualpha  {alpha}|{U2}{U1}|{U3}{U1}{U1}|{U4}{U1}{U1}{U1}
    uname   ({ualpha}|\_)*
    ...
    
    and voilà ... can be used.
    %%
    ...
    {uname} {
      ...
      yylval.str_ = std::string(yytext);
      return XyzParser::ttName;
    }
    

    Also popular now: