senia April 11, 2013 at 13:20

Language in language or embed XPath in Scala

Tutorial

Scala is a great language. You can fall in love with him. The code can be concise, but understandable; flexible, but strongly typed. Thoroughly thought out tools allow you not to fight the language, but to express your ideas on it.

But these same tools allow you to write extremely complex code.
Using scalaz-style smart balancing or shapeless-type computing is a guarantee that units will understand your code.

In this article, I will talk about what to do, most likely, not worth it.
I will tell you how to embed a different language in scala.

Although XPath and I will embed, the described method is suitable for any language for which you can build a syntax tree.

Causes

Scala has the ability to use all the tools for working with xml, which is in Java (and there are many of them). But the code will remind you of the good old Java code. Not too happy prospect.

There is a native xml built into the language syntax:

scala> 
     |   
     |    text
     | 
res0: scala.xml.Elem = 
 text
scala> res0 \\ "node"
res1: scala.xml.NodeSeq = NodeSeq(,  text)
scala> res1 \\ "@attr"
res2: scala.xml.NodeSeq = NodeSeq(aaa, 111)

It seems that it is happiness, but no. It only remotely resembles XPath. Although a little complex queries become cumbersome and unreadable.

But after some acquaintance with scala, it becomes clear that the creators do not unfoundedly call scala a scalable language. And if something is missing, then this can be added.

I set myself the task as close to XPath as possible with convenient integration into the language.

Result

All developments here: https://github.com/senia-psm/scala-xpath.git

How to look.

If you do not have git and sbt yet, then you will have to install them ( git , sbt ) and, if necessary, configure the proxy ( git , sbt - in Program Files (x86) \ SBT \ there is a special txt for such options).

Clone the repository:

git clone https://github.com/senia-psm/scala-xpath.git

Go to the folder with the repository (scala-xpath) and open the REPL in the project:

sbt console

Also in many examples, it is assumed that the following imports have been performed:

import senia.scala_xpath.macros._, senia.scala_xpath.model._

What and how

The way to achieve the goal is uniquely determined by the goal itself.
Embedding XPath as a DSL will obviously fail. Otherwise, it will no longer be XPath. The XPath expression in scala can only be placed as a string.
Which means:

Parser combinators . We will have to parse the string for validation.
String interpolation . For embedding variables and functions in XPath.
Macros . For verification at the compilation stage.

Prepare the object model.

We take the XPath 1.0 specification and rewrite it on scala.
Almost all logic is expressed through a type system and scala inheritance mechanism. Exceptions are in a couple of places of restriction through require.
Here it is worth noting the “sealed” keyword, which prohibits inheriting a class (or implementing an interface) outside this file. In particular, when comparing with the sample, “sealed” allows the compiler to verify that all possible options have been taken into account.

Parsim XPath

Introduction to Parsers

Parsers are functions that take a sequence of elements and return, if successful, the processing result and the rest of the sequence.
Unsuccessful results, however, come in two forms: “Failure” and “Error”.
Figuratively speaking, the parser bites off part of the sequence from the beginning and converts the bitten into an object of a certain type.

The simplest parser is a parser that checks that the first element in a sequence is equal to a predetermined one and returns this element as a successful result. As a remainder there will be a sequence without this element.

To create such a parser from an element, use the accept method, which accepts the element. This method is defined as implicit, and if the compiler encounters an element where it expects to meet the parser, it will add an application of this method to the element.
Let's say we parse a sequence of characters:

def elementParser: Parser[Char] = 'c' //до компиляции
def elementParser: Parser[Char] = accept('c') //во время компиляции

Thus, if when combining parsers you see an element where the parser should be, then you know that such an elementary parser is meant there.

In general, this is the only parser defined explicitly.
All other parsers are obtained by a combination of others and result transformations.

We combine parsers

A lie for the good

In fact, there are no operators in scala, but if you know this, then most likely you don’t need to tell about parsers.

The binary operator is "~". Combines 2 parsers according to the "and" principle. Successful only if the first parser is successful, and then the second on the remainder that the first parsed.
Figuratively speaking, the first parser first bites off what suits him, and then the second parses on the leftovers.
As a result, a container containing the results of both parsers is returned.

parser1 ~ parser2

This way you can combine any set of parsers.
This combinator has 2 siblings: "~>" and "<~". They work the same way, but return the result of only one of the combined parsers.

The binary operator "|". Association on the basis of "or". Successful if at least one of the results is successful at the initial input. If the first parser returned “failure” (but not an error), then try to feed the same input to the second.

rep. Sequence. If you have the parser “myParser”, then the parser created with “rep (myParser)” will “bite” with “myParser” from the input to the first unsuccessful application. The results of all “bites” are combined into a collection.
There are related transformations

Convert the result

If you want to perform a conversion on the parsing result, then operators such as ^^^ and ^^
^^^ come to the aid of changing the result to the specified constant, and ^^ performs the conversion on the result using the specified function.

Combining parsers (and w3c spec literacy) allows you to write a parser without hesitation.
In fact, we are rewriting the specification for the second time. The only significant difference is that I replaced the recursive definitions with “cyclic” ones (rep and repsep).

For example:

Specification:

[15] PrimaryExpr :: = VariableReference	
                                      | '(' Expr ')'	
                                      | Literal	
                                      | Number	
                                      | Function call

Parser :

  def primaryExpr: Parser[PrimaryExpr] = variableReference
                                       | `(` ~> expr <~ `)` ^^ { GroupedExpr }
                                       | functionCall
                                       | number
                                       | literal

The only condition is that you need to ensure that the most "strict" parsers go in the union through "|" before the rest. In this example, literal will obviously succeed wherever functionCall succeeds simply because it successfully parses the name of the function, so if you put literal earlier, then it simply won’t get to functionCall.
The whole set of parsers fit into one and a half hundred lines, which is significantly shorter than the definition of an object model.

Mix variables

To add variables to the expression, we will use the string interpolation mechanism, which appeared in version 2.10.
The mechanism is quite simple: upon encountering a line before which (without a space) is a valid method name, the compiler performs a simple conversion:

t"strinf $x interpolation ${ obj.name.toString } "
StringContext("strinf ", " interpolation ", " ").t(x, { obj.name.toString })

The string is broken into pieces by the occurrences of variables and expressions and passed to the StringContext factory method. The name preceding the string is used as the name of the method, and all variables and expressions are passed to this method as parameters.
If this ends with methods like "s" and "f", then for methods that are not in the StringContext, the compiler looks for an implicit class - a wrapper over StringContext containing the desired method. Such a search is a common mechanism for scala and does not apply directly to string interpolation.
Final code:

 new MyStringContextHelper(StringContext("strinf ", " interpolation ", " ")).t(x, { obj.name.toString })

But what about our parser? We no longer have a continuous sequence of characters. And there is a sequence of characters and something else.
Is all the work down the drain?
This is where the usefulness of the ability to parse not only a sequence of characters is revealed.
We have a sequence of characters and something else (more on that later). This is just described by the concept of Either. On a habr a couple of articles about Either translated Sigrlami .
To regain the full power of parsers, you just need to write a couple of auxiliary tools. In particular, conversion from Char, String and Regex to the corresponding parsers.
Here is the tool you need: EitherParsers. It is worth paying attention to the abstract type R. No assumptions have been made about it, so the toolkit is suitable for a previously unknown method for representing variables.

We interfere in compilation

Documentation and reasonable examples on macros in my opinion are few. But this does not mean that I am going to write an exhaustive explanation of what macros are and what they eat with.
First of all, you should know that a macro is called when the compiler encounters a method implemented using the macro keyword and the macro implementation should return the newly created syntax tree to the output.
Let's see what kind of tree we should give for the simplest example:

scala> import scala.reflect.runtime.universe._
import scala.reflect.runtime.universe._
scala> showRaw(reify( "str" -> 'symb ))
res0: String = Expr(Apply(Select(Apply(Select(Ident(scala.Predef), newTermName("any2ArrowAssoc")), List(Literal(Constant("str")))), newTermName("$minus$greater")), List(Apply(Select(Ident(scala.Symbol), newTermName("apply")), List(Literal(Constant("symb")))))))

To build such yourself there is not the slightest desire.
Let's see what scala offers us with the preservation of typing and without manual work.
On the one hand, not a lot: the literal method, which allows you to convert some limited set of "basic types" to syntax trees, and reify, which does all the manual work for you, but only if you add any variables to it from the outside in the form of the same tree and then use the splice method of this tree, designed specifically to inform reify of your desire to embed expressions of type Expt [T], as part of a new tree with the resulting type T.
On the other hand, these methods are quite enough. Additional can be written based on available ones.

Adding interpolation itselfprocessed by a macro is extremely succinct:

  implicit class XPathContext(sc: StringContext) {
    def xp(as: Any*): LocationPath = macro xpathImpl
  }

The macro-processing function is declared as follows:

def xpathImpl(c: Context)(as: c.Expr[Any]*): c.Expr[LocationPath]

It’s clear where to get the variables, but how to get the strings?
To do this, you can use the context to “look out” of the function. So to speak, look around.
More precisely, look at the expression for which the target xp method is called.
This can be done using c.prefix.
But what will we find there? It was previously mentioned that there should be an expression of the form StringContext ("strinf", "interpolation", "").
Let's look at the corresponding tree:

scala> import scala.reflect.runtime.universe._
import scala.reflect.runtime.universe._
scala> showRaw(reify(StringContext("strinf ", " interpolation ", " ")))
res0: String = Expr(Apply(Select(Ident(scala.StringContext), newTermName("apply")), List(Literal(Constant("strinf ")), Literal(Constant(" interpolation ")), Literal(Constant(" ")))))

As we can see from here, you can get all the lines in explicit form, which we will do:

    val strings = c.prefix.tree match {
      case Apply(_, List(Apply(_, ss))) => ss
      case _ => c.abort(c.enclosingPosition, "not a interpolation of XPath. Cannot extract parts.")
    }
    val chars = strings.map{
      case c.universe.Literal(Constant(source: String)) => source.map{ Left(_) }
      case _ => c.abort(c.enclosingPosition, "not a interpolation of XPath. Cannot extract string.")
    }

But not only the entrance has changed. The result of the parser can no longer be an object from our object model - it simply cannot be built based not on a value, but on a parameter of the form c.Expr [Any].

Change our parser accordingly. If, as a result, an external variable can appear at least somehow, then the parser can no longer return T, but must return c.Expr [T]. For conversions of non-elementary types to the corresponding Expr, we write auxiliary literal methods based on the available ones, for example:

  def literal(name: QName): lc.Expr[QName] = reify{ QName(literal(name.prefix).splice, literal(name.localPart).splice) }

The principle of all such functions is very simple: we parse the argument into fairly elementary parts and reassemble it inside reify.

This will require some mechanical work, but our parser will not change much.

The last step is the introduction of several parsers that can parse a variable at the input.
Here is the parser for embedding the variable:

    accept("xc.Expr[Any]", { case Right(e) => e } ) ^? ({
        case e: xc.Expr[BigInt] if confirmType(e, tagOfBigInt.tpe) =>
          reify{ CustomIntVariableExpr(VariableReference(QName(None, NCName(xc.literal(nextVarName).splice))), e.splice) }
        case e: xc.Expr[Double] if confirmType(e, xc.universe.definitions.DoubleClass.toType) =>
          reify{ CustomDoubleVariableExpr(VariableReference(QName(None, NCName(xc.literal(nextVarName).splice))), e.splice) }
        case e: xc.Expr[String] if confirmType(e, xc.universe.definitions.StringClass.toType) =>
          reify{ CustomStringVariableExpr(VariableReference(QName(None, NCName(xc.literal(nextVarName).splice))), e.splice) }
      },
      e => s"Int, Long, BigInt, Double or String expression expected, $e found."
      )

The initial parser "accept (" xc.Expr [Any] ", {case Right (e) => e})" is very simple - it accepts any Right container with a tree and returns this tree.
A further conversion determines whether this variable can be used as one of the three desired types and then converts to such use.

As a result, we get the following behavior:

scala> val xml = 
xml: scala.xml.Elem = 
scala> val os = Option("111")
os: Option[String] = Some(111)
scala> xml \\ xp"*[@attr = $os]" // Option[String] нам не подходит
:16: error: Int, Long, BigInt, Double or String expression expected, Expr[Nothing](os) found.
              xml \\ xp"*[@attr = $os]"
                     ^
scala> xml \\ xp"*[@attr = ${ os.getOrElse("") } ]" // а вот String уже подходит
res1: scala.xml.NodeSeq = NodeSeq()

And if the error messages still need to be improved, then the variables are already built in quite conveniently.

Embedding functions required quite a lot of code (23 options, one for options from 0 to 22 parameters) and it doesn’t work too convenient, since you need to accept only Any, but basically comes a NodeList (but maybe a line can come or Double):

scala> import org.w3c.dom.NodeList
import org.w3c.dom.NodeList
scala> val isAllowedAttributeOrText = (_: Any, _: Any) match { // какая-нибудь страння, возможно даже заранее не известная функция
     |   case (a: NodeList, t: NodeList) if a.getLength == 1 && t.getLength == 1 =>
     |     a.head.getTextContent == "aaa" ||
     |     t.head.getTextContent.length > 4
     |   case _ => false
     | }
isAllowedAttributeOrText: (Any, Any) => Boolean = 
scala> val xml = inner text text 
xml: scala.xml.Elem = inner text text 
scala> xml \\ xp"*[$isAllowedAttributeOrText(@attr, text())]"
res0: scala.xml.NodeSeq = NodeSeq(inner text text , inner text)

Here we got the first stupefaction from XPath syntax (except for the ability to write expressions of the form $ {arbitrary code} instead of variables) - the implemented function must be preceded by a dollar.

Method Implementation

Naturally, the scala.xml.NodeSeq methods "\" and "\\" themselves did not appear by magic, they are added using the implicit class in the package object of the model.

Similar methods are built into org.w3c.dom.Node and NodeList .

And with the application of the resulting XPath, certain problems arise.

Unresolved issues

Get rid of java.lang.System.setSecurityManager (null). Judging by the implementation of com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl, there is no other way to add a custom function handler.

Errors at the compilation stage require refinement.
If with an incorrectly specified function the error message is perfect (a separate compliment to the compiler's reasonableness):

scala> xml \\ xp"*[$isAllowedAttributeOrText(@attr)]"
:1: error: type mismatch;
 found   : (Any, Any) => Boolean
 required: Any => Any
              xml \\ xp"*[$isAllowedAttributeOrText(@attr)]"
                           ^

then for all other errors the standard message format is not respected and the position indicates the beginning of the line.
Unlike the previous one, this problem can be completely solved.

Performance when working with scala.xml leaves much to be desired. In fact, the conversion from scala.xml to w3c.dom occurs first through the string, and then the reverse.
The only possible solution is to handle XPath yourself.
At the same time, this will get rid of not too convenient typing of functions.

Performance with w3c.dom can be slightly improved. XPath is currently compiled from a string, although there is a ready-made object model. Converting between object models can speed up XPath creation somewhat.

Conclusion

XPala could be integrated into scala without serious problems and limitations.
Variables and functions from the current scope are valid wherever the specification allows them.
When used with w3c.dom and with some improvements, even minor acceleration is possible due to parsing the expression during compilation.

Everything is much simpler. than it seems at first glance.
In the beginning, the very idea of embedding into compilation raises the hell. The result is achieved with minimal effort.
Yes, the compiler API is documented much worse than the main library, but it is logical and understandable.
Yes, IDEA does not understand path-dependent types well, but it provides very convenient navigation, including the compiler API, and takes into account implicit conversions.

Tags: