PHP: How to parse a complex XML file and not drown in native code

Good time of day!

The scope of the XML format is quite extensive. Along with CSV, JSON and others, XML is one of the most common ways to present data for exchange between different services, programs and sites. As an example, we can cite the CommerceML format for exchanging goods and orders between 1C "Trade Management" and an online store.

Therefore, almost everyone who creates web services from time to time has to deal with the need to parse XML documents. In my post, I propose one of the methods on how to do this as clearly and transparently as possible using XMLReader.

PHP offers several ways to work with the XML format. Without going into details, I will say that in principle they can be divided into two groups:

  1. Loading the entire XML document into memory as an object and working with this object
  2. Step-by-step reading of an XML string at the level of tags, attributes, and text content

The first way is more intuitive, the code looks more transparent. This method works well for small files.

The second method is a lower-level approach, which gives us a number of advantages, and at the same time, somewhat overshadows life. Let us dwell on it in more detail. Pros:

  • Parsing speed. You can read in more detail here .
  • Consumption of less RAM. We do not store all data in the form of an object that is very expensive in memory.

But: we sacrifice code readability. If the goal of our parsing is, say, to calculate the sum of the values ​​in certain places inside XML with a simple structure, then there are no problems.
However, if the file structure is complex, even working with data depends on the full path to this data, and the result should include many parameters, then here we come to a rather chaotic code.

So I wrote a class that subsequently made my life easier. Its use simplifies the writing of rules and greatly improves the readability of programs, their size becomes many times smaller, and the code is more beautiful.

The main idea is this: we will store both the schema of our XML and how to work with it in a single array that repeats the hierarchy of only the tags we need. Also, for any of the tags in the same array, we will be able to register the functions we need that are handlers for opening a tag, closing it, reading attributes, or reading text, or all together. Thus, we store our XML structure and handlers in one place. A single glance at our processing structure will be enough to understand what we are doing with our XML file. I’ll make a reservation that on simple tasks (as in the examples below) the advantage in readability is small, but it will be obvious when working with files of a relatively complex structure - for example, an exchange format with 1C.

Now the specifics. Here is our class:

Debug version (with the $ debug parameter):
XMLReaderStruct class - click to expand
class XMLReaderStruct extends XMLReader {
  public function xmlStruct($xml, $structure, $encoding = null, $options = 0, $debug = false) {
    $this->xml($xml, $encoding, $options);
    $stack = array();
    $node = &$structure;
    $skipToDepth = false;
    while ($this->read()) {
      switch ($this->nodeType) {
        case self::ELEMENT:
          if ($skipToDepth === false) {
            // Если текущая ветка не входит в структуру, то просто игнорируем открытие тегов, иначе смотрим: если текущий узел структуры содержит
            // текущий тег, то открываем его, предварительно запоминая в стеке текущую позицию, чтобы при закрытии можно было вернуться. Если
            // не содержит, то открываем режим пропуска, пока не встретим закрывающий тег с текущей глубиной.
            if (isset($node[$this->name])) {
              if ($debug) echo "[ Открытие ]: ",$this->name," - найден в структуре. Спуск по структуре.\r\n";
              $stack[$this->depth] = &$node;
              $node = &$node[$this->name];
              if (isset($node["__open"])) {
                if ($debug) echo "              Найден обработчик открытия ",$this->name," - выполняю.\r\n";
                if (false === $node["__open"]()) return false;
              }
              if (isset($node["__attrs"])) {
                if ($debug) echo "              Найден обработчик атрибутов ",$this->name," - выполняю.\r\n";
                $attrs = array();
                if ($this->hasAttributes)
                  while ($this->moveToNextAttribute())
                    $attrs[$this->name] = $this->value;
                if (false === $node["__attrs"]($attrs)) return false;
              }
              if ($this->isEmptyElement) {
                if ($debug) echo "              Элемент ",$this->name," пустой. Возврат по структуре.\r\n";
                if (isset($node["__close"])) {
                  if ($debug) echo "              Найден обработчик закрытия ",$this->name," - выполняю.\r\n";
                  if (false === $node["__close"]()) return false;
                }
                $node = &$stack[$this->depth];
              }
            } else {
              $skipToDepth = $this->depth;
              if ($debug) echo "[ Открытие ]: ",$this->name," - не найден в структуре. Запуск режима пропуска тегов до достижения вложенности ",$skipToDepth,".\r\n";
            }
          } else {
            if ($debug) echo "( Открытие ): ",$this->name," - в режиме пропуска тегов.\r\n";
          }
          break;
        case self::TEXT:
          if ($skipToDepth === false) {
            if ($debug) echo "[ Текст    ]: ",$this->value," - в структуре.\r\n";
            if (isset($node["__text"])) {
              if ($debug) echo "              Найден обработчик текста - выполняю.\r\n";
              if (false === $node["__text"]($this->value)) return false;
            }
          } else {
            if ($debug) echo "( Текст    ): ",$this->value," - в режиме пропуска тегов.\r\n";
          }
          break;
        case self::END_ELEMENT:
          if ($skipToDepth === false) {
            // Если $skipToDepth не установлен, то это значит, что предшествующее ему открытие тега было внутри структуры,
            // и поэтому текущий узел структуры надо откатить.
            if ($debug) echo "[ Закрытие ]: ",$this->name," - мы в структуре. Подьем по структуре.\r\n";
            if (isset($node["__close"])) {
              if ($debug) echo "              Найден обработчик закрытия ",$this->name," - выполняю.\r\n";
              if (false === $node["__close"]()) return false;
            }
            $node = &$stack[$this->depth];
          } elseif ($this->depth === $skipToDepth) {
            // Если $skipToDepth установлен, то игнорируем все, что имеет бОльшую глубину, пока не дойдем до закрытие игнора с текущей глубиной.
            if ($debug) echo "[ Закрытие ]: ",$this->name," - достигнута вложенность ",$skipToDepth,". Отмена режима пропуска тегов.\r\n";
            $skipToDepth = false;
          } else {
            if ($debug) echo "( Закрытие ): ",$this->name," - в режиме пропуска тегов.\r\n";
          }
          break;
      }
    }
    return true;
  }
}


Release version (without the $ debug parameter and comments):
XMLReaderStruct class - click to expand
class XMLReaderStruct extends XMLReader {
  public function xmlStruct($xml, $structure, $encoding = null, $options = 0) {
    $this->xml($xml, $encoding, $options);
    $stack = array();
    $node = &$structure;
    $skipToDepth = false;
    while ($this->read()) {
      switch ($this->nodeType) {
        case self::ELEMENT:
          if ($skipToDepth === false) {
            if (isset($node[$this->name])) {
              $stack[$this->depth] = &$node;
              $node = &$node[$this->name];
              if (isset($node["__open"]) && (false === $node["__open"]()))
                return false;
              if (isset($node["__attrs"])) {
                $attrs = array();
                if ($this->hasAttributes)
                  while ($this->moveToNextAttribute())
                    $attrs[$this->name] = $this->value;
                if (false === $node["__attrs"]($attrs))
                  return false;
              }
              if ($this->isEmptyElement) {
                if (isset($node["__close"]) && (false === $node["__close"]()))
                  return false;
                $node = &$stack[$this->depth];
              }
            } else {
              $skipToDepth = $this->depth;
            }
          }
          break;
        case self::TEXT:
          if ($skipToDepth === false) {
            if (isset($node["__text"]) && (false === $node["__text"]($this->value)))
              return false;
          }
          break;
        case self::END_ELEMENT:
          if ($skipToDepth === false) {
            if (isset($node["__close"]) && (false === $node["__close"]()))
              return false;
            $node = &$stack[$this->depth];
          } elseif ($this->depth === $skipToDepth) {
            $skipToDepth = false;
          }
          break;
      }
    }
    return true;
  }
}



As you can see, our class extends the capabilities of the standard XMLReader class, to which we added one method:

xmlStruct($xml, $structure, $encoding = null, $options = 0, $debug = false)

Parameters:

  • $ xml, $ encoding, $ options : as in XMLReader :: xml ()
  • $ structure : an associative array that fully describes how we should work with our file. It is understood that its appearance is known in advance, and we know exactly what tags and what we should do.
  • $ debug : (only for the Debug version) whether to output debugging information (by default - off).

The $ structure argument .

This is an associative array, the structure of which repeats the hierarchy of tags of the XML file, plus, if necessary, each of the structure elements can have handler functions (defined as fields with the corresponding key):

  • "__open" - function when opening a tag - function ()
  • "__attrs" - function for processing tag attributes (if any) - function ($ assocArray)
  • "__text" - function in the presence of the text value of the tag - function ($ text)
  • "__close" - function when closing a tag - function ()

If any of the handlers returns false, parsing will be aborted, and the xmlStruct () function will return false. The following examples show how to construct the $ structure argument:

Example 1 showing the order in which handlers are called
Suppose there is an XML file:

AbcThis is node x inside bThis is node x inside d


    $structure = array(
      'root' => array(
        'a' => array(
          "__attrs" => function($array) { echo "ATTR ARRAY IS ",json_encode($array),"\r\n"; },
          "__text" => function($text) use (&$a) { echo "TEXT a {$text}\r\n"; }
        ),
        'b' => array(
          "__open" => function() { echo "OPEN b\r\n"; },
          "__close" => function() { echo "CLOSE b\r\n"; },
          'x' => array(
            "__open" => function() { echo "OPEN x\r\n"; },
            "__text" => function($text) { echo "TEXT x {$text}\r\n"; },
            "__close" => function() { echo "CLOSE x\r\n"; }
          )
        )
      )
    );
    $xmlReaderStruct->xmlStruct($xml, $structure);

The handlers will be called (in chronological order):

attributes root-> a
text field root-> a
open root-> b
open root-> b-> x
text root-> b-> x
close root-> b-> x
close root-> b

Other fields will not be processed (including root-> d-> x will be ignored, because it is outside the structure)

Example 2 illustrating a simple practical task
Suppose there is an XML file:

0productSome product name. ID:0001serviceSome product name. ID:11152productSome product name. ID:22303serviceSome product name. ID:33454productSome product name. ID:44605serviceSome product name. ID:5575

This is a cashier's check with goods and services.

Each check record contains the record identifier, type (product “product” or service “service”), name, quantity and price.

Task: calculate the amount of the check, but separately for goods and services.


include_once "xmlreaderstruct.class.php";
$x = new XMLReaderStruct();
$productsSum = 0;
$servicesSum = 0;
$structure = array(
  'shop' => array(
    'record' => array(
      'type'    => array( "__text" => function($text) use (&$currentRecord) {
        $currentRecord['isService'] = $text === 'service';
      } ),
      'qty'     => array( "__text" => function($text) use (&$currentRecord) {
        $currentRecord['qty'] = (int)$text;
      } ),
      'price'   => array( "__text" => function($text) use (&$currentRecord) {
        $currentRecord['price'] = (int)$text;
      } ),
      '__open'  => function() use (&$currentRecord) {
        $currentRecord = array();
      },
      '__close' => function() use (&$currentRecord, &$productsSum, &$servicesSum) {
        $money = $currentRecord['qty'] * $currentRecord['price'];
        if ($currentRecord['isService']) $servicesSum += $money;
        else $productsSum += $money;
      }
    )
  )
);
$x->xmlStruct(file_get_contents('example.xml'), $structure);
echo 'Overal products price: ', $productsSum, ', Overal services price: ', $servicesSum;

Also popular now: