Parsing XML with Oracle Database
It would seem, why in general there may be a need to parse XML on the side of the database?
But there can be many reasons for this, and each one can have their own. Some, including me, are not at all disdained by the implementation of applied logic by means of the database, but to some it seems an archaic remnant and the usefulness of tools for working with XML in a DBMS may seem dubious to such people. However, I believe that few will object to the usefulness of having such an opportunity during the operation of the application. For example - the wholesaler’s price list application wasn’t accepted by us — he cursed chaotically at the lack of recoding for some positions. More than 20k positions in XML - go figure it out where the dog rummaged, which specifically confused the application. Agree, because it would be great to be able to present a list of products listed in XML as a data set that can be connected to a conversion table, to reveal in one fell swoop all positions that do not have a conversion? And many similar examples can be given. I had occasion to support an application that integrates with external systems through the exchange of XML messages, and despite the fact that the application itself did not use the tools provided by Oracle, it turned out to be very helpful to me and my colleagues with the support of this product.
In this article, I would like to demonstrate how easy and easy it is to parse XML of varying degrees of complexity using the Oracle Database tools.
I absolutely would not want to touch the DOM parser here. I can only say that it is, it is implemented by the DBMS_XMLDOM package. At times, it can be extremely useful for a developer, and dealing with it will not be difficult for anyone who has previously encountered DOM parsers from other manufacturers.
An innovative feature of Oracle is the XMLType type and the means of working with it. This type is part of the XML DB technology, which is included with Oracle Database since version 9.2.
The source text of the XML document can be passed to the XMLType constructor as values of the CLOB, BLOB, VARCHAR2, BFILE types. Perhaps it’s worth noting that BFILE allows you to download a file from the server’s file system - not the client, because if our XML is on the client side and it is large enough to be transmitted as a string in the request, perhaps you should take care of the possibility of delivering a file with XML content on the server file system.
An example of creating an XMLType instance, with the content passed in a string:
After creating an instance of XMLType, you can try to make the first timid steps to parse our XML. The XMLType type implements the Extract method, which, taking an XPatch expression, returns an XML fragment that matches this expression. An XML fragment (XML Fragment), in contrast to a properly constructed XML (whellformed XML), allows the absence of a root element (or, in other words, allows more than one root element in its composition).
So in the example below, three expressions return three XML fragments. The first returns all occurrences of the word element, the second returns only the first occurrence of it, the third returns a fragment of the text content of the word element, for which the value of the seq attribute is two.
Here, I think it’s worth putting a bold emphasis on the fact that in the third case it is the XML fragment that is returned, not the value of this element. Differences will become noticeable only when this value contains escaped characters like &,>. In order to get the value of an element, use the extractValue function. I’ll mention here that the main XMLType methods are duplicated by SQL functions, or vice versa, the main functions of working with XMLType are implemented in the form of its methods. However, extractValue is an exception. extractValue is presented only as a function. XMLType, unfortunately, does not implement the extractValue method.
Perhaps we should also mention the rules for working with namespaces. Not every intuition leads to a correct understanding of these working mechanisms. The functions (and method) extract, extractValue, as one of the parameters, take a namespace description. The namespaces described in this parameter can be used in an XPath expression. And this is exactly what I want to emphasize. Pay attention to the third case. Namespaces in XML and XPatch expressions have different aliases, but the same URIs, because parsing is successful.
So, having learned to extract values, now we should learn to share them. Let me remind you, in the first case, for the first example, we tried to select all word elements from XML, and we succeeded, we got two word elements, but we got them in one fragment. In order to present a fragment containing several root elements as a sequence of fragments, each of which contains one root element, there is a pipelined XMLSeqence function. The function returns an XMLSequenceType, which is a table of XMLType values.
If suddenly someone has forgotten, I remind you that pipelined functions return collections, as it were, therefore, when called, they wrap themselves with the table expression. The results of these functions are accessed using the virtual column column_value, or the expression value (), and therefore for the table expression (table collection excpression) you must define an alias. If suddenly someone did not know this, I recommend to memorize it as a mantra, understanding will come with time, and then only if necessary.
The simplest example of using XMLSequence:
I’ll try to say what’s going on here, although I’m afraid that in Russian it will turn out to be much more messy, and much less clear than in SQL. In the from statement, we first create an instance of XMLType, passing it a string containing the XML text. Next, using the extract method, we extract all b elements that contain element a into one fragment. The resulting XML fragment is passed as a parameter to the XMLSequence pipeline function, for which, according to the rules of grammar, the table clause is used. The record set described by this sentence is assigned the alias t. In select-list'e we get an instance of the object returned by the table expression t, it has the XMLType type. For each row returned by the table expression, this instance contains one fragment of element b of the source XML. We pass this object as a parameter to the extractValue function. The result is on the face.
In fact, everything is far from being as complicated as it turns out in my presentation. It’s enough to get used to it just a little. But the abuse of the brain is not yet fully completed. What we have at this stage only works for one XML document. If we have the source text of several XML in the label and we need to parse several of them at once, we will have to remember what is left correlation. There is nothing military here either. This thing was invented by Oracle and especially for table expressions (table collection expression). The bottom line is that in a table expression, values (columns) from the data sets defined in the from statement in front of (to the left) of the table expression itself can be used. In practice, this does not look as awful at all as by ear:
Here, in the table expression t, the xml value of the demo3 table is used. An expression will be computed for each row of the demo3 table. This is the very thing called such a pretentious word - left correlation.
The described functionality is enough to parse XML of almost any complexity. Perhaps, only hierarchically presented data of obviously unknown depth of nesting cannot be analyzed by these means. To parse such structures, you will have to resort to XSLT to bring XML to a more readable look. XSLT conversion is performed by the XMLTransform function, which takes the XMLType of the source document, the second XMLType XSL template as the first parameter, and returns the XMLType of the conversion result.
In principle, this can be done with theory. In conclusion, I will only demonstrate an example of extracting elements from different levels of XML nesting. For beginners, this, at times, causes difficulties.
As you can see, there is nothing new. All the same left correlation. The only thing I would like to pay attention to is (+) at the end of the table expression subdtl. As probably not hard to guess, it means that an external connection should be used. If we did not specify it, we would not get a string with detail 3.
So, what appeared before our eyes? We have one object type, a relatively limited set of functions, giving an almost unlimited set of features. I really like this implementation. What amazes me most is that Oracle corp did not have to align the semantics of their SQL to fit XML into it. All the described features - objects, pipeline functions, table expressions are used by this technology, but are not created specifically for it. It turns out that such an implementation could be embodied by anyone. This bold implementation emphasizes the power and flexibility of Oracle's SQL engine.
On this note, I could have finished, however, the question that I anticipate does not give me rest, but with a reproach. “Man, what century do you live in at all, have you looked at the calendar for a long time? The year 2011 is approaching the end in the yard, far from the first production was raised to the 11r2 version of the database, and you all chew on the ninth functional. ” Yes, there is such a sinner behind me. I know very well that in the 10th version they introduced the wonderful XMLTable, which completely puts into the background the functionality I just described. This makes XML parsing even easier and even more uncomplicated. However, with XMLTable, I still do not have enough experience to say anything beyond the already obvious. Therefore, I will limit myself to just a simple demonstration.
I will show on the same example:
It would seem that there are a lot more letters, a fair question may arise ... but what is the profit of innovation? The profit is that the first parameter in XMLTable is no longer XPath expression, but XQuery. So, the union can be done precisely by its means, and not by SQL means. XMLTable promises to be still that yummy, but, alas, I repeat, I have nothing to tell about it so far.
But there can be many reasons for this, and each one can have their own. Some, including me, are not at all disdained by the implementation of applied logic by means of the database, but to some it seems an archaic remnant and the usefulness of tools for working with XML in a DBMS may seem dubious to such people. However, I believe that few will object to the usefulness of having such an opportunity during the operation of the application. For example - the wholesaler’s price list application wasn’t accepted by us — he cursed chaotically at the lack of recoding for some positions. More than 20k positions in XML - go figure it out where the dog rummaged, which specifically confused the application. Agree, because it would be great to be able to present a list of products listed in XML as a data set that can be connected to a conversion table, to reveal in one fell swoop all positions that do not have a conversion? And many similar examples can be given. I had occasion to support an application that integrates with external systems through the exchange of XML messages, and despite the fact that the application itself did not use the tools provided by Oracle, it turned out to be very helpful to me and my colleagues with the support of this product.
In this article, I would like to demonstrate how easy and easy it is to parse XML of varying degrees of complexity using the Oracle Database tools.
I absolutely would not want to touch the DOM parser here. I can only say that it is, it is implemented by the DBMS_XMLDOM package. At times, it can be extremely useful for a developer, and dealing with it will not be difficult for anyone who has previously encountered DOM parsers from other manufacturers.
An innovative feature of Oracle is the XMLType type and the means of working with it. This type is part of the XML DB technology, which is included with Oracle Database since version 9.2.
The source text of the XML document can be passed to the XMLType constructor as values of the CLOB, BLOB, VARCHAR2, BFILE types. Perhaps it’s worth noting that BFILE allows you to download a file from the server’s file system - not the client, because if our XML is on the client side and it is large enough to be transmitted as a string in the request, perhaps you should take care of the possibility of delivering a file with XML content on the server file system.
An example of creating an XMLType instance, with the content passed in a string:
select XMLType(
'Hello world
') XML
from dual
After creating an instance of XMLType, you can try to make the first timid steps to parse our XML. The XMLType type implements the Extract method, which, taking an XPatch expression, returns an XML fragment that matches this expression. An XML fragment (XML Fragment), in contrast to a properly constructed XML (whellformed XML), allows the absence of a root element (or, in other words, allows more than one root element in its composition).
So in the example below, three expressions return three XML fragments. The first returns all occurrences of the word element, the second returns only the first occurrence of it, the third returns a fragment of the text content of the word element, for which the value of the seq attribute is two.
SQL> with demo1 as (
2 select XMLType(
3 '
4 Hello
5 world
6
7 ') xml
8 from dual
9 )
10 select t.xml.extract('//word') case1
11 ,t.xml.extract('//word[position()=1]') case2
12 ,t.xml.extract('//word[@seq=2]/text()') case3
13 from demo1 t;
CASE1 CASE2 CASE3
--------------------------- -------------------------- -------
Hello Hello world
world
Here, I think it’s worth putting a bold emphasis on the fact that in the third case it is the XML fragment that is returned, not the value of this element. Differences will become noticeable only when this value contains escaped characters like &,>. In order to get the value of an element, use the extractValue function. I’ll mention here that the main XMLType methods are duplicated by SQL functions, or vice versa, the main functions of working with XMLType are implemented in the form of its methods. However, extractValue is an exception. extractValue is presented only as a function. XMLType, unfortunately, does not implement the extractValue method.
SQL> with demo2 as (select xmltype('<&hello&>') xml from dual)
2 select t.xml.extract('a/text()').getStringVal() case1
3 ,extractValue(t.xml,'a') case2
4 from demo2 t;
CASE1 CASE2
--------------------------- --------------------------
<&hello&> <&hello&>
Perhaps we should also mention the rules for working with namespaces. Not every intuition leads to a correct understanding of these working mechanisms. The functions (and method) extract, extractValue, as one of the parameters, take a namespace description. The namespaces described in this parameter can be used in an XPath expression. And this is exactly what I want to emphasize. Pay attention to the third case. Namespaces in XML and XPatch expressions have different aliases, but the same URIs, because parsing is successful.
SQL> select extractValue(t.xml,'a') case1
2 ,extractValue(t.xml,'a','xmlns="foo"') case2
3 ,extractValue(t.xml,'y:a/@z:val','xmlns:y="foo" xmlns:z="bar"') case3
4 from (select XMLType('a-text') XMl from dual) t;
CASE1 CASE2 CASE3
--------------------------- -------------------------- --------------------------
a-text a-val
So, having learned to extract values, now we should learn to share them. Let me remind you, in the first case, for the first example, we tried to select all word elements from XML, and we succeeded, we got two word elements, but we got them in one fragment. In order to present a fragment containing several root elements as a sequence of fragments, each of which contains one root element, there is a pipelined XMLSeqence function. The function returns an XMLSequenceType, which is a table of XMLType values.
If suddenly someone has forgotten, I remind you that pipelined functions return collections, as it were, therefore, when called, they wrap themselves with the table expression. The results of these functions are accessed using the virtual column column_value, or the expression value (), and therefore for the table expression (table collection excpression) you must define an alias. If suddenly someone did not know this, I recommend to memorize it as a mantra, understanding will come with time, and then only if necessary.
The simplest example of using XMLSequence:
SQL> select extractValue(value(t),'b') result
2 from table(XMLSequence(XMLType('b1b2').extract('a/b'))) t;
RESULT
-------------------------------------------------------------------------------------
b1
b2
I’ll try to say what’s going on here, although I’m afraid that in Russian it will turn out to be much more messy, and much less clear than in SQL. In the from statement, we first create an instance of XMLType, passing it a string containing the XML text. Next, using the extract method, we extract all b elements that contain element a into one fragment. The resulting XML fragment is passed as a parameter to the XMLSequence pipeline function, for which, according to the rules of grammar, the table clause is used. The record set described by this sentence is assigned the alias t. In select-list'e we get an instance of the object returned by the table expression t, it has the XMLType type. For each row returned by the table expression, this instance contains one fragment of element b of the source XML. We pass this object as a parameter to the extractValue function. The result is on the face.
In fact, everything is far from being as complicated as it turns out in my presentation. It’s enough to get used to it just a little. But the abuse of the brain is not yet fully completed. What we have at this stage only works for one XML document. If we have the source text of several XML in the label and we need to parse several of them at once, we will have to remember what is left correlation. There is nothing military here either. This thing was invented by Oracle and especially for table expressions (table collection expression). The bottom line is that in a table expression, values (columns) from the data sets defined in the from statement in front of (to the left) of the table expression itself can be used. In practice, this does not look as awful at all as by ear:
SQL> with demo3 as(select 1 id, XMLType('b1b2') xml from dual
2 union all select 2 id, XMLType('b3b4') xml from dual)
3 select id xml_id
4 ,extractValue(value(t),'b') result
5 from demo3 s,table(XMLSequence(s.xml.extract('a/b'))) t;
XML_ID RESULT
---------- --------------------------------------------------
1 b1
1 b2
2 b3
2 b4
Here, in the table expression t, the xml value of the demo3 table is used. An expression will be computed for each row of the demo3 table. This is the very thing called such a pretentious word - left correlation.
The described functionality is enough to parse XML of almost any complexity. Perhaps, only hierarchically presented data of obviously unknown depth of nesting cannot be analyzed by these means. To parse such structures, you will have to resort to XSLT to bring XML to a more readable look. XSLT conversion is performed by the XMLTransform function, which takes the XMLType of the source document, the second XMLType XSL template as the first parameter, and returns the XMLType of the conversion result.
In principle, this can be done with theory. In conclusion, I will only demonstrate an example of extracting elements from different levels of XML nesting. For beginners, this, at times, causes difficulties.
SQL> with demo4 as(
2 select XMLType(
3 '
4 mater id
5
6
7 detail 1 id
8
9
10 sub_detail 1.1 id
11
12
13 sub_detail 1.2 id
14
15
16
17
18 detail 2 id
19
20
21 sub_detail 2.1 id
22
23
24 sub_detail 2.2 id
25
26
27
28
29 detail 3 id
30
31
32 '
33 ) xml from dual)
34 select extractValue(s.xml,'master/id') master_id
35 ,extractValue(value(dtl),'detail/id') detail_id
36 ,extractValue(value(subdtl),'sub_detail/id') sub_detail_id
37 from demo4 s
38 ,table(XMLSequence(s.xml.extract('master/details/detail'))) dtl
39 ,table(XMLSequence(value(dtl).extract('detail/sub_details/sub_detail')))(+) subdtl;
MASTER_ID DETAIL_ID SUB_DETAIL_ID
--------------------------- -------------------------- --------------------------
mater id detail 1 id sub_detail 1.1 id
mater id detail 1 id sub_detail 1.2 id
mater id detail 2 id sub_detail 2.1 id
mater id detail 2 id sub_detail 2.2 id
mater id detail 3 id
As you can see, there is nothing new. All the same left correlation. The only thing I would like to pay attention to is (+) at the end of the table expression subdtl. As probably not hard to guess, it means that an external connection should be used. If we did not specify it, we would not get a string with detail 3.
So, what appeared before our eyes? We have one object type, a relatively limited set of functions, giving an almost unlimited set of features. I really like this implementation. What amazes me most is that Oracle corp did not have to align the semantics of their SQL to fit XML into it. All the described features - objects, pipeline functions, table expressions are used by this technology, but are not created specifically for it. It turns out that such an implementation could be embodied by anyone. This bold implementation emphasizes the power and flexibility of Oracle's SQL engine.
On this note, I could have finished, however, the question that I anticipate does not give me rest, but with a reproach. “Man, what century do you live in at all, have you looked at the calendar for a long time? The year 2011 is approaching the end in the yard, far from the first production was raised to the 11r2 version of the database, and you all chew on the ninth functional. ” Yes, there is such a sinner behind me. I know very well that in the 10th version they introduced the wonderful XMLTable, which completely puts into the background the functionality I just described. This makes XML parsing even easier and even more uncomplicated. However, with XMLTable, I still do not have enough experience to say anything beyond the already obvious. Therefore, I will limit myself to just a simple demonstration.
I will show on the same example:
34 select master_id
35 ,details_id
36 ,sub_details_id
37 from demo4 s
38 ,XMLTable('master'
39 passing (s.xml)
40 columns master_id varchar2 (20) path 'id'
41 ,details XMLType path 'details/detail') mstr
42 ,XMLTable('detail'
43 passing (mstr.details)
44 columns details_id varchar2 (20) path 'id'
45 ,sub_details XMLType path 'sub_details/sub_detail')(+) dtl
46 ,XMLTable('sub_detail'
47 passing (dtl.sub_details)
48 columns sub_details_id varchar2 (20) path 'id')(+) sub_dtl;
MASTER_ID DETAILS_ID SUB_DETAILS_ID
--------------------------- -------------------- --------------------
mater id detail 1 id sub_detail 1.1 id
mater id detail 1 id sub_detail 1.2 id
mater id detail 2 id sub_detail 2.1 id
mater id detail 2 id sub_detail 2.2 id
mater id detail 3 id
It would seem that there are a lot more letters, a fair question may arise ... but what is the profit of innovation? The profit is that the first parameter in XMLTable is no longer XPath expression, but XQuery. So, the union can be done precisely by its means, and not by SQL means. XMLTable promises to be still that yummy, but, alas, I repeat, I have nothing to tell about it so far.