XML, XQuery, and triple sadness with performance

  • Tutorial
A trip to Dnepropetrovsk for a meeting with the Dnepr SQL User Group , a chronic lack of sleep over the last couple of days, but a nice bonus on arrival to Kharkov ... Winter weather, which motivates to write something interesting ...

For a long time there were plans to tell about the "pitfalls" when working with XML and XQuery , which can lead to tricky performance issues.

For those who often use SQL Server , XQuery and like to parse values ​​from XML, it is recommended to familiarize yourself with the following material ...

First, we will generate a test XML on which we will conduct experiments:

USE AdventureWorks2012
GO
IF OBJECT_ID('tempdb.dbo.##temp') IS NOT NULL
    DROP TABLE ##temp
GO
SELECT val = (
    SELECT
          [@obj_id] = o.[object_id]
        , [@obj_name] = o.name
        , [@sch_name] = s.name
        , (
            SELECT i.name, i.column_id, i.user_type_id, i.is_nullable, i.is_identity
            FROM sys.all_columns i
            WHERE i.[object_id] = o.[object_id]
            FOR XML AUTO, TYPE
        )
    FROM sys.all_objects o
    JOIN sys.schemas s ON o.[schema_id] = s.[schema_id]
    WHERE o.[type] IN ('U', 'V')
    FOR XML PATH('obj'), ROOT('objects')
)
INTO ##temp
DECLARE @sql NVARCHAR(4000) = 'bcp "SELECT * FROM ##temp" queryout "D:\sample.xml" -S ' + @@servername + ' -T -w -r -t'
EXEC sys.xp_cmdshell @sql
IF OBJECT_ID('tempdb.dbo.##temp') IS NOT NULL
    DROP TABLE ##temp

For those who have xp_cmdshell disabled, you need to do:

EXEC sp_configure 'show advanced options', 1
GO
RECONFIGURE
GO
EXEC sp_configure 'xp_cmdshell', 1
GO
RECONFIGURE
GO

As a result, on the specified path, we will create a file with the following structure:


  ...
  
    ...
  
  ...

Now we’ll start sampling experiments ...

What is the most efficient way to load data from XML ? Probably, you do not need to open the file with notepad, copy the contents and paste into a variable ... I think that it would be more appropriate to use OPENROWSET :

DECLARE @xml XML
SELECT @xml = BulkColumn
FROM OPENROWSET(BULK 'D:\sample.xml', SINGLE_BLOB) x
SELECT @xml

But there is a funny catch. As it turned out, the combination of loading operations and parsing values ​​from XML can lead to a significant decrease in performance. Suppose we need to get obj_id values from a previously created file:

;WITH cte AS 
(
    SELECT x = CAST(BulkColumn AS XML)
    FROM OPENROWSET(BULK 'D:\sample.xml', SINGLE_BLOB) x
)
SELECT t.c.value('@obj_id', 'INT')
FROM cte
CROSS APPLY x.nodes('objects/obj') t(c)

On my machine, this request takes a very long time:

(495 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 20788, ..., lob logical reads 7817781, ..., lob read-ahead reads 1022368.
 SQL Server Execution Times:
   CPU time = 53688 ms,  elapsed time = 53911 ms.

Let's try to separate download and parsing:

DECLARE @xml XML
SELECT @xml = BulkColumn
FROM OPENROWSET(BULK 'D:\sample.xml', SINGLE_BLOB) x
SELECT t.c.value('@obj_id', 'INT')
FROM @xml.nodes('objects/obj') t(c)

Everything worked out very quickly:

(1 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 7, ..., lob logical reads 2691, ..., lob read-ahead reads 344.
 SQL Server Execution Times:
   CPU time = 15 ms,  elapsed time = 51 ms.
(495 row(s) affected)
 SQL Server Execution Times:
   CPU time = 47 ms,  elapsed time = 125 ms.

So what was the problem? Let's analyze the execution plan:



As it turned out, the problem lies in type conversion, so try to pass the parameter in the XML type to the nodes function initially . Next, we will consider a typical situation when parsing requires filtering ... In such cases, remember that SQL Server does not optimize function calls for working with XML . For clarity, I will show that in this request, the value function will be executed twice:





SELECT t.c.value('@obj_id', 'INT')
FROM @xml.nodes('objects/obj') t(c)
WHERE t.c.value('@obj_id', 'INT') < 0

(404 row(s) affected)
 SQL Server Execution Times:
   CPU time = 116 ms,  elapsed time = 120 ms.



This nuance can reduce performance, so it is recommended to reduce function calls:

SELECT *
FROM (
    SELECT id = t.c.value('@obj_id', 'INT')
    FROM @xml.nodes('objects/obj') t(c)
) t
WHERE t.id < 0

(404 row(s) affected)
 SQL Server Execution Times:
   CPU time = 62 ms,  elapsed time = 74 ms.



Alternatively, you can filter it like this:

SELECT t.c.value('@obj_id', 'INT')
FROM @xml.nodes('objects/obj[@obj_id < 0]') t(c)

(404 row(s) affected)
 SQL Server Execution Times:
   CPU time = 110 ms,  elapsed time = 119 ms.

but we can’t talk about substantial gains. Although QueryCost says the opposite:



It is shown that the third option is the most optimal ... Let this be another argument not to trust QueryCost , which is just an internal assessment.

And the most interesting example for a snack ... There is another VERY important feature when parsing from XML . Run the request:

SELECT
      t.c.value('../@obj_name', 'SYSNAME')
    , t.c.value('@name', 'SYSNAME')
FROM @xml.nodes('objects/obj/*') t(c)

and look at the runtime, which can only suit those who are in no hurry:

(5273 row(s) affected)
 SQL Server Execution Times:
   CPU time = 66578 ms,  elapsed time = 66714 ms.

Why is this happening? The SQL Server server has problems reading parent nodes from child nodes (to put it simply, SQL Server is hard to “look back”):



How can we be in this case? It's very simple ... start reading from the parent nodes and subtract the children using CROSS / OUTER APPLY :

SELECT
      t.c.value('@obj_name', 'SYSNAME')
    , t2.c2.value('@name', 'SYSNAME')
FROM @xml.nodes('objects/obj') t(c)
CROSS APPLY t.c.nodes('*') t2(c2)

(5273 row(s) affected)
 SQL Server Execution Times:
   CPU time = 156 ms,  elapsed time = 184 ms.

It is also interesting to consider a situation where we need to look at 2 levels above. The problem with reading the parent element did not reproduce for me:

USE AdventureWorks2012
GO
DECLARE @xml XML
SELECT @xml = (
    SELECT
          [@obj_name] = o.name
        , [columns] = (
            SELECT i.name
            FROM sys.all_columns i
            WHERE i.[object_id] = o.[object_id]
            FOR XML AUTO, TYPE
        )
    FROM sys.all_objects o
    WHERE o.[type] IN ('U', 'V')
    FOR XML PATH('obj')
)
SELECT
      t.c.value('../../@obj_name', 'SYSNAME')
    , t.c.value('@name', 'SYSNAME')
FROM @xml.nodes('obj/columns/*') t(c)

I also wanted to mention one interesting feature. OPENXML has no problems reading parents :

DECLARE
      @xml XML
    , @idoc INT
SELECT @xml = BulkColumn
FROM OPENROWSET(BULK 'D:\sample.xml', SINGLE_BLOB) x
EXEC sys.sp_xml_preparedocument @idoc OUTPUT, @xml 
SELECT *
FROM OPENXML(@idoc, '/objects/obj/*') 
WITH (
    name  SYSNAME '../@obj_name', 
    col   SYSNAME '@name'
)
EXEC sys.sp_xml_removedocument @idoc

(5273 row(s) affected)
 SQL Server Execution Times:
   CPU time = 47 ms,  elapsed time = 137 ms.

But you don’t need to think now that OPENXML has clear advantages over XQuery . In OPENXML too short stocks. For example, if we forget to call sp_xml_removedocument , then severe memory leaks may occur.

Everything was tested on SQL Server 2012 SP3 (11.00.6020) .

If you want to share this article with an English-speaking audience:
XML, XQuery & Performance Issues

Also popular now: