String aggregation in the SQL Server world

  • Tutorial
In practice, the tasks of combining strings into one come across quite often. It is very sad, but the T-SQL standard does not provide for the possibility of using string data inside the SUM aggregation function :

Msg 8117, Level 16, State 1, Line 1
Operand data type char is invalid for sum operator.


Although the solution of such problems for MySQL function has been added GROUP_CONCAT , and the Oracle - LISTAGG . In turn, SQL Server does not yet have such built-in functionality.

However, you should not consider this as a drawback, since the capabilities of T-SQLallow string concatenation to be performed more flexibly and efficiently through the use of other constructs, which will be discussed later.

Suppose that we need to combine several lines into a single line, using the data from the following table:

IF OBJECT_ID('dbo.Chars', 'U') IS NOT NULL
    DROP TABLE dbo.Chars
GO
CREATE TABLE dbo.Chars ([Char] CHAR(1) PRIMARY KEY)
INSERT INTO dbo.Chars ([Char])
VALUES ('A'), ('B'), ('C'), ('F'), ('D')

The most obvious solution to this problem is to use the cursor:

DECLARE 
      @Chars VARCHAR(100)
    , @Char CHAR(1)
DECLARE cur CURSOR LOCAL READ_ONLY FAST_FORWARD FOR
    SELECT [Char]
    FROM dbo.Chars
OPEN cur
FETCH NEXT FROM cur INTO @Char
WHILE @@FETCH_STATUS = 0 BEGIN
    SET @Chars = ISNULL(@Chars + ', ' + @Char, @Char) 
    FETCH NEXT FROM cur INTO @Char
END
CLOSE cur
DEALLOCATE cur
SELECT @Chars

However, its use reduces the efficiency of query execution and, at least, does not look too elegant.

To get rid of it, you can concatenate strings by assigning variables:

DECLARE @Chars VARCHAR(100)
SELECT @Chars = ISNULL(@Chars + ', ' + [Char], [Char])   
FROM dbo.Chars
SELECT @Chars

On the one hand, the design turned out to be very simple, on the other hand, its performance on a large sample leaves much to be desired.

For row aggregation, it is also possible to do it through XML using the following construction:

SELECT Chars = STUFF((
	SELECT ', ' + [Char]
	FROM dbo.Chars
	FOR XML PATH(''), TYPE).value('.', 'VARCHAR(MAX)'), 1, 2, '')

If you look at the plan, you will notice the high cost of calling the value method :



To get rid of this operation, you can rewrite the query using the XQuery properties :

SELECT Chars = STUFF(CAST((
	SELECT [text()] = ', ' + [Char]
	FROM dbo.Chars
	FOR XML PATH(''), TYPE) AS VARCHAR(100)), 1, 2, '')

As a result, we get a very simple and quick execution plan:



In principle, concatenating rows in a single column does not cause much difficulty.

A more interesting situation is when you want to perform concatenation on multiple columns at once. For example, we have the following table:

IF OBJECT_ID('dbo.EntityValues', 'U') IS NOT NULL
    DROP TABLE dbo.EntityValues
GO
CREATE TABLE dbo.EntityValues (
      EntityID INT
    , Value1 CHAR(1)
    , Value2 CHAR(1) 
)
CREATE NONCLUSTERED INDEX IX_WorkOut_EntityID 
	ON dbo.EntityValues (EntityID)
GO
INSERT INTO dbo.EntityValues (EntityID, Value1, Value2)
VALUES (1, 'A', 'X'), (2, 'B', 'Y'), (2, 'C', 'Z'), (2, 'F', 'H'), (1, 'D', 'R')

in which it is necessary to group the data as follows:



Alternatively, you can copy the XML calls , but then we get duplicate reads that can significantly affect the efficiency of the query:

SELECT 
      ev.EntityID
    , Values1 = STUFF(CAST((
        SELECT [text()] = ', ' + ev2.Value1
        FROM dbo.EntityValues ev2
        WHERE ev2.EntityID = ev.EntityID
        FOR XML PATH(''), TYPE) AS VARCHAR(100)), 1, 2, '')
    , Values2 = STUFF(CAST((
        SELECT [text()] = ', ' + ev2.Value2
        FROM dbo.EntityValues ev2
        WHERE ev2.EntityID = ev.EntityID
        FOR XML PATH(''), TYPE) AS VARCHAR(100)), 1, 2, '')             
FROM ( 
    SELECT DISTINCT EntityID
    FROM dbo.EntityValues
) ev

You can easily see this if you look at the execution plan:



To reduce repeated readings, you can use a small XML hack:

SELECT
      ev.EntityID
    , Values1 = STUFF(REPLACE(
            CAST([XML].query('for $a in /a return xs:string($a)') AS VARCHAR(100)), ' ,', ','), 1, 1, '') 
    , Values2 = STUFF(REPLACE(
            CAST([XML].query('for $b in /b return xs:string($b)') AS VARCHAR(100)), ' ,', ','), 1, 1, '') 
FROM (
    SELECT DISTINCT EntityID
    FROM dbo.EntityValues
) ev
CROSS APPLY (
    SELECT [XML] = CAST((
        SELECT 
              [a] = ', ' + ev2.Value1
            , [b] = ', ' + ev2.Value2
        FROM dbo.EntityValues ev2
        WHERE ev2.EntityID = ev.EntityID
        FOR XML PATH('')
    ) AS XML)
) t

But this query will also not be optimal due to repeated calling of the query method .

You can use the cursor:

IF OBJECT_ID('tempdb.dbo.#EntityValues') IS NOT NULL
	DROP TABLE #EntityValues
GO
SELECT DISTINCT
	  EntityID
	, Values1 = CAST(NULL AS VARCHAR(100))
	, Values2 = CAST(NULL AS VARCHAR(100))
INTO #EntityValues
FROM dbo.EntityValues
DECLARE
	  @EntityID INT
	, @Value1 CHAR(1)
	, @Value2 CHAR(1)
DECLARE cur CURSOR LOCAL READ_ONLY FAST_FORWARD FOR
    SELECT
	      EntityID
	    , Value1
	    , Value2
    FROM dbo.EntityValues
OPEN cur
FETCH NEXT FROM cur INTO
	  @EntityID
	, @Value1
	, @Value2
WHILE @@FETCH_STATUS = 0 BEGIN
    UPDATE #EntityValues
    SET 
          Values1 = ISNULL(Values1 + ', ' + @Value1, @Value1) 
        , Values2 = ISNULL(Values2 + ', ' + @Value2, @Value2)
    WHERE EntityID = @EntityID
	FETCH NEXT FROM cur INTO
          @EntityID
        , @Value1
        , @Value2
END
CLOSE cur
DEALLOCATE cur
SELECT *
FROM #EntityValues

However, as practice has shown, when working with large ETL packages, the most productive solution is the ability to assign variables in the UPDATE construct :

IF OBJECT_ID('tempdb.dbo.#EntityValues') IS NOT NULL
	DROP TABLE #EntityValues
GO
DECLARE 
      @Values1 VARCHAR(100)
    , @Values2 VARCHAR(100)
SELECT 
      EntityID
    , Value1
    , Value2
    , RowNum = ROW_NUMBER() OVER (PARTITION BY EntityID ORDER BY 1/0)
    , Values1 = CAST(NULL AS VARCHAR(100))
    , Values2 = CAST(NULL AS VARCHAR(100))
INTO #EntityValues
FROM dbo.EntityValues
UPDATE #EntityValues
SET 
      @Values1 = Values1 =
        CASE WHEN RowNum = 1 
            THEN Value1
            ELSE @Values1 + ', ' + Value1 
        END
    , @Values2 = Values2 = 
        CASE WHEN RowNum = 1 
            THEN Value2
            ELSE @Values2 + ', ' + Value2 
        END
SELECT
      EntityID
    , Values1 = MAX(Values1) 
    , Values2 = MAX(Values2)
FROM #EntityValues
GROUP BY EntityID

SQL Server does not have a built-in analogue of the GROUP_CONCAT and LISTAGG functions . However, this does not interfere, depending on the situation, to efficiently perform string concatenation tasks. The purpose of this post is to demonstrate this.

Also popular now: