Description
This document describes a couple of approaches to a common problem encountered in day-to-day Data Services development, namely how to concatenate the values of a column across 2 or more rows into a single row with the values given as a delimited set of values in a single column.
The solution given here is SQL Server specific though I am sure similar solutions are possible in other databases.
To demonstrate the goal, given the data below in the table PAYMENT_METHODS:-
Master_Source | Primary Key | Company Code | Payment Method |
---|---|---|---|
MDS_001 | 00001 | IE01 | C |
MDS_001 | 00001 | IE01 | D |
The desired outcome is :-
Master_Source | Primary Key | Company Code | Payment Methods |
---|---|---|---|
MDS_001 | 00001 | IE01 | C, D |
The 2 approaches that will be described here are specific to SQL Server and they are:-
- Use a User Defined Function that returns a scalar value containing the delimited values
- Use FOR XML Path
References
Much of the material here is based upon Aaron Bertrand's excellent piece http://sqlperformance.com/2014/08/t-sql-queries/sql-server-grouped-concatenation
User Defined Function
The user defined function given below uses the SQL Server COALESE() function to construct a string containing the PAYMENT_METHOD values delimited by [comma][space], returning that string as a scalar.
CREATE FUNCTION dbo.Get_List_ZWELS ( @Master_Source nvarchar(20), @Primary_Key nvarchar(50), @Company_Code nvarchar(4) )
RETURNS NVARCHAR(4000) WITH SCHEMABINDING AS
BEGIN
DECLARE @s NVARCHAR(4000);
SELECT @s = COALESCE(@s + N', ', N'') + PAYMENT_METHOD
FROM
dbo.PAYMENT_METHOD
WHERE
MASTER_SOURCE = @Master_Source and
PRIMARY_KEY = @Primary_Key and
COMPANY_CODE = @Company_Code
ORDER BY PAYMENT_METHOD RETURN (@s);
END
GO
The COALESE() function will return the first non-NULL value encountered in the list of expressions given.
So walking through the example data, the first time the COALESE() is executed @s is NULL so COALESE() returns N'' to this is concatenated the first PAYMENT_METHOD, so @s = 'C'.
On the second execution @s is not NULL so COALESE() returns @s + N', ', to this is concatenated 'D'. So @s = 'C, D'.
There are only 2 rows in the sample dataset so @s is now returned to the caller.
If we were to execute the SQL
SELECT
MASTER_SOURCE, PRIMARY_KEY, COMPANY_CODE,
PAYMENT_METHODS = dbo.Get_List_ZWELS( MASTER_SOURCE, PRIMARY_KEY, COMPANY_CODE )
FROM
PAYMENT_METHOD
Then what would be returned is:
MASTER_SOURCE | PRIMARY_KEY | COMPANY_CODE | PAYMENT_METHODS |
---|---|---|---|
MDS_001 | 00001 | IE01 | C, D |
MDS_001 | 00001 | IE01 | C, D |
Simply adding a GROUP BY to this gives us the desired output:-
SELECT
MASTER_SOURCE, PRIMARY_KEY, COMPANY_CODE, PAYMENT_METHODS = dbo.Get_List_ZWELS( MASTER_SOURCE, PRIMARY_KEY, COMPANY_CODE )
FROM
PAYMENT_METHOD
group by
MASTER_SOURCE, PRIMARY_KEY, COMPANY_CODE
Translating this approach into something usable from Data Services is as simple as:-
- Import the
Get_List_ZWELS()
function into the Datastore that contains the PAYMENT_METHOD table - Create a dataflow using the PAYMENT_METHOD table as a source
- Add a group by to the initial Query, group by MASTER_SOURCE, PRIMARY_KEY and COMPANY_CODE and add the aggregating count(*) to the output schema.
- In the next Query insert the
Get_List_ZWELS()
function call into the Query transform passing the parameters MASTER_SOURCE, PRIMARY_KEY and COMPANY_CODE - Rename the return value as PAYMENT_METHODS
- Output to target table.
This will give the desired output of:
MASTER_SOURCE | PRIMARY_KEY | COMPANY_CODE | PAYMENT_METHODS |
---|---|---|---|
MDS_001 | 00001 | IE01 | C, D |
FOR XML Path
The second method uses the SQL Server FOR XML feature.
SELECT
MASTER_SOURCE, PRIMARY_KEY, COMPANY_CODE, PAYMENT_METHODS = STUFF((SELECT N', ' + PAYMENT_METHOD
FROM
dbo.PAYMENT_METHOD AS p2
WHERE
p2.MASTER_SOURCE = p.MASTER_SOURCE and
p2.PRIMARY_KEY = p.PRIMARY_KEY and
p2.COMPANY_CODE = p.COMPANY_CODE
ORDER BY
PAYMENT_METHOD FOR XML PATH(N''), TYPE).value(N'.[1]', N'nvarchar(max)'), 1, 2, N'')
FROM dbo.PAYMENT_METHOD AS p GROUP BY MASTER_SOURCE, PRIMARY_KEY, COMPANY_CODE
Gives the same output as the final example in the UDF section above.
The difference is in the implementation in Data Services.
To use the FOR XML Path in Data Services it is necessary to specify the above SQL in an SQL transform which serves as the source data in a dataflow.
One of the drawbacks of this FOR XML approach is that if the data being processed contains character(s) that it is not possible to represent in XML then the SELECT will fail. An example of this that I have actually experienced is the character x'1A' embedded in the data. This will result in the error:
Msg 6841, Level 16, State 1, Line 51
FOR XML could not serialize the data for node 'NoName' because it contains a character (0x001A) which is not allowed in XML. To retrieve this data using FOR XML, convert it to binary, varbinary or image data type and use the BINARY BASE64 directive
Summary
Aaron's article - referenced above - indicates that from a performance perspective that the FOR XML approach will out perform the UDF approach.
From a Data Services point of view, I think that the UDF approach is more visible and more maintainable.