Quantcast
Channel: SCN : All Content - Data Services and Data Quality
Viewing all articles
Browse latest Browse all 4013

Concatenating Multiple Row Column Values into Single Row Delimited Value

$
0
0

Description

This document describes a couple of approaches to a common problem encountered in day-to-day Data Services development, namely how to concatenate the values of a column across 2 or more rows into a single row with the values given as a delimited set of values in a single column.

The solution given here is SQL Server specific though I am sure similar solutions are possible in other databases.

To demonstrate the goal, given the data below in the table PAYMENT_METHODS:-

Master_SourcePrimary KeyCompany CodePayment Method
MDS_00100001IE01C
MDS_00100001IE01D

The desired outcome is :-

Master_SourcePrimary KeyCompany CodePayment Methods
MDS_00100001IE01C, D

The 2 approaches that will be described here are specific to SQL Server and they are:-

  • Use a User Defined Function that returns a scalar value containing the delimited values
  • Use FOR XML Path

 

References

 

Much of the material here is based upon Aaron Bertrand's excellent piece http://sqlperformance.com/2014/08/t-sql-queries/sql-server-grouped-concatenation

User Defined Function

The user defined function given below uses the SQL Server COALESE() function to construct a string containing the PAYMENT_METHOD values delimited by [comma][space], returning that string as a scalar.

CREATE FUNCTION dbo.Get_List_ZWELS ( @Master_Source nvarchar(20), @Primary_Key nvarchar(50), @Company_Code nvarchar(4) )

RETURNS NVARCHAR(4000) WITH SCHEMABINDING AS

BEGIN

DECLARE @s NVARCHAR(4000);

 

SELECT @s = COALESCE(@s + N', ', N'') + PAYMENT_METHOD

     FROM

          dbo.PAYMENT_METHOD

     WHERE

          MASTER_SOURCE = @Master_Source and

          PRIMARY_KEY = @Primary_Key and

          COMPANY_CODE = @Company_Code

     ORDER BY PAYMENT_METHOD RETURN (@s);

END

GO

The COALESE() function will return the first non-NULL value encountered in the list of expressions given.

So walking through the example data, the first time the COALESE() is executed @s is NULL so COALESE() returns N'' to this is concatenated the first PAYMENT_METHOD, so @s = 'C'.

On the second execution @s is not NULL so COALESE() returns @s + N', ', to this is concatenated 'D'. So @s = 'C, D'.

There are only 2 rows in the sample dataset so @s is now returned to the caller.

If we were to execute the SQL

SELECT

     MASTER_SOURCE, PRIMARY_KEY, COMPANY_CODE,

    PAYMENT_METHODS = dbo.Get_List_ZWELS( MASTER_SOURCE, PRIMARY_KEY, COMPANY_CODE )

FROM

     PAYMENT_METHOD

Then what would be returned is:

MASTER_SOURCEPRIMARY_KEYCOMPANY_CODEPAYMENT_METHODS
MDS_00100001IE01C, D
MDS_00100001IE01C, D

Simply adding a GROUP BY to this gives us the desired output:-

SELECT

     MASTER_SOURCE, PRIMARY_KEY, COMPANY_CODE, PAYMENT_METHODS = dbo.Get_List_ZWELS( MASTER_SOURCE, PRIMARY_KEY, COMPANY_CODE )

FROM

     PAYMENT_METHOD

group by

     MASTER_SOURCE, PRIMARY_KEY, COMPANY_CODE

Translating this approach into something usable from Data Services is as simple as:-

  • Import the Get_List_ZWELS() function into the Datastore that contains the PAYMENT_METHOD table
  • Create a dataflow using the PAYMENT_METHOD table as a source
  • Add a group by to the initial Query, group by MASTER_SOURCE, PRIMARY_KEY and COMPANY_CODE and add the aggregating count(*) to the output schema.
  • In the next Query insert the Get_List_ZWELS() function call into the Query transform passing the parameters MASTER_SOURCE, PRIMARY_KEY and COMPANY_CODE
  • Rename the return value as PAYMENT_METHODS
  • Output to target table.

This will give the desired output of:

MASTER_SOURCEPRIMARY_KEYCOMPANY_CODEPAYMENT_METHODS
MDS_00100001IE01C, D

FOR XML Path

The second method uses the SQL Server FOR XML feature.

SELECT

     MASTER_SOURCE, PRIMARY_KEY, COMPANY_CODE, PAYMENT_METHODS = STUFF((SELECT N', ' + PAYMENT_METHOD

FROM

     dbo.PAYMENT_METHOD AS p2

WHERE

     p2.MASTER_SOURCE = p.MASTER_SOURCE and

     p2.PRIMARY_KEY = p.PRIMARY_KEY and

     p2.COMPANY_CODE = p.COMPANY_CODE

ORDER BY

     PAYMENT_METHOD FOR XML PATH(N''), TYPE).value(N'.[1]', N'nvarchar(max)'), 1, 2, N'')

     FROM dbo.PAYMENT_METHOD AS p GROUP BY MASTER_SOURCE, PRIMARY_KEY, COMPANY_CODE

Gives the same output as the final example in the UDF section above.

The difference is in the implementation in Data Services.

To use the FOR XML Path in Data Services it is necessary to specify the above SQL in an SQL transform which serves as the source data in a dataflow.

One of the drawbacks of this FOR XML approach is that if the data being processed contains character(s) that it is not possible to represent in XML then the SELECT will fail. An example of this that I have actually experienced is the character x'1A' embedded in the data. This will result in the error:

Msg 6841, Level 16, State 1, Line 51FOR XML could not serialize the data for node 'NoName' because it contains a character (0x001A) which is not allowed in XML. To retrieve this data using FOR XML, convert it to binary, varbinary or image data type and use the BINARY BASE64 directive

Summary

Aaron's article - referenced above - indicates that from a performance perspective that the FOR XML approach will out perform the UDF approach.

From a Data Services point of view, I think that the UDF approach is more visible and more maintainable.


Viewing all articles
Browse latest Browse all 4013

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>