Deduplication or extracting unique elements using XSL

When searching for an efficient way to de-duplicate elements in a BPEL process, i found several blogs and communities on the web showing examples how to use the XSL key and generate-id functions to extract unique elements. Only none of them explained the science behind the magic. Here a more elaborate explanation how the extraction of unique elements, based on a chosen combination of key elements, from a XML message works.

All in this article mentioned XSL examples and XML results are based on the file topgear.xml.

<?xml version="1.0" encoding="UTF-8"?>
<ListOfEpisodes xmlns="http://www.petervannes.nl/ListOfEpisodes"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="Episodes.xsd">
    
    <Episode>
        <Series>1</Series>
        <EpisodeID>1</EpisodeID>
        <Presenters>Jeremy Clarkson</Presenters>
        <Presenters>Richard Hammond</Presenters>
        <Presenters>James May</Presenters>
        <Presenters>The Stig</Presenters>
        <Guests>Harry Enfield</Guests>
        <Reviews>Citroën Berlingo</Reviews>
        <Reviews>Pagani Zonda</Reviews>
        <Reviews>Lamborghini Murciélago</Reviews>
        <Reviews>Mazda 6</Reviews>
    </Episode>
    <Episode>
        <Series>1</Series>
        <EpisodeID>1</EpisodeID>
        <Presenters>Jeremy Clarkson</Presenters>
        <Presenters>Richard Hammond</Presenters>
        <Presenters>James May</Presenters>
        <Presenters>The Stig</Presenters>
        <Guests>Harry Enfield</Guests>
        <Reviews>Citroën Berlingo</Reviews>
        <Reviews>Pagani Zonda</Reviews>
        <Reviews>Lamborghini Murciélago</Reviews>
        <Reviews>Mazda 6</Reviews>
    </Episode>
    <Episode>
        <Series>1</Series>
        <EpisodeID>2</EpisodeID>
        <Presenters>James May</Presenters>
        <Presenters>The Stig</Presenters>
        <Guests>Harry Enfield</Guests>
        <Reviews>Ford Focus RS</Reviews>
        <Reviews>Noble M12 GTO</Reviews>
    </Episode>
    <Episode>
        <Series>1</Series>
        <EpisodeID>3</EpisodeID>
        <Presenters>Jeremy Clarkson</Presenters>
        <Presenters>Richard Hammond</Presenters>
        <Presenters>James May</Presenters>
        <Presenters>The Stig</Presenters>
        <Guests>Ross Kemp</Guests>
        <Reviews>Mini One</Reviews>
        <Reviews>Toyota Yaris Verso</Reviews>
        <Reviews>Citroën DS</Reviews>
        <Reviews>BMW M1</Reviews>
        <Reviews>BMW M3</Reviews>
        <Reviews>Westfield XTR</Reviews>
        <Reviews>Aston Martin DB7</Reviews>
    </Episode>
    <Episode>
        <Series>2</Series>
        <EpisodeID>1</EpisodeID>
        <Presenters>Jeremy Clarkson</Presenters>
        <Presenters>Richard Hammond</Presenters>
        <Guests>Vinnie Jones</Guests>
        <Reviews>Smart Roadster</Reviews>
        <Reviews>Volkswagen Beetle Cabriolet</Reviews>
        <Reviews>Bowler Wildcat</Reviews>
        <Reviews>Bentley T2</Reviews>
    </Episode>
    <Episode>
        <Series>2</Series>
        <EpisodeID>2</EpisodeID>
        <Presenters>Jeremy Clarkson</Presenters>
        <Presenters>Richard Hammond</Presenters>
        <Presenters>James May</Presenters>
        <Guests>Jamie Oliver</Guests>
        <Reviews>Rolls-Royce Phantom</Reviews>
        <Reviews>Rover P5</Reviews>
        <Reviews>BMW M3</Reviews>
        <Reviews>Audi S4</Reviews>
    </Episode>
    <Episode>
        <Series>2</Series>
        <EpisodeID>3</EpisodeID>
        <Presenters>Jeremy Clarkson</Presenters>
        <Presenters>Richard Hammond</Presenters>
        <Presenters>James May</Presenters>
        <Guests>Jamie Oliver</Guests>
        <Reviews>Zastava 101</Reviews>
        <Reviews>Fiat 128</Reviews>
        <Reviews>BMW M3</Reviews>
        <Reviews>Audi S6</Reviews>
    </Episode>
    <Episode>
        <Series>3</Series>
        <EpisodeID>1</EpisodeID>
        <Presenters>Jeremy Clarkson</Presenters>
        <Presenters>James May</Presenters>
        <Presenters>The Stig</Presenters>
        <Guests>Martin Kemp</Guests>
        <Reviews>Ford GT</Reviews>
        <Reviews>BMW 5-Series</Reviews>
        <Reviews>Rover P5</Reviews>
        <Reviews>Porsche 996 GT3</Reviews>
    </Episode>
    <Episode>
        <Series>3</Series>
        <EpisodeID>2</EpisodeID>
        <Presenters>Jeremy Clarkson</Presenters>
        <Presenters>Richard Hammond</Presenters>
        <Presenters>The Stig</Presenters>
        <Guests>Stephen Fry</Guests>
        <Reviews>BMW M3 CSL</Reviews>
        <Reviews>BMW M1</Reviews>
        <Reviews>BMW M3</Reviews>
        <Reviews>BMW M5</Reviews>
        <Reviews>Porsche Boxster</Reviews>
        <Reviews>BMW Z4</Reviews>
        <Reviews>Honda S2000</Reviews>
    </Episode>
</ListOfEpisodes>

The generate-id function returns a string which is unique for a specific node, in case of an empty node-set it returns an empty string. A node-set can be supplied as an argument and when not supplied the current context-node will be used. The id returned by the generate-id functions is not based on the content of the node-set, two node-sets with exactly the same content will result in a different id. The identifier for a node is unique during the transformation, irrespectively where the id for the node in the transformation is generated. The example XSL below, Example_generate-id.xsl, adds the attributes prevGeneratedID and generatedID containing the generated-id value of the previous and current Episode elements.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
    xmlns:eps="http://www.petervannes.nl/ListOfEpisodes">

    <xsl:template match="/">
        <xsl:element name="ListOfEpisodes">
            <xsl:for-each select="//eps:Episode">
                <xsl:element name="Episode">
                    <xsl:attribute name="prevGeneratedID">
                        <xsl:value-of select="generate-id((preceding-sibling::*)[last()])"/>
                    </xsl:attribute>                    
                    <xsl:attribute name="generatedID">
                        <xsl:value-of select="generate-id(.)"/>
                    </xsl:attribute>
                    <xsl:for-each select="./child::*">
                        <xsl:element name="{name(.)}">
                            <xsl:value-of select="node()"/>
                        </xsl:element>
                    </xsl:for-each>
                </xsl:element>
            </xsl:for-each>
        </xsl:element>
    </xsl:template>
</xsl:stylesheet>

The output of the transformation shows that for the first two Episode nodes a unique id is generated while both nodes do contain exactly the same data. The attribute prevGeneratedID contains the generated-id of the preceding Episode node, containing the same value as the attribute generatedID of the preceding node.

<?xml version="1.0" encoding="utf-8"?>
<ListOfEpisodes>
    <Episode prevGeneratedID="" generatedID="d0e3">
        <Series>1</Series>
        <EpisodeID>1</EpisodeID>
        <Presenters>Jeremy Clarkson</Presenters>
        <Presenters>Richard Hammond</Presenters>
        <Presenters>James May</Presenters>
        <Presenters>The Stig</Presenters>
        <Guests>Harry Enfield</Guests>
        <Reviews>Citroën Berlingo</Reviews>
        <Reviews>Pagani Zonda</Reviews>
        <Reviews>Lamborghini Murciélago</Reviews>
        <Reviews>Mazda 6</Reviews>
    </Episode>
    <Episode prevGeneratedID="d0e3" generatedID="d0e39">
        <Series>1</Series>
        <EpisodeID>1</EpisodeID>
        <Presenters>Jeremy Clarkson</Presenters>
        <Presenters>Richard Hammond</Presenters>
        <Presenters>James May</Presenters>
        <Presenters>The Stig</Presenters>
        <Guests>Harry Enfield</Guests>
        <Reviews>Citroën Berlingo</Reviews>
        <Reviews>Pagani Zonda</Reviews>
        <Reviews>Lamborghini Murciélago</Reviews>
        <Reviews>Mazda 6</Reviews>
    </Episode>
    <Episode prevGeneratedID="d0e39" generatedID="d0e75">
        <Series>1</Series>
        <EpisodeID>2</EpisodeID>
        <Presenters>James May</Presenters>
        <Presenters>The Stig</Presenters>
        <Guests>Harry Enfield</Guests>
        <Reviews>Ford Focus RS</Reviews>
        <Reviews>Noble M12 GTO</Reviews>
    </Episode>
    <Episode prevGeneratedID="d0e75" generatedID="d0e99">
        <Series>1</Series>
        <EpisodeID>3</EpisodeID>
        <Presenters>Jeremy Clarkson</Presenters>
        <Presenters>Richard Hammond</Presenters>
        <Presenters>James May</Presenters>
        <Presenters>The Stig</Presenters>
        <Guests>Ross Kemp</Guests>
        <Reviews>Mini One</Reviews>
        <Reviews>Toyota Yaris Verso</Reviews>
        <Reviews>Citroën DS</Reviews>
        <Reviews>BMW M1</Reviews>
        <Reviews>BMW M3</Reviews>
        <Reviews>Westfield XTR</Reviews>
        <Reviews>Aston Martin DB7</Reviews>
    </Episode>
    <Episode prevGeneratedID="d0e99" generatedID="d0e144">
        <Series>2</Series>
        <EpisodeID>1</EpisodeID>
        <Presenters>Jeremy Clarkson</Presenters>
        <Presenters>Richard Hammond</Presenters>
        <Guests>Vinnie Jones</Guests>
        <Reviews>Smart Roadster</Reviews>
        <Reviews>Volkswagen Beetle Cabriolet</Reviews>
        <Reviews>Bowler Wildcat</Reviews>
        <Reviews>Bentley T2</Reviews>
    </Episode>
    <Episode prevGeneratedID="d0e144" generatedID="d0e174">
        <Series>2</Series>
        <EpisodeID>2</EpisodeID>
        <Presenters>Jeremy Clarkson</Presenters>
        <Presenters>Richard Hammond</Presenters>
        <Presenters>James May</Presenters>
        <Guests>Jamie Oliver</Guests>
        <Reviews>Rolls-Royce Phantom</Reviews>
        <Reviews>Rover P5</Reviews>
        <Reviews>BMW M3</Reviews>
        <Reviews>Audi S4</Reviews>
    </Episode>
    <Episode prevGeneratedID="d0e174" generatedID="d0e207">
        <Series>2</Series>
        <EpisodeID>3</EpisodeID>
        <Presenters>Jeremy Clarkson</Presenters>
        <Presenters>Richard Hammond</Presenters>
        <Presenters>James May</Presenters>
        <Guests>Jamie Oliver</Guests>
        <Reviews>Zastava 101</Reviews>
        <Reviews>Fiat 128</Reviews>
        <Reviews>BMW M3</Reviews>
        <Reviews>Audi S6</Reviews>
    </Episode>
    <Episode prevGeneratedID="d0e207" generatedID="d0e240">
        <Series>3</Series>
        <EpisodeID>1</EpisodeID>
        <Presenters>Jeremy Clarkson</Presenters>
        <Presenters>James May</Presenters>
        <Presenters>The Stig</Presenters>
        <Guests>Martin Kemp</Guests>
        <Reviews>Ford GT</Reviews>
        <Reviews>BMW 5-Series</Reviews>
        <Reviews>Rover P5</Reviews>
        <Reviews>Porsche 996 GT3</Reviews>
    </Episode>
    <Episode prevGeneratedID="d0e240" generatedID="d0e273">
        <Series>3</Series>
        <EpisodeID>2</EpisodeID>
        <Presenters>Jeremy Clarkson</Presenters>
        <Presenters>Richard Hammond</Presenters>
        <Presenters>The Stig</Presenters>
        <Guests>Stephen Fry</Guests>
        <Reviews>BMW M3 CSL</Reviews>
        <Reviews>BMW M1</Reviews>
        <Reviews>BMW M3</Reviews>
        <Reviews>BMW M5</Reviews>
        <Reviews>Porsche Boxster</Reviews>
        <Reviews>BMW Z4</Reviews>
        <Reviews>Honda S2000</Reviews>
    </Episode>
</ListOfEpisodes>

The xsl:key top-level element defines a key which can be used by the key function to make selections from a node-set on key values. The top-level element xsl:key has three attributes; name, match and use. Attribute name defines the identifier of the key. The value of this attribute is a QName. A pattern selecting the nodes whereon the keys are applied is assigned to the match attribute. The attribute use is an expression which specifies the key. The key is based on elements and evaluated for each of the nodes specified in the match attribute. Function key requires 2 arguments. The first argument is the name of the key which must correspond with a value assigned to the name attribute of a xsl:key top-level element. The second argument is the supplied value of the key. A node-set is returned by the key function containing all nodes matching the key value of the key defined in the name attribute.

In the following example xsl, Example_xsl-key.xsl, the key EpisodesOnReviews is defined based on element Reviews for Episode nodes. The key function uses the EpisodesOnReviews key to select all nodes matching the key value 'BMW M3'.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
    xmlns:eps="http://www.petervannes.nl/ListOfEpisodes">

    <!-- Define the key EpisodesOnReviews -->
    <xsl:key name="EpisodesOnReviews" match="eps:Episode" use="eps:Reviews"/>

    <xsl:template match="/">
        <xsl:element name="ListOfEpisodes">
            <!-- Select all Episode nodes where Reviews='BMW M3' -->
            <xsl:for-each select="key('EpisodesOnReviews','BMW M3&aposWinking">
                <xsl:element name="Episode">
                    <xsl:for-each select="./child::*">
                        <xsl:element name="{name(.)}">
                            <xsl:value-of select="node()"/>
                        </xsl:element>
                    </xsl:for-each>
                </xsl:element>
            </xsl:for-each>
        </xsl:element>
    </xsl:template>
</xsl:stylesheet>

This transformation results in a ListOfEpisodes with Episodes elements containing Review elements with the value 'BMW M3'.

<?xml version="1.0" encoding="utf-8"?>
<ListOfEpisodes>
    <Episode>
        <Series>1</Series>
        <EpisodeID>3</EpisodeID>
        <Presenters>Jeremy Clarkson</Presenters>
        <Presenters>Richard Hammond</Presenters>
        <Presenters>James May</Presenters>
        <Presenters>The Stig</Presenters>
        <Guests>Ross Kemp</Guests>
        <Reviews>Mini One</Reviews>
        <Reviews>Toyota Yaris Verso</Reviews>
        <Reviews>Citroën DS</Reviews>
        <Reviews>BMW M1</Reviews>
        <Reviews>BMW M3</Reviews>
        <Reviews>Westfield XTR</Reviews>
        <Reviews>Aston Martin DB7</Reviews>
    </Episode>
    <Episode>
        <Series>2</Series>
        <EpisodeID>2</EpisodeID>
        <Presenters>Jeremy Clarkson</Presenters>
        <Presenters>Richard Hammond</Presenters>
        <Presenters>James May</Presenters>
        <Guests>Jamie Oliver</Guests>
        <Reviews>Rolls-Royce Phantom</Reviews>
        <Reviews>Rover P5</Reviews>
        <Reviews>BMW M3</Reviews>
        <Reviews>Audi S4</Reviews>
    </Episode>
    <Episode>
        <Series>2</Series>
        <EpisodeID>3</EpisodeID>
        <Presenters>Jeremy Clarkson</Presenters>
        <Presenters>Richard Hammond</Presenters>
        <Presenters>James May</Presenters>
        <Guests>Jamie Oliver</Guests>
        <Reviews>Zastava 101</Reviews>
        <Reviews>Fiat 128</Reviews>
        <Reviews>BMW M3</Reviews>
        <Reviews>Audi S6</Reviews>
    </Episode>
    <Episode>
        <Series>3</Series>
        <EpisodeID>2</EpisodeID>
        <Presenters>Jeremy Clarkson</Presenters>
        <Presenters>Richard Hammond</Presenters>
        <Presenters>The Stig</Presenters>
        <Guests>Stephen Fry</Guests>
        <Reviews>BMW M3 CSL</Reviews>
        <Reviews>BMW M1</Reviews>
        <Reviews>BMW M3</Reviews>
        <Reviews>BMW M5</Reviews>
        <Reviews>Porsche Boxster</Reviews>
        <Reviews>BMW Z4</Reviews>
        <Reviews>Honda S2000</Reviews>
    </Episode>
</ListOfEpisodes>

Based on the generate-id and key function an on a user defined key unique list of elements can be extracted from an existing xml message. In the next example XSL, Example_unique1.xsl, a unique list of Series and EpisodeID combinations is extracted from the topgear.xml file. The key EpisodeUnqIdentifier defines a key based on the Series and EpisodeID elements, enabling to retrieve a node-list based on the key. Note that multiple nodes in the source message may match the key value and therefore the key may be not unique. The key function can be consciously or unconsciously give a node list with multiple elements, as in the example of Example_xsl-key.xsl. In the example below a separation character is added between the two elements of the key, preventing entanglement of the key value pairs Series ‘1’ and Episode ‘11’ with Series ‘11’ en Episode ‘1’.

Selection of the unique Episode elements based on the EpisodeUnqIdentifier is done in the for-each element.

<xsl:for-each select="//eps:Episode[generate-id() = 
    generate-id(key('EpisodeUnqIdentifier',concat(eps:Series,'-',eps:EpisodeID))[1])]">

The expression assigned to the select attribute of the for-each instruction selects all eps:Episode elements whereof the generated-id matches the generated-id of the first element in de node-list returned by the key function. The value for the key argument of the key-function is based on the within the context available values for Series and EpisodeID. In other words, the for-each loops through all eps:EpisodeID elements in the node-set matching the expression //eps:Episode defined in the xsl:key element. Of each eps:EpisodeID element the values of the child elements eps:Series and eps:EpisodeID is used as input for the key function. The generated-id of the first element returned in the by the key function returned node-list is compared with the generated-id of the current eps:Episode element. When the values of both generated-ids match, the eps:EpisodeID is a uniquely identified and selected for the next step in the transformation.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
    xmlns:eps="http://www.petervannes.nl/ListOfEpisodes">
    
    <!-- Define the key EpisodeUnqIdentifier -->
    <xsl:key name="EpisodeUnqIdentifier" match="eps:Episode" use="concat(eps:Series,'-',eps:EpisodeID)"/>
    
    <xsl:template match="/">
        <xsl:element name="ListOfEpisodesAndSeries">
            <!-- Select all Episodes with a unique user-key -->
            <xsl:for-each select="//eps:Episode[generate-id() = 
                generate-id(key('EpisodeUnqIdentifier',concat(eps:Series,'-',eps:EpisodeID))[1])]">
                <xsl:element name="EpisodeIdentifier">
                <xsl:element name="Series">
                    <xsl:value-of select="./eps:Series"/>
                </xsl:element>
                <xsl:element name="Episode">
                    <xsl:value-of select="./eps:EpisodeID"/>
                </xsl:element>
                </xsl:element>
            </xsl:for-each>
        </xsl:element>
    </xsl:template>
</xsl:stylesheet>



This transformation of topgear.xml using the XSL Example_unique1.xsl results in the output below.

<?xml version="1.0" encoding="utf-8"?>
<ListOfEpisodesAndSeries>
    <EpisodeIdentifier>
        <Series>1</Series>
        <Episode>1</Episode>
    </EpisodeIdentifier>
    <EpisodeIdentifier>
        <Series>1</Series>
        <Episode>2</Episode>
    </EpisodeIdentifier>
    <EpisodeIdentifier>
        <Series>1</Series>
        <Episode>3</Episode>
    </EpisodeIdentifier>
    <EpisodeIdentifier>
        <Series>2</Series>
        <Episode>1</Episode>
    </EpisodeIdentifier>
    <EpisodeIdentifier>
        <Series>2</Series>
        <Episode>2</Episode>
    </EpisodeIdentifier>
    <EpisodeIdentifier>
        <Series>2</Series>
        <Episode>3</Episode>
    </EpisodeIdentifier>
    <EpisodeIdentifier>
        <Series>3</Series>
        <Episode>1</Episode>
    </EpisodeIdentifier>
    <EpisodeIdentifier>
        <Series>3</Series>
        <Episode>2</Episode>
    </EpisodeIdentifier>
</ListOfEpisodesAndSeries>

Another transformation, Example_unique2.xsl, shows how to extract a list of unique guests per season (series) from episode wherein the BMW M3 is reviewed

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
    xmlns:eps="http://www.petervannes.nl/ListOfEpisodes">

    <xsl:key name="EpisodeWithBMW3UnqQuest" match="eps:Episode[eps:Reviews = 'BMW M3']"
        use="concat(eps:Series,'-',eps:Guests)"/>

    <xsl:template match="/">
        <xsl:element name="ListOfEpisodes">
            <xsl:for-each
                select="//eps:Episode[generate-id() = 
                generate-id(key('EpisodeWithBMW3UnqQuest',concat(eps:Series,'-',eps:Guests))[1])]">
                <xsl:element name="Episode">
                    <xsl:element name="Series">
                        <xsl:value-of select="eps:Series"/>
                    </xsl:element>
                    <xsl:element name="EpisodeID">
                        <xsl:value-of select="eps:EpisodeID"/>
                    </xsl:element>
                    <xsl:element name="Guests">
                        <xsl:value-of select="eps:Guests"/>
                    </xsl:element>
                </xsl:element>
            </xsl:for-each>
        </xsl:element>
    </xsl:template>
</xsl:stylesheet>

Resulting in;

<?xml version="1.0" encoding="utf-8"?>
<ListOfEpisodes>
    <Episode>
        <Series>1</Series>
        <EpisodeID>3</EpisodeID>
        <Guests>Ross Kemp</Guests>
    </Episode>
    <Episode>
        <Series>2</Series>
        <EpisodeID>2</EpisodeID>
        <Guests>Jamie Oliver</Guests>
    </Episode>
    <Episode>
        <Series>3</Series>
        <EpisodeID>2</EpisodeID>
        <Guests>Stephen Fry</Guests>
    </Episode>
</ListOfEpisodes>

The zip-file xsl_unique_elmts.zip with the used examples can be found here.


blog comments powered by Disqus