TIL: Force ASCII output with XSLT
Today I learned how to get ASCII output from an XSLT transformation using "unicode decomposition". In the past I've seen and used long character maps or translate, but they don't seem to cover every use case. This one seems to be a little more elegant of a solution.
This was a Workday Community question response by Parry.
Something didn't look like it copy+pasted well with Parry's solution. I asked AI to help me clean it up and this is what it came up with:
Con's: This is an "all or nothing" transformation. It can sometimes strip out a character unintentionally when the string contains a mix of Latin-1 and non-Latin-1 characters. It can be acceptable when not crashing is a higher priority than preserving accents.
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:my="http://www.example.com/functions"
exclude-result-prefixes="xs my">
<!-- The function -->
<xsl:function name="my:stripAccents" as="xs:string">
<xsl:param name="input" as="xs:string?"/>
<xsl:choose>
<xsl:when test="not($input)">
<xsl:value-of select="''"/>
</xsl:when>
<xsl:when test="matches($input, '^[\x00-\xFF]+$')">
<xsl:value-of select="$input"/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="replace(normalize-unicode($input, 'NFKD'), '[\p{M}]', '')"/>
</xsl:otherwise>
</xsl:choose>
</xsl:function>
<!-- How to call it -->
<xsl:template match="Row">
<xsl:value-of select="my:stripAccents(CustomerName)"/>
</xsl:template>
</xsl:stylesheet>
Generally speaking, using translate is not a robust method for doing character encoding. You'll ultimately miss characters you didn't even know existed. For example, you do not handle Hungarian names that have letters like Ő.
I would suggest trying a method called Unicode decomposition. All Unicode characters (the default encoding for Workday) have predefined method of splitting a character into atomic parts. In your example above, Â would become the combining character ^ (circonflex) + A. When you've decomposed the character as such, you can remove all non-ascii bytes (in this case the circonflex).
<!-- if it's entirely ISO-8859-1, then let it go through normal --> <xsl:function name="this:stripAccents"> <xsl:param name="input" as="xs:string"/> <!-- this is the regex string for all ISO-8859-1 encoded characters --> <xsl:value-of select="if(matches($input,'^[ -ÿ]+$')) then($input) else( <!-- This line replaces all non-ISO-8859-1 characters with the empty string. i.e. where the accent removal takes place --> replace(normalize-unicode($input, 'NFKD'),'[\p{M}]','') )"/> </xsl:function>Usage would look like this:
this:stripAccents("Þórr")Þórr ==> Þorrsince Þ is part of the ISO-8859-1 char setIf this does not meet all of your requirements, you can also use the XSLT function replace in conjunction with codepoints-to-string. codepoints-to-string takes the HTML entity number and converts it to a string. This is useful for representing hard to type or difficult to encode characters. In the example below, I'm using this to replace EN-DASHes ( – ) since it cannot be casted to ASCII using decomposition.
replace(wd:Name,codepoints-to-string(8211),'-')"Hello – World" ==> "Hello - World"Hope that helps! If you require only ASCII, I would suggest first using decomposition and then replacing characters such as Þ with a predefined transliteration that the vendor expects. Removing characters such as this or ð could potentially cause issues with legal names.
Parry