Lars Nielsen's Discoveries

February 29, 2012

White boxes: handling special ANSI characters in XSL and XML

Filed under: Branding,SharePoint — Lars Nielsen @ 9:43 pm
Tags: , ,

Using a data view web part you can render XML content from a public XML feed (a vacancies list) onto a SharePoint page. You add the URL of the data feed into the XML files section of the data source library.  The XML feed I was using was simply a public URL which returns pure and properly formed XML content.  The XML content that comes back starts with this:


<?xml version="1.0" encoding="ISO-8859-1"?>

This says that the XML is (or should be) encoded using ISO-8859-1.  I wrote my XSL file and configured the dataview web part to reference the XSL.  Everything worked fine except that the text on the web page contained small white boxes in place of some of the punctuation marks. On inspecting the raw XML from the feed, I found that there were characters in the byte stream that correspond to the Windows 1252 code page an which are not part of the ISO-8859-1 standard. The problem I had particularly was with character #92 (hex) which is a bacwards-facing apostrophe commonly inserted into text by Microsoft Word as a “smart quote”. I suspect that someone copied the text originally directly out of an MS Word document and pasted into a textbox, so that the character ended up in the database, and rendered out into the XML.

I wanted my output from my XSL transformation to render as UTF-8 which is the encoding used by the containing SharePoint page. So I tried setting my XSL to encode the output by adding an xsl:output line.

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" encoding="utf-8"/>

I thought this might render the backwards apostrophe as a readable UTF-8 apostrophe, but it didn’t. So I tried switching the encoding to Windows-1252:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" encoding="windows-1252"/>

This rendered the backwards apostrophe correctly on the screen (in IE 8) but broke other characters instead. I realised the problem was the original XML should not be using that #92 character but instead should use a regular single quote character (#27 hex) which is what the HTML character entity “&apos;” translates to.  So in my XSL I added a template to replace character #92 with #27.  To represent these characters in XSL or XML you can use ampersand – hash followed by the decimal (not hex) code for the character followed by a semicolon. #92 hex is 146 decimal, and #27 hex is 39 decimal. So I created an XSL template and called the template by passing in the original string as a parameter, like this:

<p class="description">
   <xsl:call-template name="CleanUpUnreadableChars">
   <xsl:with-param name="str" select="job_description" />
   </xsl:call-template>
</p>

.....

<xsl:template name="CleanUpUnreadableChars">
   <xsl:param name="str" />
   <!-- Replace ANSI char 146 (backwards apostrophe) with char 39 (single quote) so that it renders as a readable char -->
   <xsl:value-of select='translate($str,"&#146;","&#039;")' />
</xsl:template>

This works OK and transforms the “unprintable” backwards apostrophe into a standard single quote that renders on the browser as expected. Using the XSL translate function I can add other characater transformations later as necessary if I find other spurious characters in the XML feed.

Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: