XML for Java Developers

G22.3033-002

Dr. Jean-Claude Franchitti

New York University

Computer Science Department

Courant Institute of Mathematical Sciences

 

Session 2: XSL Transformations (XSLT)

 

Course Title: XML for Java Developers                                             Course Number: g22.3033-002

Instructor: Jean-Claude Franchitti                                            Session: 2

 

XSLT Overview

 

The W3C's XSL Working Group released an update to the XSL Working Draft Specification. The new XSL draft specification dumps virtually every feature (with the exception of formatting objects) into a new working draft called XSL Transformations, or simply XSLT.

 

XSL now refers specifically to XSL formatting objects, which are to XML what Cascading Style Sheets are to HTML.

 

XSLT contains various features including tree processing, patterns, and templates, and adds a plethora of new features.

 

The pattern syntax has been expanded and a new syntax called location paths has been introduced.

 

The most dramatic change, however, is the addition of a complete expression language, which looks much like a small programming language.

 

The following covers the salient points of XSLT expressions and give you an idea of how you can use them. Also covered are  iteration and conditional processing in XSLT.

 

To illustrate the technology, the following shows how you can take any number of database records in XML format, sort them in either ascending or descending order, and transform them to HTML for presentation.

 

Tree Processing in XSLT

 

An XML document can be broken down into a collection of objects that are ordered in a hierarchical fashion. This hierarchical representation is called a tree.

 

Tree structures are useful because they express relationships between your XML elements in a very simple way. In fact Hierarchical data structures are excellent for organizing data because they have a singular, unambiguous point of view, making their semantics very powerful

 

To process your XML documents, including transforming XML into HTML, you need a way go through this tree and select particular elements. Once you have an element in hand, you can do all kinds of things to it in preparation for output:

 

·       Add formatting to a headline style

·       Generate text for a text element

·       Create an entirely new element

·       Process the same element in different ways, based on a set of constraints

·       Process a set of elements with a common structure (as with database records)

 

To quickly review, an XML processor takes a marked-up document and produces a tree-like structure containing the elements, attributes, entities, and so on, of your document. At this point, you can access the "objects" in this tree through any Document Object Model (DOM) API. You can also invoke an XSL processor to apply formatting to these objects and output them in just about any manner you can think of. The XSL processor takes the tree generated in XML, called the "source tree," and creates a new "result tree" that includes all of the objects to be output along with pertinent formatting information.

 

As a style-sheet developer, you can control the creation of the result tree process through template rules. You define a template rule in a style sheet using the <xsl:template> element. Your rule generally consists of two parts: a pattern that's used to match with elements, attributes, and other nodes in the source tree; and the template that generates part of the result tree. For example, the template rule in the following example looks for paragraph elements in the source tree.

 

Example 1:

 

<xsl:template match = "para">

<fo:block font-size = "10pt" space-before = "12pt">

<xsl:apply-templates/>

</fo:block>

</xsl:template>

 

 

When the processor finds such an element, the formatting portion of the template rule is applied to the paragraph content.

 

Your template rules are placed inside an <xsl:stylesheet> element, as shown in the following example.

 

Example 2:

 

<xsl:stylesheet xmlns:xsl = "http://www.w3.org/TR/WD-xsl" xmlns:fo = "http://www.w3.org/TR/WD-xsl" result-ns = "fo">

<xsl:template match = "para">

<fo:block font-size = "10pt" space-before = "12pt">

<xsl:apply-templates/>

</fo:block>

</xsl:template>

</xsl:stylesheet>

 

The style-sheet element has several optional attributes, which are listed in the following table

 

CHARACTER

DESCRIPTION

|

The or operator (|) expresses alternatives; for example, (emph|b)

/

Used to compose a longer pattern

//

Matches descendants instead of children

.

Selects the current node

..

Selects the parent of the current node

*

Wildcard character; matches with all elements

 

Table1 : XSL Pattern Syntax

 

However, Example 2 simply defines the namespace, xmlns:xsl, which is a required step. Note that the namespace must point to the URI shown in the example. Next, a namespace for the result object is defined. Example 2 creates a namespace, fo, for formatting objects and assigns it to the result namespace.

 

There are many unresolved questions related to the implementation of formatting objects in a device-independent manner. So, the formatting-objects portion of the XSL proposal is still in limbo.

 

Asked whether Microsoft supported them, the company's XML evangelist, Adam Bosworth, responded that Microsoft does not support the formatting-objects portion of the W3C proposal, nor does it have plans to support them in the future.

 

Marie Wieck, director of IBM's technology network computing software division, concedes that that there are still ambiguities in the proposal and more work still needs to be done. However, IBM will support the XSL standard when it solidifies, assuming customer demand warrants it.

 

Fortunately, the URI in Example 2 can point at things other than fo. For example, later in the following, we will transform an XML document and output HTML. The first step in setting up such a transformation is defining the following namespace:

 

xmlns="http://www.w3.org/TR/REC-html40"

result-ns=""

 

In this case, the result-ns is optional and is included for illustration purposes. As you develop your style sheets, keep in mind that XSL-defined elements are recognized only in the style sheet, not in the source document.

 

Patterns in XSL

 

Patterns can be used to locate objects within an XML document tree. From there, you can specify template rules that let you format these objects. 

 

Patterns

Returning to our source tree, it would be nice to be able to locate any node and then apply specific formatting to that node. That's where patterns come in.

 

You use a pattern to select a node or set of nodes in the source document. In this way, you can control how the XSL processor processes your document.

 

The syntax for creating patterns is straightforward and resembles the paths used in directory structures. Therefore, it's helpful to remember that the patterns you specify are always in relation to your current position in the tree. The simplest pattern is an element type. For example, a pattern <xsl:template match = "chapter"> matches any child element that's a chapter.

 

There are a number of operators that let you control how to search for patterns within the tree. As mentioned earlier, pattern syntax resembles the syntax used for traversing directory structures. For instance, a period represents the current node in the tree, just as it would represent the current directory in a directory structure. Likewise, two periods (..) refer to the parent of the current node. The slash character (/) lets you select specific descendant patterns. For example, "chapter/title/paragraph" would start at the current node, look for a chapter child, then a title, and ultimately match with any paragraph descendants. Again, this is very much like using directory paths, so it should feel intuitive.

You can also use the wildcard character (*) to match all elements. For example, */subhead selects all subhead grandchildren of the current node. On the other hand, chapter/* matches any element that has a chapter parent. Another operator, //, matches descendants instead of children. For example, chapter//subhead matches all subhead elements with a chapter ancestor.

 

You can create alternative paths through the tree using the or operator (|). For instance, you could select either a chapter or an appendix using chapter|appendix. You can also string longer patterns together. As you construct your patterns, however, keep in mind that / binds more tightly than |. For example, */chapter/title | ../preface/title would select either the title chapter/title grandchild of the current, or look to the parent of the current node for a preface/title descendant. Note the white space between the selectors on either side of the or operator. White space is not significant, so you can break things as you like for better readability.

 

Other Node Types

 

So far, we have described the syntax for accessing element nodes within the source tree. But a node can contain other objects as well. To distinguish these other objects, you'll have to identify them for the XSL processor.

 

For example, to identify an object as an attribute, you must prefix the attribute name with the @ symbol. The pattern syntax is pretty much the same, though. The figure/@caption pattern selects the caption attribute of the figure element, which is a child of the current node. The @* pattern selects all attributes.

 

You can select comments in the source tree using the comment pattern. Using the comment() pattern without any arguments selects all comment nodes. Similarly, the pi() pattern matches all processing- instruction child nodes. In addition, you can specify an argument that indicates a target for the processing instruction, such as pi("xml-stylesheet").

 

Tests, Comparisons, And Refinement

 

You can refine the result returned by a pattern by specifying the parameter within square brackets ([ ]) after the pattern. For example, list[@type] matches list elements with a type attribute, book[editor] matches child book elements that have at least one editor child element.b

 

Another thing XSL lets you do is compare patterns to strings. For example, list[@type="ordered"] matches type attributes with a value of ordered, and figure[@caption="Figure1"]

looks for "Figure1" captions. Finally, contact [name="Joe Butler"], selects the child element with the value "Joe Butler".

 

XSL also lets you test for positions relative to a sibling. In particular, you can select the first and last child elements in a branch, as well as the first and last elements of their respective types.

 

The options are listed in the following table:

 

TEST

DESCRIPTION

first-of-any()

Succeeds if the node is the first element child

last-of-any()

Succeeds if the node is the last element child

first-of-type()

Succeeds if the node is the first element child of its type

last-of-type()

Succeeds if the node is the last element child of its type

 

Table 2 : Selecting the position of a node

 

 

Putting Patterns to Work

 

Some additional syntactical details have been omitted. However, you should have the foundation to create some very powerful patterns. This will let you access and ultimately format and output virtually any object within your source tree.

 

To demonstrate in a real-world sense, let us create an XML document and an accompanying XSL style sheet that will transform our document into HTML.

 

The XSL style sheet shows how you can combine CSS style rules with XSL to format HTML. This is, in fact, how most Web developers will handle XML.

 

Listing One below presents news.xml, an XML document containing a news story.

 

Listing One:

 


<?xml version="1.0"?>

 

<Story>

   <SectionTitle>News&amp;Views</SectionTitle>

   <Headline>New Web Graphics Standard Emerges</Headline>

   <Deck>Vector graphics allows images to be resized,

         cropped and printed at different resolutions</Deck>

   <Dateline>March 1, 1999</Dateline>       

   <Byline Email="jubutler@xyz.com">Joe Butler</Byline>

 

<BodyText ID="P1">

<DropCap>W</DropCap>hile XML has primarily been used for text, the World Wide

Web Consortium (W3C) released the first public working draft of the

<bold>Scalable Vector Graphics (SVG)</bold> format, which is defined in XML.

SVG is intended to be a vendor-neutral, cross-platform format for XML vector

graphics over the Web. The working draft status indicates that the W3C is

making the proposal public and openly soliciting feedback.

 

</BodyText>

 

<BodyText ID="P2">

The use of vector graphics means that Web designers will be able to reuse

images more effectively and that images can be easily resized, cropped and

printed at different resolutions. <Pullquote>Because it is defined in

XML</Pullquote>, the SVG format can be read by <italic>any</italic> existing

XML parser, and programmers and script developers will be able to access SVG

documents through any DOM API to, for example, create animations. Text within

images, such as figure captions, will be maintained as text, so it can easily

be searched by search engines. And Webmasters will be able to apply style

sheets equally well to XML text and SVG.

</BodyText>

 

<BodyText ID="P3">

Members of the W3C's SVG Working Group include Adobe, IBM, Apple, Microsoft,

Sun, HP, Corel, Macromedia, Netscape, and  Quark. For those interested, a

public mailing list, www-svg@w3.org, has been started. You can get more

information on SVG at <Anchor myURL="http://www.w3.org/Graphics/SVG/">www.w3.org/Graphics/SVG/</Anchor>.

</BodyText>

 

</Story>

   

The root element, Story, contains the other elements for this document. Elements were created for the section in which the story runs, along with elements for the title, dek (subtitle), byline, dateline, and so on. The BodyText element contains the content for the news story, and includes additional markup to create a dropcap for the leading character in the first paragraph, and some bold and italics. Since we are not interested in validating this document, no document type definition (DTD) is specified.

 

The style sheet for this document, news.xsl in Listing Two below contains the template rules to process Listing One. When a rule maps to a source element, the rule's template is instantiated. The templates may contain literal "result" elements, character data, and instructions for creating a portion of the result tree. So, after creating the namespaces for the <xsl:stylesheet> element as detailed earlier, Listing Two creates a template rule to process the root element, Story. The root node is a special case, so Listing Two uses the / pattern to get the root element. (If you need to access the document element, you can use the pattern: /*.)

Next, the template includes some "literal" HTML that will be passed directly to the result tree. Note that this includes the CSS style rules for formatting the document.

 

After the style rules, we find an HTML <TITLE> element, which specifies a title for the document. The title comes from the SectionTitle of the XML document, so we need to process a portion of it to get this title. We will use <xsl:apply-templates>, which processes descendant nodes.

 

You specify the nodes to be processed by using a pattern in a select attribute, in this case, select="Story/SectionTitle". If no select attribute is included, then apply-templates would process the immediate children of the current node. Now, the "News&Views" title will appear in the title bar of the browser.

 

Next, Listing Two creates the Body of the HTML document, using apply-templates and a select pattern to similarly process the story title, dek (subtitle), byline, and text elements of the document. HTML SPAN and DIV elements are used to apply the CSS styles. We have included additional templates to handle specific elements within the document. There are template rules to handle paragraph formatting for the BodyText, format the dropcap, handle bold and italics, and to create a mailto anchor that's included in the author's byline.

 

You can test this yourself using tools from your Server Side XML workbench. You'll need to install the XML4J parser and LotusXSL (See http://www.alphaWorks.ibm.com, and installation instructions).

 

From there you can run this example from the command line.

 

Summarizing the use of patterns

 

The key to applying templates to document elements is patterns. Patterns let you access any object in your document tree. Without question, there's a lot more to template rules than covered here. For example, template rules let you create new elements, attributes and attribute sets, comments, and more.

 

Ultimately, though, XSL's contribution to Web developers is its tremendous ability to transform XML documents. By adding new style sheets for different output formats, you can create transformations for virtually any medium you desire. You can even generate custom transformations for specific browsers. That means you can create Web sites that push the envelope with new features without leaving behind users with older browsers. It also means you can render XML in any browser, even Lynx.

 

Expressions in XSLT

 

As mentioned earlier, you may want to process a set of elements with a common structure (as with database records) using XLST. This is were expressions come in.

 

The XSLT expression language lets you select one or more elements, specify conditions for processing nodes, and generate new elements that can be inserted into the result tree.

 

The expression language provides some general purpose functions that let you, for example, determine the number of nodes in a tree fragment, get the position of the current node, and so on.

 

There are functions that support Boolean operations, and functions to manipulate strings and numbers. When an expression is evaluated, you get back an object whose type is either a string, number, boolean, node-set, or a result tree fragment. The string type refers to a Unicode character string. A boolean is represented as either a true or false value. The number type represents a floating-point real number. A node-set refers to nodes in the source tree, and result tree fragment refers to elements in the result tree.

 

Of course, an expression can simply be a pattern, as described earlier. In that case, the expression returns the set of nodes selected by the pattern. However, XSLT provides various functions that let you manipulate these different object types. For example, Table 3 presents a list of the proposed functions for handling strings.

 

 Expression

Description

string()

Converts an object to a Unicode character string.

concat()

Concatenates two strings and returns the result.

contains()

Determines whether a substring is contained within a string.

starts-with()

Takes two string arguments: If the second string starts with the characters in the first string, this function returns with a value of true.

substring-before()

Returns the substring preceding a specified character.

substring-after()

Returns the substring following a specified character.

normalize()

Removes leading/trailing white space and reduces extra white space to a single space.

translate()

Translates a string of characters

format-number()

Formats number strings

Table 3 : String Functions

 

The basic function, string(), converts an object of another type to a string. For instance, if the object was originally a number type, string() performs a conversion and returns a string in the form of a real number. If the number is negative, a negative sign (-) precedes the string. Boolean values are converted to the strings true and false. If the object is a node-set, the first node (in document order) is selected and that value is converted to a string. An empty string is returned if the set is empty. A result tree fragment is converted to a string by treating it as a single document fragment node. In all cases, the argument defaults to the current node if the argument is omitted.

 

A complementary function, number(), does the opposite of string(): It takes a string that represents a numeric value and converts it to that value as per Table 4.

 

 

Expression

Description

div

Operator that divides two numbers and returns a floating-point value.

quo

Operator that divides two numbers and returns an integer.

mod

Operator that returns the remainder of a floating-point division operation.

number()

Function that converts the value specified in its argument to a number.

sum()

Function that returns the sum of the values of the nodes in the argument node-set.

round()

Function that returns the round of a number as an integer.

floor()

Function that returns the largest integer not greater than the argument.

ceiling()

Function that returns the smallest integer not smaller than the argument.

Table 4 : Floating-point functions and operators

 

If the string does not represent a number, then the function returns a value of 0. The input string may contain white space, and Boolean values are converted to 1 (true), or 0 (false). If the argument contains a node-set or a result tree fragment, it's converted to a string and then evaluated as just described. Finally, if you don't supply an input string in the argument, the current node is used.

 

Another useful function in Table 3 is concat(), which takes two strings and concatenates them. For example, let's say your application performs a database lookup and needs to add a label to one field in a record. You might use concat("Name: ", "Joe Butler"). The result would be Name: Joe Butler.

 

Another function, contains(), could be useful in searching for substrings. For example contains("ML", "AfterHTML.com") will return with a value of true. What's unclear from the draft specification is whether case-sensitive comparisons are allowed. A case-sensitive comparison would mean, for example, that contains("ml", "AfterHTML.com") would return false. Two related functions, substring-before and substring-after are illustrated in Example 3.

 

Example 3:

 

(a) substring-before("AfterHTML.com", "HTML")

 

(b) substring-after("AfterHTML.com", "HTML")

 

XSLT provides several other functions and operators for handling numbers. The div operator divides two numbers and returns a floating-point number as specified by the IEEE 754 specification. The quo operator also divides two numbers, but truncates the result and returns an integer. The mod operator similarly divides two numbers, but returns the remainder as an integer. For example, 10 quo 3 returns the value 3, while 10 mod 3 returns the value 1. The sum() function takes a node-set and returns the sum of the values of the nodes in the set. round() returns an integer after the value has been rounded off. The floor() function returns an integer representing the largest number not greater than the argument value. And the ceiling() function returns the smallest integer that is not less than the argument value.

 

Booleans

 

Booleans are particularly useful when comparing two values. XSLT provides five functions and five operators that let you make these comparisons as indicated in Table 5.

 

 

Expression

Description

boolean()

Evaluates its argument value and returns a Boolean value.

not()

Returns the opposite Boolean value.

true()

Returns a value of true.

false()

Returns a value of false.

lang()

Compares the language of the context node.

=, <, >, <=, >=

Converts each operand to a number and compares the two numbers.

or

Returns true if either operand is true.

and

Returns true only when both operands are true.

Table 5 : Boolean Functions and Operators

 

The boolean() function simply evaluates its argument and converts it to a Boolean. The argument can be a number, node list, result tree fragment, or a string. The next function, not() negates whatever Boolean value the argument would normally return. Thus, not() returns a value of true when its argument is false. The true() function forces a true value to be returned, and likewise, false() always returns false.

 

The Boolean operators listed in Table 5 directly test the values on either side of the operand. For <, >, <=, or >=, each operand is converted to a number and then the two numbers are compared. For example, 1 < 2 returns a value of true and 2 <= 1 returns false. The = operator is treated differently depending on the argument type. Number types are treated as just described for the other operands. However, if the argument is not of type number, the operands are converted to strings and the string values are compared. The or operator evaluates each operand and converts it to a Boolean, then compares the two Boolean values. The result of the operation is true if either value is true. The and operator is converted similarly. However, both operands must be true for the result to return true.

 

XML lets you specify the language for elements using an xml:lang attribute. The lang() function examines this value for the current node and compares it to the language specified in its argument. If the xml:lang attribute was not specified for the current node, the lang() function looks up the tree for ancestors that have specified the xml:lang attribute and uses that value. If no attribute was specified, the lookup fails and the lang() function returns false.

 

Extension Functions

 

While the expression language contains features found in a programming language, it's not intended to be one. Instead, XSLT provides an extension mechanism that lets you access languages such as JavaScript, VBScript, and Java. The specification doesn't require an XSLT processor to support extensions for any particular language, so you'll want to check the documentation for your specific processor for this support.

 

Putting XSLT to Work

 

XSLT provides a number of additional features that make it easier to process elements. For example, XSLT provides a for-each element that instructs the processor to perform iterative processing. This is particularly useful when you need to process a large number of elements that have the same structure. A typical example is when you have a collection of elements that represent records in a database. Consider the XML document in Listing Three, which represents some of the tools in an XML tools database.

 

Listing Three:

 

<?xml version="1.0"?>

 

<productDB>

 

   <product>

      <name>XML Toolbox</name>

      <company>AfterHTML</company>

      <version>1.0</version>

      <price>99.95</price>

      <sys-requirements>Any Java Platform</sys-requirements>

      <description>

         XML Toolbox is a collection of tools for creating

         and processing XML documents

      </description>

   </product>

 

   <product>

      <name>xml4j</name>

      <company>IBM</company>

      <version>1.1.1.4</version>

      <price>Freely available</price>

      <sys-requirements>Any Java platform</sys-requirements>

      <description>

         xml4j is an XML processor that is compliant

         with the XML 1.0 working draft specification

      </description>

   </product>

 

   <product>

      <name>MSXML</name>

      <company>Microsoft/DataChannel</company>

      <version>N/A</version>

      <price>Freely available</price>

      <sys-requirements>Any Java platform</sys-requirements>

      <description>

         MSXML is an XML processor that is compliant

         with the XML 1.0 working draft specification

      </description>

   </product>

 

</productDB>

 

The document represents a database table called productDB. Each record is referenced as product. The rest of the elements represent field names. For simplicity, Listing Three presents just three records.

 

The goal of this example is to publish some of the fields from each record as a summary within an HTML table. The summary could be a hit list resulting from searching the database. In any case, we would like to transform some (but not all) of the XML record elements in Listing Three into HTML and publish each summary in a row of the table. The columns represent the product's name, version, and price, respectively. Let's further stipulate that we'd like to sort each entry in ascending order based on the product's name.

 

Listing Four presents the XSL style sheet to execute our transformation. The style sheet contains a single template rule that uses a step pattern to select the document element, productDB. Next, the template generates some preliminary HTML elements including the page TITLE, appropriate labels, and the start of the HTML table. The first row of the table contains the headings for each column.

 

Listing Four:

 

<?xml version="1.0"?>

 

<xsl:stylesheet xmlns:xsl="http://www.w3.org/TR/WD-xsl">

 

  <xsl:template match="/productDB">

   <HTML>

   <HEAD>

      <TITLE>

         XML Tools Database Search

      </TITLE>

   </HEAD>

   <BODY>

   <H1>XML Tools Database Search Results</H1>

 

   <TABLE Border="1" width="100%">

      <TR>

         <TD>Product Name</TD>

         <TD>Product Version</TD>

         <TD>Product Price</TD>

      </TR>

      <xsl:for-each select="product">

               <xsl:sort select="name"/>

       <TR style="color:green">

 

         <xsl:for-each select="name">

            <TD>

            <xsl:apply-templates/>

            </TD>

 

            <xsl:for-each select="../version">

              <TD><xsl:apply-templates/></TD>

            </xsl:for-each>

 

            <xsl:for-each select="../price">

              <TD>

 

            <!-- This test only works with a processor that implements XSLT

              <xsl:if test='number()'>

                $

              </xsl:if>

            -->

 

              <xsl:apply-templates/></TD>

            </xsl:for-each>

         </xsl:for-each>

 

       </TR>

      </xsl:for-each>

      </TABLE>

   </BODY>

   </HTML>

  </xsl:template>

 

</xsl:stylesheet>

 

 

Next, the style sheet uses a for-each element to process each product record. Without such a construct, we'd have to write a separate transformation for every record in the database. Not only would this be tedious, but we have no way of knowing in advance how many records the search will return. The XSL processor, with the help of the for-each element, will figure that out for us.

 

Prior to doing anything else, Listing Four immediately calls <xsl:sort> to sort the product nodes in ascending order. The select attribute identifies the sort key. The sort element takes some additional attributes including order (specifies the sort order), lang (identifies the language of the sort keys), and data-type (determines the data type of the element nodes). By default, the sort order is ascending, and the data type is text, so are leaving  these out.

 

There are three additional <xsl:for-each> nested elements within the first -- one for each field we want to process. After selecting the appropriate field element, we create a column in the table and call <xsl:apply-templates> to process the element.

 

We have thrown in a curve on the last element. Some products in the database are priced at a specific dollar value, but others are available for free. We could simply enter $0 for freeware items, but we decided to write out "freely available" in the record. The style sheet deals with this using <xsl:if> to test whether this a numeric value. If so, it adds a dollar sign in front of the value. Otherwise, it is text and the template leaves it alone.

 

Summarizing the Use of Expressions in XSLT

 

There is, of course, a great deal more to XSLT, including location paths and extensions to external languages. However, the other part of the equation, XSL, has yet to be addressed. While we have a bright and shiny new XSL working-draft specification, there's not a processor on the planet that currently supports it.