XSM – XPath-based String Manipulation

Why XSM?

The XSM process is a modification (manipulation) of XML documents based on string operations. The special feature here is that the modifications described with XML and XPath are passed on to the processor (XSM sheet). So, the XSM process works in a similar way to a XSLT process. However, the crucial difference is that a XML document copied with XSLT and in part manipulated is parsed and rewritten. As a result, any information not contained in the DOM tree gets lost:

In case such information is required, the XML document must not be transformed with XSLT. For this purpose, the XSM process was developed. During this process, the document is parsed in the same way as in XSLT so that the XPath expressions indicated in the XSM sheet can be interpreted. But the manipulations to the document are performed at string level. The result is that any information mentioned above is maintained in the manipulated document.

Restrictions

Learning XSM

With the help of a small example shall be demonstrated how a XSM sheet is built and how the call of the XSM processor works. For this purpose, please download the XSM processor on the Download page.

The example document

This HTML example document contains errors in the handling with the attributes lang and xml:lang:

1 <?xml version="1.0" encoding="UTF-8"?>
2 <!DOCTYPE html [
3 <!ENTITY nbsp "&#x00A0;" >] >
4 <html xmlns="http://www.w3.org/1999/xhtml">
5 <head>
6 <title>Citations Niccolò Machiavelli</title>
7 </head>
8 <body>
9 <h1>Citations <span lang="it">Niccolò&nbsp;Machiavelli</span></h1>
10 <p xml:lang="en">Wars begin when you will, but they do not end when you please.</p>
11 <p lang="it" xml:lang="en">Può la disciplina nella guerra più che il furore.</p>
12 </body>
13 </html>

For the following reasons this document contains a violation against the XHTML Recommondation of the W3C:

  1. The root element does neither contain the lang attribute nor the xml:lang attribute.

  2. The <span> element in line 9 is missing the xml:lang attribute.

  3. The <p> element in line 10 has no lang attribute. Moreover, the xml:lang attribute is redundant since the language of the paragraph does not differ from the main language of the document.

  4. The <p> element in line 11 contains contradictory information in the attributes lang and xml:lang.

By means of a simple XSM sheet, the missing attributes shall be added to the root element and to the <span> element in line 9, the xml:lang attribute of the paragraph in line 10 shall be deleted and the xml:lang attribute of the paragraph in line 11 shall be corrected (replaced).

Step 1: first XSM sheet and call of the XSM processor

The root element of the XSM sheet is <xsm:manipulator> in the namespace http://www.schematron-quickfix.com/manipulator/process. In contrast to XSLT, the XSM sheet offers the possibility to indicate directly the source document to be manipulated. For this purpose, the document attribute can be added to the root element.

1 <?xml version="1.0" encoding="UTF-8"?>
2 <xsm:manipulator xmlns:xsm="http://www.schematron-quickfix.com/manipulator/process" document="lang.xhtml">
   [...]
4 </xsm:manipulator>

The document attribute refers to the lang.xhtml file introduced above.

Now, this is already a valid XSM sheet which can be processed by the XSM processor. The processor is called up in the command line as follows:

java -jar xsm.jar lang1.xsm -o lang-result.xhtml

Now, the processor uses the source document referenced in lang1.xsm (lang.xhtml). The result is written into lang-result.xhtml. If the -o option is left out, the processor returns the result to the command line. Alternatively, a source document can also be passed on to the processor with -i. In this case, the source document referenced in document is ignored.

The lang-result.xhtml file should look as follows:

1 <?xml version="1.0" encoding="UTF-8"?>
2 <!DOCTYPE html [
3 <!ENTITY nbsp "&#x00A0;" >] >
4 <html xmlns="http://www.w3.org/1999/xhtml">
5 <head>
6 <title>Citations Niccolò Machiavelli</title>
7 </head>
8 <body>
9 <h1>Citations <span lang="it">Niccolò&nbsp;Machiavelli</span></h1>
10 <p xml:lang="en">Wars begin when you will, but they do not end when you please.</p>
11 <p lang="it" xml:lang="en">Può la disciplina nella guerra più che il furore.</p>
12 </body>
13 </html>

Since we've not yet defined any manipulations, it is hardly surprising that nothing has changed. In this case, however, it is also decisive that actually nothing has changed! The XML declaration and the DTD declaration are the same and also the entity was not transformed to the non-breaking-space character. Such a result would not be possible with XSLT.

Step 2: deleting

The simplest manipulation is the deletion of a node. For this manipulation it is sufficient to indicate the node which shall be deleted. The <xsm:delete> element is provided with a node attribute which addresses the nodes to be deleted by means of a XPath expression.

1 <?xml version="1.0" encoding="UTF-8"?>
2 <xsm:manipulator xmlns:htm="http://www.w3.org/1999/xhtml" xmlns:xsm="http://www.schematron-quickfix.com/manipulator/process" document="lang.xhtml">
3 <xsm:delete node="/htm:html/htm:body[1]/htm:p[1]/@xml:lang"/>
4 </xsm:manipulator>

Please notice: If nodes are addressed belonging to a namespace, prefixes must be used in the XPath expression. A default namespace cannot be defined for the XPath expressions. The prefixes used must be assigned to a namespace for the appropriate XSM element. In this example, the htm prefix is used. Therefore, it must be assigned to a namespace in the root element.

Step 3: replacing

The replacing differs from the deleting in that the deleted node is replaced by another node. Therefore, the <xsm:replace> element has also a node attribute in order to define the node being replaced by another one. In addition, there is a <xsm:content> child element provided with <xs:any> and <xs:anyAttribute> as content model. As a consequence, it may have any attributes and as content any nodes.

The following applies: If an attribute shall be replaced, it is replaced by the attributes of the <xsm:content> element. If another node shall be replaced, it is replaced by all child nodes of the <xsm:content> element.

1 <?xml version="1.0" encoding="UTF-8"?>
2 <xsm:manipulator xmlns:htm="http://www.w3.org/1999/xhtml" xmlns:xsm="http://www.schematron-quickfix.com/manipulator/process" document="lang.xhtml">
3 <xsm:delete node="/htm:html/htm:body[1]/htm:p[1]/@xml:lang"/>
4 <xsm:replace node="/htm:html/htm:body[1]/htm:p[2]/@xml:lang">
5 <xsm:content xml:lang="it"/>
6 </xsm:replace>
7 </xsm:manipulator>

So, the xml:lang attribute (value en) is replaced by a xml:lang attribute - but with the value it.

Step 4: adding

The adding, in turn, differs from the replacing in that a node is not deleted anymore. Instead, starting from a context node, one or more nodes are added. The node attribute of the <xsm:add> element serves to determine this/these context node(s).

As an additional distinction between the other manipulations, the add manipulation requires the information on which (XPath) axis of the context node the nodes shall be added. For this purpose, the axis attribute is used. The nodes can be added as first child nodes (child), as last child nodes (last-child), on the preceding-sibling axis (preceding), on the following-sibling axis (follwoing) and on the attribute axis (@). In the last case, attributes must obviously be added.

1 <?xml version="1.0" encoding="UTF-8"?>
2 <xsm:manipulator xmlns:htm="http://www.w3.org/1999/xhtml" xmlns:xsm="http://www.schematron-quickfix.com/manipulator/process" document="lang.xhtml">
3 <xsm:delete node="/htm:html/htm:body[1]/htm:p[1]/@xml:lang"/>
4 <xsm:replace node="/htm:html/htm:body[1]/htm:p[2]/@xml:lang">
5 <xsm:content xml:lang="it"/>
6 </xsm:replace>
7 <xsm:add axis="@" node="/htm:html/htm:body[1]/htm:h1[1]/htm:span[1]">
8 <xsm:content xml:lang="it"/>
9 </xsm:add>
10 <xsm:add axis="@" node="/htm:html">
11 <xsm:content xml:lang="en" lang="en"/>
12 </xsm:add>
13 </xsm:manipulator>

In this example, attributes must be added to two elements. The <span> element in line 9 gets a xml:lang attribute with the value it. The <html> root element, by contrast, even gets two attributes: lang and xml:lang, each with the value en.

If now the completed XSM sheet is passed on to the processor, the result is as follows:

1 <?xml version="1.0" encoding="UTF-8"?>
2 <!DOCTYPE html [
3 <!ENTITY nbsp "&#x00A0;" >] >
4 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
5 <head>
6 <title>Citations Niccolò Machiavelli</title>
7 </head>
8 <body>
9 <h1>Citations <span lang="it" xml:lang="it">Niccolò&nbsp;Machiavelli</span></h1>
10 <p>Wars begin when you will, but they do not end when you please.</p>
11 <p lang="it" xml:lang="it">Può la disciplina nella guerra più che il furore.</p>
12 </body>
13 </html>

All manipulations have been executed as expected. Apart from these modifications, the source document is remained unchanged.

© Copyright 2014-2018 Nico Kutscherauer (last update 2018-07-17)

ImprintPrivacy PolicyContactSitemap