XSM – XPath-based String Manipulation

Why XSM?

The XSM process is a modification (manipulation) of XML documents based on string operations. The special feature here is that the modifications described with XML and XPath are passed on to the processor (XSM sheet). So, the XSM process works in a similar way to a XSLT process. However, the crucial difference is that a XML document copied with XSLT and in part manipulated is parsed and rewritten. As a result, any information not contained in the DOM tree gets lost:

XML declaration.
DTD declaration.
Entities are resolved.
Whitespace outside the root element.
Whitespace within start tags or between attributes.
Attributes defined by default values in the DTD are written into the output document.
CDATA sections are replaced by their escaped content.

In case such information is required, the XML document must not be transformed with XSLT. For this purpose, the XSM process was developed. During this process, the document is parsed in the same way as in XSLT so that the XPath expressions indicated in the XSM sheet can be interpreted. But the manipulations to the document are performed at string level. The result is that any information mentioned above is maintained in the manipulated document.

top

Restrictions

The main focus of the development was to maintain the XML well-formed state of the document despite the string operations. Therefore, the functionality was restricted to simple manipulations. There are only three different operations: <xsm:delete>, <xsm:replace> and <xsm:add>. Possible would also be a wrap and an unwrap function.
Since the parts of the document to be modified are defined with XPath, only those parts can be manipulated which are covered by the XPath data model. XML declaration, DTD declaration, etc. are maintained, but cannot be modified. This will not change in future.
At the moment, XPath 1.0 is supported. An adapter for XPath 2.0 based on XSLT is in the planning stage.
The entire process is based on string operations: this means that in the end, for example for the replace manipulation, only one substring of the source document is replaced by a substring of the XSM sheet. Only afterwards, a test is made whether the source document is still well-formed. The measures in order to maintain the well-formed state of the document during the process are still rudimentary. It is planned, for example, to add missing namespace declarations automatically.
Due to the string operations, XML nodes are no longer considered as objects of the DOM tree and lose their namespace context. As a result, it may happen that after the manipulation nodes belong to another namespace as they had belonged in the XSM sheet. The responsibility for the correct namespace in the result document lies with the developer of the XSM sheet.

Also for the last two aspects, an adapter solution based on XSLT would be conceivable.
At the moment, there is no strategy for the handling of conflict situations. If two manipulation instructions match the same node, the last instruction from the XSM sheet will be executed, all the others will be ignored. In contrast to the template rules of XSLT, the precision of the XPath expression does not play a role in this context. A priority attribute would be an option, but has not yet been implemented.

top

Learning XSM

With the help of a small example shall be demonstrated how a XSM sheet is built and how the call of the XSM processor works. For this purpose, please download the XSM processor on the Download page.

The example document

This HTML example document contains errors in the handling with the attributes lang and xml:lang:

1	<?xml version="1.0" encoding="UTF-8"?>
2	<!DOCTYPE html [
3	<!ENTITY nbsp " " >] >
4	<html xmlns="http://www.w3.org/1999/xhtml">
5	<head>
6	<title>Citations Niccolò Machiavelli</title>
7	</head>
8	<body>
9	<h1>Citations <span lang="it">Niccolò Machiavelli</span></h1>
10	<p xml:lang="en">Wars begin when you will, but they do not end when you please.</p>
11	<p lang="it" xml:lang="en">Può la disciplina nella guerra più che il furore.</p>
12	</body>
13	</html>

to source.

For the following reasons this document contains a violation against the XHTML Recommondation of the W3C:

The root element does neither contain the lang attribute nor the xml:lang attribute.
The element in line 9 is missing the xml:lang attribute.
The element in line 10 has no lang attribute. Moreover, the xml:lang attribute is redundant since the language of the paragraph does not differ from the main language of the document.
The element in line 11 contains contradictory information in the attributes lang and xml:lang.

By means of a simple XSM sheet, the missing attributes shall be added to the root element and to the element in line 9, the xml:lang attribute of the paragraph in line 10 shall be deleted and the xml:lang attribute of the paragraph in line 11 shall be corrected (replaced).

Step 1: first XSM sheet and call of the XSM processor

The root element of the XSM sheet is <xsm:manipulator> in the namespace http://www.schematron-quickfix.com/manipulator/process. In contrast to XSLT, the XSM sheet offers the possibility to indicate directly the source document to be manipulated. For this purpose, the document attribute can be added to the root element.

1	<?xml version="1.0" encoding="UTF-8"?>
2	<xsm:manipulator xmlns:xsm="http://www.schematron-quickfix.com/manipulator/process" document="lang.xhtml">
	[...]
4	</xsm:manipulator>

to source.

The document attribute refers to the lang.xhtml file introduced above.

Now, this is already a valid XSM sheet which can be processed by the XSM processor. The processor is called up in the command line as follows:

java -jar xsm.jar lang1.xsm -o lang-result.xhtml

Now, the processor uses the source document referenced in lang1.xsm (lang.xhtml). The result is written into lang-result.xhtml. If the -o option is left out, the processor returns the result to the command line. Alternatively, a source document can also be passed on to the processor with -i. In this case, the source document referenced in document is ignored.

The lang-result.xhtml file should look as follows:

1	<?xml version="1.0" encoding="UTF-8"?>
2	<!DOCTYPE html [
3	<!ENTITY nbsp " " >] >
4	<html xmlns="http://www.w3.org/1999/xhtml">
5	<head>
6	<title>Citations Niccolò Machiavelli</title>
7	</head>
8	<body>
9	<h1>Citations <span lang="it">Niccolò Machiavelli</span></h1>
10	<p xml:lang="en">Wars begin when you will, but they do not end when you please.</p>
11	<p lang="it" xml:lang="en">Può la disciplina nella guerra più che il furore.</p>
12	</body>
13	</html>

to source.

Since we've not yet defined any manipulations, it is hardly surprising that nothing has changed. In this case, however, it is also decisive that actually nothing has changed! The XML declaration and the DTD declaration are the same and also the entity was not transformed to the non-breaking-space character. Such a result would not be possible with XSLT.

Step 2: deleting

The simplest manipulation is the deletion of a node. For this manipulation it is sufficient to indicate the node which shall be deleted. The <xsm:delete> element is provided with a node attribute which addresses the nodes to be deleted by means of a XPath expression.

1	<?xml version="1.0" encoding="UTF-8"?>
2	<xsm:manipulator xmlns:htm="http://www.w3.org/1999/xhtml" xmlns:xsm="http://www.schematron-quickfix.com/manipulator/process" document="lang.xhtml">
3	<xsm:delete node="/htm:html/htm:body[1]/htm:p[1]/@xml:lang"/>
4	</xsm:manipulator>

to source.

Please notice: If nodes are addressed belonging to a namespace, prefixes must be used in the XPath expression. A default namespace cannot be defined for the XPath expressions. The prefixes used must be assigned to a namespace for the appropriate XSM element. In this example, the htm prefix is used. Therefore, it must be assigned to a namespace in the root element.

Step 3: replacing

The replacing differs from the deleting in that the deleted node is replaced by another node. Therefore, the <xsm:replace> element has also a node attribute in order to define the node being replaced by another one. In addition, there is a <xsm:content> child element provided with <xs:any> and <xs:anyAttribute> as content model. As a consequence, it may have any attributes and as content any nodes.

The following applies: If an attribute shall be replaced, it is replaced by the attributes of the <xsm:content> element. If another node shall be replaced, it is replaced by all child nodes of the <xsm:content> element.

1	<?xml version="1.0" encoding="UTF-8"?>
2	<xsm:manipulator xmlns:htm="http://www.w3.org/1999/xhtml" xmlns:xsm="http://www.schematron-quickfix.com/manipulator/process" document="lang.xhtml">
3	<xsm:delete node="/htm:html/htm:body[1]/htm:p[1]/@xml:lang"/>
4	<xsm:replace node="/htm:html/htm:body[1]/htm:p[2]/@xml:lang">
5	<xsm:content xml:lang="it"/>
6	</xsm:replace>
7	</xsm:manipulator>

to source.

So, the xml:lang attribute (value en) is replaced by a xml:lang attribute - but with the value it.

Step 4: adding

The adding, in turn, differs from the replacing in that a node is not deleted anymore. Instead, starting from a context node, one or more nodes are added. The node attribute of the <xsm:add> element serves to determine this/these context node(s).

As an additional distinction between the other manipulations, the add manipulation requires the information on which (XPath) axis of the context node the nodes shall be added. For this purpose, the axis attribute is used. The nodes can be added as first child nodes (child), as last child nodes (last-child), on the preceding-sibling axis (preceding), on the following-sibling axis (follwoing) and on the attribute axis (@). In the last case, attributes must obviously be added.

1	<?xml version="1.0" encoding="UTF-8"?>
2	<xsm:manipulator xmlns:htm="http://www.w3.org/1999/xhtml" xmlns:xsm="http://www.schematron-quickfix.com/manipulator/process" document="lang.xhtml">
3	<xsm:delete node="/htm:html/htm:body[1]/htm:p[1]/@xml:lang"/>
4	<xsm:replace node="/htm:html/htm:body[1]/htm:p[2]/@xml:lang">
5	<xsm:content xml:lang="it"/>
6	</xsm:replace>
7	<xsm:add axis="@" node="/htm:html/htm:body[1]/htm:h1[1]/htm:span[1]">
8	<xsm:content xml:lang="it"/>
9	</xsm:add>
10	<xsm:add axis="@" node="/htm:html">
11	<xsm:content xml:lang="en" lang="en"/>
12	</xsm:add>
13	</xsm:manipulator>

to source.

In this example, attributes must be added to two elements. The element in line 9 gets a xml:lang attribute with the value it. The <html> root element, by contrast, even gets two attributes: lang and xml:lang, each with the value en.

If now the completed XSM sheet is passed on to the processor, the result is as follows:

1	<?xml version="1.0" encoding="UTF-8"?>
2	<!DOCTYPE html [
3	<!ENTITY nbsp " " >] >
4	<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
5	<head>
6	<title>Citations Niccolò Machiavelli</title>
7	</head>
8	<body>
9	<h1>Citations <span lang="it" xml:lang="it">Niccolò Machiavelli</span></h1>
10	<p>Wars begin when you will, but they do not end when you please.</p>
11	<p lang="it" xml:lang="it">Può la disciplina nella guerra più che il furore.</p>
12	</body>
13	</html>

to source.

All manipulations have been executed as expected. Apart from these modifications, the source document is remained unchanged.

top

Imprint – Privacy Policy – Contact – Sitemap