Open-Source Parser For Practical XML (pXML)

Christian Neumanns
From XML to pXML

Introduction

My previous article Suggestion For a Better XML/HTML Syntax suggests a new XML/HTML syntax called practicalXML (pXML). pXML is less verbose than XML, and has other advantages.

In this article I introduce a parser for pXML. The parser is written in Java, open-sourced under MIT, and the source code is available on Github. Examples used in this article are on Github too.

More information about pXML can be found on its website.

pXML Syntax Refresh

If you never heard of pXML you might want to read Suggestion For a Better XML/HTML Syntax first. That article introduces pXML and explains its rationale.

Here is a copy of chapter Syntax Comparison:

Empty element:

XML:  <br />
pXML: [br]

Element with text content:

XML:  <summary>text</summary>
pXML: [summary text]

Element with child elements:

XML:  <ul>
          <li>
              <div>A <i>friendly</i> dog</div>
          </li>
      </ul>

pXML: [ul
          [li
              [div A [i friendly] dog]
          ]
      ]

Attributes:

XML:  <div id="unplug_warning" class="warning big-text">Unplug power cord before opening!</div>
pXML: [div (id=unplug_warning class="warning big-text")Unplug power cord before opening!]

Escaping:

XML:  <note>Watch out for &lt;, &gt;, &quot;, &apos;, &amp;, [, ], and \ characters</note>
pXML: [note Watch out for <, >, ", ', &, \[, \], and \\ characters]    

Comments:

Single comment:
XML:  <!-- text -->
pXML: [- text -]

Nested comments:
XML:  not supported
pXML: [- text [- nested -] -]

Usage Examples

Before explaining how the parser is implemented, let's first see what you can do with the parser, by looking at some high-level usage examples. The aim is to show how XML technology can be used with pXML.

From pXML to XML And Back Again

To honor the 'p' in pXML ('p' stands for practical), we obviously need to be able to convert pXML to XML, and XML to pXML. This chapter shows examples of how to do this.

Hello World

From PXML to XML

The code below illustrates the simplest possible pXML document - an empty root element with name hello:

[hello]

To convert pXML to XML, there is a utility class PXMLToXMLConverter in package dev.pxml.core.utilities. This class contains method pXMLFileToXMLFile with the following signature:

public static void pXMLFileToXMLFile ( @NotNull File pXMLFile, @NotNull File XMLFile ) throws Exception

The method is overloaded. The input parameters can be of type File (as shown above), Path or String.

Suppose the above pXML [hello] code is stored in file hello.pxml. The following instruction converts hello.pxml to hello.xml :

pXMLFileToXMLFile ( "hello.pxml", "hello.xml" );

As expected, the resulting file hello.xml contains the following code:

<?xml version="1.0" encoding="UTF-8"?>
<hello />

A complete test suite with all source code examples used in this article is available in a Github repo. That repo uses the Gradle build tool.

The parser's Java API documentation is available on pXML's website.

If you want to try out the above example in your own environment, you can proceed as follows:

  • If not done already, install Java version 11 or later.

  • Create a Java application with the tool of your choice (e.g. Gradle, IntellijIdea, Eclipse), or just with raw Java.

  • Visit pXML's downloads page, download the latest .jar file, and add it as a dependency to your Java project.

  • Adapt the main class so that it contains the following code:

    package tests.pxml.hello;
    
    import static dev.pxml.core.utilities.PXMLToXMLConverter.*;
    
    public static void main ( String[] args ) {
        
        try {
            pXMLFileToXMLFile ( "input/hello.pxml", "output/hello.xml" );
        } catch ( Exception e ) {
            e.printStackTrace();
        }
    }
    Note

    Adapt tests.pxml.hello, as well as the paths of the two files if necessary. Absolute and relative file paths are accepted. Relative file paths are relative to your working directory.

  • Create file input/hello.pxml with [hello] as content.

  • Create directory output.

  • Execute the application.

  • Open the resulting file output/hello.xml in your editor to verify its content.

From XML to PXML

To convert from XML to pXML is easy too. It's done with method XMLFileToPXMLFile in class dev.pxml.core.utilities.XMLToPXMLConverter. Hence, the following two Java statements are required to convert an XML file into a pXML file:

import static dev.pxml.core.utilities.XMLToPXMLConverter.*;

XMLFileToPXMLFile ( "input/hello.xml", "output/hello.pxml" );

Executing this code converts file input/hello.xml with this content:

<?xml version="1.0" encoding="UTF-8"?>
<hello />

... into output/hello.pxml with the following pXML code:

[hello]

Any Reader/Writer

As we have seen, methods pXMLFileToXMLFile and XMLFileToPXMLFile accept file paths as input/output arguments. If we want to read/write XML/pXML documents from/to other sources like URLs, strings, etc., we can:

  • Use PXMLToXMLConverter.pipePXMLReaderToXMLWriter to read any pXML source (URL, File, String, etc.) and write to any XML destination (URL, File, String, etc.). For example we could read pXML code from a URL and store the resulting XML code as a string.

    This is possible because pipePXMLReaderToXMLWriter takes a standard java.io.Reader to read pXML, and a java.io.Writer to write XML.

  • Analogously, XMLToPXMLConverter.pipeXMLReaderToPXMLWriter can be used to read any XML source and write the result to any pXML destination.

Login Form

Let's create a more useful example showing some commonly used XML features.

We will convert pXML code to XML, and then convert the resulting XML back to pXML. If everything works fine, the final pXML code must be the same as the initial one.

From PXML to XML

Here is a pXML document using nested elements, attributes, comments, and name spaces:

[form
    [title Login]
    [note Characters \[, \], < and > are not allowed]
    [fields
        [- Two text fields: user and password -]
        [text_entry (id=user) User]
        [text_entry (id=password) Password]
    ]
    [buttons
        [button (type=submit) Ok]
        [button (type=cancel color="light red") Cancel]
    ]

    [ch:checks (xmlns:ch="http://www.example.com")
        [ch:check user.size >= 2]
        [ch:check password.size >= 8]
    ]
]
File input/login_form.pxml

As seen before, we can convert this file to output/login_form.xml with:

pXMLFileToXMLFile ( "input/login_form.pxml", "output/login_form.xml" );

After executing the above statement, the content of output/login_form.xml is:

<?xml version="1.0" encoding="UTF-8"?>
<form>
    <title>Login</title>
    <note>Characters [, ], &lt; and &gt; are not allowed</note>
    <fields>
        <!-- Two text fields: user and password -->
        <text_entry id="user">User</text_entry>
        <text_entry id="password">Password</text_entry>
    </fields>
    <buttons>
        <button type="submit">Ok</button>
        <button type="cancel" color="light red">Cancel</button>
    </buttons>

    <ch:checks xmlns:ch="http://www.example.com">
        <ch:check>user.size &gt;= 2</ch:check>
        <ch:check>password.size &gt;= 8</ch:check>
    </ch:checks>
</form>
File output/login_form.xml

The following syntax differences can be observed:

  • pXML: [title Login]
    XML:  <title>Login</title>

    This illustrates the most important difference between pXML and XML, as explained in Suggestion For a Better XML/HTML Syntax

  • pXML: [note Characters \[, \], < and > are not allowed]
    XML:  <note>Characters [, ], &lt; and &gt; are not allowed</note>

    Here we can see how the escape rules of both dialects are applied during the conversion. pXML uses \ as escape character (like most programming languages), while XML uses entities.

  • pXML: [- Two text fields: user and password -]
    XML:  <!-- Two text fields: user and password -->

    Example of converting a comment.

  • pXML: [text_entry (id=user) User]
    XML:  <text_entry id="user">User</text_entry>

    Example of converting an attribute.

    Note the space after ) in the pXML code, which does not appear in the resulting XML. The pXML parser allows an optional space after ) which is ignored. This allows to write:

    [text_entry (id=user) User]

    .. instead of:

    [text_entry (id=user)User]

    ... which is a bit less readable (but still valid pXML code).

    Writing:

    [text_entry(id=user)User]

    ... would also be parsed correctly.

  • pXML: [ch:checks (xmlns:ch="http://www.example.com")
              [ch:check user.size >= 2]
    
    XML:  <ch:checks xmlns:ch="http://www.example.com">
              <ch:check>user.size &gt;= 2</ch:check>

    XML namespaces are supported in the pXML parser.

From XML to PXML

After copying the result file output/login_form.xml to input/login_form.xml we can convert back from XML to pXML with:

XMLFileToPXMLFile ( "input/login_form.xml", "output/login_form.pxml" );

Here is the content of output/login_form.pxml:

[form 
    [title Login]
    [note Characters \[, \], < and > are not allowed]
    [fields 
        [- Two text fields: user and password -]
        [text_entry (id="user") User]
        [text_entry (id="password") Password]
    ]
    [buttons 
        [button (type="submit") Ok]
        [button (color="light red" type="cancel") Cancel]
    ]

    [ch:checks (xmlns:ch="http://www.example.com") 
        [ch:check user.size >= 2]
        [ch:check password.size >= 8]
    ]
]
File output/login_form.pxml

As we can see, the content is the same as the content of our initial file input/login_form.pxml.

However, there is one small syntax difference - a difference that does not change the data stored in both files. In the new file, quotes are always used to surround attribute values, even if they could be omitted (e.g. id="user" instead of id=user). The reason is that, by default, the pXML writer used in this example always encloses attribute values with quotes. It does not check if the value is allowed to be written without quotes, as that would reduce performance. In a future version of the writer, a parameter could be added to tell the writer to omit quotes if possible.

XML Technology Used With pXML

The most powerful feature of the pXML parser is its ability to read a pXML document into a standard org.w3c.dom.Document Java object.

Since we have a Java Document object in memory we can use the whole set of XML extensions supported natively in Java or provided by third party libraries and frameworks. For example, we can:

  • validate a document with XML Schema (W3C), RELAX NG, or Schematron

  • programmatically traverse the document and extract data

  • insert, modify, and delete elements and attributes, and save the result as a new XML or pXML document

  • query the document (search for values, compute aggregates, etc.) with XQuery/XPath

  • convert the document using an XSL transformer (e.g. create a differently structured XML or pXML document, create a plain text document, etc.)

We cannot cover everything in a single article, so let's just have a look at some examples to see how it works.

Loading/Saving a 'Document'

The key to using XML technology with pXML is method pXMLToXMLDocument in class PXMLToXMLConverter. This method reads a pXML document from any source (file, URL, string, etc.), and loads it into a standard Java org.w3c.dom.Document object. The method's signature is:

public static Document pXMLToXMLDocument (
    @NotNull Reader pXMLReader, Object pXMLResource ) throws Exception

As shown, this method uses a Java Reader to read pXML code, and returns a Document object. Input argument pXMLResource is just an optional argument used to include the resource's name in error messages (e.g. "Error in file foo/bar.pxml").

If anything goes wrong, an exception is thrown.

Once the data is loaded, we can do everything we could do with an XML document: validate, query, modify, transform, etc.

The counterpart to method pXMLToXMLDocument is XMLDocumentToPXML in class XMLToPXMLConverter. The method is defined as:

public static void XMLDocumentToPXML (
    @NotNull Document XMLDocument, @NotNull Writer pXMLWriter ) throws Exception

The method reads a standard Java Document object and writes the pXML data to any Java Writer (e.g. FileWriter, StringWriter, etc.).

Validation

A common way to validate XML data is to use an XML Schema. An XML schema is itself an XML document containing rules that must be respected by the XML data document.

Here is a simple example of an XML document defining a list of books:

<?xml version="1.0" encoding="UTF-8"?>
<books>
    <book>
        <isbn>978-0135957059</isbn>
        <title>The Pragmatic Programmer: Your Journey to Mastery</title>
        <price>41.41</price>
    </book>
    <book>
        <isbn>978-0735619678</isbn>
        <title>Code Complete: A Practical Handbook of Software Construction</title>
        <price>45.32</price>
    </book>
    <book>
        <isbn>978-0134685991</isbn>
        <title>Effective Java</title>
        <price>44.10</price>
    </book>
</books>
File input/books.xml

The same data, defined with pXML looks like this:

[books
    [book
        [isbn 978-0135957059]
        [title The Pragmatic Programmer: Your Journey to Mastery]
        [price 41.41]
    ]
    [book
        [isbn 978-0735619678]
        [title Code Complete: A Practical Handbook of Software Construction]
        [price 45.32]
    ]
    [book
        [isbn 978-0134685991]
        [title Effective Java]
        [price 44.10]
    ]
]
File input/books.pxml

The above XML can be validated with this XML schema:

<?xml version="1.0" encoding="UTF-8"?>

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:element name="books">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="book" type="booktype" minOccurs="1" maxOccurs="unbounded"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>

    <xs:complexType name="booktype">
        <xs:sequence>
            <xs:element name="isbn" type="xs:string"/>
            <xs:element name="title" type="xs:string"/>
            <xs:element name="price" type="xs:decimal"/>
        </xs:sequence>
    </xs:complexType>

</xs:schema>
File input/books.xsd

Because an XML Schema is itself a standard XML document, we can define the schema with pXML too, like this:

[xs:schema (xmlns:xs=http://www.w3.org/2001/XMLSchema)

    [xs:element (name=books)
        [xs:complexType
            [xs:sequence
                [xs:element (name=book type=booktype minOccurs=1 maxOccurs=unbounded)]
            ]
        ]
    ]

    [xs:complexType (name=booktype)
        [xs:sequence
            [xs:element (name=isbn type=xs:string)]
            [xs:element (name=title type=xs:string)]
            [xs:element (name=price type=xs:decimal)]
        ]
    ]
]
File input/books.pxsd

Hence, there are four possible combinations to validate data:

Data FormatSchema Format
XMLXML
XMLpXML
pXMLXML
pXMLpXML

An example of each combination is included in the examples repo.

Class dev.pxml.core.utilities.XMLSchemaValidator provides static methods to validate data. For example, validating pXML data with a pXML schema document (e.g. validate books.pxml with books.pxsd) can be done with the following one-liner:

XMLSchemaValidator.validatePXMLFileWithPXMLSchemaFile (
    new File ( "input/books.pxml" ),
    new File ( "input/books.pxsd" ) );

An exception is thrown if the data is invalid. For example, if a book using tag ibn instead of isbn, the following error is reported:

Invalid content was found starting with element 'ibn'. One of '{isbn}' is expected.

Transformation

XML transformation is another very useful XML feature. It is used to transform an XML document to another document. The output document can be another XML document, an HTML document, or any other plain text document. The transformation process is described with a transformation language. The most popular transformation language is XSLT, which is defined as an XML document.

For example, let's re-use the books data from the previous 'validation' example.

[books
    [book
        [isbn 978-0135957059]
        [title The Pragmatic Programmer: Your Journey to Mastery]
        [price 41.41]
    ]
    [book
        [isbn 978-0735619678]
        [title Code Complete: A Practical Handbook of Software Construction]
        [price 45.32]
    ]
    [book
        [isbn 978-0134685991]
        [title Effective Java]
        [price 44.10]
    ]
]
File input/books.pxml

Now we want to create an HTML document that displays the books in a table. We could use the following XSLT document, written in XML:

<?xml version="1.0" encoding="UTF-8"?>

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

<xsl:output method="html" />

<xsl:template match="/">
<html>
    <head>
        <title>Books</title>
        <style>
            table, th, td {
                border: 1px solid #ddd;
                border-collapse: collapse;
            }
            th, td {
                padding: 0.5em;
            }
        </style>
    </head>

    <body>
        <h2>Books</h2>
        <table>
            <tr><th>ISBN</th><th>Title</th><th>Price</th></tr>
            <xsl:for-each select="books/book">
                <tr>
                    <td><xsl:value-of select="isbn"/></td>
                    <td><xsl:value-of select="title"/></td>
                    <td><xsl:value-of select="price"/></td>
                </tr>
            </xsl:for-each>
        </table>
    </body>
</html>
</xsl:template>

</xsl:stylesheet>
File input/books.html.xslt

Because an XSLT document is itself an XML document, we can define it in pXML too:

[xsl:stylesheet (xmlns:xsl=http://www.w3.org/1999/XSL/Transform version=1.0)

[xsl:output (method=text)]

[xsl:template (match=/)
<html>
    <head>
        <title>Books</title>
        <style>
            table, th, td {
                border: 1px solid #ddd;
                border-collapse: collapse;
            }
            th, td {
                padding: 0.5em;
            }
        </style>
    </head>

    <body>
        <h2>Books</h2>
        <table>
            <tr><th>ISBN</th><th>Title</th><th>Price</th></tr>
            [xsl:for-each (select=books/book)
                <tr>
                    <td>[xsl:value-of (select=isbn)]</td>
                    <td>[xsl:value-of (select=title)]</td>
                    <td>[xsl:value-of (select=price)]</td>
                </tr>
            ]
        </table>
    </body>
</html>
]
]
File input/books.html.pxslt

Class dev.pxml.core.utilities.XSLTTransformer provides static methods to transform data. For example, we can transform the above pXML books data with the above pXML XSLT document like this:

XSLTTransformer.transformPXMLFileWithPXMLXSLTFile (
    new File ( "input/books.pxml" ),
    new File ( "input/books.pxslt" ),
    new File ( "output/books.html" ) );

Executing the above statement creates file output/books.html with this content:


<html>
    <head>
        <title>Books</title>
        <style>
            table, th, td {
                border: 1px solid #ddd;
                border-collapse: collapse;
            }
            th, td {
                padding: 0.5em;
            }
        </style>
    </head>

    <body>
        <h2>Books</h2>
        <table>
            <tr><th>ISBN</th><th>Title</th><th>Price</th></tr>
            
                <tr>
                    <td>978-0135957059</td>
                    <td>The Pragmatic Programmer: Your Journey to Mastery</td>
                    <td>41.41</td>
                </tr>
            
                <tr>
                    <td>978-0735619678</td>
                    <td>Code Complete: A Practical Handbook of Software Construction</td>
                    <td>45.32</td>
                </tr>
            
                <tr>
                    <td>978-0134685991</td>
                    <td>Effective Java</td>
                    <td>44.10</td>
                </tr>
            
        </table>
    </body>
</html>

The result in a web browser looks like this:

Book table in browser

Using XML technology with PML

Chapter pXML Predecessor in a previous article explains that the pXML syntax originated from the Practical Markup Language (PML). PML is a markup language designed to create web articles and books.

Now we can say that PML uses the pXML syntax. It also supports lenient parsing, but internally the AST is stored in pXML format. In the future, the pXML parser described in this article will be used in PML. Hence, all XML technology illustrated in the previous chapter can then be used with PML documents.

For example, one could:

  • use XQuery to extract all links in a PML document

  • use an XML transformer to save all links in a CSV file that can be read by a tool (written in any language) to check for broken links.

  • create filters that consume the AST created by the PML parser, and then transform the AST (add/remove/change nodes) before letting PML produce the HTML output.

It's easy to imagine all kinds of useful PML extensions users will be able create and share.

Parser (Reader)

The preceding chapter showed what we can do with the pXML parser. Now we'll dive deeper and see how it works, and how you can use and customize the parser for your own specific needs.

Note

The parser's source code is available on Github, and it's API is documented here.

Event-Based

The parser is event-based. It reads a pXML document and generates a stream of events. The parser itself doesn't do anything with the parsed data. Each type of event (e.g. onNodeStart, onNodeEnd, etc.) is handled by a callback function. All callback functions are part of an events handler object. Before parsing, the client code must pass an events handler object to the parser. The events handler is an interface containing one callback function for each type of event. It is defined as follows:

package dev.pxml.core.reader.parser.eventHandler;

import dev.pxml.core.data.node.PXMLNode;
import dev.pxml.core.reader.reader.TextLocation;
import dev.pxml.core.utilities.annotations.NotNull;

public interface IParserEventsHandler<N, R> {

    void onStart() throws Exception;
    void onStop() throws Exception;

    N onRootNodeStart ( @NotNull PXMLNode rootNode ) throws Exception;
    void onRootNodeEnd ( N rootNode ) throws Exception;

    N onNodeStart ( @NotNull PXMLNode node, @NotNull N parentNode ) throws Exception;
    void onNodeEnd ( N node ) throws Exception;

    void onText ( @NotNull String text, @NotNull N parentNode, TextLocation location ) throws Exception;

    // [- and -] is included in comment
    void onComment ( String comment, @NotNull N parentNode, TextLocation location ) throws Exception;

    R getResult() throws Exception;
}

Type parameter N defines the type of the nodes generated by this events handler. Type parameter R defines the type of the final result created when parsing is terminated.

The following implementations of IParserEventsHandler are included in the core library:

  • CreateDOM_ParserEventHandler

    This handler creates a standard Java org.w3c.dom.Document object. It's the handler used in the previous chapter when we validated or transformed a pXML document.

  • CreateAST_ParserEventHandler

    Besides creating a Document object, we can create a pXML specific AST with this handler. The end result is a PXMLNode.

  • WriteXML_ParserEventHandler

    If we just need to convert pXML to XML then the most efficient way is to use this handler. Instead of loading the whole pXML document into an internal tree structure, each item (name, attribute, text, etc.) is immediately written to a Java Writer, as soon as it is parsed. Hence, very large documents can be converted quickly and without eating up internal memory.

  • Logger_ParserEventHandler

    This is a utility handler that writes logging data to a Java Writer (default is standard OS out device). Can be used for debugging purposes.

  • DoNothing_ParserEventHandler

    As the name suggest, this handler doesn't do anything. It's useful in these cases:

    • We just want to know if an error is reported by the parser (e.g. malformed pXML document)

    • We don't want to handle all events. In that case we can create an events handler that inherits from this one, and overwrites the functions we care about.

  • Timer_ParserEventHandler

    This events handler inherits from DoNothing_ParserEventHandler, and overwrites functions onStart and onEnd to measure the total parsing time.

Customized Parsing

If none of the above handlers suits your needs, you can create your own customized events handler by creating a class that implements IParserEventsHandler, and pass it to a parser that implements AEventStreamParser. To get started you can have a look at the implementations mentioned in the previous chapter.

A parser uses an ITokenizer to read pXML tokens (name, text, comment, etc.). For maximum customization, you could provide your own tokenizer and/or parser and use it with pXML's core library.

Parser Properties And Features

  • The parser is in a proof of concept state, not ready yet to be used in production.

  • Written in Java.

  • Free and open-sourced under MIT license.

  • No dependencies.

  • Just one +-55 kB .jar file

  • Fast (no regexes used)

  • Event-based. Therefore low memory footprint, even when reading large documents.

  • Customized event handlers can be provided. Increases versatility

  • Able to load pXML into a standard Java org.w3c.dom.Document object. Therefore all XML technology based on Document can be used (validation, querying, transformation etc.).

  • Uses standard Java Reader / Writer for flexible input/output configurations.

XML Features Not Yet Supported

The following features are not yet supported in the current implementation:

  • CDATA sections

  • processing instructions

  • DTD (replaced by XML Schema; will not be supported in pXML)

Writer

Besides a reader, the core library also includes a writer that implements interface IPXMLWriter. A writer is created by passing a standard Java java.io.Writer to the constructor of class PXMLWriter. Then methods like writeEmptyNode, writeTextNode, writeComment, etc. can be used to write pXML to any destination (file, string, URL, etc.). The writer takes care of using escape sequences when needed.

Indenting must be done manually. A future version might include a pretty printing mode.

Summary

The pXML parser can be used to:

  • read pXML documents

  • convert pXML to XML

  • convert XML to pXML

  • use XML technology with pXML documents (validate, query, change, and transform documents)

To maximize versatility, the parser produces an event stream which can be consumed by customized event handlers.

The core library also contains a writer to write pXML document programmatically.