Open-Source Parser For Practical XML (pXML)
Introduction
My previous article Suggestion For a Better XML/HTML Syntax suggests a new XML/HTML syntax called practicalXML (pXML). pXML is less verbose than XML, and has other advantages.
In this article I introduce a parser for pXML. The parser is written in Java, open-sourced under MIT, and the source code is available on Github. Examples used in this article are on Github too.
More information about pXML can be found on its website.
pXML Syntax Refresh
If you never heard of pXML you might want to read Suggestion For a Better XML/HTML Syntax first. That article introduces pXML and explains its rationale.
Here is a copy of chapter Syntax Comparison:
Empty element:
XML: <br />
pXML: [br]
Element with text content:
XML: <summary>text</summary>
pXML: [summary text]
Element with child elements:
XML: <ul>
<li>
<div>A <i>friendly</i> dog</div>
</li>
</ul>
pXML: [ul
[li
[div A [i friendly] dog]
]
]
Attributes:
XML: <div id="unplug_warning" class="warning big-text">Unplug power cord before opening!</div>
pXML: [div (id=unplug_warning class="warning big-text")Unplug power cord before opening!]
Escaping:
XML: <note>Watch out for <, >, ", ', &, [, ], and \ characters</note>
pXML: [note Watch out for <, >, ", ', &, \[, \], and \\ characters]
Comments:
Single comment:
XML: <!-- text -->
pXML: [- text -]
Nested comments:
XML: not supported
pXML: [- text [- nested -] -]
Usage Examples
Before explaining how the parser is implemented, let's first see what you can do with the parser, by looking at some high-level usage examples. The aim is to show how XML technology can be used with pXML.
From pXML to XML And Back Again
To honor the 'p' in pXML ('p' stands for practical), we obviously need to be able to convert pXML to XML, and XML to pXML. This chapter shows examples of how to do this.
Hello World
From PXML to XML
The code below illustrates the simplest possible pXML document - an empty root element with name hello
:
[hello]
To convert pXML to XML, there is a utility class PXMLToXMLConverter
in package dev.pxml.core.utilities
. This class contains method pXMLFileToXMLFile
with the following signature:
public static void pXMLFileToXMLFile ( @NotNull File pXMLFile, @NotNull File XMLFile ) throws Exception
The method is overloaded. The input parameters can be of type File
(as shown above), Path
or String
.
Suppose the above pXML [hello]
code is stored in file hello.pxml
. The following instruction converts hello.pxml
to hello.xml
:
pXMLFileToXMLFile ( "hello.pxml", "hello.xml" );
As expected, the resulting file hello.xml
contains the following code:
<?xml version="1.0" encoding="UTF-8"?>
<hello />
A complete test suite with all source code examples used in this article is available in a Github repo. That repo uses the Gradle build tool.
The parser's Java API documentation is available on pXML's website.
If you want to try out the above example in your own environment, you can proceed as follows:
-
If not done already, install Java version 11 or later.
-
Create a Java application with the tool of your choice (e.g. Gradle, IntellijIdea, Eclipse), or just with raw Java.
-
Visit pXML's downloads page, download the latest
.jar
file, and add it as a dependency to your Java project. -
Adapt the main class so that it contains the following code:
package tests.pxml.hello; import static dev.pxml.core.utilities.PXMLToXMLConverter.*; public static void main ( String[] args ) { try { pXMLFileToXMLFile ( "input/hello.pxml", "output/hello.xml" ); } catch ( Exception e ) { e.printStackTrace(); } }
NoteAdapt
tests.pxml.hello
, as well as the paths of the two files if necessary. Absolute and relative file paths are accepted. Relative file paths are relative to your working directory. -
Create file
input/hello.pxml
with[hello]
as content. -
Create directory
output
. -
Execute the application.
-
Open the resulting file
output/hello.xml
in your editor to verify its content.
From XML to PXML
To convert from XML to pXML is easy too. It's done with method XMLFileToPXMLFile
in class dev.pxml.core.utilities.XMLToPXMLConverter
. Hence, the following two Java statements are required to convert an XML file into a pXML file:
import static dev.pxml.core.utilities.XMLToPXMLConverter.*;
XMLFileToPXMLFile ( "input/hello.xml", "output/hello.pxml" );
Executing this code converts file input/hello.xml
with this content:
<?xml version="1.0" encoding="UTF-8"?>
<hello />
... into output/hello.pxml
with the following pXML code:
[hello]
Any Reader/Writer
As we have seen, methods pXMLFileToXMLFile
and XMLFileToPXMLFile
accept file paths as input/output arguments. If we want to read/write XML/pXML documents from/to other sources like URLs, strings, etc., we can:
-
Use
PXMLToXMLConverter.pipePXMLReaderToXMLWriter
to read any pXML source (URL
,File
,String
, etc.) and write to any XML destination (URL
,File
,String
, etc.). For example we could read pXML code from a URL and store the resulting XML code as a string.This is possible because
pipePXMLReaderToXMLWriter
takes a standardjava.io.Reader
to read pXML, and ajava.io.Writer
to write XML. -
Analogously,
XMLToPXMLConverter.pipeXMLReaderToPXMLWriter
can be used to read any XML source and write the result to any pXML destination.
Login Form
Let's create a more useful example showing some commonly used XML features.
We will convert pXML code to XML, and then convert the resulting XML back to pXML. If everything works fine, the final pXML code must be the same as the initial one.
From PXML to XML
Here is a pXML document using nested elements, attributes, comments, and name spaces:
[form
[title Login]
[note Characters \[, \], < and > are not allowed]
[fields
[- Two text fields: user and password -]
[text_entry (id=user) User]
[text_entry (id=password) Password]
]
[buttons
[button (type=submit) Ok]
[button (type=cancel color="light red") Cancel]
]
[ch:checks (xmlns:ch="http://www.example.com")
[ch:check user.size >= 2]
[ch:check password.size >= 8]
]
]
As seen before, we can convert this file to output/login_form.xml
with:
pXMLFileToXMLFile ( "input/login_form.pxml", "output/login_form.xml" );
After executing the above statement, the content of output/login_form.xml
is:
<?xml version="1.0" encoding="UTF-8"?>
<form>
<title>Login</title>
<note>Characters [, ], < and > are not allowed</note>
<fields>
<!-- Two text fields: user and password -->
<text_entry id="user">User</text_entry>
<text_entry id="password">Password</text_entry>
</fields>
<buttons>
<button type="submit">Ok</button>
<button type="cancel" color="light red">Cancel</button>
</buttons>
<ch:checks xmlns:ch="http://www.example.com">
<ch:check>user.size >= 2</ch:check>
<ch:check>password.size >= 8</ch:check>
</ch:checks>
</form>
The following syntax differences can be observed:
-
pXML: [title Login] XML: <title>Login</title>
This illustrates the most important difference between pXML and XML, as explained in Suggestion For a Better XML/HTML Syntax
-
pXML: [note Characters \[, \], < and > are not allowed] XML: <note>Characters [, ], < and > are not allowed</note>
Here we can see how the escape rules of both dialects are applied during the conversion. pXML uses
\
as escape character (like most programming languages), while XML uses entities. -
pXML: [- Two text fields: user and password -] XML: <!-- Two text fields: user and password -->
Example of converting a comment.
-
pXML: [text_entry (id=user) User] XML: <text_entry id="user">User</text_entry>
Example of converting an attribute.
Note the space after
)
in the pXML code, which does not appear in the resulting XML. The pXML parser allows an optional space after)
which is ignored. This allows to write:[text_entry (id=user) User]
.. instead of:
[text_entry (id=user)User]
... which is a bit less readable (but still valid pXML code).
Writing:
[text_entry(id=user)User]
... would also be parsed correctly.
-
pXML: [ch:checks (xmlns:ch="http://www.example.com") [ch:check user.size >= 2] XML: <ch:checks xmlns:ch="http://www.example.com"> <ch:check>user.size >= 2</ch:check>
XML namespaces are supported in the pXML parser.
From XML to PXML
After copying the result file output/login_form.xml
to input/login_form.xml
we can convert back from XML to pXML with:
XMLFileToPXMLFile ( "input/login_form.xml", "output/login_form.pxml" );
Here is the content of output/login_form.pxml
:
[form
[title Login]
[note Characters \[, \], < and > are not allowed]
[fields
[- Two text fields: user and password -]
[text_entry (id="user") User]
[text_entry (id="password") Password]
]
[buttons
[button (type="submit") Ok]
[button (color="light red" type="cancel") Cancel]
]
[ch:checks (xmlns:ch="http://www.example.com")
[ch:check user.size >= 2]
[ch:check password.size >= 8]
]
]
As we can see, the content is the same as the content of our initial file input/login_form.pxml
.
However, there is one small syntax difference - a difference that does not change the data stored in both files. In the new file, quotes are always used to surround attribute values, even if they could be omitted (e.g. id="user"
instead of id=user
). The reason is that, by default, the pXML writer used in this example always encloses attribute values with quotes. It does not check if the value is allowed to be written without quotes, as that would reduce performance. In a future version of the writer, a parameter could be added to tell the writer to omit quotes if possible.
XML Technology Used With pXML
The most powerful feature of the pXML parser is its ability to read a pXML document into a standard org.w3c.dom.Document
Java object.
Since we have a Java Document
object in memory we can use the whole set of XML extensions supported natively in Java or provided by third party libraries and frameworks. For example, we can:
-
validate a document with XML Schema (W3C), RELAX NG, or Schematron
-
programmatically traverse the document and extract data
-
insert, modify, and delete elements and attributes, and save the result as a new XML or pXML document
-
query the document (search for values, compute aggregates, etc.) with XQuery/XPath
-
convert the document using an XSL transformer (e.g. create a differently structured XML or pXML document, create a plain text document, etc.)
We cannot cover everything in a single article, so let's just have a look at some examples to see how it works.
Loading/Saving a 'Document'
The key to using XML technology with pXML is method pXMLToXMLDocument
in class PXMLToXMLConverter
. This method reads a pXML document from any source (file, URL, string, etc.), and loads it into a standard Java org.w3c.dom.Document
object. The method's signature is:
public static Document pXMLToXMLDocument (
@NotNull Reader pXMLReader, Object pXMLResource ) throws Exception
As shown, this method uses a Java Reader
to read pXML code, and returns a Document
object. Input argument pXMLResource
is just an optional argument used to include the resource's name in error messages (e.g. "Error in file foo/bar.pxml").
If anything goes wrong, an exception is thrown.
Once the data is loaded, we can do everything we could do with an XML document: validate, query, modify, transform, etc.
The counterpart to method pXMLToXMLDocument
is XMLDocumentToPXML
in class XMLToPXMLConverter
. The method is defined as:
public static void XMLDocumentToPXML (
@NotNull Document XMLDocument, @NotNull Writer pXMLWriter ) throws Exception
The method reads a standard Java Document
object and writes the pXML data to any Java Writer
(e.g. FileWriter
, StringWriter
, etc.).
Validation
A common way to validate XML data is to use an XML Schema. An XML schema is itself an XML document containing rules that must be respected by the XML data document.
Here is a simple example of an XML document defining a list of books:
<?xml version="1.0" encoding="UTF-8"?>
<books>
<book>
<isbn>978-0135957059</isbn>
<title>The Pragmatic Programmer: Your Journey to Mastery</title>
<price>41.41</price>
</book>
<book>
<isbn>978-0735619678</isbn>
<title>Code Complete: A Practical Handbook of Software Construction</title>
<price>45.32</price>
</book>
<book>
<isbn>978-0134685991</isbn>
<title>Effective Java</title>
<price>44.10</price>
</book>
</books>
The same data, defined with pXML looks like this:
[books
[book
[isbn 978-0135957059]
[title The Pragmatic Programmer: Your Journey to Mastery]
[price 41.41]
]
[book
[isbn 978-0735619678]
[title Code Complete: A Practical Handbook of Software Construction]
[price 45.32]
]
[book
[isbn 978-0134685991]
[title Effective Java]
[price 44.10]
]
]
The above XML can be validated with this XML schema:
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="books">
<xs:complexType>
<xs:sequence>
<xs:element name="book" type="booktype" minOccurs="1" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:complexType name="booktype">
<xs:sequence>
<xs:element name="isbn" type="xs:string"/>
<xs:element name="title" type="xs:string"/>
<xs:element name="price" type="xs:decimal"/>
</xs:sequence>
</xs:complexType>
</xs:schema>
Because an XML Schema is itself a standard XML document, we can define the schema with pXML too, like this:
[xs:schema (xmlns:xs=http://www.w3.org/2001/XMLSchema)
[xs:element (name=books)
[xs:complexType
[xs:sequence
[xs:element (name=book type=booktype minOccurs=1 maxOccurs=unbounded)]
]
]
]
[xs:complexType (name=booktype)
[xs:sequence
[xs:element (name=isbn type=xs:string)]
[xs:element (name=title type=xs:string)]
[xs:element (name=price type=xs:decimal)]
]
]
]
Hence, there are four possible combinations to validate data:
Data Format | Schema Format |
---|---|
XML | XML |
XML | pXML |
pXML | XML |
pXML | pXML |
An example of each combination is included in the examples repo.
Class dev.pxml.core.utilities.XMLSchemaValidator
provides static methods to validate data. For example, validating pXML data with a pXML schema document (e.g. validate books.pxml
with books.pxsd
) can be done with the following one-liner:
XMLSchemaValidator.validatePXMLFileWithPXMLSchemaFile (
new File ( "input/books.pxml" ),
new File ( "input/books.pxsd" ) );
An exception is thrown if the data is invalid. For example, if a book using tag ibn
instead of isbn
, the following error is reported:
Invalid content was found starting with element 'ibn'. One of '{isbn}' is expected.
Transformation
XML transformation is another very useful XML feature. It is used to transform an XML document to another document. The output document can be another XML document, an HTML document, or any other plain text document. The transformation process is described with a transformation language. The most popular transformation language is XSLT, which is defined as an XML document.
For example, let's re-use the books data from the previous 'validation' example.
[books
[book
[isbn 978-0135957059]
[title The Pragmatic Programmer: Your Journey to Mastery]
[price 41.41]
]
[book
[isbn 978-0735619678]
[title Code Complete: A Practical Handbook of Software Construction]
[price 45.32]
]
[book
[isbn 978-0134685991]
[title Effective Java]
[price 44.10]
]
]
Now we want to create an HTML document that displays the books in a table. We could use the following XSLT document, written in XML:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="html" />
<xsl:template match="/">
<html>
<head>
<title>Books</title>
<style>
table, th, td {
border: 1px solid #ddd;
border-collapse: collapse;
}
th, td {
padding: 0.5em;
}
</style>
</head>
<body>
<h2>Books</h2>
<table>
<tr><th>ISBN</th><th>Title</th><th>Price</th></tr>
<xsl:for-each select="books/book">
<tr>
<td><xsl:value-of select="isbn"/></td>
<td><xsl:value-of select="title"/></td>
<td><xsl:value-of select="price"/></td>
</tr>
</xsl:for-each>
</table>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
Because an XSLT document is itself an XML document, we can define it in pXML too:
[xsl:stylesheet (xmlns:xsl=http://www.w3.org/1999/XSL/Transform version=1.0)
[xsl:output (method=text)]
[xsl:template (match=/)
<html>
<head>
<title>Books</title>
<style>
table, th, td {
border: 1px solid #ddd;
border-collapse: collapse;
}
th, td {
padding: 0.5em;
}
</style>
</head>
<body>
<h2>Books</h2>
<table>
<tr><th>ISBN</th><th>Title</th><th>Price</th></tr>
[xsl:for-each (select=books/book)
<tr>
<td>[xsl:value-of (select=isbn)]</td>
<td>[xsl:value-of (select=title)]</td>
<td>[xsl:value-of (select=price)]</td>
</tr>
]
</table>
</body>
</html>
]
]
Class dev.pxml.core.utilities.XSLTTransformer
provides static methods to transform data. For example, we can transform the above pXML books data with the above pXML XSLT document like this:
XSLTTransformer.transformPXMLFileWithPXMLXSLTFile (
new File ( "input/books.pxml" ),
new File ( "input/books.pxslt" ),
new File ( "output/books.html" ) );
Executing the above statement creates file output/books.html
with this content:
<html>
<head>
<title>Books</title>
<style>
table, th, td {
border: 1px solid #ddd;
border-collapse: collapse;
}
th, td {
padding: 0.5em;
}
</style>
</head>
<body>
<h2>Books</h2>
<table>
<tr><th>ISBN</th><th>Title</th><th>Price</th></tr>
<tr>
<td>978-0135957059</td>
<td>The Pragmatic Programmer: Your Journey to Mastery</td>
<td>41.41</td>
</tr>
<tr>
<td>978-0735619678</td>
<td>Code Complete: A Practical Handbook of Software Construction</td>
<td>45.32</td>
</tr>
<tr>
<td>978-0134685991</td>
<td>Effective Java</td>
<td>44.10</td>
</tr>
</table>
</body>
</html>
The result in a web browser looks like this:
Using XML technology with PML
Chapter pXML Predecessor in a previous article explains that the pXML syntax originated from the Practical Markup Language (PML). PML is a markup language designed to create web articles and books.
Now we can say that PML uses the pXML syntax. It also supports lenient parsing, but internally the AST is stored in pXML format. In the future, the pXML parser described in this article will be used in PML. Hence, all XML technology illustrated in the previous chapter can then be used with PML documents.
For example, one could:
-
use XQuery to extract all links in a PML document
-
use an XML transformer to save all links in a CSV file that can be read by a tool (written in any language) to check for broken links.
-
create filters that consume the AST created by the PML parser, and then transform the AST (add/remove/change nodes) before letting PML produce the HTML output.
It's easy to imagine all kinds of useful PML extensions users will be able create and share.
Parser (Reader)
The preceding chapter showed what we can do with the pXML parser. Now we'll dive deeper and see how it works, and how you can use and customize the parser for your own specific needs.
Event-Based
The parser is event-based. It reads a pXML document and generates a stream of events. The parser itself doesn't do anything with the parsed data. Each type of event (e.g. onNodeStart, onNodeEnd
, etc.) is handled by a callback function. All callback functions are part of an events handler object. Before parsing, the client code must pass an events handler object to the parser. The events handler is an interface containing one callback function for each type of event. It is defined as follows:
package dev.pxml.core.reader.parser.eventHandler;
import dev.pxml.core.data.node.PXMLNode;
import dev.pxml.core.reader.reader.TextLocation;
import dev.pxml.core.utilities.annotations.NotNull;
public interface IParserEventsHandler<N, R> {
void onStart() throws Exception;
void onStop() throws Exception;
N onRootNodeStart ( @NotNull PXMLNode rootNode ) throws Exception;
void onRootNodeEnd ( N rootNode ) throws Exception;
N onNodeStart ( @NotNull PXMLNode node, @NotNull N parentNode ) throws Exception;
void onNodeEnd ( N node ) throws Exception;
void onText ( @NotNull String text, @NotNull N parentNode, TextLocation location ) throws Exception;
// [- and -] is included in comment
void onComment ( String comment, @NotNull N parentNode, TextLocation location ) throws Exception;
R getResult() throws Exception;
}
Type parameter N
defines the type of the nodes generated by this events handler. Type parameter R
defines the type of the final result created when parsing is terminated.
The following implementations of IParserEventsHandler
are included in the core library:
-
This handler creates a standard Java
org.w3c.dom.Document
object. It's the handler used in the previous chapter when we validated or transformed a pXML document. -
Besides creating a
Document
object, we can create a pXML specific AST with this handler. The end result is a PXMLNode. -
If we just need to convert pXML to XML then the most efficient way is to use this handler. Instead of loading the whole pXML document into an internal tree structure, each item (name, attribute, text, etc.) is immediately written to a Java
Writer
, as soon as it is parsed. Hence, very large documents can be converted quickly and without eating up internal memory. -
This is a utility handler that writes logging data to a Java
Writer
(default is standard OS out device). Can be used for debugging purposes. -
As the name suggest, this handler doesn't do anything. It's useful in these cases:
-
We just want to know if an error is reported by the parser (e.g. malformed pXML document)
-
We don't want to handle all events. In that case we can create an events handler that inherits from this one, and overwrites the functions we care about.
-
-
This events handler inherits from
DoNothing_ParserEventHandler
, and overwrites functionsonStart
andonEnd
to measure the total parsing time.
Customized Parsing
If none of the above handlers suits your needs, you can create your own customized events handler by creating a class that implements IParserEventsHandler
, and pass it to a parser that implements AEventStreamParser. To get started you can have a look at the implementations mentioned in the previous chapter.
A parser uses an ITokenizer to read pXML tokens (name, text, comment, etc.). For maximum customization, you could provide your own tokenizer and/or parser and use it with pXML's core library.
Parser Properties And Features
-
The parser is in a proof of concept state, not ready yet to be used in production.
-
Written in Java.
-
Free and open-sourced under MIT license.
-
No dependencies.
-
Just one +-55 kB .jar file
-
Fast (no regexes used)
-
Event-based. Therefore low memory footprint, even when reading large documents.
-
Customized event handlers can be provided. Increases versatility
-
Able to load pXML into a standard Java
org.w3c.dom.Document
object. Therefore all XML technology based onDocument
can be used (validation, querying, transformation etc.). -
Uses standard Java
Reader
/Writer
for flexible input/output configurations.
XML Features Not Yet Supported
The following features are not yet supported in the current implementation:
-
CDATA sections
-
processing instructions
-
DTD (replaced by XML Schema; will not be supported in pXML)
Writer
Besides a reader, the core library also includes a writer that implements interface IPXMLWriter. A writer is created by passing a standard Java java.io.Writer
to the constructor of class PXMLWriter. Then methods like writeEmptyNode
, writeTextNode
, writeComment
, etc. can be used to write pXML to any destination (file, string, URL, etc.). The writer takes care of using escape sequences when needed.
Indenting must be done manually. A future version might include a pretty printing mode.
Summary
The pXML parser can be used to:
-
read pXML documents
-
convert pXML to XML
-
convert XML to pXML
-
use XML technology with pXML documents (validate, query, change, and transform documents)
To maximize versatility, the parser produces an event stream which can be consumed by customized event handlers.
The core library also contains a writer to write pXML document programmatically.