parse-xml
A fast, safe, compliant XML parser for Node.js and browsers.
Contents
Installation
npm install @rgrove/parse-xml
Or, if you like living dangerously, you can load the minified UMD bundle
in a browser via Unpkg and use the parseXml
global.
Features
Not Features
This parser is not a complete implementation of the XML specification because
parts of the spec aren't very useful or aren't safe when the XML being parsed
comes from an untrusted source. However, those parts of XML that are
implemented behave as defined in the spec.
The following XML features are ignored by the parser and are not exposed in the
document tree:
- XML declarations
- Document type definitions
- Processing instructions
In addition, the only supported character encoding is UTF-8.
Examples
Basic Usage
const parseXml = require('@rgrove/parse-xml');
parseXml('<kittens fuzzy="yes">I like fuzzy kittens.</kittens>');
Output
{
type: "document",
children: [
{
type: "element",
name: "kittens",
attributes: {
fuzzy: "yes"
},
children: [
{
type: "text",
text: "I like fuzzy kittens."
}
]
}
]
}
Friendly Errors
When something goes wrong, parse-xml throws an error that tells you exactly what
happened and shows you where the problem is so you can fix it.
parseXml('<foo><bar>baz</foo>');
Output
Error: Missing end tag for element bar (line 1, column 14)
<foo><bar>baz</foo>
^
In addition to a helpful message, error objects have the following properties:
-
column Number
Column where the error occurred (1-based).
-
excerpt String
Excerpt from the input string that contains the problem.
-
line Number
Line where the error occurred (1-based).
-
pos Number
Character position where the error occurred relative to the beginning of the
input (0-based).
API
parseXml(xml: string, options?: object) => object
Parses an XML document and returns an object tree.
Options
The following options may be provided as properties of the options
argument:
-
ignoreUndefinedEntities Boolean (default: false
)
When true
, an undefined named entity like &bogus;
will be left as is
instead of causing a parse error.
-
preserveCdata Boolean (default: false
)
When true
, CDATA sections will be preserved in the document tree as nodes
of type cdata
. Otherwise CDATA sections will be represented as nodes of
type text
.
-
preserveComments Boolean (default: false
)
When true
, comments will be preserved in the document tree as nodes of
type comment
. Otherwise comments will not be included in the document
tree.
-
resolveUndefinedEntity Function
When an undefined named entity is encountered, this function will be called
with the entity as its only argument. It should return a string value with
which to replace the entity, or null
or undefined
to treat the entity as
undefined (which may result in a parse error depending on the value of
ignoreUndefinedEntities
).
Nodes
An XML document is parsed into a tree of node objects. Each node has the
following common properties:
Each node also has a toJSON()
method that returns a serializable
representation of the node without the parent
property (in order to avoid
circular references). This means you can safely pass any node to
JSON.stringify()
to serialize it and its children as JSON.
cdata
A CDATA section. Only emitted when the preserveCdata
option is true
(by
default, CDATA sections become text
nodes).
Properties
Example
<![CDATA[kittens are fuzzy & cute]]>
{
type: "cdata",
text: "kittens are fuzzy & cute",
parent: { ... }
}
A comment. Only emitted when the preserveComments
option is true
.
Properties
-
content String
Comment text.
Example
{
type: "comment",
content: "I'm a comment!",
parent: { ... }
}
document
The top-level node of an XML document.
Properties
-
children Object[]
Array of child nodes.
Example
<root />
{
type: "document",
children: [
{
type: "element",
name: "root",
attributes: {},
children: [],
parent: { ... }
}
],
parent: null
}
element
An element.
Note that since parse-xml doesn't implement XML Namespaces,
no special treatment is given to namespace prefixes in element and attribute
names.
In other words, <foo:bar foo:baz="quux" />
will result in the element name
"foo:bar" and the attribute name "foo:baz".
Properties
-
attributes Object
Hash of attribute names to values.
Attribute names in this object are always in alphabetical order regardless
of their order in the document, and values are normalized and unescaped.
Values are always strings.
-
children Object[]
Array of child nodes.
-
name String
Name of the element as given in the start and/or end tags.
-
preserveWhitespace Boolean?
This property will be set to true
if the special
xml:space
attribute on this element or on the closest parent with an xml:space
attribute has the value "preserve". This indicates that whitespace in the
text content of this element should be preserved rather than normalized.
If neither this element nor any of its ancestors has an xml:space
attribute set to "preserve", or if the closest xml:space
attribute is set
to "default", this property will not be defined.
Example
<kittens description="fuzzy & cute">I <3 kittens</kittens>
{
type: "element",
name: "kittens",
attributes: {
description: "fuzzy & cute"
},
children: [
{
type: "text",
text: "I <3 kittens",
parent: { ... }
}
],
parent: { ... }
}
text
Text content inside an element.
Properties
-
text String
Unescaped text content.
Example
kittens are fuzzy & cute
{
type: "text"
text: "kittens are fuzzy & cute",
parent: { ... }
}
Why another XML parser?
There are many XML parsers for Node, and some of them are good. However, most of
them suffer from one or more of the following shortcomings:
-
Native dependencies.
-
Loose, non-standard, "works for me" parsing behavior that can lead to
unexpected or even unsafe results when given input the author didn't
anticipate.
-
Kitchen sink APIs that tightly couple a parser with DOM manipulation
functions, a stringifier, or other tooling that isn't directly related to
parsing.
-
Stream-based parsing. This is great in the rare case that you need to parse
truly enormous documents, but can be a pain to work with when all you want
is an object tree.
-
Poor error handling.
-
Too big or too Node-specific to work well in browsers.
parse-xml's goal is to be a small, fast, safe, reasonably compliant,
non-streaming, non-validating, browser-friendly parser, because I think this is
an under-served niche.
I think parse-xml demonstrates that it's not necessary to jettison the spec
entirely or to write complex code in order to implement a small, fast XML
parser.
Also, it was fun.
Benchmark
Here's how parse-xml stacks up against two comparable libraries,
libxmljs (which is based on the native
libxml library) and xmldoc (which is based
on sax-js).
Node.js v10.1.0 / Darwin x64
Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
Small document (291 bytes)
27,143 op/s » libxmljs (native)
67,938 op/s » parse-xml
35,749 op/s » xmldoc (sax-js)
Medium document (72081 bytes)
571 op/s » libxmljs (native)
436 op/s » parse-xml
236 op/s » xmldoc (sax-js)
Large document (1162464 bytes)
50 op/s » libxmljs (native)
33 op/s » parse-xml
21 op/s » xmldoc (sax-js)
Suites: 3
Benches: 9
Elapsed: 15,383.87 ms
See the parse-xml-benchmark
repo for instructions on running this benchmark yourself.
License
ISC License