saxes
A sax-style non-validating parser for XML.
Saxes is a fork of sax 1.2.4. All mentions
of sax in this project's documentation are references to sax 1.2.4.
Designed with node in mind, but should work fine in the
browser or other CommonJS implementations.
Notable Differences from Sax.
-
Saxes aims to be much stricter than sax with regards to XML
well-formedness. Sax, even in its so-called "strict mode", is not strict. It
silently accepts structures that are not well-formed XML. Projects that need
better compliance with well-formedness constraints cannot use sax as-is.
Saxes aims for conformance with XML 1.0 fifth
edition and XML Namespaces 1.0
third edition.
Consequently, saxes does not support HTML, or pseudo-XML, or bad XML.
-
Saxes is much much faster than sax, mostly because of a substantial redesign
of the internal parsing logic. The speed improvement is not merely due to
removing features that were supported by sax. That helped a bit, but saxes
adds some expensive checks in its aim for conformance with the XML
specification. Redesigning the parsing logic is what accounts for most of the
performance improvement.
-
Saxes does not aim to support antiquated platforms. We will not pollute the
source or the default build with support for antiquated platforms. If you want
support for IE 11, you are welcome to produce a PR that adds a new build
transpiled to ES5.
-
Saxes handles errors differently from sax: it provides a default onerror
handler which throws. You can replace it with your own handler if you want. If
your handler does nothing. There is no resume
method to call.
-
There's no Stream
API. A revamped API may be introduced later. (It is still
a "streaming parser" in the general sense that you write a character stream to
it.)
-
Saxes does not have facilities for limiting the size the data chunks passed to
event handlers. See the FAQ entry for more details.
Limitations
This is a non-validating parser so it only verifies whether the document is
well-formed. We do aim to raise errors for all malformed constructs encountered.
However, this parser does not parse the contents of DTDs. So malformedness
errors caused by errors in DTDs cannot be reported.
Also, the parser continues to parse even upon encountering errors, and does its
best to continue reporting errors. You should heed all errors
reported.
HOWEVER, ONCE AN ERROR HAS BEEN ENCOUNTERED YOU CANNOT RELY ON THE DATA
PROVIDED THROUGH THE OTHER EVENT HANDLERS.
After an error, saxes tries to make sense of your document, but it may interpret
it incorrectly. For instance <foo a=bc="d"/>
is invalid XML. Did you mean to
have <foo a="bc=d"/>
or <foo a="b" c="d"/>
or some other variation?
Saxes takes an honest stab at figuring out your mangled XML. That's as good as
it gets.
Regarding <!DOCTYPE
s and <!ENTITY
s
The parser will handle the basic XML entities in text nodes and attribute
values: & < > ' "
. It's possible to define additional
entities in XML by putting them in the DTD. This parser doesn't do anything with
that. If you want to listen to the ondoctype
event, and then fetch the
doctypes, and read the entities and add them to parser.ENTITIES
, then be my
guest.
Documentation
The source code contains JSDOC comments. Use them.
PAY CLOSE ATTENTION TO WHAT IS PUBLIC AND WHAT IS PRIVATE.
The elements of code that do not have JSDOC documentation, or have documentation
with the @private
tag, are private.
If you use anything private, that's at your own peril.
If there's a mistake in the documenation, raise an issue. If you just assume,
you may assume incorrectly.
Summary Usage Information
Example
var saxes = require("./lib/saxes"),
parser = new saxes.SaxesParser();
parser.onerror = function (e) {
};
parser.ontext = function (t) {
};
parser.onopentag = function (node) {
};
parser.onend = function () {
};
parser.write('<xml>Hello, <who name="world">world</who>!</xml>').close();
Constructor Arguments
Pass the following arguments to the parser function. All are optional.
opt
- Object bag of settings regarding string formatting.
Settings supported:
-
xmlns
- Boolean. If true, then namespaces are supported. Default
is false
.
-
position
- Boolean. If false, then don't track line/col/position. Unset is
treated as true
. Default is unset.
-
fileName
- String. Set a file name for error reporting. This is useful only
when tracking positions. You may leave it unset, in which case the file name
in error messages will be undefined
.
Methods
write
- Write bytes onto the stream. You don't have to do this all at
once. You can keep writing as much as you want.
close
- Close the stream. Once closed, no more data may be written until it is
done processing the buffer, which is signaled by the end
event.
Properties
The parser has the following properties:
line
, column
, position
- Indications of the position in the XML document
where the parser currently is looking.
closed
- Boolean indicating whether or not the parser can be written to. If
it's true
, then wait for the ready
event to write again.
opt
- Any options passed into the constructor.
xmlDecl
- The XML declaration for this document. It contains the fields
version
, encoding
and standalone
. They are all undefined
before
encountering the XML declaration. If they are undefined after the XML
declaration, the corresponding value was not set by the declaration. There is no
event associated with the XML declaration. In a well-formed document, the XML
declaration may be preceded only by an optional BOM. So by the time any event
generated by the parser happens, the declaration has been processed if present
at all. Otherwise, you have a malformed document, and as stated above, you
cannot rely on the parser data!
Events
To listen to an event, override on<eventname>
. The list of supported events
are also in the exported EVENTS
array.
See the JSDOC comments in the source code for a description of each supported
event.
FAQ
Q. Why has saxes dropped support for limiting the size of data chunks passed to
event handlers?
A. With sax you could set MAX_BUFFER_LENGTH
to cause the parser to limit the
size of data chunks passed to event handlers. So if you ran into a span of text
above the limit, multiple text
events with smaller data chunks were fired
instead of a single event with a large chunk.
However, that functionality had some problematic characteristics. It had an
arbitrary default value. It was library-wide so all parsers created from a
single instance of the sax
library shared it. This could potentially cause
conflicts among libraries running in the same VM but using sax for different
purposes.
These issues could have been easily fixed, but there were larger issues. The
buffer limit arbitrarily applied to some events but not others. It would split
text
, cdata
and script
events. However, if a comment
,
doctype
, attribute
or processing instruction
were more than the
limit, the parser would generate an error and you were left picking up the
pieces.
It was not intuitive to use. You'd think setting the limit to 1K would prevent
chunks bigger than 1K to be passed to event handlers. But that was not the
case. A comment in the source code told you that you might go over the limit if
you passed large chunks to write
. So if you want a 1K limit, don't pass 64K
chunks to write
. Fair enough. You know what limit you want so you can
control the size of the data you pass to write
. So you limit the chunks to
write
to 1K at a time. Even if you do this, your event handlers may get data
chunks that are 2K in size. Suppose on the previous write
the parser has
just finished processing an open tag, so it is ready for text. Your write
passes 1K of text. You are not above the limit yet, so no event is generated
yet. The next write
passes another 1K of text. It so happens that sax checks
buffer limits only once per write
, after the chunk of data has been
processed. Now you've hit the limit and you get a text
event with 2K of
data. So even if you limit your write
calls to the buffer limit you've set,
you may still get events with chunks at twice the buffer size limit you've
specified.
We may consider reinstating an equivalent functionality, provided that it
addresses the issues above and does not cause a huge performance drop for
use-case scenarios that don't need it.
2.0.0 (2018-07-23)
Bug Fixes
- "X" is not a valid hex prefix for char references (465038b)
- add namespace checks (9f94c4b)
- always run in strict mode (ed8b0b1)
- check that the characters we read are valid char data (7611a85)
- disallow spaces after open waka (da7f76d)
- drop the lowercase option (987d4bf)
- emit CDATA on empty CDATA section too (95d192f)
- emit empty comment (b3db392)
- entities are always strict (0f6a30e)
- fail on colon at start of QName (507addd)
- harmonize error messages and initialize flags (9a20cad)
- just one error for text before the root, and text after (101ea50)
- more namespace checks (a1add21)
- move namespace checks to their proper place (4a1c99f)
- only accept uppercase CDATA to mark the start of CDATA (e86534d)
- prevent colons in pi and entity names when xmlns is true (4327eec)
- prevent empty entities (04e1593)
- raise an error if the document does not have a root (f2de520)
- raise an error on ]]> in character data (2964381)
- raise an error on < in attribute values (4fd67a1)
- raise an error on multiple root elements (45047ae)
- raise error on CDATA before or after root (604241f)
- raise error on character reference outside CHAR production (30fb540)
- remove broken or pointless examples (1a5b642)
- report an error on duplicate attributes (ee4e340)
- report an error on whitespace at the start of end tag (c13b122)
- report processing instructions that do not have a target (c007e39)
- treat ?? in processing instructions correctly (bc1e1d4)
- trim URIs (78cc6f3)
- use xmlchars for checking names (2c939fe)
- verify that character references match the CHAR production (369afde)
Code Refactoring
- adjust the names used for processing instructions (3b508e9)
- convert code to ES6 (fe81170)
- drop attribute event (c7c2e80)
- drop buffer size checks (9ce2f7a)
- drop normalize (9c6d84c)
- drop opencdata and on closecdata (3287d2c)
- drop SGML declaration parsing (4aaf2d9)
- drop the
parser
function, rename SAXParser (0878a6c) - drop trim (c03c7d0)
- pass the actual tag to onclosetag (7020e64)
- provide default no-op implementation for events (a94687f)
- remove the API based on Stream (ebb659a)
- simplify namespace processing (2d4ce0f)
Features
- drop the resume() method; and have onerror() throw (ac601e5)
- handle XML declarations (5258939)
- revamped error messages (cf9c589)
- the flush method returns its parser (68c2020)
BREAKING CHANGES
- Sax was only passing the tag name. We pass the whole object.
- The API no longer takes a
strict
argument anywhere. This also
effectively removes support for HTML processing, or allow processing
without errors anything which is less than full XML. It also removes
special processing of script
elements. attribute
is not a particularly useful event for parsing XML. The only thing
it adds over looking at attributes on tag objects is that you get the order of
the attributes from the source, but attribute order in XML is irrelevant.- The opencdata and closecdata events became redundant once we removed the buffer
size limitations. So we remove these events.
- The
parser
function is removed. Just create a new instance with
new
.
SAXParser
is now SaxesParser.
So new require("saxes").SaxesParser(...)
.
-
The API based on Stream is gone. There were multiple issues with it. It was
Node-specific. It used an ancient Node API (the so-called "classic
streams"). Its behavior was idiosyncratic.
-
Sax had no default error handler but if you wanted to continue calling
write()
after an error you had to call resume()
. We do away with
resume()
and instead install a default onerror
which throws. Replace
with a no-op handler if you want to continue after errors.
-
The "processinginstruction" now produces a "target" field instead of a "name"
field. The nomenclature "target" is the one used in the XML literature.
-
- The
ns
field is no longer using the prototype trick that sax used. The
ns
field of a tag contains only those namespaces that the tag declares.
-
We no longer have opennamespace
and closenamespace
events. The
information they provide can be obtained by examining the tags passed to tag
events.
-
SGML declaration is not supported by XML. This is an XML parser. So we
remove support for SGML declarations. They now cause errors.
-
We removed support for the code that checked buffer sizes and would
raise errors if a buffer was close to an arbitrary limit or emitted
multiple text
or cdata
events in order avoid passing strings
greater than an arbitrary size. So MAX_BUFFER_LENGTH
is gone.
The feature always seemed a bit awkward. Client code could limit the
size of buffers to 1024K, for instance, and not get a text
event
with a text payload greater than 1024K... so far so good but if the
same document contained a comment with more than 1024K that would
result in an error. Hmm.... why? The distinction seems entirely
arbitrary.
The upshot is that client code needs to be ready to handle strings of
any length supported by the platform.
If there's a clear need to reintroduce it, we'll reassess.
- It is no longer possible to load the library as-is through a
script
element. It needs building.
The library now assumes a modern runtime. It no longer contains any
code to polyfill what's missing. It is up to developers using this
code to deal with polyfills as needed.
- We drop the
trim
option. It is up to client code to trip text if
it needs it. - We no longer support the
normalize
option. It is up to client code
to perform whatever normalization it wants. - The
lowercase
option makes no sense for XML. It is removed. - Remove support for strictEntities. Entities are now always strict, as
required by the XML specification.
- By default parsers now have a default no-op implementation for each
event it supports. This would break code that determines whether a
custom handler was added by checking whether there's any handler at
all. This removes the necessity for the parser implementation to check
whether there is a handler before calling it.
In the process of making this change, we've removed support for the
on...
properties on streams objects. Their existence was not
warranted by any standard API provided by Node. (EventEmitter
does
not have on...
properties for events it supports, nor does
Stream
.) Their existence was also undocumented. And their
functioning was awkward. For instance, with sax, this:
const s = sax.createStream();
const handler = () => console.log("moo");
s.on("cdata", handler);
console.log(s.oncdata === handler);
would print false
. If you examine s.oncdata
you see it is glue
code instead of the handler assigned. This is just bizarre, so we
removed it.