New Case Study:See how Anthropic automated 95% of dependency reviews with Socket.Learn More →

@rubensworks/saxes

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

@rubensworks/saxes

An evented streaming XML parser in JavaScript

6.0.1
latest
Source
npm

Version published: 2 years ago

Weekly downloads: 8.8K; decreased by-31.96%

Maintainers: 1

Weekly downloads

Created: 2 years ago

Source

saxes

A sax-style non-validating parser for XML.

This is a fork of Saxes, which was on its turned forked from sax.

This fork was created as the Saxes project appeared to be unmaintained, so I have created this project to resolve major blockers in my own projects. I do not aim to add features to this project. This fork will be closed as soon as Saxes maintenance picks up again.

Designed with node in mind, but should work fine in the browser or other CommonJS implementations.

Saxes does not support Node versions older than 10.

Notable Differences from Sax.

Saxes aims to be much stricter than sax with regards to XML well-formedness. Sax, even in its so-called "strict mode", is not strict. It silently accepts structures that are not well-formed XML. Projects that need better compliance with well-formedness constraints cannot use sax as-is.

Consequently, saxes does not support HTML, or pseudo-XML, or bad XML. Saxes will report well-formedness errors in all these cases but it won't try to extract data from malformed documents like sax does.
Saxes is much much faster than sax, mostly because of a substantial redesign of the internal parsing logic. The speed improvement is not merely due to removing features that were supported by sax. That helped a bit, but saxes adds some expensive checks in its aim for conformance with the XML specification. Redesigning the parsing logic is what accounts for most of the performance improvement.
Saxes does not aim to support antiquated platforms. We will not pollute the source or the default build with support for antiquated platforms. If you want support for IE 11, you are welcome to produce a PR that adds a new build transpiled to ES5.
Saxes handles errors differently from sax: it provides a default onerror handler which throws. You can replace it with your own handler if you want. If your handler does nothing, there is no resume method to call.
There's no Stream API. A revamped API may be introduced later. (It is still a "streaming parser" in the general sense that you write a character stream to it.)
Saxes does not have facilities for limiting the size the data chunks passed to event handlers. See the FAQ entry for more details.

Conformance

Saxes supports:

Limitations

This is a non-validating parser so it only verifies whether the document is well-formed. We do aim to raise errors for all malformed constructs encountered. However, this parser does not thorougly parse the contents of DTDs. So most malformedness errors caused by errors in DTDs cannot be reported.

Regarding `<!DOCTYPE` and `<!ENTITY`

The parser will handle the basic XML entities in text nodes and attribute values: & < > ' ". It's possible to define additional entities in XML by putting them in the DTD. This parser doesn't do anything with that. If you want to listen to the doctype event, and then fetch the doctypes, and read the entities and add them to parser.ENTITIES, then be my guest.

Documentation

The source code contains JSDOC comments. Use them. What follows is a brief summary of what is available. The final authority is the source code.

PAY CLOSE ATTENTION TO WHAT IS PUBLIC AND WHAT IS PRIVATE.

The move to TypeScript makes it so that everything is now formally private, protected, or public.

If you use anything not public, that's at your own peril.

If there's a mistake in the documentation, raise an issue. If you just assume, you may assume incorrectly.

Summary Usage Information

Example

var saxes = require("./lib/saxes"),
  parser = new saxes.SaxesParser();

parser.on("error", function (e) {
  // an error happened.
});
parser.on("text", function (t) {
  // got some text.  t is the string of text.
});
parser.on("opentag", function (node) {
  // opened a tag.  node has "name" and "attributes"
});
parser.on("end", function () {
  // parser stream is done, and ready to have more stuff written to it.
});

parser.write('<xml>Hello, <who name="world">world</who>!</xml>').close();

Constructor Arguments

Settings supported:

xmlns - Boolean. If true, then namespaces are supported. Default is false.
position - Boolean. If false, then don't track line/col/position. Unset is treated as true. Default is unset. Currently, setting this to false only results in a cosmetic change: the errors reported do not contain position information. sax-js would literally turn off the position-computing logic if this flag was set to false. The notion was that it would optimize execution. In saxes at least it turns out that continually testing this flag causes a cost that offsets the benefits of turning off this logic.
fileName - String. Set a file name for error reporting. This is useful only when tracking positions. You may leave it unset.
fragment - Boolean. If true, parse the XML as an XML fragment. Default is false.
additionalNamespaces - A plain object whose key, value pairs define namespaces known before parsing the XML file. It is not legal to pass bindings for the namespaces "xml" or "xmlns".
defaultXMLVersion - The default version of the XML specification to use if the document contains no XML declaration. If the document does contain an XML declaration, then this setting is ignored. Must be "1.0" or "1.1". The default is "1.0".
forceXMLVersion - Boolean. A flag indicating whether to force the XML version used for parsing to the value of defaultXMLVersion. When this flag is true, defaultXMLVersion must be specified. If unspecified, the default value of this flag is false.

Example: suppose you are parsing a document that has an XML declaration specifying XML version 1.1.

If you set defaultXMLVersion to "1.0" without setting forceXMLVersion then the XML declaration will override the value of defaultXMLVersion and the document will be parsed according to XML 1.1.

If you set defaultXMLVersion to "1.0" and set forceXMLVersion to true, then the XML declaration will be ignored and the document will be parsed according to XML 1.0.

Methods

write - Write bytes onto the stream. You don't have to pass the whole document in one write call. You can read your source chunk by chunk and call write with each chunk.

close - Close the stream. Once closed, no more data may be written until it is done processing the buffer, which is signaled by the end event.

Properties

The parser has the following properties:

line, column, columnIndex, position - Indications of the position in the XML document where the parser currently is looking. The columnIndex property counts columns as if indexing into a JavaScript string, whereas the column property counts Unicode characters.

closed - Boolean indicating whether or not the parser can be written to. If it's true, then wait for the ready event to write again.

opt - Any options passed into the constructor.

xmlDecl - The XML declaration for this document. It contains the fields version, encoding and standalone. They are all undefined before encountering the XML declaration. If they are undefined after the XML declaration, the corresponding value was not set by the declaration. There is no event associated with the XML declaration. In a well-formed document, the XML declaration may be preceded only by an optional BOM. So by the time any event generated by the parser happens, the declaration has been processed if present at all. Otherwise, you have a malformed document, and as stated above, you cannot rely on the parser data!

Error Handling

The parser continues to parse even upon encountering errors, and does its best to continue reporting errors. You should heed all errors reported. After an error, however, saxes may interpret your document incorrectly. For instance <foo a=bc="d"/> is invalid XML. Did you mean to have <foo a="bc=d"/> or <foo a="b" c="d"/> or some other variation? For the sake of continuing to provide errors, saxes will continue parsing the document, but the structure it reports may be incorrect. It is only after the errors are fixed in the document that saxes can provide a reliable interpretation of the document.

That leaves you with two rules of thumb when using saxes:

Pay attention to the errors that saxes report. The default onerror handler throws, so by default, you cannot miss errors.
ONCE AN ERROR HAS BEEN ENCOUNTERED, STOP RELYING ON THE EVENT HANDLERS OTHER THAN onerror. As explained above, when saxes runs into a well-formedness problem, it makes a guess in order to continue reporting more errors. The guess may be wrong.

Events

To listen to an event, override on<eventname>. The list of supported events are also in the exported EVENTS array.

See the JSDOC comments in the source code for a description of each supported event.

Parsing XML Fragments

The XML specification does not define any method by which to parse XML fragments. However, there are usage scenarios in which it is desirable to parse fragments. In order to allow this, saxes provides three initialization options.

If you pass the option fragment: true to the parser constructor, the parser will expect an XML fragment. It essentially starts with a parsing state equivalent to the one it would be in if parser.write("<foo">) had been called right after initialization. In other words, it expects content which is acceptable inside an element. This also turns off well-formedness checks that are inappropriate when parsing a fragment.

The option additionalNamespaces allows you to define additional prefix-to-URI bindings known before parsing starts. You would use this over resolvePrefix if you have at the ready a series of namespaces bindings to use.

The option resolvePrefix allows you to pass a function which saxes will use if it is unable to resolve a namespace prefix by itself. You would use this over additionalNamespaces in a context where getting a complete list of defined namespaces is onerous.

Note that you can use additionalNamespaces and resolvePrefix together if you want. additionalNamespaces applies before resolvePrefix.

The options additionalNamespaces and resolvePrefix are really meant to be used for parsing fragments. However, saxes won't prevent you from using them with fragment: false. Note that if you do this, your document may parse without errors and yet be malformed because the document can refer to namespaces which are not defined in the document.

Of course, additionalNamespaces and resolvePrefix are used only if xmlns is true. If you are parsing a fragment that does not use namespaces, there's no point in setting these options.

Performance Tips

saxes works faster on files that use newlines (\u000A) as end of line markers than files that use other end of line markers (like \r or \r\n). The XML specification requires that conformant applications behave as if all characters that are to be treated as end of line characters are converted to \u000A prior to parsing. The optimal code path for saxes is a file in which all end of line characters are already \u000A.
Don't split Unicode strings you feed to saxes across surrogates. When you naively split a string in JavaScript, you run the risk of splitting a Unicode character into two surrogates. e.g. In the following example a and b each contain half of a single Unicode character: const a = "\u{1F4A9}"[0]; const b = "\u{1F4A9}"[1] If you feed such split surrogates to versions of saxes prior to 4, you'd get errors. Saxes version 4 and over are able to detect when a chunk of data ends with a surrogate and carry over the surrogate to the next chunk. However this operation entails slicing and concatenating strings. If you can feed your data in a way that does not split surrogates, you should do it. (Obviously, feeding all the data at once with a single write is fastest.)
Don't set event handlers you don't need. Saxes has always aimed to avoid doing work that will just be tossed away but future improvements hope to do this more aggressively. One way saxes knows whether or not some data is needed is by checking whether a handler has been set for a specific event.

FAQ

Q. Why has saxes dropped support for limiting the size of data chunks passed to event handlers?

A. With sax you could set MAX_BUFFER_LENGTH to cause the parser to limit the size of data chunks passed to event handlers. So if you ran into a span of text above the limit, multiple text events with smaller data chunks were fired instead of a single event with a large chunk.

However, that functionality had some problematic characteristics. It had an arbitrary default value. It was library-wide so all parsers created from a single instance of the sax library shared it. This could potentially cause conflicts among libraries running in the same VM but using sax for different purposes.

These issues could have been easily fixed, but there were larger issues. The buffer limit arbitrarily applied to some events but not others. It would split text, cdata and script events. However, if a comment, doctype, attribute or processing instruction were more than the limit, the parser would generate an error and you were left picking up the pieces.

It was not intuitive to use. You'd think setting the limit to 1K would prevent chunks bigger than 1K to be passed to event handlers. But that was not the case. A comment in the source code told you that you might go over the limit if you passed large chunks to write. So if you want a 1K limit, don't pass 64K chunks to write. Fair enough. You know what limit you want so you can control the size of the data you pass to write. So you limit the chunks to write to 1K at a time. Even if you do this, your event handlers may get data chunks that are 2K in size. Suppose on the previous write the parser has just finished processing an open tag, so it is ready for text. Your write passes 1K of text. You are not above the limit yet, so no event is generated yet. The next write passes another 1K of text. It so happens that sax checks buffer limits only once per write, after the chunk of data has been processed. Now you've hit the limit and you get a text event with 2K of data. So even if you limit your write calls to the buffer limit you've set, you may still get events with chunks at twice the buffer size limit you've specified.

We may consider reinstating an equivalent functionality, provided that it addresses the issues above and does not cause a huge performance drop for use-case scenarios that don't need it.

6.0.1 (2023-06-05)

Bug Fixes

"X" is not a valid hex prefix for char references (465038b)
add fragment and additionalNamespaces to SaxesOption typing (02d8275)
add namespace checks (9f94c4b)
always run in strict mode (ed8b0b1)
CDATA end in attributes must not cause an error (a7495ac)
check that the characters we read are valid char data (7611a85)
correct typo (97bc5da)
detect unclosed tags in fragments (5642f36)
disallow BOM characters at the beginning of subsequent chunks (66d07b6)
disallow spaces after open waka (da7f76d)
don't serialize the fileName as undefined: when not present (4ff2365)
drop the lowercase option (987d4bf)
emit CDATA on empty CDATA section too (95d192f)
emit empty comment (b3db392)
entities are always strict (0f6a30e)
fail on colon at start of QName (507addd)
fix a bug in EOL handling (bed38a8)
fix bug with initial eol characters (7b3db75)
fix corrupted attribute values when there is no text handler (e135f11), closes #38
fix some typing mistakes (f2a1d5e)
fixing linting errors for eslint 8 (cd4b5c9)
generate an error on prefix with empty local name (89a3b86), closes #5
handle column computation over characters in the astral plane (cefc8f7)
handling of end of line characters (f13247a)
harmonize error messages and initialize flags (9a20cad)
implement attribute normalization (be51114), closes #24
just one error for text before the root, and text after (101ea50)
more namespace checks (a1add21)
move eslint to devDependencies (d747538)
move namespace checks to their proper place (4a1c99f)
normalize \r\n and \r followed by something else to \n (d7b1abe), closes #2
npm audit warning (a6c9ba8)
only accept uppercase CDATA to mark the start of CDATA (e86534d)
pay attention to comments and processing instructions in DTDs (52ffd90), closes #19
prevent colons in pi and entity names when xmlns is true (4327eec)
prevent empty entities (04e1593)
raise an error if the document does not have a root (f2de520)
raise an error on ]]> in character data (2964381)
raise an error on < in attribute values (4fd67a1)
raise an error on multiple root elements (45047ae)
raise error on CDATA before or after root (604241f)
raise error on character reference outside CHAR production (30fb540)
remove broken or pointless examples (1a5b642)
report an error on duplicate attributes (ee4e340)
report an error on whitespace at the start of end tag (c13b122)
report processing instructions that do not have a target (c007e39)
resolve is now part of the public API (bb4bed5)
treat ?? in processing instructions correctly (bc1e1d4)
trim URIs (78cc6f3)
typings: "selfClosing" => "isSelfClosing" (d96a2bd)
use isNameChar for later chars in PI target (83d2b61)
use the latest xmlchars (b30a714)
use xmlchars for checking names (2c939fe)
verify that character references match the CHAR production (369afde)
we don't support node 10 anymore (f2aa1a8)

Code Refactoring

adjust the names used for processing instructions (3b508e9)
convert code to ES6 (fe81170)
drop attribute event (c7c2e80)
drop buffer size checks (9ce2f7a)
drop normalize (9c6d84c)
drop opencdata and on closecdata (3287d2c)
drop SGML declaration parsing (4aaf2d9)
drop the parser function, rename SAXParser (0878a6c)
drop trim (c03c7d0)
pass the actual tag to onclosetag (7020e64)
provide default no-op implementation for events (a94687f)
remove the API based on Stream (ebb659a)
simplify namespace processing (2d4ce0f)

Features

add forceXMLVersion (1eedbf8)
add makeError method (50fa39a)
add support for parsing fragments (1ff2d6a)
add the resolvePrefix option (90301fb)
add xmldecl event (a2e677f)
drop the resume() method; and have onerror() throw (ac601e5)
formal method for setting event listeners (f346150)
handle XML declarations (5258939)
process the xmlns attribute the customary way (2c9672a)
reinstating the attribute events (7c80f7b)
revamped error messages (cf9c589)
saxes handles chunks that "break" unicode (1272448)
saxes is now implemented in TS (664ba69)
stronger check on bad cdata closure (d416760)
support for XML 1.1 (36704fb)
the flush method returns its parser (68c2020)

Performance Improvements

add emitNodes to skip checking text buffer more than needed (9d5e357)
add topNS for faster namespace processing (1a33a57)
capture names in the name field (c7dffd5)
check the most common case first (40a34d5)
concatenate openWakaBang just once (07345bf)
don't check twice if this.textNode is set (00536cc)
don't depend on limit to know when we hit the end of buffer (ad4ab53)
don't increment a column number (490fc24)
don't repeatedly read this.i in the getCode methods (d3f196c)
drop the originalNL flag in favor of a NL_LIKE fake character (f690725)
dump isNaN; it is very costly (7d97e1a)
eliminate extra buffers (3412fcb)
improve performance of text handling (9c13099)
improve some more the speed of ]]> detection (a0216cd)
improve text node checking speed (f270e8b)
improve the check for ]]> in character data (21df9b5)
inline closeText (07a3b51)
introduce a specialized version of captureWhile (04855d6)
introduce captureTo and captureToChar (76eb95a)
make the most common path of getCode functions the shortest (4d66bbb)
minimine concatenation by adding the capability to unget codes (27fa8b9)
minor optimizations (c7e36bf)
move more common/valid cases first (a65586e)
reduce the frequency at which we clear attribValue (1570615)
reduce the number of calls to closeText (3e68df5)
remove an unnecessary variable (ac03a1c)
remove handler check (fbe35ff)
remove more extra buffers (b5ee774)
remove skipWhitespace (c8b7ae2)
remove some redundant buffer resets (5ded326)
simplify captureWhile (bb2085c)
simplify the skip functions (c7b8c3b)
split sText into two specialized loops (732325e)
the c field has been unused for a while: remove it (9ca0246)
use -1 to mean EOC (end-of-chunk) (55c0b1b)
use charCodeAt and handle surrogates ourselves (b8ec232)
use isCharAndNotRestricted rather than call two functions (f0b67a4)
use slice rather than substring (c1fed89)
use specialized code for sAttribValueQuoted (6c484f3)
use strings for the general states (3869908)

BREAKING CHANGES

we don't support node 10.
The individually named event handlers no longer exist. You now must use the methods on and off to set handlers. Upcoming features require that saxes know when handlers are added and removed, and it may be necessary in the future to qualify how to add or remove a handler. Getters/setters are too restrictives so we bite the bullet now and move to actual methods.
The fix to column number reporting changes the meaning of the column field. If you need the old behavior of column you can use the new columnIndex field which behaves like the old column and may be useful in some contexts. Ultimately you should decide whether your application needs to know column numbers by Unicode character count or by JavaScript index. (And you need to know the difference between the two. You can see this page for a detailed discussion of the Unicode problem in JavaScript. Note that the numbers put in the error messages that fail produce are still based on the column field and thus use the new meaning of column. If you want error message that use columnIndex you may override the fail method.
previous versions of saxes did not consistently convert end of line characters to NL (0xA) in the data reported by event handlers. This has been fixed. If your code relied on the old (incorrect) behavior then you'll have to update it.
previous versions of saxes would parse files with an XML declaration set to 1.1 as 1.0 documents. The support for 1.1 entails that if a document has an XML declaration that specifies version 1.1 it is parsed as a 1.1 document.
when fileName is undefined in the parser options saxes does not show a file name in error messages. Previously it was showing the name undefined. To get the previous behavior, in all cases where you'd leave fileName undefined, you must set it to the string "undefined" instead.
In previous versions the attribute xmlns (as in <foo xmlns="some-uri"> would be reported as having the prefix "xmlns" and the local name "". This behavior was inherited from sax. There was some logic to it, but this behavior was surprising to users of the library. The principle of least surprise favors eliminating that surprising behavior in favor of something less surprising.

This commit makes it so that xmlns is not reported as having a prefix of "" and a local name of "xmlns". This accords with how people interpret attribute names like foo, bar, moo which all have no prefix and a local name.

Code that deals with namespace bindings or cares about xmlns probably needs to be changed.

Sax was only passing the tag name. We pass the whole object.
- The ns field is no longer using the prototype trick that sax used. The ns field of a tag contains only those namespaces that the tag declares.
We no longer have opennamespace and closenamespace events. The information they provide can be obtained by examining the tags passed to tag events.
attribute is not a particularly useful event for parsing XML. The only thing it adds over looking at attributes on tag objects is that you get the order of the attributes from the source, but attribute order in XML is irrelevant.
The opencdata and closecdata events became redundant once we removed the buffer size limitations. So we remove these events.
The parser function is removed. Just create a new instance with new.

SAXParser is now SaxesParser. So new require("saxes").SaxesParser(...).

The API based on Stream is gone. There were multiple issues with it. It was Node-specific. It used an ancient Node API (the so-called "classic streams"). Its behavior was idiosyncratic.
Sax had no default error handler but if you wanted to continue calling write() after an error you had to call resume(). We do away with resume() and instead install a default onerror which throws. Replace with a no-op handler if you want to continue after errors.
The "processinginstruction" now produces a "target" field instead of a "name" field. The nomenclature "target" is the one used in the XML literature.
By default parsers now have a default no-op implementation for each event it supports. This would break code that determines whether a custom handler was added by checking whether there's any handler at all. This removes the necessity for the parser implementation to check whether there is a handler before calling it.

In the process of making this change, we've removed support for the on... properties on streams objects. Their existence was not warranted by any standard API provided by Node. (EventEmitter does not have on... properties for events it supports, nor does Stream.) Their existence was also undocumented. And their functioning was awkward. For instance, with sax, this:

const s = sax.createStream();
const handler = () => console.log("moo");
s.on("cdata", handler);
console.log(s.oncdata === handler);

would print false. If you examine s.oncdata you see it is glue code instead of the handler assigned. This is just bizarre, so we removed it.

SGML declaration is not supported by XML. This is an XML parser. So we remove support for SGML declarations. They now cause errors.
We removed support for the code that checked buffer sizes and would raise errors if a buffer was close to an arbitrary limit or emitted multiple text or cdata events in order avoid passing strings greater than an arbitrary size. So MAX_BUFFER_LENGTH is gone.

The feature always seemed a bit awkward. Client code could limit the size of buffers to 1024K, for instance, and not get a text event with a text payload greater than 1024K... so far so good but if the same document contained a comment with more than 1024K that would result in an error. Hmm.... why? The distinction seems entirely arbitrary.

The upshot is that client code needs to be ready to handle strings of any length supported by the platform.

If there's a clear need to reintroduce it, we'll reassess.

It is no longer possible to load the library as-is through a script element. It needs building.

The library now assumes a modern runtime. It no longer contains any code to polyfill what's missing. It is up to developers using this code to deal with polyfills as needed.

We drop the trim option. It is up to client code to trip text if it needs it.
We no longer support the normalize option. It is up to client code to perform whatever normalization it wants.
The lowercase option makes no sense for XML. It is removed.
Remove support for strictEntities. Entities are now always strict, as required by the XML specification.
The API no longer takes a strict argument anywhere. This also effectively removes support for HTML processing, or allow processing without errors anything which is less than full XML. It also removes special processing of script elements.

FAQs

What is @rubensworks/saxes?

Is @rubensworks/saxes popular?

Is @rubensworks/saxes well maintained?

Package last updated on 05 Jun 2023

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

@rubensworks/saxes

saxes

Notable Differences from Sax.

Conformance

Limitations

Regarding <!DOCTYPE and <!ENTITY

Documentation

Summary Usage Information

Example

Constructor Arguments

Methods

Properties

Error Handling

Events

Parsing XML Fragments

Performance Tips

FAQ

6.0.1 (2023-06-05)

Bug Fixes

Code Refactoring

Features

Performance Improvements

BREAKING CHANGES

Related posts

PyPI’s New Archival Feature Closes a Major Security Gap

North Korean APT Lazarus Targets Developers with Malicious npm Package

Regarding `<!DOCTYPE` and `<!ENTITY`