Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
@rubensworks/saxes
Advanced tools
A sax-style non-validating parser for XML.
This is a fork of Saxes, which was on its turned forked from sax.
This fork was created as the Saxes project appeared to be unmaintained, so I have created this project to resolve major blockers in my own projects. I do not aim to add features to this project. This fork will be closed as soon as Saxes maintenance picks up again.
Designed with node in mind, but should work fine in the browser or other CommonJS implementations.
Saxes does not support Node versions older than 10.
Saxes aims to be much stricter than sax with regards to XML well-formedness. Sax, even in its so-called "strict mode", is not strict. It silently accepts structures that are not well-formed XML. Projects that need better compliance with well-formedness constraints cannot use sax as-is.
Consequently, saxes does not support HTML, or pseudo-XML, or bad XML. Saxes will report well-formedness errors in all these cases but it won't try to extract data from malformed documents like sax does.
Saxes is much much faster than sax, mostly because of a substantial redesign of the internal parsing logic. The speed improvement is not merely due to removing features that were supported by sax. That helped a bit, but saxes adds some expensive checks in its aim for conformance with the XML specification. Redesigning the parsing logic is what accounts for most of the performance improvement.
Saxes does not aim to support antiquated platforms. We will not pollute the source or the default build with support for antiquated platforms. If you want support for IE 11, you are welcome to produce a PR that adds a new build transpiled to ES5.
Saxes handles errors differently from sax: it provides a default onerror
handler which throws. You can replace it with your own handler if you want. If
your handler does nothing, there is no resume
method to call.
There's no Stream
API. A revamped API may be introduced later. (It is still
a "streaming parser" in the general sense that you write a character stream to
it.)
Saxes does not have facilities for limiting the size the data chunks passed to event handlers. See the FAQ entry for more details.
Saxes supports:
This is a non-validating parser so it only verifies whether the document is well-formed. We do aim to raise errors for all malformed constructs encountered. However, this parser does not thorougly parse the contents of DTDs. So most malformedness errors caused by errors in DTDs cannot be reported.
<!DOCTYPE
and <!ENTITY
The parser will handle the basic XML entities in text nodes and attribute
values: & < > ' "
. It's possible to define additional
entities in XML by putting them in the DTD. This parser doesn't do anything with
that. If you want to listen to the doctype
event, and then fetch the
doctypes, and read the entities and add them to parser.ENTITIES
, then be my
guest.
The source code contains JSDOC comments. Use them. What follows is a brief summary of what is available. The final authority is the source code.
PAY CLOSE ATTENTION TO WHAT IS PUBLIC AND WHAT IS PRIVATE.
The move to TypeScript makes it so that everything is now formally private, protected, or public.
If you use anything not public, that's at your own peril.
If there's a mistake in the documentation, raise an issue. If you just assume, you may assume incorrectly.
var saxes = require("./lib/saxes"),
parser = new saxes.SaxesParser();
parser.on("error", function (e) {
// an error happened.
});
parser.on("text", function (t) {
// got some text. t is the string of text.
});
parser.on("opentag", function (node) {
// opened a tag. node has "name" and "attributes"
});
parser.on("end", function () {
// parser stream is done, and ready to have more stuff written to it.
});
parser.write('<xml>Hello, <who name="world">world</who>!</xml>').close();
Settings supported:
xmlns
- Boolean. If true
, then namespaces are supported. Default
is false
.
position
- Boolean. If false
, then don't track line/col/position. Unset is
treated as true
. Default is unset. Currently, setting this to false
only
results in a cosmetic change: the errors reported do not contain position
information. sax-js would literally turn off the position-computing logic if
this flag was set to false. The notion was that it would optimize
execution. In saxes at least it turns out that continually testing this flag
causes a cost that offsets the benefits of turning off this logic.
fileName
- String. Set a file name for error reporting. This is useful only
when tracking positions. You may leave it unset.
fragment
- Boolean. If true
, parse the XML as an XML fragment. Default is
false
.
additionalNamespaces
- A plain object whose key, value pairs define
namespaces known before parsing the XML file. It is not legal to pass
bindings for the namespaces "xml"
or "xmlns"
.
defaultXMLVersion
- The default version of the XML specification to use if
the document contains no XML declaration. If the document does contain an XML
declaration, then this setting is ignored. Must be "1.0"
or "1.1"
. The
default is "1.0"
.
forceXMLVersion
- Boolean. A flag indicating whether to force the XML
version used for parsing to the value of defaultXMLVersion
. When this flag
is true
, defaultXMLVersion
must be specified. If unspecified, the
default value of this flag is false
.
Example: suppose you are parsing a document that has an XML declaration specifying XML version 1.1.
If you set defaultXMLVersion
to "1.0"
without setting
forceXMLVersion
then the XML declaration will override the value of
defaultXMLVersion
and the document will be parsed according to XML 1.1.
If you set defaultXMLVersion
to "1.0"
and set forceXMLVersion
to
true
, then the XML declaration will be ignored and the document will be
parsed according to XML 1.0.
write
- Write bytes onto the stream. You don't have to pass the whole document
in one write
call. You can read your source chunk by chunk and call write
with each chunk.
close
- Close the stream. Once closed, no more data may be written until it is
done processing the buffer, which is signaled by the end
event.
The parser has the following properties:
line
, column
, columnIndex
, position
- Indications of the position in the
XML document where the parser currently is looking. The columnIndex
property
counts columns as if indexing into a JavaScript string, whereas the column
property counts Unicode characters.
closed
- Boolean indicating whether or not the parser can be written to. If
it's true
, then wait for the ready
event to write again.
opt
- Any options passed into the constructor.
xmlDecl
- The XML declaration for this document. It contains the fields
version
, encoding
and standalone
. They are all undefined
before
encountering the XML declaration. If they are undefined after the XML
declaration, the corresponding value was not set by the declaration. There is no
event associated with the XML declaration. In a well-formed document, the XML
declaration may be preceded only by an optional BOM. So by the time any event
generated by the parser happens, the declaration has been processed if present
at all. Otherwise, you have a malformed document, and as stated above, you
cannot rely on the parser data!
The parser continues to parse even upon encountering errors, and does its best
to continue reporting errors. You should heed all errors reported. After an
error, however, saxes may interpret your document incorrectly. For instance
<foo a=bc="d"/>
is invalid XML. Did you mean to have <foo a="bc=d"/>
or
<foo a="b" c="d"/>
or some other variation? For the sake of continuing to
provide errors, saxes will continue parsing the document, but the structure it
reports may be incorrect. It is only after the errors are fixed in the document
that saxes can provide a reliable interpretation of the document.
That leaves you with two rules of thumb when using saxes:
Pay attention to the errors that saxes report. The default onerror
handler
throws, so by default, you cannot miss errors.
ONCE AN ERROR HAS BEEN ENCOUNTERED, STOP RELYING ON THE EVENT HANDLERS OTHER
THAN onerror
. As explained above, when saxes runs into a well-formedness
problem, it makes a guess in order to continue reporting more errors. The guess
may be wrong.
To listen to an event, override on<eventname>
. The list of supported events
are also in the exported EVENTS
array.
See the JSDOC comments in the source code for a description of each supported event.
The XML specification does not define any method by which to parse XML fragments. However, there are usage scenarios in which it is desirable to parse fragments. In order to allow this, saxes provides three initialization options.
If you pass the option fragment: true
to the parser constructor, the parser
will expect an XML fragment. It essentially starts with a parsing state
equivalent to the one it would be in if parser.write("<foo">)
had been called
right after initialization. In other words, it expects content which is
acceptable inside an element. This also turns off well-formedness checks that
are inappropriate when parsing a fragment.
The option additionalNamespaces
allows you to define additional prefix-to-URI
bindings known before parsing starts. You would use this over resolvePrefix
if
you have at the ready a series of namespaces bindings to use.
The option resolvePrefix
allows you to pass a function which saxes will use if
it is unable to resolve a namespace prefix by itself. You would use this over
additionalNamespaces
in a context where getting a complete list of defined
namespaces is onerous.
Note that you can use additionalNamespaces
and resolvePrefix
together if you
want. additionalNamespaces
applies before resolvePrefix
.
The options additionalNamespaces
and resolvePrefix
are really meant to be
used for parsing fragments. However, saxes won't prevent you from using them
with fragment: false
. Note that if you do this, your document may parse
without errors and yet be malformed because the document can refer to namespaces
which are not defined in the document.
Of course, additionalNamespaces
and resolvePrefix
are used only if xmlns
is true
. If you are parsing a fragment that does not use namespaces, there's
no point in setting these options.
saxes works faster on files that use newlines (\u000A
) as end of line
markers than files that use other end of line markers (like \r
or
\r\n
). The XML specification requires that conformant applications behave
as if all characters that are to be treated as end of line characters are
converted to \u000A
prior to parsing. The optimal code path for saxes is a
file in which all end of line characters are already \u000A
.
Don't split Unicode strings you feed to saxes across surrogates. When you
naively split a string in JavaScript, you run the risk of splitting a Unicode
character into two surrogates. e.g. In the following example a
and b
each contain half of a single Unicode character: const a = "\u{1F4A9}"[0]; const b = "\u{1F4A9}"[1]
If you feed such split surrogates to versions of
saxes prior to 4, you'd get errors. Saxes version 4 and over are able to
detect when a chunk of data ends with a surrogate and carry over the surrogate
to the next chunk. However this operation entails slicing and concatenating
strings. If you can feed your data in a way that does not split surrogates,
you should do it. (Obviously, feeding all the data at once with a single write
is fastest.)
Don't set event handlers you don't need. Saxes has always aimed to avoid doing work that will just be tossed away but future improvements hope to do this more aggressively. One way saxes knows whether or not some data is needed is by checking whether a handler has been set for a specific event.
Q. Why has saxes dropped support for limiting the size of data chunks passed to event handlers?
A. With sax you could set MAX_BUFFER_LENGTH
to cause the parser to limit the
size of data chunks passed to event handlers. So if you ran into a span of text
above the limit, multiple text
events with smaller data chunks were fired
instead of a single event with a large chunk.
However, that functionality had some problematic characteristics. It had an
arbitrary default value. It was library-wide so all parsers created from a
single instance of the sax
library shared it. This could potentially cause
conflicts among libraries running in the same VM but using sax for different
purposes.
These issues could have been easily fixed, but there were larger issues. The
buffer limit arbitrarily applied to some events but not others. It would split
text
, cdata
and script
events. However, if a comment
,
doctype
, attribute
or processing instruction
were more than the
limit, the parser would generate an error and you were left picking up the
pieces.
It was not intuitive to use. You'd think setting the limit to 1K would prevent
chunks bigger than 1K to be passed to event handlers. But that was not the
case. A comment in the source code told you that you might go over the limit if
you passed large chunks to write
. So if you want a 1K limit, don't pass 64K
chunks to write
. Fair enough. You know what limit you want so you can
control the size of the data you pass to write
. So you limit the chunks to
write
to 1K at a time. Even if you do this, your event handlers may get data
chunks that are 2K in size. Suppose on the previous write
the parser has
just finished processing an open tag, so it is ready for text. Your write
passes 1K of text. You are not above the limit yet, so no event is generated
yet. The next write
passes another 1K of text. It so happens that sax checks
buffer limits only once per write
, after the chunk of data has been
processed. Now you've hit the limit and you get a text
event with 2K of
data. So even if you limit your write
calls to the buffer limit you've set,
you may still get events with chunks at twice the buffer size limit you've
specified.
We may consider reinstating an equivalent functionality, provided that it addresses the issues above and does not cause a huge performance drop for use-case scenarios that don't need it.
6.0.1 (2023-06-05)
isNameChar
for later chars in PI target (83d2b61)parser
function, rename SAXParser (0878a6c)resolvePrefix
option (90301fb)name
field (c7dffd5)on
and off
to set handlers. Upcoming features require
that saxes know when handlers are added and removed, and it may be necessary in
the future to qualify how to add or remove a handler. Getters/setters are too
restrictives so we bite the bullet now and move to actual methods.column
field. If you need the old behavior of column
you can use the new
columnIndex
field which behaves like the old column
and may be useful in
some contexts. Ultimately you should decide whether your application needs to
know column numbers by Unicode character count or by JavaScript index. (And you
need to know the difference between the two. You can see this
page for a detailed
discussion of the Unicode problem in JavaScript. Note that the numbers put in
the error messages that fail
produce are still based on the column
field
and thus use the new meaning of column
. If you want error message that use
columnIndex
you may override the fail
method.fileName
is undefined in the parser options saxes does
not show a file name in error messages. Previously it was showing the name
undefined
. To get the previous behavior, in all cases where you'd leave
fileName
undefined, you must set it to the string "undefined"
instead.xmlns
(as in <foo xmlns="some-uri">
would
be reported as having the prefix "xmlns"
and the local name ""
. This
behavior was inherited from sax. There was some logic to it, but this behavior
was surprising to users of the library. The principle of least surprise favors
eliminating that surprising behavior in favor of something less surprising.This commit makes it so that xmlns
is not reported as having a prefix of ""
and a local name of "xmlns"
. This accords with how people interpret attribute
names like foo
, bar
, moo
which all have no prefix and a local name.
Code that deals with namespace bindings or cares about xmlns
probably needs to
be changed.
Sax was only passing the tag name. We pass the whole object.
ns
field is no longer using the prototype trick that sax used. The
ns
field of a tag contains only those namespaces that the tag declares.We no longer have opennamespace
and closenamespace
events. The
information they provide can be obtained by examining the tags passed to tag
events.
attribute
is not a particularly useful event for parsing XML. The only thing
it adds over looking at attributes on tag objects is that you get the order of
the attributes from the source, but attribute order in XML is irrelevant.
The opencdata and closecdata events became redundant once we removed the buffer size limitations. So we remove these events.
The parser
function is removed. Just create a new instance with
new
.
SAXParser
is now SaxesParser.
So new require("saxes").SaxesParser(...)
.
write()
after an error you had to call resume()
. We do away with
resume()
and instead install a default onerror
which throws. Replace
with a no-op handler if you want to continue after errors.In the process of making this change, we've removed support for the
on...
properties on streams objects. Their existence was not
warranted by any standard API provided by Node. (EventEmitter
does
not have on...
properties for events it supports, nor does
Stream
.) Their existence was also undocumented. And their
functioning was awkward. For instance, with sax, this:
const s = sax.createStream();
const handler = () => console.log("moo");
s.on("cdata", handler);
console.log(s.oncdata === handler);
would print false
. If you examine s.oncdata
you see it is glue
code instead of the handler assigned. This is just bizarre, so we
removed it.
text
or cdata
events in order avoid passing strings
greater than an arbitrary size. So MAX_BUFFER_LENGTH
is gone.The feature always seemed a bit awkward. Client code could limit the
size of buffers to 1024K, for instance, and not get a text
event
with a text payload greater than 1024K... so far so good but if the
same document contained a comment with more than 1024K that would
result in an error. Hmm.... why? The distinction seems entirely
arbitrary.
The upshot is that client code needs to be ready to handle strings of any length supported by the platform.
If there's a clear need to reintroduce it, we'll reassess.
script
element. It needs building.The library now assumes a modern runtime. It no longer contains any code to polyfill what's missing. It is up to developers using this code to deal with polyfills as needed.
trim
option. It is up to client code to trip text if
it needs it.normalize
option. It is up to client code
to perform whatever normalization it wants.lowercase
option makes no sense for XML. It is removed.strict
argument anywhere. This also
effectively removes support for HTML processing, or allow processing
without errors anything which is less than full XML. It also removes
special processing of script
elements.FAQs
An evented streaming XML parser in JavaScript
The npm package @rubensworks/saxes receives a total of 12,769 weekly downloads. As such, @rubensworks/saxes popularity was classified as popular.
We found that @rubensworks/saxes demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.