saxes
Advanced tools
Comparing version 3.1.11 to 4.0.0-rc.1
@@ -0,1 +1,48 @@ | ||
<a name="4.0.0-rc.1"></a> | ||
# [4.0.0-rc.1](https://github.com/lddubeau/saxes/compare/v3.1.11...v4.0.0-rc.1) (2019-10-02) | ||
### Bug Fixes | ||
* don't serialize the fileName as undefined: when not present ([4ff2365](https://github.com/lddubeau/saxes/commit/4ff2365)) | ||
* fix bug with initial eol characters ([7b3db75](https://github.com/lddubeau/saxes/commit/7b3db75)) | ||
* handling of end of line characters ([f13247a](https://github.com/lddubeau/saxes/commit/f13247a)) | ||
### Features | ||
* add forceXMLVersion ([1eedbf8](https://github.com/lddubeau/saxes/commit/1eedbf8)) | ||
* saxes handles chunks that "break" unicode ([1272448](https://github.com/lddubeau/saxes/commit/1272448)) | ||
* support for XML 1.1 ([36704fb](https://github.com/lddubeau/saxes/commit/36704fb)) | ||
### Performance Improvements | ||
* don't depend on limit to know when we hit the end of buffer ([ad4ab53](https://github.com/lddubeau/saxes/commit/ad4ab53)) | ||
* don't increment a column number ([490fc24](https://github.com/lddubeau/saxes/commit/490fc24)) | ||
* don't repeatedly read this.i in the getCode methods ([d3f196c](https://github.com/lddubeau/saxes/commit/d3f196c)) | ||
* improve performance of text handling ([9c13099](https://github.com/lddubeau/saxes/commit/9c13099)) | ||
* make the most common path of getCode functions the shortest ([4d66bbb](https://github.com/lddubeau/saxes/commit/4d66bbb)) | ||
* minimine concatenation by adding the capability to unget codes ([27fa8b9](https://github.com/lddubeau/saxes/commit/27fa8b9)) | ||
* use isCharAndNotRestricted rather than call two functions ([f0b67a4](https://github.com/lddubeau/saxes/commit/f0b67a4)) | ||
* use slice rather than substring ([c1fed89](https://github.com/lddubeau/saxes/commit/c1fed89)) | ||
### BREAKING CHANGES | ||
* previous versions of saxes did not consistently convert end of | ||
line characters to NL (0xA) in the data reported by event handlers. This has | ||
been fixed. If your code relied on the old (incorrect) behavior then you'll have | ||
to update it. | ||
* previous versions of saxes would parse files with an XML | ||
declaration set to 1.1 as 1.0 documents. The support for 1.1 entails that if a | ||
document has an XML declaration that specifies version 1.1 it is parsed as a 1.1 | ||
document. | ||
* when ``fileName`` is undefined in the parser options saxes does | ||
not show a file name in error messages. Previously it was showing the name | ||
``undefined``. To get the previous behavior, in all cases where you'd leave | ||
``fileName`` undefined, you must set it to the string ``"undefined"`` instead. | ||
<a name="3.1.11"></a> | ||
@@ -2,0 +49,0 @@ ## [3.1.11](https://github.com/lddubeau/saxes/compare/v3.1.10...v3.1.11) (2019-06-25) |
declare namespace saxes { | ||
export const EVENTS: ReadonlyArray<string>; | ||
export interface SaxesOptions { | ||
export interface CommonSaxesOptions { | ||
xmlns?: boolean; | ||
@@ -10,4 +10,16 @@ position?: boolean; | ||
additionalNamespaces?: Record<string, string>; | ||
defaultXMLVersion?: "1.0" | "1.1"; | ||
} | ||
export interface NotForced extends CommonSaxesOptions { | ||
forceXMLVersion?: false; | ||
} | ||
export interface Forced extends CommonSaxesOptions { | ||
defaultXMLVersion: CommonSaxesOptions["defaultXMLVersion"]; | ||
forceXMLVersion: true; | ||
} | ||
export type SaxesOptions = NotForced | Forced; | ||
export interface XMLDecl { | ||
@@ -14,0 +26,0 @@ version?: string; |
590
lib/saxes.js
"use strict"; | ||
const { isS, isChar, isNameStartChar, isNameChar, S_LIST, NAME_RE } = | ||
require("xmlchars/xml/1.0/ed5"); | ||
const { isNCNameStartChar, isNCNameChar, NC_NAME_RE } = require("xmlchars/xmlns/1.0/ed3"); | ||
const { | ||
isS, isChar: isChar10, isNameStartChar, isNameChar, S_LIST, NAME_RE, | ||
} = require("xmlchars/xml/1.0/ed5"); | ||
const { isChar: isChar11 } = require("xmlchars/xml/1.1/ed2"); | ||
const { isNCNameStartChar, isNCNameChar, NC_NAME_RE } = | ||
require("xmlchars/xmlns/1.0/ed3"); | ||
@@ -88,2 +91,3 @@ const XML_NAMESPACE = "http://www.w3.org/XML/1998/namespace"; | ||
const TAB = 9; | ||
const NL = 0xA; | ||
@@ -105,2 +109,4 @@ const CR = 0xD; | ||
const CLOSE_BRACKET = 0x5D; | ||
const NEL = 0x85; | ||
const LS = 0x2028; // Line Separator | ||
@@ -261,6 +267,14 @@ function isQuote(c) { | ||
* | ||
* @property {string} [fileName] A file name to use for error reporting. Leaving | ||
* this unset will report a file name of "undefined". "File name" is a loose | ||
* concept. You could use a URL to some resource, or any descriptive name you | ||
* like. | ||
* @property {string} [fileName] A file name to use for error reporting. "File | ||
* name" is a loose concept. You could use a URL to some resource, or any | ||
* descriptive name you like. | ||
* | ||
* @property {"1.0" | "1.1"} [defaultXMLVersion] The default XML version to | ||
* use. If unspecified, and there is no XML encoding declaration, the default | ||
* version is "1.0". | ||
* | ||
* @property {boolean} [forceXMLVersion] A flag indicating whether to force the | ||
* XML version used for parsing to the value of ``defaultXMLVersion``. When this | ||
* flag is ``true``, ``defaultXMLVersion`` must be specified. If unspecified, | ||
* the default value of this flag is ``false``. | ||
*/ | ||
@@ -326,3 +340,14 @@ | ||
this.i = 0; | ||
this.trailingCR = false; | ||
// | ||
// We use prevI to allow "ungetting" the previously read code point. Note | ||
// however, that it is not safe to unget everything and anything. In | ||
// particular ungetting EOL characters will screw positioning up. | ||
// | ||
// Practically, you must not unget a code which has any side effect beyond | ||
// updating ``this.i`` and ``this.prevI``. Only EOL codes have such side | ||
// effects. | ||
// | ||
this.prevI = 0; | ||
this.carriedFromPrevious = undefined; | ||
this.originalNL = true; | ||
this.forbiddenState = FORBIDDEN_START; | ||
@@ -368,5 +393,4 @@ /** | ||
this.processAttribs = this.processAttribsNS; | ||
this.pushAttrib = this.pushAttribNS; | ||
this.ns = Object.assign({ __proto__: null }, rootNS); | ||
this.ns = { __proto__: null, ...rootNS }; | ||
const additional = this.opt.additionalNamespaces; | ||
@@ -383,5 +407,14 @@ if (additional) { | ||
this.processAttribs = this.processAttribsPlain; | ||
this.pushAttrib = this.pushAttribPlain; | ||
} | ||
let { defaultXMLVersion } = this.opt; | ||
const { forceXMLVersion } = this.opt; | ||
if (defaultXMLVersion === undefined) { | ||
if (forceXMLVersion) { | ||
throw new Error("forceXMLVersion set but defaultXMLVersion is not set"); | ||
} | ||
defaultXMLVersion = "1.0"; | ||
} | ||
this.setXMLVersion(defaultXMLVersion); | ||
this.trackPosition = this.opt.position !== false; | ||
@@ -392,3 +425,3 @@ /** The line number the parser is currently looking at. */ | ||
/** The column the parser is currently looking at. */ | ||
this.column = 0; | ||
this.positionAtNewLine = 0; | ||
@@ -404,2 +437,6 @@ this.fileName = this.opt.fileName; | ||
get column() { | ||
return this.position - this.positionAtNewLine; | ||
} | ||
/* eslint-disable class-methods-use-this */ | ||
@@ -499,3 +536,3 @@ /** | ||
* | ||
* @param {Error} er The error to report. | ||
* @param {string} er The error to report. | ||
* | ||
@@ -505,5 +542,13 @@ * @returns this | ||
fail(er) { | ||
const message = (this.trackPosition) ? | ||
`${this.fileName}:${this.line}:${this.column}: ${er}` : er; | ||
let message = this.fileName || ""; | ||
if (this.trackPosition) { | ||
if (message.length > 0) { | ||
message += ":"; | ||
} | ||
message += `${this.line}:${this.column}`; | ||
} | ||
if (message.length > 0) { | ||
message += ": "; | ||
} | ||
message += er; | ||
this.onerror(new Error(message)); | ||
@@ -537,21 +582,25 @@ return this; | ||
// of single complete characters (``Array.from(chunk)``) would be faster | ||
// than the current repeated calls to ``codePointAt``. As of August 2018, it | ||
// than the current repeated calls to ``charCodeAt``. As of August 2018, it | ||
// isn't. (There may be Node-specific code that would perform faster than | ||
// ``Array.from`` but don't want to be dependent on Node.) | ||
let limit = chunk.length; | ||
if (this.trailingCR) { | ||
// The previous chunk had a trailing cr. We need to handle it now. | ||
chunk = `\r${chunk}`; | ||
if (this.carriedFromPrevious !== undefined) { | ||
// The previous chunk had char we must carry over. | ||
chunk = `${this.carriedFromPrevious}${chunk}`; | ||
this.carriedFromPrevious = undefined; | ||
} | ||
if (!end && chunk[limit - 1] === CR) { | ||
// The chunk ends with a trailing CR. We cannot know how to handle it | ||
// until we get the next chunk or the end of the stream. So save it for | ||
// later. | ||
let limit = chunk.length; | ||
const lastCode = chunk.charCodeAt(limit - 1); | ||
if (!end && | ||
// A trailing CR or surrogate must be carried over to the next | ||
// chunk. | ||
(lastCode === CR || (lastCode >= 0xD800 && lastCode <= 0xDBFF))) { | ||
// The chunk ends with a character that must be carried over. We cannot | ||
// know how to handle it until we get the next chunk or the end of the | ||
// stream. So save it for later. | ||
this.carriedFromPrevious = chunk[limit - 1]; | ||
limit--; | ||
this.trailingCR = true; | ||
chunk = chunk.slice(0, limit); | ||
} | ||
this.limit = limit; | ||
@@ -578,2 +627,9 @@ this.chunk = chunk; | ||
/** @private */ | ||
newline(originalNL) { | ||
this.originalNL = originalNL; | ||
this.line++; | ||
this.positionAtNewLine = this.position; | ||
} | ||
/** | ||
@@ -583,2 +639,4 @@ * Get a single code point out of the current chunk. This updates the current | ||
* | ||
* This is the algorithm to use for XML 1.0. | ||
* | ||
* @private | ||
@@ -588,45 +646,150 @@ * | ||
*/ | ||
getCode() { | ||
getCode10() { | ||
const { chunk, i } = this; | ||
this.prevI = i; | ||
// Using charCodeAt and handling the surrogates ourselves is faster | ||
// than using codePointAt. | ||
let code = chunk.charCodeAt(i); | ||
const code = chunk.charCodeAt(i); | ||
let skip = 1; | ||
switch (code) { | ||
case CR: | ||
// We may get NaN if we read past the end of the chunk, which is | ||
// fine. | ||
if (chunk.charCodeAt(i + 1) === NL) { | ||
// A \r\n sequence is converted to \n so we have to skip over the next | ||
// character. We already know it has a size of 1 so ++ is fine here. | ||
skip++; | ||
// Yes, we do this instead of doing this.i++. Doing it this way, we do not | ||
// read this.i again, which is a bit faster. | ||
this.i = i + 1; | ||
if (code < 0xD800) { | ||
if (code >= SPACE || code === TAB) { | ||
return code; | ||
} | ||
// Otherwise, a \r is just converted to \n, so we don't have to skip | ||
// ahead. | ||
// In either case, \r becomes \n. | ||
code = NL; | ||
/* yes, fall through */ | ||
case NL: | ||
this.line++; | ||
this.column = 0; | ||
break; | ||
default: | ||
this.column++; | ||
if (code >= 0xD800 && code <= 0xDBFF) { | ||
code = 0x10000 + ((code - 0xD800) * 0x400) + | ||
switch (code) { | ||
case NL: | ||
this.newline(true); | ||
return NL; | ||
case CR: | ||
// We may get NaN if we read past the end of the chunk, which is fine. | ||
if (chunk.charCodeAt(i + 1) === NL) { | ||
// A \r\n sequence is converted to \n so we have to skip over the next | ||
// character. We already know it has a size of 1 so ++ is fine here. | ||
this.i = i + 2; | ||
} | ||
// Otherwise, a \r is just converted to \n, so we don't have to skip | ||
// ahead. | ||
// In either case, \r becomes \n. | ||
this.newline(false); | ||
return NL; | ||
default: | ||
// If we get here, then code < SPACE and it is not NL CR or TAB. | ||
this.fail("disallowed character."); | ||
return code; | ||
} | ||
} | ||
if (code > 0xDBFF) { | ||
// This is a specialized version of isChar10 that takes into account | ||
// that in this context code > 0xDBFF and code <= 0xFFFF. So it does not | ||
// test cases that don't need testing. | ||
if (!(code >= 0xE000 && code <= 0xFFFD)) { | ||
this.fail("disallowed character."); | ||
} | ||
return code; | ||
} | ||
// eslint-disable-next-line no-restricted-globals | ||
if (isNaN(code)) { | ||
return undefined; | ||
} | ||
const final = 0x10000 + ((code - 0xD800) * 0x400) + | ||
(chunk.charCodeAt(i + 1) - 0xDC00); | ||
this.column++; | ||
skip++; | ||
this.i = i + 2; | ||
// This is a specialized version of isChar10 that takes into account that in | ||
// this context necessarily final >= 0x10000. | ||
if (final > 0x10FFFF) { | ||
this.fail("disallowed character."); | ||
} | ||
return final; | ||
} | ||
/** | ||
* Get a single code point out of the current chunk. This updates the current | ||
* position if we do position tracking. | ||
* | ||
* This is the algorithm to use for XML 1.1. | ||
* | ||
* @private | ||
* | ||
* @returns {number} The character read. | ||
*/ | ||
getCode11() { | ||
const { chunk, i } = this; | ||
this.prevI = i; | ||
// Using charCodeAt and handling the surrogates ourselves is faster | ||
// than using codePointAt. | ||
const code = chunk.charCodeAt(i); | ||
// Yes, we do this instead of doing this.i++. Doing it this way, we do not | ||
// read this.i again, which is a bit faster. | ||
this.i = i + 1; | ||
if (code < 0xD800) { | ||
if ((code > 0x1F && code < 0x7F) || (code > 0x9F && code !== LS) || | ||
code === TAB) { | ||
return code; | ||
} | ||
if (!isChar(code)) { | ||
switch (code) { | ||
case NL: // 0xA | ||
this.newline(true); | ||
return NL; | ||
case CR: { // 0xD | ||
// We may get NaN if we read past the end of the chunk, which is | ||
// fine. | ||
const next = chunk.charCodeAt(i + 1); | ||
if (next === NL || next === NEL) { | ||
// A CR NL or CR NEL sequence is converted to NL so we have to skip over | ||
// the next character. We already know it has a size of 1. | ||
this.i = i + 2; | ||
} | ||
// Otherwise, a CR is just converted to NL, no skip. | ||
} | ||
/* yes, fall through */ | ||
case NEL: // 0x85 | ||
case LS: // Ox2028 | ||
this.newline(false); | ||
return NL; | ||
default: | ||
this.fail("disallowed character."); | ||
return code; | ||
} | ||
} | ||
this.i += skip; | ||
if (code > 0xDBFF) { | ||
// This is a specialized version of isCharAndNotRestricted that takes into | ||
// account that in this context code > 0xDBFF and code <= 0xFFFF. So it | ||
// does not test cases that don't need testing. | ||
if (!(code >= 0xE000 && code <= 0xFFFD)) { | ||
this.fail("disallowed character."); | ||
} | ||
return code; | ||
return code; | ||
} | ||
// eslint-disable-next-line no-restricted-globals | ||
if (isNaN(code)) { | ||
return undefined; | ||
} | ||
const final = 0x10000 + ((code - 0xD800) * 0x400) + | ||
(chunk.charCodeAt(i + 1) - 0xDC00); | ||
this.i = i + 2; | ||
// This is a specialized version of isCharAndNotRestricted that takes into | ||
// account that in this context necessarily final >= 0x10000. | ||
if (final > 0x10FFFF) { | ||
this.fail("disallowed character."); | ||
} | ||
return final; | ||
} | ||
@@ -646,2 +809,14 @@ | ||
/** | ||
* @private | ||
*/ | ||
handleEOL(buffer, chunk, start) { | ||
if (this.originalNL) { | ||
return start; | ||
} | ||
this[buffer] += `${chunk.slice(start, this.prevI)}\n`; | ||
return this.i; | ||
} | ||
/** | ||
* Capture characters into a buffer until encountering one of a set of | ||
@@ -661,16 +836,19 @@ * characters. | ||
captureTo(chars, buffer) { | ||
const { chunk, limit, i: start } = this; | ||
while (this.i < limit) { | ||
let { i: start } = this; | ||
const { chunk } = this; | ||
while (true) { | ||
const c = this.getCode(); | ||
if (c === NL) { | ||
start = this.handleEOL(buffer, chunk, start); | ||
} | ||
else if (c === undefined) { | ||
this[buffer] += chunk.slice(start); | ||
return undefined; | ||
} | ||
if (chars.includes(c)) { | ||
// This is faster than adding codepoints one by one. | ||
this[buffer] += chunk.substring(start, | ||
this.i - (c <= 0xFFFF ? 1 : 2)); | ||
this[buffer] += chunk.slice(start, this.prevI); | ||
return c; | ||
} | ||
} | ||
// This is faster than adding codepoints one by one. | ||
this[buffer] += chunk.substring(start); | ||
return undefined; | ||
} | ||
@@ -691,16 +869,19 @@ | ||
captureToChar(char, buffer) { | ||
const { chunk, limit, i: start } = this; | ||
while (this.i < limit) { | ||
let { i: start } = this; | ||
const { chunk } = this; | ||
while (true) { | ||
const c = this.getCode(); | ||
if (c === NL) { | ||
start = this.handleEOL(buffer, chunk, start); | ||
} | ||
else if (c === undefined) { | ||
this[buffer] += chunk.slice(start); | ||
return false; | ||
} | ||
if (c === char) { | ||
// This is faster than adding codepoints one by one. | ||
this[buffer] += chunk.substring(start, | ||
this.i - (c <= 0xFFFF ? 1 : 2)); | ||
this[buffer] += chunk.slice(start, this.prevI); | ||
return true; | ||
} | ||
} | ||
// This is faster than adding codepoints one by one. | ||
this[buffer] += chunk.substring(start); | ||
return false; | ||
} | ||
@@ -718,16 +899,16 @@ | ||
captureNameChars() { | ||
const { chunk, limit, i: start } = this; | ||
while (this.i < limit) { | ||
const { chunk, i: start } = this; | ||
while (true) { | ||
const c = this.getCode(); | ||
if (c === undefined) { | ||
this.name += chunk.slice(start); | ||
return undefined; | ||
} | ||
// NL is not a name char so we don't have to test specifically for it. | ||
if (!isNameChar(c)) { | ||
// This is faster than adding codepoints one by one. | ||
this.name += chunk.substring(start, | ||
this.i - (c <= 0xFFFF ? 1 : 2)); | ||
this.name += chunk.slice(start, this.prevI); | ||
return c; | ||
} | ||
} | ||
// This is faster than adding codepoints one by one. | ||
this.name += chunk.substring(start); | ||
return undefined; | ||
} | ||
@@ -747,16 +928,17 @@ | ||
captureWhileNameCheck(buffer) { | ||
const { chunk, limit, i: start } = this; | ||
while (this.i < limit) { | ||
const { chunk, i: start } = this; | ||
while (true) { | ||
const c = this.getCode(); | ||
if (c === undefined) { | ||
this[buffer] += chunk.slice(start); | ||
return undefined; | ||
} | ||
// NL cannot satisfy this.nameCheck so we don't have to test | ||
// specifically for it. | ||
if (!this.nameCheck(c)) { | ||
// This is faster than adding codepoints one by one. | ||
this[buffer] += chunk.substring(start, | ||
this.i - (c <= 0xFFFF ? 1 : 2)); | ||
this[buffer] += chunk.slice(start, this.prevI); | ||
return c; | ||
} | ||
} | ||
// This is faster than adding codepoints one by one. | ||
this[buffer] += chunk.substring(start); | ||
return undefined; | ||
} | ||
@@ -773,11 +955,24 @@ | ||
skipSpaces() { | ||
const { limit } = this; | ||
while (this.i < limit) { | ||
while (true) { | ||
const c = this.getCode(); | ||
if (!isS(c)) { | ||
if (c === undefined || !isS(c)) { | ||
return c; | ||
} | ||
} | ||
} | ||
return undefined; | ||
/** @private */ | ||
setXMLVersion(version) { | ||
if (version === "1.0") { | ||
this.isChar = isChar10; | ||
this.getCode = this.getCode10; | ||
this.pushAttrib = | ||
this.xmlnsOpt ? this.pushAttribNS10 : this.pushAttribPlain; | ||
} | ||
else { | ||
this.isChar = isChar11; | ||
this.getCode = this.getCode11; | ||
this.pushAttrib = | ||
this.xmlnsOpt ? this.pushAttribNS11 : this.pushAttribPlain; | ||
} | ||
} | ||
@@ -797,10 +992,3 @@ | ||
this.i++; | ||
this.column++; | ||
} | ||
else if (isS(c)) { | ||
this.i++; | ||
this.column++; | ||
// An XML declaration cannot appear after initial spaces. | ||
this.xmlDeclPossible = false; | ||
} | ||
@@ -812,7 +1000,26 @@ this.state = S_BEGIN_WHITESPACE; | ||
sBeginWhitespace() { | ||
const c = this.skipSpaces(); | ||
// This initial loop is a specialized version of skipSpaces. We need to know | ||
// whether we've encountered spaces or not because as soon as we run into a | ||
// space, an XML declaration is no longer possible. Rather than slow down | ||
// skipSpaces even in places where we don't care whether it skipped anything | ||
// or not, we use a specialized loop here. | ||
let c; | ||
let sawSpace = false; | ||
while (true) { | ||
c = this.getCode(); | ||
if (c === undefined || !isS(c)) { | ||
break; | ||
} | ||
sawSpace = true; | ||
} | ||
if (sawSpace) { | ||
this.xmlDeclPossible = false; | ||
} | ||
if (c === LESS) { | ||
this.state = S_OPEN_WAKA; | ||
} | ||
else if (c) { | ||
else if (c !== undefined) { | ||
// have to process this as a text node. | ||
@@ -824,3 +1031,3 @@ // weird, but happens. | ||
} | ||
this.text = String.fromCodePoint(c); | ||
this.i = this.prevI; | ||
this.state = S_TEXT; | ||
@@ -864,13 +1071,11 @@ this.xmlDeclPossible = false; | ||
// | ||
const { chunk, limit, i: start } = this; | ||
let { forbiddenState } = this; | ||
let c; | ||
let { i: start, forbiddenState } = this; | ||
const { chunk } = this; | ||
// eslint-disable-next-line no-labels, no-restricted-syntax | ||
scanLoop: | ||
while (this.i < limit) { | ||
const code = this.getCode(); | ||
switch (code) { | ||
while (true) { | ||
switch (this.getCode()) { | ||
case LESS: | ||
this.state = S_OPEN_WAKA; | ||
c = code; | ||
this.text += chunk.slice(start, this.prevI); | ||
forbiddenState = FORBIDDEN_START; | ||
@@ -882,3 +1087,3 @@ // eslint-disable-next-line no-labels | ||
this.entityReturnState = S_TEXT; | ||
c = code; | ||
this.text += chunk.slice(start, this.prevI); | ||
forbiddenState = FORBIDDEN_START; | ||
@@ -907,2 +1112,10 @@ // eslint-disable-next-line no-labels | ||
break; | ||
case NL: | ||
start = this.handleEOL("text", chunk, start); | ||
forbiddenState = FORBIDDEN_START; | ||
break; | ||
case undefined: | ||
this.text += chunk.slice(start); | ||
// eslint-disable-next-line no-labels | ||
break scanLoop; | ||
default: | ||
@@ -913,7 +1126,2 @@ forbiddenState = FORBIDDEN_START; | ||
this.forbiddenState = forbiddenState; | ||
// This is faster than adding codepoints one by one. | ||
this.text += chunk.substring(start, | ||
c === undefined ? undefined : | ||
(this.i - (c <= 0xFFFF ? 1 : 2))); | ||
} | ||
@@ -924,16 +1132,11 @@ | ||
// This is essentially a specialized version of captureTo which is optimized | ||
// for performing the ]]> check. A previous version of this code, checked | ||
// ``this.text`` for the presence of ]]>. It simplified the code but was | ||
// very costly when character data contained a lot of entities to be parsed. | ||
// | ||
// Since we are using a specialized loop, we also keep track of the presence | ||
// of non-space characters in the text since these are errors when appearing | ||
// outside the document root element. | ||
// | ||
const { chunk, limit, i: start } = this; | ||
// for a specialized task. We keep track of the presence of non-space | ||
// characters in the text since these are errors when appearing outside the | ||
// document root element. | ||
let { i: start } = this; | ||
const { chunk } = this; | ||
let nonSpace = false; | ||
let c; | ||
// eslint-disable-next-line no-labels, no-restricted-syntax | ||
outRootLoop: | ||
while (this.i < limit) { | ||
while (true) { | ||
const code = this.getCode(); | ||
@@ -943,3 +1146,3 @@ switch (code) { | ||
this.state = S_OPEN_WAKA; | ||
c = code; | ||
this.text += chunk.slice(start, this.prevI); | ||
// eslint-disable-next-line no-labels | ||
@@ -950,6 +1153,14 @@ break outRootLoop; | ||
this.entityReturnState = S_TEXT; | ||
c = code; | ||
this.text += chunk.slice(start, this.prevI); | ||
nonSpace = true; | ||
// eslint-disable-next-line no-labels | ||
break outRootLoop; | ||
case NL: | ||
start = this.handleEOL("text", chunk, start); | ||
// eslint-disable-next-line no-labels | ||
break; | ||
case undefined: | ||
this.text += chunk.slice(start); | ||
// eslint-disable-next-line no-labels | ||
break outRootLoop; | ||
default: | ||
@@ -962,7 +1173,2 @@ if (!isS(code)) { | ||
// This is faster than adding codepoints one by one. | ||
this.text += chunk.substring(start, | ||
c === undefined ? undefined : | ||
(this.i - (c <= 0xFFFF ? 1 : 2))); | ||
if (!nonSpace) { | ||
@@ -988,2 +1194,6 @@ return; | ||
sOpenWaka() { | ||
// Reminder: a state handler is called with at least one character | ||
// available in the current chunk. So the first call to get code inside of | ||
// a state handler cannot return ``undefined``. That's why we don't test | ||
// for it. | ||
const c = this.getCode(); | ||
@@ -993,3 +1203,3 @@ // either a /, ?, !, or text is coming next. | ||
this.state = S_OPEN_TAG; | ||
this.name = String.fromCodePoint(c); | ||
this.i = this.prevI; | ||
this.xmlDeclPossible = false; | ||
@@ -1012,3 +1222,3 @@ } | ||
default: | ||
this.fail("disallowed character in tag name."); | ||
this.fail("disallowed character in tag name"); | ||
this.state = S_TEXT; | ||
@@ -1068,3 +1278,3 @@ this.xmlDeclPossible = false; | ||
} | ||
else if (c) { | ||
else if (c !== undefined) { | ||
this.doctype += String.fromCodePoint(c); | ||
@@ -1094,3 +1304,3 @@ if (c === OPEN_BRACKET) { | ||
const c = this.captureTo(DTD_TERMINATOR, "doctype"); | ||
if (!c) { | ||
if (c === undefined) { | ||
return; | ||
@@ -1304,3 +1514,3 @@ } | ||
} | ||
else if (c) { | ||
else if (c !== undefined) { | ||
this.fail("disallowed character in processing instruction name."); | ||
@@ -1411,11 +1621,18 @@ this.piTarget += String.fromCodePoint(c); | ||
if (c) { | ||
if (c !== undefined) { | ||
switch (this.xmlDeclName) { | ||
case "version": | ||
if (!/^1\.[0-9]+$/.test(this.xmlDeclValue)) { | ||
case "version": { | ||
this.xmlDeclExpects = ["encoding", "standalone"]; | ||
const version = this.xmlDeclValue; | ||
this.xmlDecl.version = version; | ||
// This is the test specified by XML 1.0 but it is fine for XML 1.1. | ||
if (!/^1\.[0-9]+$/.test(version)) { | ||
this.fail("version number must match /^1\\.[0-9]+$/."); | ||
} | ||
this.xmlDeclExpects = ["encoding", "standalone"]; | ||
this.xmlDecl.version = this.xmlDeclValue; | ||
// When forceXMLVersion is set, the XML declaration is ignored. | ||
else if (!this.opt.forceXMLVersion) { | ||
this.setXMLVersion(version); | ||
} | ||
break; | ||
} | ||
case "encoding": | ||
@@ -1524,3 +1741,3 @@ if (!/^[A-Za-z][A-Za-z0-9._-]*$/.test(this.xmlDeclValue)) { | ||
const c = this.captureNameChars(); | ||
if (!c) { | ||
if (c === undefined) { | ||
return; | ||
@@ -1533,2 +1750,3 @@ } | ||
}; | ||
this.name = ""; | ||
@@ -1578,7 +1796,7 @@ if (this.xmlnsOpt) { | ||
const c = this.skipSpaces(); | ||
if (!c) { | ||
if (c === undefined) { | ||
return; | ||
} | ||
if (isNameStartChar(c)) { | ||
this.name = String.fromCodePoint(c); | ||
this.i = this.prevI; | ||
this.state = S_ATTRIB_NAME; | ||
@@ -1598,3 +1816,3 @@ } | ||
/** @private */ | ||
pushAttribNS(name, value) { | ||
pushAttribNS10(name, value) { | ||
const { prefix, local } = this.qname(name); | ||
@@ -1604,2 +1822,5 @@ this.attribList.push({ name, prefix, local, value, uri: undefined }); | ||
const trimmed = value.trim(); | ||
if (trimmed === "") { | ||
this.fail("invalid attempt to undefine prefix in XML 1.0"); | ||
} | ||
this.tag.ns[local] = trimmed; | ||
@@ -1615,2 +1836,17 @@ nsPairCheck(this, local, trimmed); | ||
pushAttribNS11(name, value) { | ||
const { prefix, local } = this.qname(name); | ||
this.attribList.push({ name, prefix, local, value, uri: undefined }); | ||
if (prefix === "xmlns") { | ||
const trimmed = value.trim(); | ||
this.tag.ns[local] = trimmed; | ||
nsPairCheck(this, local, trimmed); | ||
} | ||
else if (name === "xmlns") { | ||
const trimmed = value.trim(); | ||
this.tag.ns[""] = trimmed; | ||
nsPairCheck(this, "", trimmed); | ||
} | ||
} | ||
/** @private */ | ||
@@ -1636,3 +1872,3 @@ pushAttribPlain(name, value) { | ||
} | ||
else if (c) { | ||
else if (c !== undefined) { | ||
this.fail("disallowed character in attribute name."); | ||
@@ -1645,3 +1881,3 @@ } | ||
const c = this.skipSpaces(); | ||
if (!c) { | ||
if (c === undefined) { | ||
return; | ||
@@ -1662,3 +1898,3 @@ } | ||
else if (isNameStartChar(c)) { | ||
this.name = String.fromCodePoint(c); | ||
this.i = this.prevI; | ||
this.state = S_ATTRIB_NAME; | ||
@@ -1683,3 +1919,3 @@ } | ||
this.state = S_ATTRIB_VALUE_UNQUOTED; | ||
this.text = String.fromCodePoint(c); | ||
this.i = this.prevI; | ||
} | ||
@@ -1693,15 +1929,16 @@ } | ||
const { q } = this; | ||
const { chunk, limit, i: start } = this; | ||
// eslint-disable-next-line no-constant-condition | ||
let { i: start } = this; | ||
const { chunk } = this; | ||
while (true) { | ||
if (this.i >= limit) { | ||
// This is faster than adding codepoints one by one. | ||
this.text += chunk.substring(start); | ||
const code = this.getCode(); | ||
if (code === undefined) { | ||
this.text += chunk.slice(start); | ||
return; | ||
} | ||
const code = this.getCode(); | ||
if (code === q || code === AMP || code === LESS) { | ||
// This is faster than adding codepoints one by one. | ||
const slice = chunk.substring(start, | ||
this.i - (code <= 0xFFFF ? 1 : 2)); | ||
if (code === NL) { | ||
start = this.handleEOL("text", chunk, start); | ||
} | ||
else if (code === q || code === AMP || code === LESS) { | ||
const slice = chunk.slice(start, this.prevI); | ||
switch (code) { | ||
@@ -1742,3 +1979,3 @@ case q: | ||
this.fail("no whitespace between attributes."); | ||
this.name = String.fromCodePoint(c); | ||
this.i = this.prevI; | ||
this.state = S_ATTRIB_NAME; | ||
@@ -1761,3 +1998,3 @@ } | ||
} | ||
else if (c) { | ||
else if (c !== undefined) { | ||
if (this.text.includes("]]>")) { | ||
@@ -1786,3 +2023,3 @@ this.fail("the string \"]]>\" is disallowed in char data."); | ||
} | ||
else if (c) { | ||
else if (c !== undefined) { | ||
this.fail("disallowed character in closing tag."); | ||
@@ -1798,3 +2035,3 @@ } | ||
} | ||
else if (c) { | ||
else if (c !== undefined) { | ||
this.fail("disallowed character in closing tag."); | ||
@@ -1901,2 +2138,3 @@ } | ||
qname(name) { | ||
// This is faster than using name.split(":"). | ||
const colon = name.indexOf(":"); | ||
@@ -1907,4 +2145,4 @@ if (colon === -1) { | ||
const local = name.substring(colon + 1); | ||
const prefix = name.substring(0, colon); | ||
const local = name.slice(colon + 1); | ||
const prefix = name.slice(0, colon); | ||
if (prefix === "" || local === "" || local.includes(":")) { | ||
@@ -1920,7 +2158,6 @@ this.fail(`malformed name: ${name}.`); | ||
const { tag, attribList } = this; | ||
const { name: tagName, attributes } = tag; | ||
{ | ||
// add namespace info to tag | ||
const { prefix, local } = this.qname(tagName); | ||
const { prefix, local } = this.qname(tag.name); | ||
tag.prefix = prefix; | ||
@@ -1946,2 +2183,3 @@ tag.local = local; | ||
const { attributes } = tag; | ||
const seen = new Set(); | ||
@@ -1955,3 +2193,3 @@ // Note: do not apply default ns to attributes: | ||
if (prefix === "") { | ||
uri = (name === "xmlns") ? XMLNS_NAMESPACE : ""; | ||
uri = name === "xmlns" ? XMLNS_NAMESPACE : ""; | ||
eqname = name; | ||
@@ -2114,3 +2352,3 @@ } | ||
// The character reference is required to match the CHAR production. | ||
if (!isChar(num)) { | ||
if (!this.isChar(num)) { | ||
this.fail("malformed character entity."); | ||
@@ -2117,0 +2355,0 @@ return `&${entity};`; |
@@ -5,3 +5,3 @@ { | ||
"author": "Louis-Dominique Dubeau <ldd@lddubeau.com>", | ||
"version": "3.1.11", | ||
"version": "4.0.0-rc.1", | ||
"main": "lib/saxes.js", | ||
@@ -30,10 +30,10 @@ "types": "lib/saxes.d.ts", | ||
"devDependencies": { | ||
"@commitlint/cli": "^8.0.0", | ||
"@commitlint/config-angular": "^8.0.0", | ||
"@commitlint/cli": "^8.2.0", | ||
"@commitlint/config-angular": "^8.2.0", | ||
"chai": "^4.2.0", | ||
"conventional-changelog-cli": "^2.0.21", | ||
"eslint": "^5.16.0", | ||
"eslint-config-lddubeau-base": "^3.0.5", | ||
"husky": "^2.5.0", | ||
"mocha": "^6.1.4", | ||
"conventional-changelog-cli": "^2.0.23", | ||
"eslint": "^6.5.1", | ||
"eslint-config-lddubeau-base": "^4.0.2", | ||
"husky": "^3.0.8", | ||
"mocha": "^6.2.1", | ||
"renovate-config-lddubeau": "^1.0.0", | ||
@@ -43,3 +43,3 @@ "xml-conformance-suite": "^1.2.0" | ||
"dependencies": { | ||
"xmlchars": "^2.1.1" | ||
"xmlchars": "^2.2.0" | ||
}, | ||
@@ -46,0 +46,0 @@ "husky": { |
104
README.md
@@ -19,7 +19,6 @@ # saxes | ||
better compliance with well-formedness constraints cannot use sax as-is. | ||
Saxes aims for conformance with [XML 1.0 fifth | ||
edition](https://www.w3.org/TR/2008/REC-xml-20081126/) and [XML Namespaces 1.0 | ||
third edition](http://www.w3.org/TR/2009/REC-xml-names-20091208/). | ||
Consequently, saxes does not support HTML, or pseudo-XML, or bad XML. | ||
Consequently, saxes does not support HTML, or pseudo-XML, or bad XML. Saxes | ||
will report well-formedness errors in all these cases but it won't try to | ||
extract data from malformed documents like sax does. | ||
@@ -49,25 +48,20 @@ * Saxes is much much faster than sax, mostly because of a substantial redesign | ||
## Limitations | ||
## Conformance | ||
This is a non-validating parser so it only verifies whether the document is | ||
well-formed. We do aim to raise errors for all malformed constructs encountered. | ||
Saxes supports: | ||
However, this parser does not parse the contents of DTDs. So malformedness | ||
errors caused by errors in DTDs cannot be reported. | ||
* [XML 1.0 fifth edition](https://www.w3.org/TR/2008/REC-xml-20081126/) | ||
* [XML 1.1 second edition](https://www.w3.org/TR/2006/REC-xml11-20060816/) | ||
* [Namespaces in XML 1.0 (Third Edition)](https://www.w3.org/TR/2009/REC-xml-names-20091208/). | ||
* [Namespaces in XML 1.1 (Second Edition)](https://www.w3.org/TR/2006/REC-xml-names11-20060816/). | ||
Also, the parser continues to parse even upon encountering errors, and does its | ||
best to continue reporting errors. You should heed all errors | ||
reported. | ||
## Limitations | ||
**HOWEVER, ONCE AN ERROR HAS BEEN ENCOUNTERED YOU CANNOT RELY ON THE DATA | ||
PROVIDED THROUGH THE OTHER EVENT HANDLERS.** | ||
This is a non-validating parser so it only verifies whether the document is | ||
well-formed. We do aim to raise errors for all malformed constructs | ||
encountered. However, this parser does not thorougly parse the contents of | ||
DTDs. So most malformedness errors caused by errors in DTDs cannot be reported. | ||
After an error, saxes tries to make sense of your document, but it may interpret | ||
it incorrectly. For instance ``<foo a=bc="d"/>`` is invalid XML. Did you mean to | ||
have ``<foo a="bc=d"/>`` or ``<foo a="b" c="d"/>`` or some other variation? | ||
Saxes takes an honest stab at figuring out your mangled XML. That's as good as | ||
it gets. | ||
## Regarding `<!DOCTYPE` and `<!ENTITY` | ||
## Regarding `<!DOCTYPE`s and `<!ENTITY`s | ||
The parser will handle the basic XML entities in text nodes and attribute | ||
@@ -143,6 +137,28 @@ values: `& < > ' "`. It's possible to define additional | ||
* `defaultXMLVersion` - The default version of the XML specification to use if | ||
the document contains no XML declaration. If the document does contain an XML | ||
declaration, then this setting is ignored. Must be `"1.0"` or `"1.1"`. The | ||
default is `"1.0"`. | ||
* `forceXMLVersion` - Boolean. A flag indicating whether to force the XML | ||
version used for parsing to the value of ``defaultXMLVersion``. When this flag | ||
is ``true``, ``defaultXMLVersion`` must be specified. If unspecified, the | ||
default value of this flag is ``false``. | ||
Example: suppose you are parsing a document that has an XML declaration | ||
specifying XML version 1.1. | ||
If you set ``defaultXMLVersion`` to ``"1.0"`` without setting | ||
``forceXMLVersion`` then the XML declaration will override the value of | ||
``defaultXMLVersion`` and the document will be parsed according to XML 1.1. | ||
If you set ``defaultXMLVersion`` to ``"1.0"`` and set ``forceXMLVersion`` to | ||
``true``, then the XML declaration will be ignored and the document will be | ||
parsed according to XML 1.0. | ||
### Methods | ||
`write` - Write bytes onto the stream. You don't have to do this all at | ||
once. You can keep writing as much as you want. | ||
`write` - Write bytes onto the stream. You don't have to pass the whole document | ||
in one `write` call. You can read your source chunk by chunk and call `write` | ||
with each chunk. | ||
@@ -174,2 +190,23 @@ `close` - Close the stream. Once closed, no more data may be written until it is | ||
### Error Handling | ||
The parser continues to parse even upon encountering errors, and does its best | ||
to continue reporting errors. You should heed all errors reported. After an | ||
error, however, saxes may interpret your document incorrectly. For instance | ||
``<foo a=bc="d"/>`` is invalid XML. Did you mean to have ``<foo a="bc=d"/>`` or | ||
``<foo a="b" c="d"/>`` or some other variation? For the sake of continuing to | ||
provide errors, saxes will continue parsing the document, but the structure it | ||
reports may be incorrect. It is only after the errors are fixed in the document | ||
that saxes can provide a reliable interpretation of the document. | ||
That leaves you with two rules of thumb when using saxes: | ||
* Pay attention to the errors that saxes report. The default `onerror` handler | ||
throws, so by default, you cannot miss errors. | ||
* **ONCE AN ERROR HAS BEEN ENCOUNTERED, STOP RELYING ON THE EVENT HANDLERS OTHER | ||
THAN `onerror`.** As explained above, when saxes runs into a well-formedness | ||
problem, it makes a guess in order to continue reporting more errors. The guess | ||
may be wrong. | ||
### Events | ||
@@ -208,2 +245,23 @@ | ||
### Performance Tips | ||
* saxes works faster on files that use newlines (``\u000A``) as end of line | ||
markers than files that use other end of line markers (like ``\r`` or | ||
``\r\n``). The XML specification requires that conformant applications behave | ||
as if all characters that are to be treated as end of line characters are | ||
converted to ``\u000A`` prior to parsing. The optimal code path for saxes is a | ||
file in which all end of line characters are already ``\u000A``. | ||
* Don't split Unicode strings you feed to saxes across surrogates. When you | ||
naively split a string in JavaScript, you run the risk of splitting a Unicode | ||
character into two surrogates. e.g. In the following example ``a`` and ``b`` | ||
each contain half of a single Unicode character: ``const a = "\u{1F4A9}"[0]; | ||
const b = "\u{1F4A9}"[1]`` If you feed such split surrogates to versions of | ||
saxes prior to 4, you'd get errors. Saxes version 4 and over are able to | ||
detect when a chunk of data ends with a surrogate and carry over the surrogate | ||
to the next chunk. However this operation entails slicing and concatenating | ||
strings. If you can feed your data in a way that does not split surrogates, | ||
you should do it. (Obviously, feeding all the data at once with a single write | ||
is fastest.) | ||
## FAQ | ||
@@ -210,0 +268,0 @@ |
No v1
QualityPackage is not semver >=1. This means it is not stable and does not support ^ ranges.
Found 1 instance in 1 package
102049
2170
305
1
Updatedxmlchars@^2.2.0