RDF Dataset Canonicalization in TypeScript
This is an implementation of the RDF Dataset Canonicalization algorithm, also referred to as RDFC-1.0. The algorithm has been published by the W3C RDF Dataset Canonicalization and Hash Working Group.
Requirements
RDF packages and references
The implementation depends on the interfaces defined by the RDF/JS Data model specification for RDF terms, named and blank nodes, or quads. It also depends on an instance of an RDF Data Factory, specified by the same document. For TypeScript, the necessary type specifications are available through the @rdfjs/types package; an implementation of the RDF Data Factory is provided by, for example, the n3 package, which also provides a Turtle/TriG parser and serializer.
By default (i.e., if not explicitly specified) the Data Factory of the n3 package is used.
Crypto
The implementation relies on the Web Cryptography API as implemented by modern browsers, deno (version 1.3.82 or higher), or node.js (version 21 or higher). A side effect of using Web Crypto is that the canonicalization and hashing interface entries are asynchronous, returning Promises, and must be used, for example, through the await idiom of Javascript/Typescript.
Usage
An input RDF Dataset may be represented by any object that may be iterated through quad instances (e.g., arrays of quads, a set of quads, or any specialized objects storing quads like RDF DatasetCore implementations), or a string representing an N-Quads, Turtle, or TriG document. Formally, the input type is:
Iterable<rdf.Quad> | string
The canonicalization process can be invoked by
- the canonicalize method, that returns an N-Quads document containing the (sorted) quads of the dataset, using the canonical blank node id-s;
- the canonicalizeDetailed method, that returns an Object of the form:
- canonicalized_dataset: an RDF DatasetCore instance using the canonical blank node id-s
- canonical_form: an N-Quads document containing the (sorted) quads of the dataset, using the canonical blank node id-s
- issued_identifier_map: a Map object, mapping the original blank node id-s (as used in the input) to their canonical equivalents
- bnode_identifier_map: Map object, mapping a blank node to its (canonical) blank node id
Copying the input quads
The Iterable<rdf.Qad>
input instance is expected to be a set of quads, i.e., it should not include repeated entries. This is not checked by
the process. Usually, the input quads are copied into an internal store, thereby de-duplicating them. Because this can be a costly operation
for large dataset, it can be controlled through an additional, optional, boolean parameter copy
. The effects are as follows:
- If the value of
copy
is set, and its value is true
, the input quads are copied to an internal store. If the value is false
, the quads are used directly. - If the value of
copy
is not set, the input is copied to an internal store unless the object implements the RDF DatasetCore interface.
If the input is a string serializing a Dataset in Turtle/TriG format, the input is parsed, and duplicate quads are filtered out automatically.
Note that the value of copy
must not be set to false
if the input is a generator function (even if the generator function avoids duplicate quads).
The separate testing folder includes a tiny application that runs some local tests, and can be used as an example for the additional packages that are required. See also the separate tester repository that runs the official test suite set up by the W3C Working Group.
All the examples below ignore the copy
argument.
Installation
For node.js, the usual npm installation can be used:
npm install rdfjs-c14n
The package has been written in TypeScript but is distributed in JavaScript; the type definition (i.e., index.d.ts
) is included in the distribution.
Using appropriate tools (e.g., esbuild) the package can be included into a module to be loaded into a browser.
For deno a simple
import { RDFC10, Quads, InputQuads } from "npm:rdfjs-c14n"
will do.
Usage Examples
There is a more detailed documentation of the classes and types on github. The basic usage may be as follows:
import * as n3 from 'n3';
import * as rdf from '@rdfjs/types';;
import {RDFC10, Quads, InputQuads } from 'rdf-c14n';
async function main(): Promise<void> {
const rdfc10 = new RDFC10(n3.DataFactory);
const input: InputQuads = createYourQuads();
const normalized: Quads = (await rdfc10.c14n(input)).canonicalized_dataset;
const normalized_N_Quads: string = (await rdfc10.c14n(input)).canonical_form;
const normalized_N_Quads_bis: string = await rdfc10.canonicalize(input);
const hash: string = await rdfc10.hash(normalized);
}
Additional features
Choice of hash
The RDFC 1.0 algorithm is based on an extensive usage of hashing. By default, as specified by the specification, the hash function is sha256.
This default hash function can be changed via the
rdfc10.hash_algorithm = algorithm;
attribute, where algorithm
can be any hash function identification. Examples are sha256, sha512, etc. The list of available hash algorithms can be retrieved as:
rdfc10.available_hash_algorithms;
which corresponds to the values defined by the Web Cryptography API specification as of December 2013, namely sha1, sha256, sha384, and sha512. Future revision of the specification may add more.
Controlling the complexity level
On rare occasions, the RDFC 1.0 algorithm has to go through complex
cycles that may also involve recursive steps. On even more extreme situations, this could result in an unreasonably long canonicalization process. Although this practically never occurs in practice, attackers may use some "poison graphs" to create such situations (see the security consideration section in the specification).
As specified by the standard, this implementation sets a maximum complexity level (usually set to 50); this level can be inquired by the
rdfc10.maximum_allowed_complexity_number;
(read-only) attribute. This number can be lowered by setting the
rdfc10.maximum_complexity_number
attribute. The value of this attribute cannot exceed the system wide maximum level.
Maintainer: @iherman