Socket
Socket
Sign inDemoInstall

spacy-js

Package Overview
Dependencies
7
Maintainers
1
Versions
1
Alerts
File Explorer

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

    spacy-js

JavaScript API for spaCy with Python REST API


Version published
Weekly downloads
54
increased by35%
Maintainers
1
Install size
522 kB
Created
Weekly downloads
 

Readme

Source

spaCy JS

travis npm GitHub unpkg

JavaScript interface for accessing linguistic annotations provided by spaCy. This project is mostly experimental and was developed for fun to play around with different ways of mimicking spaCy's Python API.

The results will still be computed in Python and made available via a REST API. The JavaScript API resembles spaCy's Python API as closely as possible (with a few exceptions, as the values are all pre-computed and it's tricky to express complex recursive relationships).

const spacy = require('spacy');

(async function() {
    const nlp = spacy.load('en_core_web_sm');
    const doc = await nlp('This is a text about Facebook.');
    for (let ent of doc.ents) {
        console.log(ent.text, ent.label);
    }
    for (let token of doc) {
        console.log(token.text, token.pos, token.head.text);
    }
})();

⌛️ Installation

Installing the JavaScript library

You can install the JavaScript package via npm:

npm install spacy-js

Setting up the Python server

First, clone this repo and install the requirements. If you've installed the package via npm, you can also use the api/server.py and requirements.txt in your ./node_modules/spacy-js directory. It's recommended to use a virtual environment.

python -m pip install -r requirements.txt

You can then run the REST API. By default, this will serve the API via 0.0.0.0:8080:

python api/server.py

If you like, you can install more models and specify a comma-separated list of models to load as the first argument when you run the server. All models need to be installed in the same environment.

python api/server.py en_core_web_sm,de_core_news_sm
ArgumentTypeDescriptionDefault
modelspositional (str)Comma-separated list of models to load and make available.en_core_web_sm
--host, -hooption (str)Host to serve the API.0.0.0.0
--port, -poption (int)Port to server the API.8080

🎛 API

spacy.load

"Load" a spaCy model. This method mostly exists for consistency with the Python API. It sets up the REST API and nlp object, but doesn't actually load anything, since the models are already available via the REST API.

const nlp = spacy.load('en_core_web_sm');
ArgumentTypeDescription
modelStringName of model to load, e.g. 'en_core_web_sm'. Needs to be available via the REST API.
apiStringAlternative URL of REST API. Defaults to http://0.0.0.0:8080.
RETURNSLanguageThe nlp object.

nlp async

The nlp object created by spacy.load can be called on a string of text and makes a request to the REST API. The easiest way to use it is to wrap the call in an async function and use await:

async function() {
    const nlp = spacy.load('en_core_web_sm');
    const doc = await nlp('This is a text.');
}
ArgumentTypeDescription
textStringThe text to process.
RETURNSDocThe processed Doc.

Doc

Just like in the original API, the Doc object can be constructed with an array of words and spaces. It also takes an additional attrs object, which corresponds to the JSON-serialized linguistic annotations created in doc2json in api/server.py.

The Doc behaves just like the regular spaCy Doc – you can iterate over its tokens, index into individual tokens, access the Doc attributes and properties and also use native JavaScript methods like map and slice (since there's no real way to make Python's slice notation like doc[2:4] work).

Construction
import { Doc } from 'spacy';

const words = ['Hello', 'world', '!'];
const spaces = [true, false, false];
const doc = Doc(words, spaces)
console.log(doc.text) // 'Hello world!'
ArgumentTypeDescription
wordsArrayThe individual token texts.
spacesArrayWhether the token at this position is followed by a space or not.
attrsObjectJSON-serialized attributes, see doc2json.
RETURNSDocThe newly constructed Doc.
Symbol iterator and token indexing
async function() {
    const nlp = spacy.load('en_core_web_sm');
    const doc = await nlp('Hello world');

    for (let token of doc) {
        console.log(token.text);
    }
    // Hello
    // world

    const token1 = doc[0];
    console.log(token1.text);
    // Hello
}
Properties and Attributes
NameTypeDescription
textStringThe Doc text.
lengthNumberThe number of tokens in the Doc.
entsArrayA list of Span objects, describing the named entities in the Doc.
sentsArrayA list of Span objects, describing the sentences in the Doc.
nounChunksArrayA list of Span objects, describing the base noun phrases in the Doc.
catsObjectThe document categories predicted by the text classifier, if available in the model.
isTaggedBooleanWhether the part-of-speech tagger has been applied to the Doc.
isParsedBooleanWhether the dependency parser has been applied to the Doc.
isSentencedBooleanWhether the sentence boundary detector has been applied to the Doc.

Span

A Span object is a slice of a Doc and contains of one or more tokens. Just like in the original API, it can be constructed from a Doc, a start and end index and an optional label, or by slicing a Doc.

Construction
import { Doc, Span } from 'spacy';

const doc = Doc(['Hello', 'world', '!'], [true, false, false]);
const span = Span(doc, 1, 3);
console.log(span.text) // 'world!'
ArgumentTypeDescription
docDocThe reference document.
startNumberThe start token index.
endNumberThe end token index. This is exclusive, i.e. "up to token X".
label StringOptional label.
RETURNSSpanThe newly constructed Span.
Properties and Attributes
NameTypeDescription
textStringThe Span text.
lengthNumberThe number of tokens in the Span.
docDocThe parent Doc.
startNumberThe Span's start index in the parent document.
endNumberThe Span's end index in the parent document.
labelStringThe Span's label, if available.

Token

For token attributes that exist as string and ID versions (e.g. Token.pos vs. Token.pos_), only the string versions are exposed.

Usage Examples
async function() {
    const nlp = spacy.load('en_core_web_sm');
    const doc = await nlp('Hello world');

    for (let token of doc) {
        console.log(token.text, token.pos, token.isLower);
    }
    // Hello INTJ false
    // world NOUN true
}
Properties and Attributes
NameTypeDescription
textStringThe token text.
whitespaceStringWhitespace character following the token, if available.
textWithWsStringToken text with training whitespace.
lengthNumberThe length of the token text.
orthNumberID of the token text.
docDocThe parent Doc.
headTokenThe syntactic parent, or "governor", of this token.
iNumberIndex of the token in the parent document.
entTypeStringThe token's named entity type.
entIobStringIOB code of the token's named entity tag.
lemmaStringThe token's lemma, i.e. the base form.
normStringThe normalised form of the token.
lowerStringThe lowercase form of the token.
shapeStringTransform of the tokens's string, to show orthographic features. For example, "Xxxx" or "dd".
prefixStringA length-N substring from the start of the token. Defaults to N=1.
suffixStringLength-N substring from the end of the token. Defaults to N=3.
posStringThe token's coarse-grained part-of-speech tag.
tagStringThe token's fine-grained part-of-speech tag.
isAlphaBooleanDoes the token consist of alphabetic characters?
isAsciiBooleanDoes the token consist of ASCII characters?
isDigitBooleanDoes the token consist of digits?
isLowerBooleanIs the token lowercase?
isUpperBooleanIs the token uppercase?
isTitleBooleanIs the token titlecase?
isPunctBooleanIs the token punctuation?
isLeftPunctBooleanIs the token left punctuation?
isRightPunctBooleanIs the token right punctuation?
isSpaceBooleanIs the token a whitespace character?
isBracketBooleanIs the token a bracket?
isCurrencyBooleanIs the token a currency symbol?
likeUrlBooleanDoes the token resemble a URL?
likeNumBoolean Does the token resemble a number?
likeEmailBooleanDoes the token resemble an email address?
isOovBooleanIs the token out-of-vocabulary?
isStopBooleanIs the token a stop word?
isSentStartBooleanDoes the token start a sentence?

🔔 Run Tests

Python

First, make sure you have pytest and all dependencies installed. You can then run the tests by pointing pytest to /tests:

python -m pytest tests

JavaScript

This project uses Jest for testing. Make sure you have all dependencies and development dependencies installed. You can then run:

npm run test

To allow testing the code without a REST API providing the data, the test suite currently uses a mock of the Language class, which returns static data located in tests/util.js.

✅ Ideas and Todos

  • Improve JavaScript tests.
  • Experiment with NodeJS bindings to make Python integration easier. To be fair, running a separate API in an environment controlled by the user and not hiding it a few levels deep is often much easier. But maybe there are some modern Node tricks that this project could benefit from.

Keywords

FAQs

Last updated on 04 Mar 2020

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc