Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Created for ONS. Proof-of-concept mmap'd Rust word2vec implementation linked with category matching
NLP Category-Matching tools
A Rust microservice to match queries on the ONS Website to groupings in the ONS taxonomy
This should be adapted from the taxonomy.json.example and placed in the root directory.
These are most simply sourced as pretrained fifu models, but can be dynamically generated using the embedded FinalFusion libraries.
To build wheels for distribution, use:
make
Environment variable | Default | Description |
---|---|---|
CATEGORY_API_HOST | 0.0.0.0 | Host |
CATEGORY_API_PORT | 28800 | Port that the API is listening on |
CATEGORY_API_DUMMY_RUN | false | Returns empty list for testing purposes |
CATEGORY_API_DEBUG_LEVEL_FOR_DYNACONF | "DEBUG" | Verbosity of dynaconf internal logging |
CATEGORY_API_ENVVAR_PREFIX_FOR_DYNACONF | "CATEGORY_API" | The prefix of which variables to be taken into dynaconf configuration |
CATEGORY_API_FIFU_FILE | "test_data/wiki.en.fifu" | The location of the final fusion file |
CATEGORY_API_THRESHOLD | 0.4 | Threshold of what's considered a low-scoring category |
CATEGORY_API_CACHE_S3_BUCKET | S3 for bucket for cache files in format "s3://" | |
--------core variables------------ | --------- | ----------- |
BONN_CACHE_TARGET | "cache.json" | Cache target |
BONN_ELASTICSEARCH_HOST | "http://localhost:9200" | Elasticsearch host |
BONN_REBUILD_CACHE | true | Should cache be rebuild |
BONN_TAXONOMY_LOCATION | "test_data/taxonomy.json" | Location of taxonomy |
BONN_ELASTICSEARCH_INDEX | "ons1639492069322" | Location of taxonomy |
BONN_WEIGHTING__C | 1 | Word vectors based on the words in the category name |
BONN_WEIGHTING__SC | 2 | Word vectors based on the words in the sub-categories name |
BONN_WEIGHTING__SSC | 2 | Word vectors based on the words in the sub-sub-categories name |
BONN_WEIGHTING__WC | 6 | Based on a bag of words found in the metadata of the datasets found in the categories |
BONN_WEIGHTING__WSSC | 8 | Based on a bag of words found in the metadata of the datasets found in the sub-sub-categories |
setup .env file - $ cp .env.local .env
make wheels
make sure you've placed taxonomy.json in the root folder (This should be obtained from ONS).
[TODO: genericize] you need an elasticsearch container forwarded to port:9200 (you can customize the port in .env) with a dump matching the appropriate schema https://gitlab.com/flaxandteal/onyx/dp-search-api
in this readme you can checkout how to setup elasticsearch.
cd core
RUSTFLAGS="-C link-args=-lcblas -llapack" cargo install finalfusion-utils --features=opq
Note: if you try to use the full wiki bin you'll need about 128GB of RAM...
finalfusion quantize -f fasttext -q opq <fasttext.bin> fasttext.fifu.opq
poetry shell
cd core
poetry install
cd ../api
poetry install
exit
poetry run python -c "from bonn import FfModel; FfModel('test_data/wiki.en.fifu').eval('Hello')"
You can create a cache with the following command:
poetry run python -m bonn.extract
This assumes that the correct environment variables for the NLP model, taxonomy and Elasticsearch are set.
The following requirements were identified:
We found that the most effective approach was to use the standard Wikipedia unstructured word2vec model as the ML basis.
This has an additional advantage that we have been able to prototype incorporating other language category matching into the algorithm, although further work is required, including manual review by native speakers and initial results suggest that a larger language corpus would be required for training.
Using finalfusion libraries in Rust enables mmapping for memory efficiency.
A bag of words is formed, to make a vector for the category - a weighted average of the terms, according to the attribute contributing it:
Grouping | Score basis |
---|---|
Category (top-level) | Literal words within title |
Subcategory (second-level) | Literal words within title |
Subsubcategory (third-level) | Literal words within title |
Related words across whole category | Common thematic words across all datasets within the category |
Related words across subsubcategory | Common thematic words across all datasets within the subsubcategory |
To build a weighted bag of words, the system finds thematically-distinctive words occurring in dataset titles and descriptions present in the categories, according to the taxonomy. The "thematic distinctiveness" of words in a dataset description is defined by exceeding a similarity threshold to terms in the category title.
These can then be compared to search queries word-by-word, obtaining a score for each taxonomy entry, for a given phrase.
In addition to the direct cosine similarity of these vectors, we:
To do the last, a global count of (lemmatized) words appearing in dataset descriptions/titles across all categories is made, and common terms are deprioritized within the bag according to an exponential decay function - this allows us to rely more heavily on words that strongly signpost a category (such as “education” or “school”) without being confounded by words many categories contain (such as “price” or “economic”).
Once per-category scores for a search phrase are obtained, we filter them based on:
Prepared by Flax & Teal Limited for ONS Alpha project. Copyright © 2022, Office for National Statistics (https://www.ons.gov.uk)
Released under MIT license, see LICENSE for details.
FAQs
Created for ONS. Proof-of-concept mmap'd Rust word2vec implementation linked with category matching
We found that bonn demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.