
Security News
Open Source Maintainers Demand Ability to Block Copilot-Generated Issues and PRs
Open source maintainers are urging GitHub to let them block Copilot from submitting AI-generated issues and pull requests to their repositories.
@chattylabs/language-detection
Advanced tools
Package to detect the language of a given text (focusing on short sms type text used on tweets, facebook, WhatsApp, etc)
This package aids the detection of the language of a given text.
End goal is to detect any text, no matter how short or obscure (think messages from Twitter, WhatsApp, Instagram, SMS, etc) and return an object describing the language that best matches it.
{
language: 'en',
country: 'gb'
}
This is obtained with a combination of "reducing" and "matching". Given a piece of text we can reduce it to a set of potential languages by checking for common patterns (see src/utils/reducers.js
), additionally we can match the n-grams of sed text to a set of pre-compiled language profiles generated through "learning" (processing known samples).
Usage:
const detect = require('@chattylabs/language-detection')
const result = detect('some text to detect')
const language = result.language
const detect = require('@chattylabs/language-detection')
const customLanguageProfiles = require('../path/to/data/languageProfiles.json')
const result = detect(text, {
languageProfiles: customLanguageProfiles,
reducers: customReducers
})
const language = result.language
NOTE: the languages you provide will be the set used, you could additionally merge them with our base:
const combinedProfiles = {
...require('@chattylabs/language-detection').languageProfiles,
...customLanguageProfiles
}
You will need to build a "training" script, which analysis all your sample data and generates the language profiles object.
Your sample data should be a set of txt files containing as much text as possible and similar to the text you will be detecting. Do this per locale or language. e.g. data/samples/en.txt
, data/samples/fr.txt
, data/samples/cn.txt
or data/samples/en_GB.txt
(for country indentifier locale code must use underscore _ separator)
// bin/train.js
const train = require('@chattylabs/language-detection').train
train('./path/to/custom/samples/*.txt', './path/to/custom/export/languageProfiles.json')
then execute it via the cli node bin/training.js
or via an npm script.
NOTE: filenames determine the language, but using filename such as en_GB will result in the response splitting this out into language and country.
const detect = require('@chattylabs/language-detection')
const customLanguageProfiles = require('../path/to/data/languageProfiles.json')
const customReducers = require('../path/to/your/reducers')
const result = detect(text, {
languageProfiles: customLanguageProfiles,
reducers: customReducers
})
const language = result.language
Reducers are a collection of objects which map a regex to an array of languages. They help reduce the amount of languages we need to run the n-gram matching on, by finding intersections of known patterns.
So for example, imagine we provide the following reducers:
# /path/to/data/languageProfiles.json
module.exports = [
{
regex: /[ñ]+/i,
languages: ['es', 'gn', 'gl']
},
{
regex: /[á|é|í|ó|ú]+/i,
languages: ['fr', 'es', 'it', 'cn', 'nl', 'fo', 'is', 'pt', 'vi', 'cy', 'el', 'gl']
}
]
From the above, we would reduce the words "Alimentación de niño" to the languages ['es', 'gl'], and only run n-gram matching on those. If the reducer were to just return 1 language, that would be our result.
NOTE: providing your own reducers will override the base ones. If you chose not to use them, but do use your own language profiles, languages not in your profiles will not be taken into account.
You can also combine your own reducers with the base ones:
const combinedProfiles = {
...require('@chattylabs/language-detection').reducers,
...customReducers
}
FAQs
Package to detect the language of a given text (focusing on short sms type text used on tweets, facebook, WhatsApp, etc)
The npm package @chattylabs/language-detection receives a total of 3 weekly downloads. As such, @chattylabs/language-detection popularity was classified as not popular.
We found that @chattylabs/language-detection demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Open source maintainers are urging GitHub to let them block Copilot from submitting AI-generated issues and pull requests to their repositories.
Research
Security News
Malicious Koishi plugin silently exfiltrates messages with hex strings to a hardcoded QQ account, exposing secrets in chatbots across platforms.
Research
Security News
Malicious PyPI checkers validate stolen emails against TikTok and Instagram APIs, enabling targeted account attacks and dark web credential sales.