langdet - Language Detection for Go
Overview
Package langdet detects natural languages in text using a straightforward implementation of trigram based text categorization. The most commonly used languages worldwide are supported out of the box, but the code is flexible enough to accept any set of languages.
Langdet first detects the writing script in order to narrow down the number of languages to test against. Some writing scripts are used by only a single language (Korean, Greek, etc). In that case the language is returned directly without needing to do trigram analysis. Otherwise, it matches each language profile under the detected writing script against the input text and returns a result set listing the languages ordered by confidence.
Install
go get -u github.com/askeladdk/langdet
Quickstart
Use DetectLanguage
to detect the language of a string. It returns the BCP 47 language tag of the language with the highest probability. If no language was detected, the function returns language.Und
.
detectedLanguage := langdet.DetectLanguage(s)
Use DetectLanguageWithOptions
if you need more control. DetectLanguage
is a shorthand for this function using DefaultOptions
. Unlike DetectLanguage
, DetectLanguageWithOptions
returns a slice of Result
s listing the probabilities of all languages using the detected writing script ordered by probability.
results := langdet.DetectLanguageWithOptions(s, DefaultOptions)
Use Options
to configure the detector. Any number of writing scripts and languages can be detected by setting the Scripts
and Languages
fields. Use the Train
function to build language profiles. Use MinConfidence
and MinRelConfidence
to filter languages by confidence.
myLang := langdet.Language {
Tag: language.Make("zz"),
Trigrams: langdet.Train(trainingSet),
}
options := langdet.Options {
Scripts: []*unicode.RangeTable{
unicode.Latin,
},
Languages: map[*unicode.RangeTable]langdet.Languages {
unicode.Latin: {
Languages: []langdet.Languge {
langdet.Dutch,
langdet.French,
myLang,
},
},
},
}
results := langdet.DetectLanguageWithOptions(s, options)
Read the rest of the documentation on pkg.go.dev. It's easy-peasy!
License
Package langdet is released under the terms of the ISC license.