You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6.RSVP
Socket
Book a DemoInstallSign in
Socket

mygithub.libinneed.workers.dev/sugarme/tokenizer

Package Overview
Dependencies
Alerts
File Explorer
Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

mygithub.libinneed.workers.dev/sugarme/tokenizer

v0.2.2
Go
Version published
Created
Source

Tokenizer LicenseGo.Dev referenceTravis CIGo Report Card

Overview

tokenizer is pure Go package to facilitate applying Natural Language Processing (NLP) models train/test and inference in Go.

It is heavily inspired by and based on the popular HuggingFace Tokenizers.

tokenizer is part of an ambitious goal (together with transformer and gotch) to bring more AI/deep-learning tools to Gophers so that they can stick to the language they love and build faster software in production.

Features

tokenizer is built in modules located in sub-packages.

  • Normalizer
  • Pretokenizer
  • Tokenizer
  • Post-processing

It implements various tokenizer models:

  • Word level model
  • Wordpiece model
  • Byte Pair Encoding (BPE)

It can be used for both training new models from scratch or fine-tuning existing models. See examples detail.

Basic example

This tokenizer package is compatible to load pretrained models from Huggingface. Some of them can be loaded using pretrained subpackage.

import (
	"fmt"
	"log"

	"github.com/sugarme/tokenizer/pretrained"
)

func main() {
    // Download and cache pretrained tokenizer. In this case `bert-base-uncased` from Huggingface
    // can be any model with `tokenizer.json` available. E.g. `tiiuae/falcon-7b`
	configFile, err := tokenizer.CachedPath("bert-base-uncased", "tokenizer.json")
	if err != nil {
		panic(err)
	}

	tk, err := pretrained.FromFile(configFile)
	if err != nil {
		panic(err)
	}

	sentence := `The Gophers craft code using [MASK] language.`
	en, err := tk.EncodeSingle(sentence)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Printf("tokens: %q\n", en.Tokens)
	fmt.Printf("offsets: %v\n", en.Offsets)

	// Output
	// tokens: ["the" "go" "##pher" "##s" "craft" "code" "using" "[MASK]" "language" "."]
	// offsets: [[0 3] [4 6] [6 10] [10 11] [12 17] [18 22] [23 28] [29 35] [36 44] [44 45]]
}

All models can be loaded from files manually. pkg.go.dev for detail APIs.

Getting Started

License

tokenizer is Apache 2.0 licensed.

Acknowledgement

FAQs

Package last updated on 25 Jun 2023

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts