DashText Python Library
DashText is a Python package for DashVector's sparse-dense (hybrid) semantic search which contains a series of text utilities and an integrated tool named SparseVectorEncoder.
Installation
To install the DashText Client, simply run:
pip install dashtext
QuickStart
SparseVector Encoding
It's easy to convert text corpus to sparse vectors in DashText with default models.
from dashtext import SparseVectorEncoder
encoder = SparseVectorEncoder.default('zh')
document = "向量检索服务DashVector基于达摩院自研的高效向量引擎Proxima内核,提供具备水平拓展能力的云原生、全托管的向量检索服务。"
print(encoder.encode_documents(document))
query = "什么是向量检索服务?"
print(encoder.encode_queries(document))
SparseVector Parameters
The SparseVectorEncoder
class is based on BM25 Algorithm, so it contains some parameters required for the BM25 algorithm and some text utilities parameters for text processing.
b
: Document length normalization required by BM25 (default: 0.75).k1
: Term frequency saturation required by BM25 (default: 1.2).tokenize_function
: Tokenization process function, such as SentencePiece or GPTTokenizer in Transformers, outputs may by a string or integer array (default: Jieba, type: Callable[[str], List[str]]
).hash_function
: Hash process function when need to convert text to number after tokenizing (default: mmh3 hash, type: Callable[[Union[str, int]], int]
).hash_bucket_function
: Dividing process function when need to dividing hash values into finite buckets (default: None, type: Callable[[int], int]
).
from dashtext import SparseVectorEncoder
from dashtext import TextTokenizer
tokenizer = TextTokenizer().from_pretrained("Jieba", stop_words=True)
encoder = SparseVectorEncoder(b=0.75, k1=1.2, tokenize_function=tokenizer.tokenize)
Reference
Encode Documents
encode_documents(texts: Union[str, List[str], List[int], List[List[int]]]) -> Union[Dict, List[Dict]]
Parameters | Type | Required | Description |
---|
texts | str List[str] List[int] List[List[int]] | Yes | str : single text List[str]:mutiple texts List[int]:hash representation of a single text List[List[int]]:hash representation of mutiple texts |
Example:
texts1 = "DashVector将其强大的向量管理、向量查询等多样化能力,通过简洁易用的SDK/API接口透出,方便被上层AI应用迅速集成"
result = encoder.encode_documents(texts1)
texts2 = ["DashVector将其强大的向量管理、向量查询等多样化能力,通过简洁易用的SDK/API接口透出,方便被上层AI应用迅速集成",
"从而为包括大模型生态、多模态AI搜索、分子结构分析在内的多种应用场景,提供所需的高效向量检索能力"]
result = encoder.encode_documents(texts2)
texts3 = [1218191817, 2673099881, 2982218203, 3422996809]
result = encoder.encode_documents(texts3)
texts4 = [[1218191817, 2673099881, 2982218203, 3422996809], [2673099881, 2982218203, 3422996809, 771291085, 741580288]]
result = encoder.encode_documents(texts4)
Encode Queries
encode_queries(texts: Union[str, List[str], List[int], List[List[int]]]) -> Union[Dict, List[Dict]]
The input format is the same as the encode_documents method.
Example:
texts = "什么是向量检索服务?"
result = encoder.encode_queries(texts)
Train / Dump / Load DashText Model
Train
train(corpus: Union[str, List[str], List[int], List[List[int]]]) -> None
Parameters | Type | Required | Description |
---|
corpus | str List[str] List[int] List[List[int]] | Yes | str : single text List[str]:mutiple texts List[int]:hash representation of a single text List[List[int]]:hash representation of mutiple texts |
Example:
corpus = [
"向量检索服务DashVector基于达摩院自研的高效向量引擎Proxima内核,提供具备水平拓展能力的云原生、全托管的向量检索服务",
"DashVector将其强大的向量管理、向量查询等多样化能力,通过简洁易用的SDK/API接口透出,方便被上层AI应用迅速集成",
"从而为包括大模型生态、多模态AI搜索、分子结构分析在内的多种应用场景,提供所需的高效向量检索能力",
"简单灵活、开箱即用的SDK,使用极简代码即可实现向量管理",
"自研向量相似性比对算法,快速高效稳定服务",
"Schema-free设计,通过Schema实现任意条件下的组合过滤查询"
]
encoder.train(corpus)
encoder.dump("./dump_paras.json")
Dump and Load
dump(path: str) -> None
load(path: str) -> None
Parameters | Type | Required | Description |
---|
path | str | Yes | Use the dump method to dump the model parameters as a JSON file to the specified path ; Use load method to load a model parameters from a JSON file path or URL |
The input path can be either relative or absolute, but it should be specific to the file, Example:". /test_dump.json"
, URL starts with "http://"
or "https://"
Example:
encoder.dump("./model.json")
encoder.load("./model.json")
encoder.load("https://example.com/model.json")
Default DashText Models
If you want to use the default BM25
model of SparseVectorEncoder
, you can call the default
method.
default(name : str = 'zh') -> "SparseVectorEncoder"
Parameters | Type | Required | Description |
---|
name | str | No | Currently supports both Chinese and English default models,Chinese model name is 'zh' (default), English model name is 'en' . |
Example:
encoder = dashtext.SparseVectorEncoder.default()
encoder.encode_documents("DashVector将其强大的向量管理、向量查询等多样化能力,通过简洁易用的SDK/API接口透出,方便被上层AI应用迅速集成")
encoder.encode_queries("什么是向量检索服务?")
Extend Tokenizer
DashText comes with a built-in Jieba tokenizer that users can readily use (the default SparseVectorEncoder is trained with this Jieba tokenizer). However, in cases requires proprietary corpus, then a customized tokenizer is needed. To solve this problem, DashText offers two flexible options:
- Option 1: Utilize the TextTokenizer.from_pretrained() method to create a customized built-in Jieba tokenizer. Users can effortlessly specify an original dictionary, a user-defined dictionary, and stopwords for quickstart. If the Jieba tokenizer meets the requirements, this option would be more suitable.
TextTokenizer.from_pretrained(cls, model_name : str = 'Jieba',
*inputs, **kwargs) -> "BaseTokenizer"
Parameters | Type | Required | Description |
---|
model_name | str | Yes | Currently only supports Jieba . |
dict | str | No | Dict path. Defaults to dict.txt.big. |
user_dict | str | No | Extra user dict path. Defaults to data/jieba/user_dict.txt (an empty file). |
stop_words | Union[bool, Dict[str, Any], List[str], Set[str]] | No | Stop words. Defaults to False. True/False: True means using pre-defined stopwords, False means not using any stopwords. Dict/List/Set: user defined stopwords. Type [Dict]/[List] will transfer to [Set]. |
| | | |
- Option 2: Use any customized Tokenizers by providing a callable function in the signature
Callable[[str], List[str]]
. This alternative grants users more freedom to tailor the tokenizer for specific needs. If there is a preferred tokenizer that has already fitted particular requirements, this option would allow users to seamlessly integrate the tokenizer directly into the workflow.
Combining Sparse and Dense Encodings for Hybrid Search in DashVector
combine_dense_and_sparse(dence_vector: Union[List[float], np.ndarray], sparse_vector: Dict[int, float], alpha: float) -> Tuple[Union[List[float], np.ndarray, Dict[int, float]]
Parameters | Type | Required | Description |
---|
dense_vector | Union[List[float], np.ndarray] | Yes | dense vector |
sparse_vector | Dict[int, float] | Yes | sparse vector generated by encode_documents or encode_query method |
alpha | float | Yes | alpha controls the computational weights of sparse and dense vectors. alpha=0.0 means sparse vector only, alpha=1.0 means dense vector only. |
Example:
from dashtext import combine_dense_and_sparse
dense_vector = [0.02428389742874429,0.02036450577918233,0.00758973862139133,-0.060652585776971274,0.03321684423003758,-0.019009049500375488,0.015808212986566556,0.0037662904132509424,-0.0178332320055069]
sparse_vector = encoder.encode_documents("DashVector将其强大的向量管理、向量查询等多样化能力,通过简洁易用的SDK/API接口透出,方便被上层AI应用迅速集成")
scaled_dense_vector, scaled_sparse_vector = combine_dense_and_sparse(dense_vector, sparse_vector, 0.8)
License
This project is licensed under the Apache License (Version 2.0).