Security News
Fluent Assertions Faces Backlash After Abandoning Open Source Licensing
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
DashText is a Python package for DashVector's sparse-dense (hybrid) semantic search which contains a series of text utilities and an integrated tool named SparseVectorEncoder.
To install the DashText Client, simply run:
pip install dashtext
It's easy to convert text corpus to sparse vectors in DashText with default models.
from dashtext import SparseVectorEncoder
# Initialize a Encoder Instance and Load a Default Model in DashText
encoder = SparseVectorEncoder.default('zh')
# Encode a new document (for upsert to DashVector)
document = "向量检索服务DashVector基于达摩院自研的高效向量引擎Proxima内核,提供具备水平拓展能力的云原生、全托管的向量检索服务。"
print(encoder.encode_documents(document))
# {380823393: 0.7262431704356519, 414191989: 0.7262431704356519, 565176162: 0.7262431704356519, 904594806: 0.7262431704356519, 1005505802: 0.7262431704356519, 1169440797: 0.8883757984694465, 1240922502: 0.7262431704356519, 1313971048: 0.7262431704356519, 1317077351: 0.7262431704356519, 1490140460: 0.7262431704356519, 1574737055: 0.7262431704356519, 1760434515: 0.7262431704356519, 2045788977: 0.8414146776926797, 2141666983: 0.7262431704356519, 2509543087: 0.7262431704356519, 3180265193: 0.7262431704356519, 3845702398: 0.7262431704356519, 4106887295: 0.7262431704356519}
# Encode a query (for search in DashVector)
query = "什么是向量检索服务?"
print(encoder.encode_queries(document))
# {380823393: 0.08361891359384604, 414191989: 0.09229860190522488, 565176162: 0.04535506923676476, 904594806: 0.020073288360284405, 1005505802: 0.027556881447714194, 1169440797: 0.04022365461249135, 1240922502: 0.050572420319144815, 1313971048: 0.01574978858878569, 1317077351: 0.03899710322573238, 1490140460: 0.03401309416846664, 1574737055: 0.03240084602715354, 1760434515: 0.11848476345398339, 2045788977: 0.09625917015244072, 2141666983: 0.11848476345398339, 2509543087: 0.05570020739487387, 3180265193: 0.023553249869916984, 3845702398: 0.05542717955003807, 4106887295: 0.05123100463915489}
The SparseVectorEncoder
class is based on BM25 Algorithm, so it contains some parameters required for the BM25 algorithm and some text utilities parameters for text processing.
b
: Document length normalization required by BM25 (default: 0.75).k1
: Term frequency saturation required by BM25 (default: 1.2).tokenize_function
: Tokenization process function, such as SentencePiece or GPTTokenizer in Transformers, outputs may by a string or integer array (default: Jieba, type: Callable[[str], List[str]]
).hash_function
: Hash process function when need to convert text to number after tokenizing (default: mmh3 hash, type: Callable[[Union[str, int]], int]
).hash_bucket_function
: Dividing process function when need to dividing hash values into finite buckets (default: None, type: Callable[[int], int]
).from dashtext import SparseVectorEncoder
from dashtext import TextTokenizer
tokenizer = TextTokenizer().from_pretrained("Jieba", stop_words=True)
encoder = SparseVectorEncoder(b=0.75, k1=1.2, tokenize_function=tokenizer.tokenize)
encode_documents(texts: Union[str, List[str], List[int], List[List[int]]]) -> Union[Dict, List[Dict]]
Parameters | Type | Required | Description |
---|---|---|---|
texts | str List[str] List[int] List[List[int]] | Yes | str : single text List[str]:mutiple texts List[int]:hash representation of a single text List[List[int]]:hash representation of mutiple texts |
Example:
# single text
texts1 = "DashVector将其强大的向量管理、向量查询等多样化能力,通过简洁易用的SDK/API接口透出,方便被上层AI应用迅速集成"
result = encoder.encode_documents(texts1)
# mutiple texts
texts2 = ["DashVector将其强大的向量管理、向量查询等多样化能力,通过简洁易用的SDK/API接口透出,方便被上层AI应用迅速集成",
"从而为包括大模型生态、多模态AI搜索、分子结构分析在内的多种应用场景,提供所需的高效向量检索能力"]
result = encoder.encode_documents(texts2)
# hash representation of a single text
texts3 = [1218191817, 2673099881, 2982218203, 3422996809]
result = encoder.encode_documents(texts3)
# hash representation of mutiple texts
texts4 = [[1218191817, 2673099881, 2982218203, 3422996809], [2673099881, 2982218203, 3422996809, 771291085, 741580288]]
result = encoder.encode_documents(texts4)
# result example
# {59256732: 0.7340568742689919, 863271227: 0.7340568742689919, 904594806: 0.7340568742689919, 942054413: 0.7340568742689919, 1169440797: 0.8466352922575744, 1314384716: 0.7340568742689919, 1554119115: 0.7340568742689919, 1736403260: 0.7340568742689919, 2029341792: 0.7340568742689919, 2141666983: 0.7340568742689919, 2367386033: 0.7340568742689919, 2549501804: 0.7340568742689919, 3869223639: 0.7340568742689919, 4130523965: 0.7340568742689919, 4162843804: 0.7340568742689919, 4202556960: 0.7340568742689919}
encode_queries(texts: Union[str, List[str], List[int], List[List[int]]]) -> Union[Dict, List[Dict]]
The input format is the same as the encode_documents method.
Example:
# single text
texts = "什么是向量检索服务?"
result = encoder.encode_queries(texts)
train(corpus: Union[str, List[str], List[int], List[List[int]]]) -> None
Parameters | Type | Required | Description |
---|---|---|---|
corpus | str List[str] List[int] List[List[int]] | Yes | str : single text List[str]:mutiple texts List[int]:hash representation of a single text List[List[int]]:hash representation of mutiple texts |
Example:
corpus = [
"向量检索服务DashVector基于达摩院自研的高效向量引擎Proxima内核,提供具备水平拓展能力的云原生、全托管的向量检索服务",
"DashVector将其强大的向量管理、向量查询等多样化能力,通过简洁易用的SDK/API接口透出,方便被上层AI应用迅速集成",
"从而为包括大模型生态、多模态AI搜索、分子结构分析在内的多种应用场景,提供所需的高效向量检索能力",
"简单灵活、开箱即用的SDK,使用极简代码即可实现向量管理",
"自研向量相似性比对算法,快速高效稳定服务",
"Schema-free设计,通过Schema实现任意条件下的组合过滤查询"
]
encoder.train(corpus)
# use dump method to check parameters
encoder.dump("./dump_paras.json")
dump(path: str) -> None
load(path: str) -> None
Parameters | Type | Required | Description |
---|---|---|---|
path | str | Yes | Use the dump method to dump the model parameters as a JSON file to the specified path ;Use load method to load a model parameters from a JSON file path or URL |
The input path can be either relative or absolute, but it should be specific to the file, Example:
". /test_dump.json"
, URL starts with"http://"
or"https://"
Example:
# dump model
encoder.dump("./model.json")
# load model from path
encoder.load("./model.json")
# load model from url
encoder.load("https://example.com/model.json")
If you want to use the default BM25
model of SparseVectorEncoder
, you can call the default
method.
default(name : str = 'zh') -> "SparseVectorEncoder"
Parameters | Type | Required | Description |
---|---|---|---|
name | str | No | Currently supports both Chinese and English default models,Chinese model name is 'zh' (default), English model name is 'en' . |
Example:
# default method
encoder = dashtext.SparseVectorEncoder.default()
# using default model, you can directly encode documents and queries
encoder.encode_documents("DashVector将其强大的向量管理、向量查询等多样化能力,通过简洁易用的SDK/API接口透出,方便被上层AI应用迅速集成")
encoder.encode_queries("什么是向量检索服务?")
DashText comes with a built-in Jieba tokenizer that users can readily use (the default SparseVectorEncoder is trained with this Jieba tokenizer). However, in cases requires proprietary corpus, then a customized tokenizer is needed. To solve this problem, DashText offers two flexible options:
TextTokenizer.from_pretrained(cls, model_name : str = 'Jieba',
*inputs, **kwargs) -> "BaseTokenizer"
Parameters | Type | Required | Description |
---|---|---|---|
model_name | str | Yes | Currently only supports Jieba . |
dict | str | No | Dict path. Defaults to dict.txt.big. |
user_dict | str | No | Extra user dict path. Defaults to data/jieba/user_dict.txt (an empty file). |
stop_words | Union[bool, Dict[str, Any], List[str], Set[str]] | No | Stop words. Defaults to False. True/False: True means using pre-defined stopwords, False means not using any stopwords. Dict/List/Set: user defined stopwords. Type [Dict]/[List] will transfer to [Set]. |
Callable[[str], List[str]]
. This alternative grants users more freedom to tailor the tokenizer for specific needs. If there is a preferred tokenizer that has already fitted particular requirements, this option would allow users to seamlessly integrate the tokenizer directly into the workflow.combine_dense_and_sparse(dence_vector: Union[List[float], np.ndarray], sparse_vector: Dict[int, float], alpha: float) -> Tuple[Union[List[float], np.ndarray, Dict[int, float]]
Parameters | Type | Required | Description |
---|---|---|---|
dense_vector | Union[List[float], np.ndarray] | Yes | dense vector |
sparse_vector | Dict[int, float] | Yes | sparse vector generated by encode_documents or encode_query method |
alpha | float | Yes | alpha controls the computational weights of sparse and dense vectors. alpha=0.0 means sparse vector only, alpha=1.0 means dense vector only. |
Example:
from dashtext import combine_dense_and_sparse
dense_vector = [0.02428389742874429,0.02036450577918233,0.00758973862139133,-0.060652585776971274,0.03321684423003758,-0.019009049500375488,0.015808212986566556,0.0037662904132509424,-0.0178332320055069]
sparse_vector = encoder.encode_documents("DashVector将其强大的向量管理、向量查询等多样化能力,通过简洁易用的SDK/API接口透出,方便被上层AI应用迅速集成")
# using convex combination to generate hybrid vector
scaled_dense_vector, scaled_sparse_vector = combine_dense_and_sparse(dense_vector, sparse_vector, 0.8)
# result example
# scaled_dense_vector: [0.019427117942995432, 0.016291604623345866, 0.006071790897113065, -0.04852206862157702, 0.026573475384030067, -0.01520723960030039, 0.012646570389253245, 0.003013032330600754, -0.014266585604405522]
# scaled_sparse_vector: {59256732: 0.14681137485379836, 863271227: 0.14681137485379836, 904594806: 0.14681137485379836, 942054413: 0.14681137485379836, 1169440797: 0.16932705845151483, 1314384716: 0.14681137485379836, 1554119115: 0.14681137485379836, 1736403260: 0.14681137485379836, 2029341792: 0.14681137485379836, 2141666983: 0.14681137485379836, 2367386033: 0.14681137485379836, 2549501804: 0.14681137485379836, 3869223639: 0.14681137485379836, 4130523965: 0.14681137485379836, 4162843804: 0.14681137485379836, 4202556960: 0.14681137485379836}
This project is licensed under the Apache License (Version 2.0).
FAQs
DashText is a Text Modal Data Library
We found that dashtext demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
Research
Security News
Socket researchers uncover the risks of a malicious Python package targeting Discord developers.
Security News
The UK is proposing a bold ban on ransomware payments by public entities to disrupt cybercrime, protect critical services, and lead global cybersecurity efforts.