Research
Security News
Malicious npm Package Targets Solana Developers and Hijacks Funds
A malicious npm package targets Solana developers, rerouting funds in 2% of transactions to a hardcoded address.
A Regular Expression constraint for Language Models of transformers. With this module, you can force the LLMs to generate following your regex. Using regex in tokens and tensors are also implemented in this project.
A Regular Expression constraint for Language Models of transformers. With this module, you can force the LLMs to generate following your regex. Using regex in tokens and tensors are also implemented in this project.
本项目支持通过正则表达式控制LLMs输出,同时还实现了通过正则表达式抽取token或tensor。
pip install transformers-re
A regex constraint logits processor for transformers.
__init__:
Attributes:
Usage example:
>>> with RegexLogitsProcessor(tokenizer, prompt, pattern) as regex_logits_processor:
>>> model.generate(logits_processor=[regex_logits_processor])
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers_re import RegexPrefixProcessor
if __name__ == "__main__":
tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen-14B-Chat-Int4')
model = AutoModelForCausalLM.from_pretrained('Qwen/Qwen-14B-Chat-Int4').eval().to('cuda') # load your own model
prompt = "<|im_start|>user\n请帮我写一首表达思乡之情的诗<|im_end|>\n<|im_start|>assistant\n"
pattern = r"一[\u4e00-\u9fa5]{4},键[\u4e00-\u9fa5]{4}。三[\u4e00-\u9fa5]{4},连[\u4e00-\u9fa5]{4}。"
with RegexLogitsProcessor(tokenizer, prompt, pattern, num_proc=16, debug=True,
fail_strategy=tokenizer.encode("<|im_end|><|endoftext|>")) as regex_logits_processor:
input_ids = tokenizer(prompt, return_tensors="pt").to('cuda')["input_ids"]
outputs = model.generate(input_ids, max_new_tokens=40, logits_processor=[regex_logits_processor])
print(tokenizer.decode(outputs[0]))
# <|imstart|>user
# 请帮我写一首表达思乡之情的诗<|im_end|>
# <|imstart|>assistant
# 一叶扁舟下,键桥夜月凉。三秋雁南去,连年心自伤。<|im_end|><|endoftext|>
A regex prefix constraint for transformers.
__init__:
Attributes:
Usage example:
>>> with RegexPrefixProcessor(tokenizer, prompt, pattern) as regex_prefix_processor:
>>> model.generate(prefix_allowed_tokens_fn=regex_prefix_processor)
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers_re import RegexPrefixProcessor
if __name__ == "__main__":
tokenizer = AutoTokenizer.from_pretrained('YeungNLP/firefly-1b4')
model = AutoModelForCausalLM.from_pretrained('YeungNLP/firefly-1b4').eval().to('cuda') # load your own model
prompt = "<s>请帮我写一首表达思乡之情的诗</s></s>"
pattern = r"一[\u4e00-\u9fa5]{4},键[\u4e00-\u9fa5]{4}。三[\u4e00-\u9fa5]{4},连[\u4e00-\u9fa5]{4}。"
with RegexPrefixProcessor(tokenizer, prompt, pattern, num_proc=16) as regex_prefix_processor:
input_ids = tokenizer(prompt, return_tensors="pt").to('cuda')["input_ids"]
outputs = model.generate(input_ids, max_new_tokens=20, prefix_allowed_tokens_fn=regex_prefix_processor)
print(tokenizer.decode(outputs[0]))
# <s>请帮我写一首表达思乡之情的诗</s></s>一叶落叶归,键盘敲击声。三更梦回乡,连日思乡情。</s>
A regex pattern compiled with a tokenizer and applies to token or tensors.
__init__:
Methods:
The regex match result corresponding to a TokenizedPattern and a string.
Methods:
Usage example:
s = "This is an example text"
token_ids = tokenizer(s, return_tensors="pt").input_ids
tokens = [tokenizer.decode(t) for t in tokenizer(s).input_ids]
print(token_ids, tokens, sep="\n")
"""
tensor([[3180, 579, 593, 3392, 2895]])
['This', ' is', ' an', ' example', ' text']
"""
from transformers_re import TokenizedPattern
a, b = TokenizedPattern(tokenizer, "ample(.*)", "expand").search(s).span() # Get the text span, expanded according to token
print(s[a:b])
# " example text"
a, b = TokenizedPattern(tokenizer, "ample(.*)", "shrink").search(s).span() # Strategy shrink
print(s[a:b])
# " text"
a, b = TokenizedPattern(tokenizer, "ample(.*)", "expand").search(s).span(of_token=True) # Get token span
print(token_ids[0, a:b])
# tensor([3392, 2895])
a, b = TokenizedPattern(tokenizer, "ample(.*)", "expand").search(s).span(index=1, of_token=True) # Select group 1 by index
print(token_ids[0, a:b])
# tensor([2895])
mask = TokenizedPattern(tokenizer, "ample(.*)", "expand").search(s).mask() # Get mask tensor
print(mask)
# tensor([0, 0, 0, 1, 1])
import torch
torch.manual_seed(42)
x = torch.randint(0, 10, (5, 5))
"""
tensor([[2, 7, 6, 4, 6],
[5, 0, 4, 0, 3],
[8, 4, 0, 4, 1],
[2, 5, 5, 7, 6],
[9, 6, 3, 1, 9]])
"""
print(TokenizedPattern(tokenizer, "ample(.*)", "expand").search(s).masked_select(x, dim=0))
"""
tensor([[2, 5, 5, 7, 6],
[9, 6, 3, 1, 9]])
"""
print(TokenizedPattern(tokenizer, "ample(.*)", "expand").search(s).masked_select(x, dim=1))
"""
tensor([[4, 6],
[0, 3],
[4, 1],
[7, 6],
[1, 9]])
"""
THe main bottleneck of regex constraint generation is the match processing of regex. Although we can use multiprocess in this project to accelerate the procedure, the time consumption proportion of regex match is still too high.
The time cost can be reduced more if incremental matching of regex or some GPU regex engine can be implement in this project.
FAQs
A Regular Expression constraint for Language Models of transformers. With this module, you can force the LLMs to generate following your regex. Using regex in tokens and tensors are also implemented in this project.
We found that transformers-re demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
A malicious npm package targets Solana developers, rerouting funds in 2% of transactions to a hardcoded address.
Security News
Research
Socket researchers have discovered malicious npm packages targeting crypto developers, stealing credentials and wallet data using spyware delivered through typosquats of popular cryptographic libraries.
Security News
Socket's package search now displays weekly downloads for npm packages, helping developers quickly assess popularity and make more informed decisions.