nlp-synt-data
Synthetic Data Tools for Natural Language Processing (NLP) and Large Language Models (LLM) tasks
- generate prompts (and prompt ids)
- generate synthetic data (and data ids)
- retrieve prompts and data from ids (to reduce generated dataset size)
Installation
pip install nlp-synt-data
Quickstart
An example of this library with ollama
from nlp_synt_data import *
import ollama
prompts_dict = {
"a": ["promptA0", "promptA1"],
"b": ["promptB0", "promptB1"],
"c": ["promptC0", "promptC1"],
"d": ["promptD0", "promptD1"],
"e": ["promptE0", "promptE1"],
}
prompts = PromptGenerator.generate(prompts_dict, [["c","e"],["a","b","d"]])
texts_with_keys = [
("[PERSON]","label0"),
("[PERSON] is working as a [JOB] in [POS]","label1"),
]
substitutions = {
"JOB": [("job0","labeljob0"), ("job1","labeljob1")],
"PERSON": [("person0","labelperson0"), ("person1","labelperson1")],
"POS": [("pos0","labelpos0"), ("pos1","labelpos1")]
}
texts = DataGenerator.generate(texts_with_keys, substitutions)
model_func = lambda prompt, text: ollama.chat(model='llama3:instruct', messages=[
{ 'role': 'system', 'content': prompt, },
{ 'role': 'user', 'content': text, },
])['message']['content']
ResponseGenerator.generate("results.csv", texts, prompts, model_func)
results.csv
prompt_id | text_id | text_labels | response | text_PERSON_value | text_JOB_value | text_POS_value | text_PERSON_label | text_JOB_label | text_POS_label |
---|
c#0_e#0 | t#0_PERSON#0 | label0 | response | person0 | | | labelperson0 | | |
c#0_e#0 | t#0_PERSON#1 | label0 | response | person1 | | | labelperson1 | | |
c#0_e#0 | t#1_JOB#0_PERSON#0_POS#0 | label1 | response | person0 | job0 | pos0 | labelperson0 | labeljob0 | labelpos0 |
c#0_e#0 | t#1_JOB#0_PERSON#0_POS#1 | label1 | response | person0 | job0 | pos1 | labelperson0 | labeljob0 | labelpos1 |
c#0_e#0 | t#1_JOB#0_PERSON#1_POS#0 | label1 | response | person1 | job0 | pos0 | labelperson1 | labeljob0 | labelpos0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |