
Research
/Security News
60 Malicious Ruby Gems Used in Targeted Credential Theft Campaign
A RubyGems malware campaign used 60 malicious packages posing as automation tools to steal credentials from social media and marketing tool users.
A framework that enables efficient extraction of structured data from unstructured text using large language models (LLMs).
⚠️ This tool is a prototype in active development and may change significantly. Always verify results!
LLM Extractinator enables efficient extraction of structured data from unstructured text using large language models (LLMs). It supports configurable task definitions, CLI or Python usage, a point‑and‑click GUI Studio, and flexible data input/output formats.
📘 Full documentation: https://DIAGNijmegen.github.io/llm_extractinator/
curl -fsSL https://ollama.com/install.sh | sh
Download the installer from: https://ollama.com/download
Create a fresh conda environment:
conda create -n llm_extractinator python=3.11
conda activate llm_extractinator
Install the package via pip:
pip install llm_extractinator
Or from source:
git clone https://github.com/DIAGNijmegen/llm_extractinator.git
cd llm_extractinator
pip install -e .
Tip: to be able to run the latest models, update the Ollama client regularly:
pip install --upgrade ollama langchain-ollama
Starting with v 0.4, Extractinator ships with a Streamlit‑based Studio for designing, running and monitoring extraction tasks with zero code:
launch-extractinator # opens http://localhost:8501 in your browser
Features
🗂️ Project Manager | Create / select datasets, parsers and tasks with file previews |
🔧 Parser Builder | Visual Pydantic schema designer (nested models supported) |
🚀 One‑click Runs | Configure model, sampling & advanced flags, then watch live logs |
🛠️ Task JSON Wizard | Step‑by‑step helper to generate valid TaskXXX.json files |
🆘 Help bubbles everywhere | Inline docs so you never lose context |
The Studio is fully optional: anything you configure here can still be executed from the CLI or Python API.
launch-extractinator # recommended for new users
extractinate --task_id 001 --model_name "phi4"
from llm_extractinator import extractinate
extractinate(task_id=1, model_name="phi4")
Each task is defined by a JSON file stored in tasks/
.
Filename format:
TaskXXX_name.json
Example:
{
"Description": "Extract product data from text.",
"Data_Path": "products.csv",
"Input_Field": "text",
"Parser_Format": "product_parser.py"
}
Parser_Format
points to a .py
file in tasks/parsers/
that implements a Pydantic OutputParser
model used to structure the LLM output.
If you prefer a graphical approach to designing parsers, run:
build-parser
This starts the same builder embedded in the Studio, letting you assemble nested Pydantic models visually. Save the resulting .py
file in tasks/parsers/
and reference it via Parser_Format
.
👉 Read the parser docs for full details.
If you use this tool, please cite: https://doi.org/10.5281/zenodo.15089764
We welcome pull requests! See the contributing guide for details.
FAQs
A framework that enables efficient extraction of structured data from unstructured text using large language models (LLMs).
We found that llm-extractinator demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
/Security News
A RubyGems malware campaign used 60 malicious packages posing as automation tools to steal credentials from social media and marketing tool users.
Security News
The CNA Scorecard ranks CVE issuers by data completeness, revealing major gaps in patch info and software identifiers across thousands of vulnerabilities.
Research
/Security News
Two npm packages masquerading as WhatsApp developer libraries include a kill switch that deletes all files if the phone number isn’t whitelisted.