
Research
/Security News
Critical Vulnerability in NestJS Devtools: Localhost RCE via Sandbox Escape
A flawed sandbox in @nestjs/devtools-integration lets attackers run code on your machine via CSRF, leading to full Remote Code Execution (RCE).
DocVec CLI is a powerful command-line tool designed to transform your unstructured local documents (PDF, DOCX, TXT) into query-ready vector embeddings, making them instantly usable for Large Language Models (LLMs) and bolstering Retrieval Augmented Generation (RAG) workflows.
.pdf
, .docx
, and .txt
files.langchain
's RecursiveCharacterTextSplitter
, with customizable chunk_size
and chunk_overlap
.sentence-transformers
models (default: all-MiniLM-L6-v2
) to create high-quality vector embeddings directly on your machine, ensuring privacy and offline capabilities.git clone https://github.com/onurbaran/docvec-cli.git
cd docvec-cli
It’s highly recommended to use a virtual environment to manage dependencies.
python -m venv .venv
# On Windows:
.\.venv\Scripts\activate
# On macOS/Linux:
source ./.venv/bin/activate
Ensure your requirements.txt
contains:
pypdf
python-docx
sentence-transformers
langchain-text-splitters
tqdm
numpy
Then run:
pip install -r requirements.txt
Once installed, you can use docvec-cli
from your terminal.
python src/main.py --input-path <path_to_document_or_directory> --output-path <path_to_output_directory> [OPTIONS]
--input-path <path>
: Path to a document file (e.g., report.pdf
) or a directory (directory processing is planned for future updates).--output-path <path>
: Path to the directory where the generated vector and metadata files will be saved.--chunk-size <int>
: Max size of each text chunk in characters (default: 1000
)--chunk-overlap <int>
: Number of characters to overlap between chunks (default: 200
)--model-name <str>
: Sentence-transformers model name (default: all-MiniLM-L6-v2
)--output-format <str>
: Format for output files (default: json
, only format currently supported)python src/main.py --input-path "docs/my_report.pdf" --output-path "vectors/"
python src/main.py --input-path "articles/research.docx" --output-path "embeddings/" --chunk-size 500 --chunk-overlap 100
python src/main.py --input-path "notes/daily_journal.txt" --output-path "processed_data/" --model-name "all-MiniLM-L12-v2"
For each processed document (e.g., my_report.pdf
), a JSON file (my_report_vectors.json
) will be created in the specified --output-path
.
Example content:
[
{
"id": "my_report-0",
"document": "This is the text content of the first chunk...",
"embedding": [0.123, -0.456, ..., 0.789],
"metadata": {
"source_file": "my_report.pdf",
"chunk_index": 0,
"chunk_size": 250
}
}
]
We welcome contributions from the community! To contribute:
git checkout -b feature/your-feature-name
git push origin feature/your-feature-name
Please ensure:
This project is licensed under the MIT License.
For questions, feedback, or issues, please open an issue.
FAQs
DocVec CLI is a powerful command-line tool designed to transform your unstructured local documents (PDF, DOCX, TXT) into query-ready vector embeddings, making them instantly usable for Large Language Models (LLMs) and bolstering Retrieval Augmented Generation (RAG) workflows.
We found that docvec-cli demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
/Security News
A flawed sandbox in @nestjs/devtools-integration lets attackers run code on your machine via CSRF, leading to full Remote Code Execution (RCE).
Product
Customize license detection with Socket’s new license overlays: gain control, reduce noise, and handle edge cases with precision.
Product
Socket now supports Rust and Cargo, offering package search for all users and experimental SBOM generation for enterprise projects.