
Research
PyPI Package Disguised as Instagram Growth Tool Harvests User Credentials
A deceptive PyPI package posing as an Instagram growth tool collects user credentials and sends them to third-party bot services.
DocVec CLI is a powerful command-line tool designed to transform your unstructured local documents (PDF, DOCX, TXT) into query-ready vector embeddings, making them instantly usable for Large Language Models (LLMs) and bolstering Retrieval Augmented Generation (RAG) workflows.
.pdf
, .docx
, and .txt
files.langchain
's RecursiveCharacterTextSplitter
, with customizable chunk_size
and chunk_overlap
.sentence-transformers
models (default: all-MiniLM-L6-v2
) to create high-quality vector embeddings directly on your machine, ensuring privacy and offline capabilities.git clone https://github.com/onurbaran/docvec-cli.git
cd docvec-cli
Itβs highly recommended to use a virtual environment to manage dependencies.
python -m venv .venv
# On Windows:
.\.venv\Scripts\activate
# On macOS/Linux:
source ./.venv/bin/activate
Ensure your requirements.txt
contains:
pypdf
python-docx
sentence-transformers
langchain-text-splitters
tqdm
numpy
Then run:
pip install -r requirements.txt
Once installed, you can use docvec-cli
from your terminal.
python src/main.py --input-path <path_to_document_or_directory> --output-path <path_to_output_directory> [OPTIONS]
--input-path <path>
: Path to a document file (e.g., report.pdf
) or a directory (directory processing is planned for future updates).--output-path <path>
: Path to the directory where the generated vector and metadata files will be saved.--chunk-size <int>
: Max size of each text chunk in characters (default: 1000
)--chunk-overlap <int>
: Number of characters to overlap between chunks (default: 200
)--model-name <str>
: Sentence-transformers model name (default: all-MiniLM-L6-v2
)--output-format <str>
: Format for output files (default: json
, only format currently supported)python src/main.py --input-path "docs/my_report.pdf" --output-path "vectors/"
python src/main.py --input-path "articles/research.docx" --output-path "embeddings/" --chunk-size 500 --chunk-overlap 100
python src/main.py --input-path "notes/daily_journal.txt" --output-path "processed_data/" --model-name "all-MiniLM-L12-v2"
For each processed document (e.g., my_report.pdf
), a JSON file (my_report_vectors.json
) will be created in the specified --output-path
.
Example content:
[
{
"id": "my_report-0",
"document": "This is the text content of the first chunk...",
"embedding": [0.123, -0.456, ..., 0.789],
"metadata": {
"source_file": "my_report.pdf",
"chunk_index": 0,
"chunk_size": 250
}
}
]
We welcome contributions from the community! To contribute:
git checkout -b feature/your-feature-name
git push origin feature/your-feature-name
Please ensure:
This project is licensed under the MIT License.
For questions, feedback, or issues, please open an issue.
FAQs
DocVec CLI is a powerful command-line tool designed to transform your unstructured local documents (PDF, DOCX, TXT) into query-ready vector embeddings, making them instantly usable for Large Language Models (LLMs) and bolstering Retrieval Augmented Generation (RAG) workflows.
We found that docvec-cli demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago.Β It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
A deceptive PyPI package posing as an Instagram growth tool collects user credentials and sends them to third-party bot services.
Product
Socket now supports pylock.toml, enabling secure, reproducible Python builds with advanced scanning and full alignment with PEP 751's new standard.
Security News
Research
Socket uncovered two npm packages that register hidden HTTP endpoints to delete all files on command.