Security News
Fluent Assertions Faces Backlash After Abandoning Open Source Licensing
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
embedd-all
is a Python package designed to convert various document formats into a format that can be used to create an embedding vector using embedding models. The package extracts text from PDFs, summarizes data from Excel files, and now includes functionality to create RAG (Retrieval-Augmented Generation) for documents using Voyage AI embedding models and Pinecone vector database. It supports file formats including xlsx, csv, pdf, doc, and docx.
df["summarized"]
. If the Excel file contains multiple sheets, it processes each sheet and returns all summaries.Install the package via pip:
pip install embedd-all
from embedd_all.index import modify_excel_for_embedding, process_pdf, pinecone_embeddings_with_voyage_ai, rag_query
The modify_excel_for_embedding
function processes an Excel file, summarizes each row, and returns the summaries.
import pandas as pd
from embedd_all.embedd.index import modify_excel_for_embedding
if __name__ == '__main__':
# Path to the Excel file
file_path = '/path/to/your/data.xlsx'
context = "data"
# Process the Excel file
dfs = modify_excel_for_embedding(file_path=file_path, context=context)
# Display the summarized data from the second sheet (if exists)
if len(dfs) > 1:
logger.info(dfs[1].head(3))
The process_pdf
function extracts text from each page of a PDF file and returns it as an array.
from embedd_all.embedd.index import process_pdf
if __name__ == '__main__':
# Path to the PDF file
file_path = '/path/to/your/document.pdf'
# Process the PDF file
texts = process_pdf(file_path)
# Display the processed text
logger.info("Number of pages processed: ", len(texts))
logger.info("Sample text from the first page: ", texts[0])
The pinecone_embeddings_with_voyage_ai
function creates RAG for documents using Voyage AI embedding models and stores them in a Pinecone vector database. This function supports multiple file formats including xlsx, csv, pdf, doc, and docx.
from embedd_all.embedd.index import pinecone_embeddings_with_voyage_ai
def create_rag_for_documents():
paths = [
'/Users/arnabbhattachargya/Desktop/flamingo_english_book.pdf',
'/Users/arnabbhattachargya/Desktop/Data_Train.xlsx'
]
vector_db_name = 'arnab-test'
voyage_embed_model = 'voyage-2'
embed_dimension = 1024
pinecone_embeddings_with_voyage_ai(paths, PINECONE_KEY, VOYAGE_API_KEY, vector_db_name, voyage_embed_model, embed_dimension)
if __name__ == '__main__':
create_rag_for_documents()
The rag_query
function performs context-based querying using RAG (Retrieval-Augmented Generation).
from embedd_all.embedd.index import rag_query
def execute_rag_query():
CLAUDE_MODEL = "claude-3-5-sonnet-20240620"
INDEX_NAME = 'arnab-test'
TEMPERATURE = 0
MAX_TOKENS = 4000
QUERY = 'what all fuel types are there in cars?'
SYSTEM_PROMPT = "You are a world-class document writer. Respond only with detailed descriptions and implementations. Use bullet points if necessary."
VOYAGE_EMBED_MODEL = 'voyage-2'
resp = rag_query(
temperature=TEMPERATURE,
max_tokens=MAX_TOKENS,
anthropic_api_key=ANTHROPIC_API_KEY,
claude_model=CLAUDE_MODEL,
index_name=INDEX_NAME,
pinecone_key=PINECONE_KEY,
query=QUERY,
system_prompt=SYSTEM_PROMPT,
voyage_api_key=VOYAGE_API_KEY,
voyage_embed_model=VOYAGE_EMBED_MODEL
)
for text_block in resp:
print(text_block.text)
if __name__ == '__main__':
execute_rag_query()
modify_excel_for_embedding(file_path: str, context: str) -> list
Processes an Excel file and summarizes the data in each sheet.
Parameters:
file_path
(str): Path to the Excel file.context
(str): Additional context to be added to each summary.Returns:
list
: A list of DataFrames, each containing the summarized data for each sheet.process_pdf(file_path: str) -> list
Extracts text from each page of a PDF file.
Parameters:
file_path
(str): Path to the PDF file.Returns:
list
: A list of strings, each representing the text extracted from a page.pinecone_embeddings_with_voyage_ai(paths: list, PINECONE_KEY: str, VOYAGE_API_KEY: str, vector_db_name: str, voyage_embed_model: str, embed_dimension: int)
Creates RAG for documents using Voyage AI embedding models and stores them in a Pinecone vector database. Supports various document formats including xlsx, csv, pdf, doc, and docx.
paths
(list): List of paths to documents.PINECONE_KEY
(str): Pinecone API key.VOYAGE_API_KEY
(str): Voyage AI API key.vector_db_name
(str): Name of the Pinecone vector database.voyage_embed_model
(str): Name of the Voyage AI embedding model to use.embed_dimension
(int): Dimension of the embedding vectors.rag_query()
Performs context-based querying using RAG (Retrieval-Augmented Generation).
temperature
(float): Sampling temperature.max_tokens
(int): Maximum number of tokens in the response.anthropic_api_key
(str): Anthropic API key.claude_model
(str): Name of the Claude model to use.index_name
(str): Name of the Pinecone index.pinecone_key
(str): Pinecone API key.query
(str): The query to perform.system_prompt
(str): The system prompt for guiding the model's response.voyage_api_key
(str): Voyage AI API key.voyage_embed_model
(str): Name of the Voyage AI embedding model to use.This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
If you have any questions or suggestions, please open an issue or contact the maintainer.
Happy embedding with embedd-all
!
FAQs
Embedd (docs, pdfs, excels, csv etc) -> RAG -> Query with LLMs
We found that embedd-all demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
Research
Security News
Socket researchers uncover the risks of a malicious Python package targeting Discord developers.
Security News
The UK is proposing a bold ban on ransomware payments by public entities to disrupt cybercrime, protect critical services, and lead global cybersecurity efforts.