AzureML Retrieval Augmented Generation Utilities
This package is in alpha stage at the moment, use at risk of breaking changes and unstable behavior.
It contains utilities for:
- Processing text documents into chunks appropriate for use in LLM prompts, with metadata such is source url.
- Embedding chunks with OpenAI or HuggingFace embeddings models, including the ability to update a set of embeddings over time.
- Create MLIndex artifacts from embeddings, a yaml file capturing metadata needed to deserialize different kinds of Vector Indexes for use in langchain. Supported Index types:
- FAISS index (via langchain)
- Azure Cognitive Search index
- Pinecone index
- Milvus index
- Azure Cosmos Mongo vCore index
- MongoDB
Getting started
You can install AzureMLs RAG package using pip.
pip install azureml-rag
There are various extra installs you probably want to include based on intended use:
faiss
: When using FAISS based Vector Indexescognitive_search
: When using Azure Cognitive Search Indexespinecone
: When using Pinecone Indexesazure_cosmos_mongo_vcore
: When using Azure Cosmos Mongo vCore Indexeshugging_face
: When using Sentence Transformer embedding models from HuggingFace (local inference)document_parsing
: When cracking and chunking documents locally to put in an Indexmongodb
: When using native mongo db indexes
MLIndex
MLIndex files describe an index of data + embeddings and the embeddings model used in yaml.
Azure Cognitive Search Index:
embeddings:
dimension: 768
kind: hugging_face
model: sentence-transformers/all-mpnet-base-v2
schema_version: '2'
index:
api_version: 2021-04-30-Preview
connection:
id: /subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/connections/<acs_connection_name>
connection_type: workspace_connection
endpoint: https://<acs_name>.search.windows.net
engine: azure-sdk
field_mapping:
content: content
filename: filepath
metadata: meta_json_string
title: title
url: url
embedding: contentVector
index: azureml-rag-test-206e03b6-3880-407b-9bc4-c0a1162d6c70
kind: acs
Pinecone Index:
embeddings:
dimension: 768
kind: hugging_face
model: sentence-transformers/all-mpnet-base-v2
schema_version: '2'
index:
connection:
id: /subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/connections/<pinecone_connection_name>
connection_type: workspace_connection
engine: pinecone-sdk
field_mapping:
content: content
filename: filepath
metadata: metadata_json_string
title: title
url: url
index: azureml-rag-test-206e03b6-3880-407b-9bc4-c0a1162d6c70
kind: pinecone
Azure Cosmos Mongo vCore Index:
embeddings:
dimension: 768
kind: hugging_face
model: sentence-transformers/all-mpnet-base-v2
schema_version: '2'
index:
connection:
id: /subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/connections/<cosmos_connection_name>
connection_type: workspace_connection
engine: pymongo-sdk
field_mapping:
content: content
filename: filepath
metadata: metadata_json_string
title: title
url: url
embedding: contentVector
database: azureml-rag-test-db
collection: azureml-rag-test-collection
index: azureml-rag-test-206e03b6-3880-407b-9bc4-c0a1162d6c70
kind: azure_cosmos_mongo_vcore
Create MLIndex
Examples using MLIndex remotely with AzureML and locally with langchain live here: https://github.com/Azure/azureml-examples/tree/main/sdk/python/generative-ai/rag
Consume MLIndex
from azureml.rag.mlindex import MLIndex
retriever = MLIndex(uri_to_folder_with_mlindex).as_langchain_retriever()
retriever.get_relevant_documents('What is an AzureML Compute Instance?')
Changelog
Please insert change log into "Next Release" ONLY.
Next release
0.2.36
- Implement mongodb vector store and ml index supports
- Detect OBO credential with AZUREML_OBO_ENABLED environment variable
- ACS update on changed (new or deleted) documents
- Drop azure-search-documents 11.4.0 beta version support
0.2.35
- Implement cosmosdb for nosql vector store and ml index supports
- Relax langchain version constraint
- Upgraded langchain-pinecone version to 0.1.1 and pinecone-client version
0.2.34
- Update azure-ai-ml version to 1.16.1 by introducing noneCredentialConfigure and add authType for AadCredentialConfigure
- Use set of exceptions as retry_exceptions in backoff_retry_on_exceptions
0.2.33
- Support existing qdrant indices
- Mitigate PF failure while more than 3 lookup tools used in a flow
- Add the retry for the embedder if there was a successfully embedding
0.2.32
- Implement langchain weaviate vectorstore in mlindex
- Get connection in
get_connection_by_id_v2
with caller specified credential - Set upper bound for
azure-ai-ml
to 1.15.0
0.2.31.1
- Update search index with azure-search-documents 11.4.0
- Add azureml-core in the dependency list
0.2.31
- Categorize user error and system error, and update RH accordingly to show in logs
- Bugfix using obo credential for AAD connections.
- Prevention fix to support AadCredentialConfig in Connection object
- Update Pinecone legacy API
- Creating image embedding index with azure-search-documents 11.4.0
0.2.30.2
- Bugfix remove azure_ad_token_provider from EmbeddingContainer metadata
- Set embeddings_model as optional argument
0.2.30.1
- Introduce
elasticsearch
extra to declare transitive dependency on the elasticsearch
package when using Elasticsearch indices.
0.2.30
- Bugfix in models.py to handle empty deployment name.
- Supporting existing elasticsearch indices
- Bug fix in
crack_and_chunk_and_embed_and_index
- Fixing bug in using AAD auth type ACS connections.
0.2.29.2
- Fixing ACS index creation failure with azure-search-documents 11.4.0
0.2.29.1
- Fixing FAISS, dependable_faiss_import import failure with Langchain 0.1.x
0.2.29
- Support AAD and MSI auth type in AOAI, ACS connection
0.2.28
- Ensure compatibility with newer versions of azure-ai-ml.
- Upgrade langchain to support up to 0.1
0.2.27
- Support Cohere serverless endpoint
- Support multiple ACS lookups in the same process, eliminating field mapping conflicts
- Support pass-in credential in get_connection_by_name_v2 to unblock managed vNet setup
- Update validate_deployments in crack_chunk_embed_index_and_register.py
0.2.26
- Support for .csv and .json file extensions in pipeline
- Ignore mlflow.exceptions.RestException in safe_mlflow_log_metric
- validate_deployments supports openai v1.0+
- Removing unexpected keyword argument 'engine'
- Checking ACS account has enough index quota
- infer_deployment supports openai v1.0+
- Create missing fields for existing index
0.2.25
- Using local cached encodings.
- Adding convert_to_dict() for openai v1.0+
- Check index_config before passing in validate_deployments.py
- Limit size of documents upload to ACS in one batch to solve RequestEntityTooLargeError
0.2.24.2
- Supporting
*.cognitiveservices.*
endpoint - Adding azureml-rag specific user_agent when using DocumentIntelligence
- Refactored update index tasks
- Supporting uppercase file extensions name in crack_and_chunk
- Fixing Deployment importing bug in utils
- Adding the playgroundType tag in MLIndex Asset used for Azure AI studio playground
- Remove mandatory module-level imports of optional extra packages
0.2.24.1
- Fixing is_florence key detection
- Using 'embedding_connection_id' instead of 'florence_connection_id' as parameter name
0.2.24
- Introducing image ingestion with florence embedding API
- Adding dummy output to validate_deployments for holding the right order
- Fixing DeploymentNotFound bug
0.2.23.5
0.2.23.4
- Make the
api_type
parameter non-case sensitive in OpenAIEmbedder - Bug fix in embeddings container path
0.2.23.3
- Set upper bound for
langchain
to 0.0.348
0.2.23.2
- Make tiktoken pull from a cache instead of making the outgoing network call to get encodings files
- Add support for Azure Cosmos Mongo vCore
0.2.23.1
- Fixing exception handling in validate_deployments to support OpenAI v1.0+
0.2.23
- Support OpenAI v1.0 +
- Handle FAISS.load_local() change since Langchain 0.0.318
- Handle mailto links in url crawling component.
- Add support for Milvus vector store
0.2.22
- update pypdf's version to 3.17.1 in document-parsing.
0.2.21
- Use workspace connection tags instead of metadata since it's deprecated.
- Fix bug handling single files in
files_to_document_sources
0.2.20
- Initial introduction of validate_deployments.
- Asset registration in *_and_register attempts to infer target workspace from asset_uri and handle multiple auth options
- activity_logger moved out as first arg, this is an intermediate step as logger also shouldn't be first arg and instead handled by get_logger, activity_logger should be truly optional.
- validate_deployments itself was modified to make its interface closer to what existing tasks expect as input, and callable from other tasks as a function.
0.2.19
- Introduce a new
path
parameter in the index
section of MLIndex documents over FAISS indices, to allow the path to FAISS index files to be different from the MLIndex document path. - Ensure
MLIndex.base_uri
is never undefined for a valid MLIndex object.
0.2.18.1
- Only save out metadata before embedding in crack_and_chunk_and_embed_and_index
- Update create_embeddings to return num_embedded value.
- This enables crack_and_chunk_and_embed to skip loading EmbeddedDocument partitions when no documents were embedded (all reused).
0.2.18
- Add new task to crack, chunk, embed, index to ACS, and register MLIndex in one step.
- Handle
openai.api_type
being None
0.2.17
- Fix loading MLIndex failure. Don't need to get the
endpoint
from connection when it is already provided. - Try use
langchain
VectorStore and fallback to vendor - Support `azure-search-documents==11.4.0b11``
- Add support for Pinecone in DataIndex
0.2.16
- Use Retry-After when aoai embedding endpoint throws RateLimitError
0.2.15.1
- Fix vendored FAISS langchain VectorStore to only error when a doc is
None
(rather than when a Document isn't exactly the right class)
0.2.15
- Support PDF cracking with Azure Document Intelligence service
crack_and_chunk_and_embed
now pulls documents through to embedding (streaming) and embeds documents in parallel batches- Update default field names.
- Fix long file name bug when writing to output during crack and chunk
0.2.14
- Fix git_clone to handle WorkspaceConnections, again.
0.2.13
- Fix git_clone to handle WorkspaceConnection objects and urls with usernames already in them.
0.2.12
- Only process
.jsonl
and .csv
files when reading chunks for embedding.
0.2.11
- Check casing for model kind and api_type
- Ensure api_version not being set is supported and default make sense.
- Add support for Pinecone indexes
0.2.10
- Fix QA generator and connections check for ApiType metadata
0.2.9
- QA data generation accepts connection as input
0.2.8
- Remove
allowed_special="all"
from tiktoken usage as it encodes special tokens like <|endoftext|>
as their special token rather then as plain text (which is the case when only disallowed_special=()
is set on its own) - Stop truncating texts to embed (to model ctx length) as new
azureml.rag.embeddings.OpenAIEmbedder
handles batching and splitting long texts pre-embed then averaging the results into a single final embedding. - Loosen tiktoken version range from
~=0.3.0
to <1
0.2.7
- Don't try and use MLClient for connections if azure-ai-ml<1.10.0
- Handle Custom Conenctions which azure-ai-ml can't deserialize today.
- Allow passing faiss index engine to MLIndex local
- Pass chunks directly into write_chunks_to_jsonl
0.2.6
- Fix jsonl output mode of crack_and_chunk writing csv internally.
0.2.5
- Ensure EmbeddingsContainer.mount_and_load sets
create_destination=True
when mounting to create embeddings_cache location if it's not already created. - Fix
safe_mlflow_start_run
to yield None
when mlflow not available - Handle custom
field_mappings
passed to update_acs
task.
0.2.4
- Introduce
crack_and_chunk_and_embed
task which tracks deletions and reused source + documents to enable full sync with indexes, levering EmbeddingsContainer for storage of this information across Snapshots. - Restore
workspace_connection_to_credential
function.
0.2.3
- Fix git clone url format bug
0.2.2
- Fix all langchain splitter to use tiktoken in an airgap friendly way.
0.2.1
- Introduce DataIndex interface for scheduling Vector Index Pipeline in AzureML and creating MLIndex Assets
- Vendor various langchain components to avoid breaking changes to MLIndex internal logic
0.1.24.2
- Fix all langchain splitter to use tiktoken in an airgap friendly way.
0.1.24.1
- Fix subsplitter init bug in MarkdownHeaderSplitter
- Support getting langchain retriever for ACS based MLIndex with embeddings.kind: none.
0.1.24
- Don't mlflow log unless there's an active mlflow run.
- Support
langchain.vectorstores.azuresearch
after langchain>=0.0.273
upgraded to azure-search-documents==11.4.0b8
- Use tiktoken encodings from package for other splitter types
0.1.23.2
- Handle
Path
objects passed into MLIndex
init.
0.1.23.1
- Handle .api.cognitive style aoai endpoints correctly
0.1.23
- Ensure tiktoken encodings are packaged in wheel
0.1.22
- Set environment variables to pull encodings files from directory with cache key to avoid tiktoken external network call
- Fix mlflow log error when there's no files input
0.1.21
- Fix top level imports in
update_acs
task failing without helpful reason when old azure-search-documents
is installed.
0.1.20
- Fix Crack'n'Chunk race-condition where same named files would overwrite each other.
0.1.19
- Various bug fixes:
- Handle some malformed git urls in
git_clone
task - Try fall back when parsing csv with pandas fails
- Allow chunking special tokens
- Ensure logging with mlflow can't fail a task
- Update to support latest
azure-search-documents==11.4.0b8
0.1.18
- Add FaissAndDocStore and FileBasedDocStore which closely mirror langchains' FAISS and InMemoryDocStore without the langchain or pickle dependency. These are default not used until PromptFlow support has been added.
- Pin
azure-documents-search==11.4.0b6
as there's breaking changes in 11.4.0b7
and 11.4.0b8
0.1.17
- Update interactions with Azure Cognitive Search to use latest azure-documents-search SDK
0.1.16
- Convert api_type from Workspace Connections to lower case to appease langchains case sensitive checking.
0.1.15
- Add support for custom loaders
- Added logging for MLIndex.init to understand usage of MLindex
0.1.14
- Add Support for CustomKeys connections
- Add OpenAI support for QA Gen and Embeddings
0.1.13 (2023-07-12)
- Implement single node non-PRS embed task to enable clearer logs for users.
0.1.12 (2023-06-29)
- Fix casing check of ApiVersion, ApiType in infer_deployment util
0.1.11 (2023-06-28)
- Update casing check for workspace connection ApiVersion, ApiType
- int casting for temperature, max_tokens
0.1.10 (2023-06-26)
- Update data asset registering to have adjustable output_type
- Remove asset registering from generate_qa.py
0.1.9 (2023-06-22)
- Add
azureml.rag.data_generation
module. - Fixed bug that would cause crack_and_chunk to fail for documents that contain non-utf-8 characters. Currently these characters will be ignored.
- Improved heading extraction from Markdown files. When
use_rcts=False
Markdown files will be split on headings and each chunk with have the heading context up to the root as a prefix (e.g. # Heading 1\n## Heading 2\n# Heading 3\n{content}
)
0.1.8 (2023-06-21)
- Add deployment inferring util for use in azureml-insider notebooks.
0.1.7 (2023-06-08)
- Improved telemetry for tasks (used in RAG Pipeline Components)
0.1.6 (2023-05-31)
- Fail crack_and_chunk task when no files were processed (usually because of a malformed
input_glob
) - Change
update_acs.py
to default push_embeddings=True
instead of False
.
0.1.5 (2023-05-19)
- Add api_base back to MLIndex embeddings config for back-compat (until all clients start getting it from Workspace Connection).
- Add telemetry for tasks used in pipeline components, not enabled by default for SDK usage.
0.1.4 (2023-05-17)
- Fix bug where enabling rcts option on split_documents used nltk splitter instead.
0.1.3 (2023-05-12)
- Support Workspace Connection based auth for Git, Azure OpenAI and Azure Cognitive Search usage.
0.1.2 (2023-05-05)
- Refactored document chunking to allow insertion of custom processing logic
0.0.1 (2023-04-25)
Features Added
- Introduced package
- langchain Retriever for Azure Cognitive Search