# AzureML Retrieval Augmented Generation Utilities
This package is in alpha stage at the moment, use at risk of breaking changes and unstable behavior.
It contains utilities for:
- Processing text documents into chunks appropriate for use in LLM prompts, with metadata such is source url.
- Embedding chunks with OpenAI or HuggingFace embeddings models, including the ability to update a set of embeddings over time.
- Create MLIndex artifacts from embeddings, a yaml file capturing metadata needed to deserialize different kinds of Vector Indexes for use in langchain. Supported Index types:
- FAISS index (via langchain)
- Azure Cognitive Search index
- Pinecone index
- Milvus index
- Azure Cosmos Mongo vCore index
- MongoDB
## Getting started
You can install AzureMLs RAG package using pip.
```bash
pip install azureml-rag
```
There are various extra installs you probably want to include based on intended use:
- `faiss`: When using FAISS based Vector Indexes
- `cognitive_search`: When using Azure Cognitive Search Indexes
- `pinecone`: When using Pinecone Indexes
- `azure_cosmos_mongo_vcore`: When using Azure Cosmos Mongo vCore Indexes
- `hugging_face`: When using Sentence Transformer embedding models from HuggingFace (local inference)
- `document_parsing`: When cracking and chunking documents locally to put in an Index
- `mongodb`: When using native mongo db indexes
## MLIndex
MLIndex files describe an index of data + embeddings and the embeddings model used in yaml.
Azure Cognitive Search Index:
```yaml
embeddings:
dimension: 768
kind: hugging_face
model: sentence-transformers/all-mpnet-base-v2
schema_version: '2'
index:
api_version: 2021-04-30-Preview
connection:
id: /subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/connections/<acs_connection_name>
connection_type: workspace_connection
endpoint: https://<acs_name>.search.windows.net
engine: azure-sdk
field_mapping:
content: content
filename: filepath
metadata: meta_json_string
title: title
url: url
embedding: contentVector
index: azureml-rag-test-206e03b6-3880-407b-9bc4-c0a1162d6c70
kind: acs
```
Pinecone Index:
```yaml
embeddings:
dimension: 768
kind: hugging_face
model: sentence-transformers/all-mpnet-base-v2
schema_version: '2'
index:
connection:
id: /subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/connections/<pinecone_connection_name>
connection_type: workspace_connection
engine: pinecone-sdk
field_mapping:
content: content
filename: filepath
metadata: metadata_json_string
title: title
url: url
index: azureml-rag-test-206e03b6-3880-407b-9bc4-c0a1162d6c70
kind: pinecone
```
Azure Cosmos Mongo vCore Index:
```yaml
embeddings:
dimension: 768
kind: hugging_face
model: sentence-transformers/all-mpnet-base-v2
schema_version: '2'
index:
connection:
id: /subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/connections/<cosmos_connection_name>
connection_type: workspace_connection
engine: pymongo-sdk
field_mapping:
content: content
filename: filepath
metadata: metadata_json_string
title: title
url: url
embedding: contentVector
database: azureml-rag-test-db
collection: azureml-rag-test-collection
index: azureml-rag-test-206e03b6-3880-407b-9bc4-c0a1162d6c70
kind: azure_cosmos_mongo_vcore
```
### Create MLIndex
Examples using MLIndex remotely with AzureML and locally with langchain live here: https://github.com/Azure/azureml-examples/tree/main/sdk/python/generative-ai/rag
### Consume MLIndex
```python
from azureml.rag.mlindex import MLIndex
retriever = MLIndex(uri_to_folder_with_mlindex).as_langchain_retriever()
retriever.get_relevant_documents('What is an AzureML Compute Instance?')
```
# Changelog
Please insert change log into "Next Release" ONLY.
## Next release
## 0.2.37.2
- Fix PydanticUndefinedAnnotation: name 'AzureSearch' is not defined
- Remove Python 3.8 support
## 0.2.37.1
- Upgrade nltk to >=3.9.1, <4.0
## 0.2.37
- Upgrade langchain to 0.3.x
- Upgrade langchain-community to 0.3.x
- Upgrade langchain-pinecone to 0.2.x
- Upgrade pinecone-client to 5.0.x
## 0.2.36
- Implement mongodb vector store and ml index supports
- Detect OBO credential with AZUREML_OBO_ENABLED environment variable
- ACS update on changed (new or deleted) documents
- Drop azure-search-documents 11.4.0 beta version support
## 0.2.35
- Implement cosmosdb for nosql vector store and ml index supports
- Relax langchain version constraint
- Upgraded langchain-pinecone version to 0.1.1 and pinecone-client version
## 0.2.34
- Update azure-ai-ml version to 1.16.1 by introducing noneCredentialConfigure and add authType for AadCredentialConfigure
- Use set of exceptions as retry_exceptions in backoff_retry_on_exceptions
## 0.2.33
- Support existing qdrant indices
- Mitigate PF failure while more than 3 lookup tools used in a flow
- Add the retry for the embedder if there was a successfully embedding
## 0.2.32
- Implement langchain weaviate vectorstore in mlindex
- Get connection in `get_connection_by_id_v2` with caller specified credential
- Set upper bound for `azure-ai-ml` to 1.15.0
## 0.2.31.1
- Update search index with azure-search-documents 11.4.0
- Add azureml-core in the dependency list
## 0.2.31
- Categorize user error and system error, and update RH accordingly to show in logs
- Bugfix using obo credential for AAD connections.
- Prevention fix to support AadCredentialConfig in Connection object
- Update Pinecone legacy API
- Creating image embedding index with azure-search-documents 11.4.0
## 0.2.30.2
- Bugfix remove azure_ad_token_provider from EmbeddingContainer metadata
- Set embeddings_model as optional argument
## 0.2.30.1
- Introduce `elasticsearch` extra to declare transitive dependency on the `elasticsearch` package when using Elasticsearch indices.
## 0.2.30
- Bugfix in models.py to handle empty deployment name.
- Supporting existing elasticsearch indices
- Bug fix in `crack_and_chunk_and_embed_and_index`
- Fixing bug in using AAD auth type ACS connections.
## 0.2.29.2
- Fixing ACS index creation failure with azure-search-documents 11.4.0
## 0.2.29.1
- Fixing FAISS, dependable_faiss_import import failure with Langchain 0.1.x
## 0.2.29
- Support AAD and MSI auth type in AOAI, ACS connection
## 0.2.28
- Ensure compatibility with newer versions of azure-ai-ml.
- Upgrade langchain to support up to 0.1
## 0.2.27
- Support Cohere serverless endpoint
- Support multiple ACS lookups in the same process, eliminating field mapping conflicts
- Support pass-in credential in get_connection_by_name_v2 to unblock managed vNet setup
- Update validate_deployments in crack_chunk_embed_index_and_register.py
## 0.2.26
- Support for .csv and .json file extensions in pipeline
- Ignore mlflow.exceptions.RestException in safe_mlflow_log_metric
- validate_deployments supports openai v1.0+
- Removing unexpected keyword argument 'engine'
- Checking ACS account has enough index quota
- infer_deployment supports openai v1.0+
- Create missing fields for existing index
## 0.2.25
- Using local cached encodings.
- Adding convert_to_dict() for openai v1.0+
- Check index_config before passing in validate_deployments.py
- Limit size of documents upload to ACS in one batch to solve RequestEntityTooLargeError
## 0.2.24.2
- Supporting `*.cognitiveservices.*` endpoint
- Adding azureml-rag specific user_agent when using DocumentIntelligence
- Refactored update index tasks
- Supporting uppercase file extensions name in crack_and_chunk
- Fixing Deployment importing bug in utils
- Adding the playgroundType tag in MLIndex Asset used for Azure AI studio playground
- Remove mandatory module-level imports of optional extra packages
## 0.2.24.1
- Fixing is_florence key detection
- Using 'embedding_connection_id' instead of 'florence_connection_id' as parameter name
## 0.2.24
- Introducing image ingestion with florence embedding API
- Adding dummy output to validate_deployments for holding the right order
- Fixing DeploymentNotFound bug
## 0.2.23.5
- Deprecate pkg_resources in logging.py (https://setuptools.pypa.io/en/latest/pkg_resources.html)
## 0.2.23.4
- Make the `api_type` parameter non-case sensitive in OpenAIEmbedder
- Bug fix in embeddings container path
## 0.2.23.3
- Set upper bound for `langchain` to 0.0.348
## 0.2.23.2
- Make tiktoken pull from a cache instead of making the outgoing network call to get encodings files
- Add support for Azure Cosmos Mongo vCore
## 0.2.23.1
- Fixing exception handling in validate_deployments to support OpenAI v1.0+
## 0.2.23
- Support OpenAI v1.0 +
- Handle FAISS.load_local() change since Langchain 0.0.318
- Handle mailto links in url crawling component.
- Add support for Milvus vector store
## 0.2.22
- update pypdf's version to 3.17.1 in document-parsing.
## 0.2.21
- Use workspace connection tags instead of metadata since it's deprecated.
- Fix bug handling single files in `files_to_document_sources`
## 0.2.20
- Initial introduction of validate_deployments.
- Asset registration in \*\_and_register attempts to infer target workspace from asset_uri and handle multiple auth options
- activity_logger moved out as first arg, this is an intermediate step as logger also shouldn't be first arg and instead handled by get_logger, activity_logger should be truly optional.
- validate_deployments itself was modified to make its interface closer to what existing tasks expect as input, and callable from other tasks as a function.
## 0.2.19
- Introduce a new `path` parameter in the `index` section of MLIndex documents over FAISS indices, to allow the path to FAISS index files to be different from the MLIndex document path.
- Ensure `MLIndex.base_uri` is never undefined for a valid MLIndex object.
## 0.2.18.1
- Only save out metadata before embedding in crack_and_chunk_and_embed_and_index
- Update create_embeddings to return num_embedded value.
- This enables crack_and_chunk_and_embed to skip loading EmbeddedDocument partitions when no documents were embedded (all reused).
## 0.2.18
- Add new task to crack, chunk, embed, index to ACS, and register MLIndex in one step.
- Handle `openai.api_type` being `None`
## 0.2.17
- Fix loading MLIndex failure. Don't need to get the `endpoint` from connection when it is already provided.
- Try use `langchain` VectorStore and fallback to vendor
- Support `azure-search-documents==11.4.0b11``
- Add support for Pinecone in DataIndex
## 0.2.16
- Use Retry-After when aoai embedding endpoint throws RateLimitError
## 0.2.15.1
- Fix vendored FAISS langchain VectorStore to only error when a doc is `None` (rather than when a Document isn't exactly the right class)
## 0.2.15
- Support PDF cracking with Azure Document Intelligence service
- `crack_and_chunk_and_embed` now pulls documents through to embedding (streaming) and embeds documents in parallel batches
- Update default field names.
- Fix long file name bug when writing to output during crack and chunk
## 0.2.14
- Fix git_clone to handle WorkspaceConnections, again.
## 0.2.13
- Fix git_clone to handle WorkspaceConnection objects and urls with usernames already in them.
## 0.2.12
- Only process `.jsonl` and `.csv` files when reading chunks for embedding.
## 0.2.11
- Check casing for model kind and api_type
- Ensure api_version not being set is supported and default make sense.
- Add support for Pinecone indexes
## 0.2.10
- Fix QA generator and connections check for ApiType metadata
## 0.2.9
- QA data generation accepts connection as input
## 0.2.8
- Remove `allowed_special="all"` from tiktoken usage as it encodes special tokens like `<|endoftext|>` as their special token rather then as plain text (which is the case when only `disallowed_special=()` is set on its own)
- Stop truncating texts to embed (to model ctx length) as new `azureml.rag.embeddings.OpenAIEmbedder` handles batching and splitting long texts pre-embed then averaging the results into a single final embedding.
- Loosen tiktoken version range from `~=0.3.0` to `<1`
## 0.2.7
- Don't try and use MLClient for connections if azure-ai-ml<1.10.0
- Handle Custom Conenctions which azure-ai-ml can't deserialize today.
- Allow passing faiss index engine to MLIndex local
- Pass chunks directly into write_chunks_to_jsonl
## 0.2.6
- Fix jsonl output mode of crack_and_chunk writing csv internally.
## 0.2.5
- Ensure EmbeddingsContainer.mount_and_load sets `create_destination=True` when mounting to create embeddings_cache location if it's not already created.
- Fix `safe_mlflow_start_run` to `yield None` when mlflow not available
- Handle custom `field_mappings` passed to `update_acs` task.
## 0.2.4
- Introduce `crack_and_chunk_and_embed` task which tracks deletions and reused source + documents to enable full sync with indexes, levering EmbeddingsContainer for storage of this information across Snapshots.
- Restore `workspace_connection_to_credential` function.
## 0.2.3
- Fix git clone url format bug
## 0.2.2
- Fix all langchain splitter to use tiktoken in an airgap friendly way.
## 0.2.1
- Introduce DataIndex interface for scheduling Vector Index Pipeline in AzureML and creating MLIndex Assets
- Vendor various langchain components to avoid breaking changes to MLIndex internal logic
## 0.1.24.2
- Fix all langchain splitter to use tiktoken in an airgap friendly way.
## 0.1.24.1
- Fix subsplitter init bug in MarkdownHeaderSplitter
- Support getting langchain retriever for ACS based MLIndex with embeddings.kind: none.
## 0.1.24
- Don't mlflow log unless there's an active mlflow run.
- Support `langchain.vectorstores.azuresearch` after `langchain>=0.0.273` upgraded to `azure-search-documents==11.4.0b8`
- Use tiktoken encodings from package for other splitter types
## 0.1.23.2
- Handle `Path` objects passed into `MLIndex` init.
## 0.1.23.1
- Handle <region>.api.cognitive style aoai endpoints correctly
## 0.1.23
- Ensure tiktoken encodings are packaged in wheel
## 0.1.22
- Set environment variables to pull encodings files from directory with cache key to avoid tiktoken external network call
- Fix mlflow log error when there's no files input
## 0.1.21
- Fix top level imports in `update_acs` task failing without helpful reason when old `azure-search-documents` is installed.
## 0.1.20
- Fix Crack'n'Chunk race-condition where same named files would overwrite each other.
## 0.1.19
- Various bug fixes:
- Handle some malformed git urls in `git_clone` task
- Try fall back when parsing csv with pandas fails
- Allow chunking special tokens
- Ensure logging with mlflow can't fail a task
- Update to support latest `azure-search-documents==11.4.0b8`
## 0.1.18
- Add FaissAndDocStore and FileBasedDocStore which closely mirror langchains' FAISS and InMemoryDocStore without the langchain or pickle dependency. These are default not used until PromptFlow support has been added.
- Pin `azure-documents-search==11.4.0b6` as there's breaking changes in `11.4.0b7` and `11.4.0b8`
## 0.1.17
- Update interactions with Azure Cognitive Search to use latest azure-documents-search SDK
## 0.1.16
- Convert api_type from Workspace Connections to lower case to appease langchains case sensitive checking.
## 0.1.15
- Add support for custom loaders
- Added logging for MLIndex.**init** to understand usage of MLindex
## 0.1.14
- Add Support for CustomKeys connections
- Add OpenAI support for QA Gen and Embeddings
## 0.1.13 (2023-07-12)
- Implement single node non-PRS embed task to enable clearer logs for users.
## 0.1.12 (2023-06-29)
- Fix casing check of ApiVersion, ApiType in infer_deployment util
## 0.1.11 (2023-06-28)
- Update casing check for workspace connection ApiVersion, ApiType
- int casting for temperature, max_tokens
## 0.1.10 (2023-06-26)
- Update data asset registering to have adjustable output_type
- Remove asset registering from generate_qa.py
## 0.1.9 (2023-06-22)
- Add `azureml.rag.data_generation` module.
- Fixed bug that would cause crack_and_chunk to fail for documents that contain non-utf-8 characters. Currently these characters will be ignored.
- Improved heading extraction from Markdown files. When `use_rcts=False` Markdown files will be split on headings and each chunk with have the heading context up to the root as a prefix (e.g. `# Heading 1\n## Heading 2\n# Heading 3\n{content}`)
## 0.1.8 (2023-06-21)
- Add deployment inferring util for use in azureml-insider notebooks.
## 0.1.7 (2023-06-08)
- Improved telemetry for tasks (used in RAG Pipeline Components)
## 0.1.6 (2023-05-31)
- Fail crack_and_chunk task when no files were processed (usually because of a malformed `input_glob`)
- Change `update_acs.py` to default `push_embeddings=True` instead of `False`.
## 0.1.5 (2023-05-19)
- Add api_base back to MLIndex embeddings config for back-compat (until all clients start getting it from Workspace Connection).
- Add telemetry for tasks used in pipeline components, not enabled by default for SDK usage.
## 0.1.4 (2023-05-17)
- Fix bug where enabling rcts option on split_documents used nltk splitter instead.
## 0.1.3 (2023-05-12)
- Support Workspace Connection based auth for Git, Azure OpenAI and Azure Cognitive Search usage.
## 0.1.2 (2023-05-05)
- Refactored document chunking to allow insertion of custom processing logic
## 0.0.1 (2023-04-25)
### Features Added
- Introduced package
- langchain Retriever for Azure Cognitive Search
Raw data
{
"_id": null,
"home_page": "https://docs.microsoft.com/python/api/overview/azure/ml/?view=azure-ml-py",
"name": "azureml-rag",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.9",
"maintainer_email": null,
"keywords": null,
"author": "Microsoft Corporation",
"author_email": null,
"download_url": null,
"platform": null,
"description": "# AzureML Retrieval Augmented Generation Utilities\r\n\r\nThis package is in alpha stage at the moment, use at risk of breaking changes and unstable behavior.\r\n\r\nIt contains utilities for:\r\n\r\n- Processing text documents into chunks appropriate for use in LLM prompts, with metadata such is source url.\r\n- Embedding chunks with OpenAI or HuggingFace embeddings models, including the ability to update a set of embeddings over time.\r\n- Create MLIndex artifacts from embeddings, a yaml file capturing metadata needed to deserialize different kinds of Vector Indexes for use in langchain. Supported Index types:\r\n - FAISS index (via langchain)\r\n - Azure Cognitive Search index\r\n - Pinecone index\r\n - Milvus index\r\n - Azure Cosmos Mongo vCore index\r\n - MongoDB\r\n\r\n## Getting started\r\n\r\nYou can install AzureMLs RAG package using pip.\r\n\r\n```bash\r\npip install azureml-rag\r\n```\r\n\r\nThere are various extra installs you probably want to include based on intended use:\r\n- `faiss`: When using FAISS based Vector Indexes\r\n- `cognitive_search`: When using Azure Cognitive Search Indexes\r\n- `pinecone`: When using Pinecone Indexes\r\n- `azure_cosmos_mongo_vcore`: When using Azure Cosmos Mongo vCore Indexes\r\n- `hugging_face`: When using Sentence Transformer embedding models from HuggingFace (local inference)\r\n- `document_parsing`: When cracking and chunking documents locally to put in an Index\r\n- `mongodb`: When using native mongo db indexes\r\n\r\n## MLIndex\r\n\r\nMLIndex files describe an index of data + embeddings and the embeddings model used in yaml.\r\n\r\nAzure Cognitive Search Index:\r\n\r\n```yaml\r\nembeddings:\r\n dimension: 768\r\n kind: hugging_face\r\n model: sentence-transformers/all-mpnet-base-v2\r\n schema_version: '2'\r\nindex:\r\n api_version: 2021-04-30-Preview\r\n connection:\r\n id: /subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/connections/<acs_connection_name>\r\n connection_type: workspace_connection\r\n endpoint: https://<acs_name>.search.windows.net\r\n engine: azure-sdk\r\n field_mapping:\r\n content: content\r\n filename: filepath\r\n metadata: meta_json_string\r\n title: title\r\n url: url\r\n embedding: contentVector\r\n index: azureml-rag-test-206e03b6-3880-407b-9bc4-c0a1162d6c70\r\n kind: acs\r\n```\r\n\r\nPinecone Index:\r\n\r\n```yaml\r\nembeddings:\r\n dimension: 768\r\n kind: hugging_face\r\n model: sentence-transformers/all-mpnet-base-v2\r\n schema_version: '2'\r\nindex:\r\n connection:\r\n id: /subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/connections/<pinecone_connection_name>\r\n connection_type: workspace_connection\r\n engine: pinecone-sdk\r\n field_mapping:\r\n content: content\r\n filename: filepath\r\n metadata: metadata_json_string\r\n title: title\r\n url: url\r\n index: azureml-rag-test-206e03b6-3880-407b-9bc4-c0a1162d6c70\r\n kind: pinecone\r\n```\r\n\r\nAzure Cosmos Mongo vCore Index:\r\n\r\n```yaml\r\nembeddings:\r\n dimension: 768\r\n kind: hugging_face\r\n model: sentence-transformers/all-mpnet-base-v2\r\n schema_version: '2'\r\nindex:\r\n connection:\r\n id: /subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/connections/<cosmos_connection_name>\r\n connection_type: workspace_connection\r\n engine: pymongo-sdk\r\n field_mapping:\r\n content: content\r\n filename: filepath\r\n metadata: metadata_json_string\r\n title: title\r\n url: url\r\n embedding: contentVector\r\n database: azureml-rag-test-db\r\n collection: azureml-rag-test-collection\r\n index: azureml-rag-test-206e03b6-3880-407b-9bc4-c0a1162d6c70\r\n kind: azure_cosmos_mongo_vcore\r\n```\r\n\r\n### Create MLIndex\r\n\r\nExamples using MLIndex remotely with AzureML and locally with langchain live here: https://github.com/Azure/azureml-examples/tree/main/sdk/python/generative-ai/rag\r\n\r\n### Consume MLIndex\r\n\r\n```python\r\nfrom azureml.rag.mlindex import MLIndex\r\n\r\nretriever = MLIndex(uri_to_folder_with_mlindex).as_langchain_retriever()\r\nretriever.get_relevant_documents('What is an AzureML Compute Instance?')\r\n```\r\n\r\n\r\n# Changelog\r\n\r\nPlease insert change log into \"Next Release\" ONLY.\r\n\r\n## Next release\r\n\r\n## 0.2.37.2\r\n\r\n- Fix PydanticUndefinedAnnotation: name 'AzureSearch' is not defined\r\n- Remove Python 3.8 support\r\n\r\n## 0.2.37.1\r\n\r\n- Upgrade nltk to >=3.9.1, <4.0\r\n\r\n## 0.2.37\r\n\r\n- Upgrade langchain to 0.3.x\r\n- Upgrade langchain-community to 0.3.x\r\n- Upgrade langchain-pinecone to 0.2.x\r\n- Upgrade pinecone-client to 5.0.x\r\n\r\n## 0.2.36\r\n\r\n- Implement mongodb vector store and ml index supports\r\n- Detect OBO credential with AZUREML_OBO_ENABLED environment variable\r\n- ACS update on changed (new or deleted) documents\r\n- Drop azure-search-documents 11.4.0 beta version support\r\n\r\n## 0.2.35\r\n\r\n- Implement cosmosdb for nosql vector store and ml index supports\r\n- Relax langchain version constraint\r\n- Upgraded langchain-pinecone version to 0.1.1 and pinecone-client version\r\n\r\n## 0.2.34\r\n\r\n- Update azure-ai-ml version to 1.16.1 by introducing noneCredentialConfigure and add authType for AadCredentialConfigure\r\n- Use set of exceptions as retry_exceptions in backoff_retry_on_exceptions\r\n\r\n## 0.2.33\r\n\r\n- Support existing qdrant indices\r\n- Mitigate PF failure while more than 3 lookup tools used in a flow\r\n- Add the retry for the embedder if there was a successfully embedding\r\n\r\n## 0.2.32\r\n\r\n- Implement langchain weaviate vectorstore in mlindex\r\n- Get connection in `get_connection_by_id_v2` with caller specified credential\r\n- Set upper bound for `azure-ai-ml` to 1.15.0\r\n\r\n## 0.2.31.1\r\n\r\n- Update search index with azure-search-documents 11.4.0\r\n- Add azureml-core in the dependency list\r\n\r\n## 0.2.31\r\n\r\n- Categorize user error and system error, and update RH accordingly to show in logs\r\n- Bugfix using obo credential for AAD connections.\r\n- Prevention fix to support AadCredentialConfig in Connection object\r\n- Update Pinecone legacy API\r\n- Creating image embedding index with azure-search-documents 11.4.0\r\n\r\n## 0.2.30.2\r\n\r\n- Bugfix remove azure_ad_token_provider from EmbeddingContainer metadata\r\n- Set embeddings_model as optional argument\r\n\r\n## 0.2.30.1\r\n\r\n- Introduce `elasticsearch` extra to declare transitive dependency on the `elasticsearch` package when using Elasticsearch indices.\r\n\r\n## 0.2.30\r\n\r\n- Bugfix in models.py to handle empty deployment name.\r\n- Supporting existing elasticsearch indices\r\n- Bug fix in `crack_and_chunk_and_embed_and_index`\r\n- Fixing bug in using AAD auth type ACS connections.\r\n\r\n## 0.2.29.2\r\n\r\n- Fixing ACS index creation failure with azure-search-documents 11.4.0\r\n\r\n## 0.2.29.1\r\n\r\n- Fixing FAISS, dependable_faiss_import import failure with Langchain 0.1.x\r\n\r\n## 0.2.29\r\n\r\n- Support AAD and MSI auth type in AOAI, ACS connection\r\n\r\n## 0.2.28\r\n\r\n- Ensure compatibility with newer versions of azure-ai-ml.\r\n- Upgrade langchain to support up to 0.1\r\n\r\n## 0.2.27\r\n\r\n- Support Cohere serverless endpoint\r\n- Support multiple ACS lookups in the same process, eliminating field mapping conflicts\r\n- Support pass-in credential in get_connection_by_name_v2 to unblock managed vNet setup\r\n- Update validate_deployments in crack_chunk_embed_index_and_register.py\r\n\r\n## 0.2.26\r\n\r\n- Support for .csv and .json file extensions in pipeline\r\n- Ignore mlflow.exceptions.RestException in safe_mlflow_log_metric\r\n- validate_deployments supports openai v1.0+\r\n- Removing unexpected keyword argument 'engine'\r\n- Checking ACS account has enough index quota\r\n- infer_deployment supports openai v1.0+\r\n- Create missing fields for existing index\r\n\r\n## 0.2.25\r\n\r\n- Using local cached encodings.\r\n- Adding convert_to_dict() for openai v1.0+\r\n- Check index_config before passing in validate_deployments.py\r\n- Limit size of documents upload to ACS in one batch to solve RequestEntityTooLargeError\r\n\r\n## 0.2.24.2\r\n\r\n- Supporting `*.cognitiveservices.*` endpoint\r\n- Adding azureml-rag specific user_agent when using DocumentIntelligence\r\n- Refactored update index tasks\r\n- Supporting uppercase file extensions name in crack_and_chunk\r\n- Fixing Deployment importing bug in utils\r\n- Adding the playgroundType tag in MLIndex Asset used for Azure AI studio playground\r\n- Remove mandatory module-level imports of optional extra packages\r\n\r\n## 0.2.24.1\r\n\r\n- Fixing is_florence key detection\r\n- Using 'embedding_connection_id' instead of 'florence_connection_id' as parameter name\r\n\r\n## 0.2.24\r\n\r\n- Introducing image ingestion with florence embedding API\r\n- Adding dummy output to validate_deployments for holding the right order\r\n- Fixing DeploymentNotFound bug\r\n\r\n## 0.2.23.5\r\n\r\n- Deprecate pkg_resources in logging.py (https://setuptools.pypa.io/en/latest/pkg_resources.html)\r\n\r\n## 0.2.23.4\r\n\r\n- Make the `api_type` parameter non-case sensitive in OpenAIEmbedder\r\n- Bug fix in embeddings container path\r\n\r\n## 0.2.23.3\r\n\r\n- Set upper bound for `langchain` to 0.0.348\r\n\r\n## 0.2.23.2\r\n\r\n- Make tiktoken pull from a cache instead of making the outgoing network call to get encodings files\r\n- Add support for Azure Cosmos Mongo vCore\r\n\r\n## 0.2.23.1\r\n\r\n- Fixing exception handling in validate_deployments to support OpenAI v1.0+\r\n\r\n## 0.2.23\r\n\r\n- Support OpenAI v1.0 +\r\n- Handle FAISS.load_local() change since Langchain 0.0.318\r\n- Handle mailto links in url crawling component.\r\n- Add support for Milvus vector store\r\n\r\n## 0.2.22\r\n\r\n- update pypdf's version to 3.17.1 in document-parsing.\r\n\r\n## 0.2.21\r\n\r\n- Use workspace connection tags instead of metadata since it's deprecated.\r\n- Fix bug handling single files in `files_to_document_sources`\r\n\r\n## 0.2.20\r\n\r\n- Initial introduction of validate_deployments.\r\n- Asset registration in \\*\\_and_register attempts to infer target workspace from asset_uri and handle multiple auth options\r\n- activity_logger moved out as first arg, this is an intermediate step as logger also shouldn't be first arg and instead handled by get_logger, activity_logger should be truly optional.\r\n- validate_deployments itself was modified to make its interface closer to what existing tasks expect as input, and callable from other tasks as a function.\r\n\r\n## 0.2.19\r\n\r\n- Introduce a new `path` parameter in the `index` section of MLIndex documents over FAISS indices, to allow the path to FAISS index files to be different from the MLIndex document path.\r\n- Ensure `MLIndex.base_uri` is never undefined for a valid MLIndex object.\r\n\r\n## 0.2.18.1\r\n\r\n- Only save out metadata before embedding in crack_and_chunk_and_embed_and_index\r\n- Update create_embeddings to return num_embedded value.\r\n - This enables crack_and_chunk_and_embed to skip loading EmbeddedDocument partitions when no documents were embedded (all reused).\r\n\r\n## 0.2.18\r\n\r\n- Add new task to crack, chunk, embed, index to ACS, and register MLIndex in one step.\r\n- Handle `openai.api_type` being `None`\r\n\r\n## 0.2.17\r\n\r\n- Fix loading MLIndex failure. Don't need to get the `endpoint` from connection when it is already provided.\r\n- Try use `langchain` VectorStore and fallback to vendor\r\n- Support `azure-search-documents==11.4.0b11``\r\n- Add support for Pinecone in DataIndex\r\n\r\n## 0.2.16\r\n\r\n- Use Retry-After when aoai embedding endpoint throws RateLimitError\r\n\r\n## 0.2.15.1\r\n\r\n- Fix vendored FAISS langchain VectorStore to only error when a doc is `None` (rather than when a Document isn't exactly the right class)\r\n\r\n## 0.2.15\r\n\r\n- Support PDF cracking with Azure Document Intelligence service\r\n- `crack_and_chunk_and_embed` now pulls documents through to embedding (streaming) and embeds documents in parallel batches\r\n- Update default field names.\r\n- Fix long file name bug when writing to output during crack and chunk\r\n\r\n## 0.2.14\r\n\r\n- Fix git_clone to handle WorkspaceConnections, again.\r\n\r\n## 0.2.13\r\n\r\n- Fix git_clone to handle WorkspaceConnection objects and urls with usernames already in them.\r\n\r\n## 0.2.12\r\n\r\n- Only process `.jsonl` and `.csv` files when reading chunks for embedding.\r\n\r\n## 0.2.11\r\n\r\n- Check casing for model kind and api_type\r\n- Ensure api_version not being set is supported and default make sense.\r\n- Add support for Pinecone indexes\r\n\r\n## 0.2.10\r\n\r\n- Fix QA generator and connections check for ApiType metadata\r\n\r\n## 0.2.9\r\n\r\n- QA data generation accepts connection as input\r\n\r\n## 0.2.8\r\n\r\n- Remove `allowed_special=\"all\"` from tiktoken usage as it encodes special tokens like `<|endoftext|>` as their special token rather then as plain text (which is the case when only `disallowed_special=()` is set on its own)\r\n- Stop truncating texts to embed (to model ctx length) as new `azureml.rag.embeddings.OpenAIEmbedder` handles batching and splitting long texts pre-embed then averaging the results into a single final embedding.\r\n- Loosen tiktoken version range from `~=0.3.0` to `<1`\r\n\r\n## 0.2.7\r\n\r\n- Don't try and use MLClient for connections if azure-ai-ml<1.10.0\r\n- Handle Custom Conenctions which azure-ai-ml can't deserialize today.\r\n- Allow passing faiss index engine to MLIndex local\r\n- Pass chunks directly into write_chunks_to_jsonl\r\n\r\n## 0.2.6\r\n\r\n- Fix jsonl output mode of crack_and_chunk writing csv internally.\r\n\r\n## 0.2.5\r\n\r\n- Ensure EmbeddingsContainer.mount_and_load sets `create_destination=True` when mounting to create embeddings_cache location if it's not already created.\r\n- Fix `safe_mlflow_start_run` to `yield None` when mlflow not available\r\n- Handle custom `field_mappings` passed to `update_acs` task.\r\n\r\n## 0.2.4\r\n\r\n- Introduce `crack_and_chunk_and_embed` task which tracks deletions and reused source + documents to enable full sync with indexes, levering EmbeddingsContainer for storage of this information across Snapshots.\r\n- Restore `workspace_connection_to_credential` function.\r\n\r\n## 0.2.3\r\n\r\n- Fix git clone url format bug\r\n\r\n## 0.2.2\r\n\r\n- Fix all langchain splitter to use tiktoken in an airgap friendly way.\r\n\r\n## 0.2.1\r\n\r\n- Introduce DataIndex interface for scheduling Vector Index Pipeline in AzureML and creating MLIndex Assets\r\n- Vendor various langchain components to avoid breaking changes to MLIndex internal logic\r\n\r\n## 0.1.24.2\r\n\r\n- Fix all langchain splitter to use tiktoken in an airgap friendly way.\r\n\r\n## 0.1.24.1\r\n\r\n- Fix subsplitter init bug in MarkdownHeaderSplitter\r\n- Support getting langchain retriever for ACS based MLIndex with embeddings.kind: none.\r\n\r\n## 0.1.24\r\n\r\n- Don't mlflow log unless there's an active mlflow run.\r\n- Support `langchain.vectorstores.azuresearch` after `langchain>=0.0.273` upgraded to `azure-search-documents==11.4.0b8`\r\n- Use tiktoken encodings from package for other splitter types\r\n\r\n## 0.1.23.2\r\n\r\n- Handle `Path` objects passed into `MLIndex` init.\r\n\r\n## 0.1.23.1\r\n\r\n- Handle <region>.api.cognitive style aoai endpoints correctly\r\n\r\n## 0.1.23\r\n\r\n- Ensure tiktoken encodings are packaged in wheel\r\n\r\n## 0.1.22\r\n\r\n- Set environment variables to pull encodings files from directory with cache key to avoid tiktoken external network call\r\n- Fix mlflow log error when there's no files input\r\n\r\n## 0.1.21\r\n\r\n- Fix top level imports in `update_acs` task failing without helpful reason when old `azure-search-documents` is installed.\r\n\r\n## 0.1.20\r\n\r\n- Fix Crack'n'Chunk race-condition where same named files would overwrite each other.\r\n\r\n## 0.1.19\r\n\r\n- Various bug fixes:\r\n - Handle some malformed git urls in `git_clone` task\r\n - Try fall back when parsing csv with pandas fails\r\n - Allow chunking special tokens\r\n - Ensure logging with mlflow can't fail a task\r\n- Update to support latest `azure-search-documents==11.4.0b8`\r\n\r\n## 0.1.18\r\n\r\n- Add FaissAndDocStore and FileBasedDocStore which closely mirror langchains' FAISS and InMemoryDocStore without the langchain or pickle dependency. These are default not used until PromptFlow support has been added.\r\n- Pin `azure-documents-search==11.4.0b6` as there's breaking changes in `11.4.0b7` and `11.4.0b8`\r\n\r\n## 0.1.17\r\n\r\n- Update interactions with Azure Cognitive Search to use latest azure-documents-search SDK\r\n\r\n## 0.1.16\r\n\r\n- Convert api_type from Workspace Connections to lower case to appease langchains case sensitive checking.\r\n\r\n## 0.1.15\r\n\r\n- Add support for custom loaders\r\n- Added logging for MLIndex.**init** to understand usage of MLindex\r\n\r\n## 0.1.14\r\n\r\n- Add Support for CustomKeys connections\r\n- Add OpenAI support for QA Gen and Embeddings\r\n\r\n## 0.1.13 (2023-07-12)\r\n\r\n- Implement single node non-PRS embed task to enable clearer logs for users.\r\n\r\n## 0.1.12 (2023-06-29)\r\n\r\n- Fix casing check of ApiVersion, ApiType in infer_deployment util\r\n\r\n## 0.1.11 (2023-06-28)\r\n\r\n- Update casing check for workspace connection ApiVersion, ApiType\r\n- int casting for temperature, max_tokens\r\n\r\n## 0.1.10 (2023-06-26)\r\n\r\n- Update data asset registering to have adjustable output_type\r\n- Remove asset registering from generate_qa.py\r\n\r\n## 0.1.9 (2023-06-22)\r\n\r\n- Add `azureml.rag.data_generation` module.\r\n- Fixed bug that would cause crack_and_chunk to fail for documents that contain non-utf-8 characters. Currently these characters will be ignored.\r\n- Improved heading extraction from Markdown files. When `use_rcts=False` Markdown files will be split on headings and each chunk with have the heading context up to the root as a prefix (e.g. `# Heading 1\\n## Heading 2\\n# Heading 3\\n{content}`)\r\n\r\n## 0.1.8 (2023-06-21)\r\n\r\n- Add deployment inferring util for use in azureml-insider notebooks.\r\n\r\n## 0.1.7 (2023-06-08)\r\n\r\n- Improved telemetry for tasks (used in RAG Pipeline Components)\r\n\r\n## 0.1.6 (2023-05-31)\r\n\r\n- Fail crack_and_chunk task when no files were processed (usually because of a malformed `input_glob`)\r\n- Change `update_acs.py` to default `push_embeddings=True` instead of `False`.\r\n\r\n## 0.1.5 (2023-05-19)\r\n\r\n- Add api_base back to MLIndex embeddings config for back-compat (until all clients start getting it from Workspace Connection).\r\n- Add telemetry for tasks used in pipeline components, not enabled by default for SDK usage.\r\n\r\n## 0.1.4 (2023-05-17)\r\n\r\n- Fix bug where enabling rcts option on split_documents used nltk splitter instead.\r\n\r\n## 0.1.3 (2023-05-12)\r\n\r\n- Support Workspace Connection based auth for Git, Azure OpenAI and Azure Cognitive Search usage.\r\n\r\n## 0.1.2 (2023-05-05)\r\n\r\n- Refactored document chunking to allow insertion of custom processing logic\r\n\r\n## 0.0.1 (2023-04-25)\r\n\r\n### Features Added\r\n\r\n- Introduced package\r\n- langchain Retriever for Azure Cognitive Search\r\n",
"bugtrack_url": null,
"license": "Proprietary https://aka.ms/azureml-preview-sdk-license",
"summary": "Contains Retrieval Augmented Generation related utilities for Azure Machine Learning and OSS interoperability.",
"version": "0.2.37.2",
"project_urls": {
"Homepage": "https://docs.microsoft.com/python/api/overview/azure/ml/?view=azure-ml-py"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "33d70882f9375cbb9ea6997ba18c4ff153e647f479e40b638e836da08eda0510",
"md5": "0e055666e4043f1578c0cd15de8a45da",
"sha256": "eac352f58e29f01b76f447bf357003592370a40f30b74d910d7371da59d112be"
},
"downloads": -1,
"filename": "azureml_rag-0.2.37.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0e055666e4043f1578c0cd15de8a45da",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.9",
"size": 1697400,
"upload_time": "2025-01-08T01:21:22",
"upload_time_iso_8601": "2025-01-08T01:21:22.324782Z",
"url": "https://files.pythonhosted.org/packages/33/d7/0882f9375cbb9ea6997ba18c4ff153e647f479e40b638e836da08eda0510/azureml_rag-0.2.37.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-08 01:21:22",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "azureml-rag"
}