azureml-rag


Nameazureml-rag JSON
Version 0.2.31 PyPI version JSON
download
home_pagehttps://docs.microsoft.com/python/api/overview/azure/ml/?view=azure-ml-py
SummaryContains Retrieval Augmented Generation related utilities for Azure Machine Learning and OSS interoperability.
upload_time2024-05-07 05:39:25
maintainerNone
docs_urlNone
authorMicrosoft Corporation
requires_python<4.0,>=3.8
licenseProprietary https://aka.ms/azureml-preview-sdk-license
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # AzureML Retrieval Augmented Generation Utilities

This package is in alpha stage at the moment, use at risk of breaking changes and unstable behavior.

It contains utilities for:

- Processing text documents into chunks appropriate for use in LLM prompts, with metadata such is source url.
- Embedding chunks with OpenAI or HuggingFace embeddings models, including the ability to update a set of embeddings over time.
- Create MLIndex artifacts from embeddings, a yaml file capturing metadata needed to deserialize different kinds of Vector Indexes for use in langchain. Supported Index types:
  - FAISS index (via langchain)
  - Azure Cognitive Search index
  - Pinecone index
  - Milvus index
  - Azure Cosmos Mongo vCore index

## Getting started

You can install AzureMLs RAG package using pip.

```bash
pip install azureml-rag
```

There are various extra installs you probably want to include based on intended use:
- `faiss`: When using FAISS based Vector Indexes
- `cognitive_search`: When using Azure Cognitive Search Indexes
- `pinecone`: When using Pinecone Indexes
- `azure_cosmos_mongo_vcore`: When using Azure Cosmos Mongo vCore Indexes
- `hugging_face`: When using Sentence Transformer embedding models from HuggingFace (local inference)
- `document_parsing`: When cracking and chunking documents locally to put in an Index

## MLIndex

MLIndex files describe an index of data + embeddings and the embeddings model used in yaml.

Azure Cognitive Search Index:

```yaml
embeddings:
  dimension: 768
  kind: hugging_face
  model: sentence-transformers/all-mpnet-base-v2
  schema_version: '2'
index:
  api_version: 2021-04-30-Preview
  connection:
    id: /subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/connections/<acs_connection_name>
  connection_type: workspace_connection
  endpoint: https://<acs_name>.search.windows.net
  engine: azure-sdk
  field_mapping:
    content: content
    filename: filepath
    metadata: meta_json_string
    title: title
    url: url
    embedding: contentVector
  index: azureml-rag-test-206e03b6-3880-407b-9bc4-c0a1162d6c70
  kind: acs
```

Pinecone Index:

```yaml
embeddings:
  dimension: 768
  kind: hugging_face
  model: sentence-transformers/all-mpnet-base-v2
  schema_version: '2'
index:
  connection:
    id: /subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/connections/<pinecone_connection_name>
  connection_type: workspace_connection
  engine: pinecone-sdk
  field_mapping:
    content: content
    filename: filepath
    metadata: metadata_json_string
    title: title
    url: url
  index: azureml-rag-test-206e03b6-3880-407b-9bc4-c0a1162d6c70
  kind: pinecone
```

Azure Cosmos Mongo vCore Index:

```yaml
embeddings:
  dimension: 768
  kind: hugging_face
  model: sentence-transformers/all-mpnet-base-v2
  schema_version: '2'
index:
  connection:
    id: /subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/connections/<cosmos_connection_name>
  connection_type: workspace_connection
  engine: pymongo-sdk
  field_mapping:
    content: content
    filename: filepath
    metadata: metadata_json_string
    title: title
    url: url
    embedding: contentVector
  database: azureml-rag-test-db
  collection: azureml-rag-test-collection
  index: azureml-rag-test-206e03b6-3880-407b-9bc4-c0a1162d6c70
  kind: azure_cosmos_mongo_vcore
```

### Create MLIndex

Examples using MLIndex remotely with AzureML and locally with langchain live here: https://github.com/Azure/azureml-examples/tree/main/sdk/python/generative-ai/rag

### Consume MLIndex

```python
from azureml.rag.mlindex import MLIndex

retriever = MLIndex(uri_to_folder_with_mlindex).as_langchain_retriever()
retriever.get_relevant_documents('What is an AzureML Compute Instance?')
```


# Changelog

Please insert change log into "Next Release" ONLY.

## Next release

## 0.2.31

- Categorize user error and system error, and update RH accordingly to show in logs
- Bugfix using obo credential for AAD connections.
- Prevention fix to support AadCredentialConfig in Connection object
- Update Pinecone legacy API
- Creating image embedding index with azure-search-documents 11.4.0

## 0.2.30.2

- Bugfix remove azure_ad_token_provider from EmbeddingContainer metadata
- set embeddings_model as optional argument

## 0.2.30.1

- Introduce `elasticsearch` extra to declare transitive dependency on the `elasticsearch` package when using Elasticsearch indices.

## 0.2.30

- Bugfix in models.py to handle empty deployment name.
- Supporting existing elasticsearch indices
- Bug fix in `crack_and_chunk_and_embed_and_index`
- Fixing bug in using AAD auth type ACS connections.

## 0.2.29.2

- Fixing ACS index creation failure with azure-search-documents 11.4.0

## 0.2.29.1

- Fixing FAISS, dependable_faiss_import import failure with Langchain 0.1.x

## 0.2.29

- Support AAD and MSI auth type in AOAI, ACS connection

## 0.2.28

- Ensure compatibility with newer versions of azure-ai-ml.
- Upgrade langchain to support up to 0.1

## 0.2.27

- Support Cohere serverless endpoint
- Support multiple ACS lookups in the same process, eliminating field mapping conflicts
- Support pass-in credential in get_connection_by_name_v2 to unblock managed vNet setup
- Update validate_deployments in crack_chunk_embed_index_and_register.py

## 0.2.26

- Support for .csv and .json file extensions in pipeline
- Ignore mlflow.exceptions.RestException in safe_mlflow_log_metric
- validate_deployments supports openai v1.0+
- Removing unexpected keyword argument 'engine'
- Checking ACS account has enough index quota
- infer_deployment supports openai v1.0+
- Create missing fields for existing index

## 0.2.25

- Using local cached encodings.
- Adding convert_to_dict() for openai v1.0+
- Check index_config before passing in validate_deployments.py
- Limit size of documents upload to ACS in one batch to solve RequestEntityTooLargeError

## 0.2.24.2

- Supporting `*.cognitiveservices.*` endpoint
- Adding azureml-rag specific user_agent when using DocumentIntelligence
- Refactored update index tasks
- Supporting uppercase file extensions name in crack_and_chunk
- Fixing Deployment importing bug in utils
- Adding the playgroundType tag in MLIndex Asset used for Azure AI studio playground
- Remove mandatory module-level imports of optional extra packages

## 0.2.24.1

- Fixing is_florence key detection
- Using 'embedding_connection_id' instead of 'florence_connection_id' as parameter name

## 0.2.24

- Introducing image ingestion with florence embedding API
- Adding dummy output to validate_deployments for holding the right order
- Fixing DeploymentNotFound bug

## 0.2.23.5

- Deprecate pkg_resources in logging.py (https://setuptools.pypa.io/en/latest/pkg_resources.html)

## 0.2.23.4

- Make the `api_type` parameter non-case sensitive in OpenAIEmbedder
- Bug fix in embeddings container path

## 0.2.23.3

- Set upper bound for `langchain` to 0.0.348

## 0.2.23.2

- Make tiktoken pull from a cache instead of making the outgoing network call to get encodings files
- Add support for Azure Cosmos Mongo vCore

## 0.2.23.1

- Fixing exception handling in validate_deployments to support OpenAI v1.0+

## 0.2.23

- Support OpenAI v1.0 +
- Handle FAISS.load_local() change since Langchain 0.0.318
- Handle mailto links in url crawling component.
- Add support for Milvus vector store

## 0.2.22

- update pypdf's version to 3.17.1 in document-parsing.

## 0.2.21

- Use workspace connection tags instead of metadata since it's deprecated.
- Fix bug handling single files in `files_to_document_sources`

## 0.2.20

- Initial introduction of validate_deployments.
- Asset registration in \*\_and_register attempts to infer target workspace from asset_uri and handle multiple auth options
- activity_logger moved out as first arg, this is an intermediate step as logger also shouldn't be first arg and instead handled by get_logger, activity_logger should be truly optional.
- validate_deployments itself was modified to make its interface closer to what existing tasks expect as input, and callable from other tasks as a function.

## 0.2.19

- Introduce a new `path` parameter in the `index` section of MLIndex documents over FAISS indices, to allow the path to FAISS index files to be different from the MLIndex document path.
- Ensure `MLIndex.base_uri` is never undefined for a valid MLIndex object.

## 0.2.18.1

- Only save out metadata before embedding in crack_and_chunk_and_embed_and_index
- Update create_embeddings to return num_embedded value.
  - This enables crack_and_chunk_and_embed to skip loading EmbeddedDocument partitions when no documents were embedded (all reused).

## 0.2.18

- Add new task to crack, chunk, embed, index to ACS, and register MLIndex in one step.
- Handle `openai.api_type` being `None`

## 0.2.17

- Fix loading MLIndex failure. Don't need to get the `endpoint` from connection when it is already provided.
- Try use `langchain` VectorStore and fallback to vendor
- Support `azure-search-documents==11.4.0b11``
- Add support for Pinecone in DataIndex

## 0.2.16

- Use Retry-After when aoai embedding endpoint throws RateLimitError

## 0.2.15.1

- Fix vendored FAISS langchain VectorStore to only error when a doc is `None` (rather than when a Document isn't exactly the right class)

## 0.2.15

- Support PDF cracking with Azure Document Intelligence service
- `crack_and_chunk_and_embed` now pulls documents through to embedding (streaming) and embeds documents in parallel batches
- Update default field names.
- Fix long file name bug when writing to output during crack and chunk

## 0.2.14

- Fix git_clone to handle WorkspaceConnections, again.

## 0.2.13

- Fix git_clone to handle WorkspaceConnection objects and urls with usernames already in them.

## 0.2.12

- Only process `.jsonl` and `.csv` files when reading chunks for embedding.

## 0.2.11

- Check casing for model kind and api_type
- Ensure api_version not being set is supported and default make sense.
- Add support for Pinecone indexes

## 0.2.10

- Fix QA generator and connections check for ApiType metadata

## 0.2.9

- QA data generation accepts connection as input

## 0.2.8

- Remove `allowed_special="all"` from tiktoken usage as it encodes special tokens like `<|endoftext|>` as their special token rather then as plain text (which is the case when only `disallowed_special=()` is set on its own)
- Stop truncating texts to embed (to model ctx length) as new `azureml.rag.embeddings.OpenAIEmbedder` handles batching and splitting long texts pre-embed then averaging the results into a single final embedding.
- Loosen tiktoken version range from `~=0.3.0` to `<1`

## 0.2.7

- Don't try and use MLClient for connections if azure-ai-ml<1.10.0
- Handle Custom Conenctions which azure-ai-ml can't deserialize today.
- Allow passing faiss index engine to MLIndex local
- Pass chunks directly into write_chunks_to_jsonl

## 0.2.6

- Fix jsonl output mode of crack_and_chunk writing csv internally.

## 0.2.5

- Ensure EmbeddingsContainer.mount_and_load sets `create_destination=True` when mounting to create embeddings_cache location if it's not already created.
- Fix `safe_mlflow_start_run` to `yield None` when mlflow not available
- Handle custom `field_mappings` passed to `update_acs` task.

## 0.2.4

- Introduce `crack_and_chunk_and_embed` task which tracks deletions and reused source + documents to enable full sync with indexes, levering EmbeddingsContainer for storage of this information across Snapshots.
- Restore `workspace_connection_to_credential` function.

## 0.2.3

- Fix git clone url format bug

## 0.2.2

- Fix all langchain splitter to use tiktoken in an airgap friendly way.

## 0.2.1

- Introduce DataIndex interface for scheduling Vector Index Pipeline in AzureML and creating MLIndex Assets
- Vendor various langchain components to avoid breaking changes to MLIndex internal logic

## 0.1.24.2

- Fix all langchain splitter to use tiktoken in an airgap friendly way.

## 0.1.24.1

- Fix subsplitter init bug in MarkdownHeaderSplitter
- Support getting langchain retriever for ACS based MLIndex with embeddings.kind: none.

## 0.1.24

- Don't mlflow log unless there's an active mlflow run.
- Support `langchain.vectorstores.azuresearch` after `langchain>=0.0.273` upgraded to `azure-search-documents==11.4.0b8`
- Use tiktoken encodings from package for other splitter types

## 0.1.23.2

- Handle `Path` objects passed into `MLIndex` init.

## 0.1.23.1

- Handle <region>.api.cognitive style aoai endpoints correctly

## 0.1.23

- Ensure tiktoken encodings are packaged in wheel

## 0.1.22

- Set environment variables to pull encodings files from directory with cache key to avoid tiktoken external network call
- Fix mlflow log error when there's no files input

## 0.1.21

- Fix top level imports in `update_acs` task failing without helpful reason when old `azure-search-documents` is installed.

## 0.1.20

- Fix Crack'n'Chunk race-condition where same named files would overwrite each other.

## 0.1.19

- Various bug fixes:
  - Handle some malformed git urls in `git_clone` task
  - Try fall back when parsing csv with pandas fails
  - Allow chunking special tokens
  - Ensure logging with mlflow can't fail a task
- Update to support latest `azure-search-documents==11.4.0b8`

## 0.1.18

- Add FaissAndDocStore and FileBasedDocStore which closely mirror langchains' FAISS and InMemoryDocStore without the langchain or pickle dependency. These are default not used until PromptFlow support has been added.
- Pin `azure-documents-search==11.4.0b6` as there's breaking changes in `11.4.0b7` and `11.4.0b8`

## 0.1.17

- Update interactions with Azure Cognitive Search to use latest azure-documents-search SDK

## 0.1.16

- Convert api_type from Workspace Connections to lower case to appease langchains case sensitive checking.

## 0.1.15

- Add support for custom loaders
- Added logging for MLIndex.**init** to understand usage of MLindex

## 0.1.14

- Add Support for CustomKeys connections
- Add OpenAI support for QA Gen and Embeddings

## 0.1.13 (2023-07-12)

- Implement single node non-PRS embed task to enable clearer logs for users.

## 0.1.12 (2023-06-29)

- Fix casing check of ApiVersion, ApiType in infer_deployment util

## 0.1.11 (2023-06-28)

- Update casing check for workspace connection ApiVersion, ApiType
- int casting for temperature, max_tokens

## 0.1.10 (2023-06-26)

- Update data asset registering to have adjustable output_type
- Remove asset registering from generate_qa.py

## 0.1.9 (2023-06-22)

- Add `azureml.rag.data_generation` module.
- Fixed bug that would cause crack_and_chunk to fail for documents that contain non-utf-8 characters. Currently these characters will be ignored.
- Improved heading extraction from Markdown files. When `use_rcts=False` Markdown files will be split on headings and each chunk with have the heading context up to the root as a prefix (e.g. `# Heading 1\n## Heading 2\n# Heading 3\n{content}`)

## 0.1.8 (2023-06-21)

- Add deployment inferring util for use in azureml-insider notebooks.

## 0.1.7 (2023-06-08)

- Improved telemetry for tasks (used in RAG Pipeline Components)

## 0.1.6 (2023-05-31)

- Fail crack_and_chunk task when no files were processed (usually because of a malformed `input_glob`)
- Change `update_acs.py` to default `push_embeddings=True` instead of `False`.

## 0.1.5 (2023-05-19)

- Add api_base back to MLIndex embeddings config for back-compat (until all clients start getting it from Workspace Connection).
- Add telemetry for tasks used in pipeline components, not enabled by default for SDK usage.

## 0.1.4 (2023-05-17)

- Fix bug where enabling rcts option on split_documents used nltk splitter instead.

## 0.1.3 (2023-05-12)

- Support Workspace Connection based auth for Git, Azure OpenAI and Azure Cognitive Search usage.

## 0.1.2 (2023-05-05)

- Refactored document chunking to allow insertion of custom processing logic

## 0.0.1 (2023-04-25)

### Features Added

- Introduced package
- langchain Retriever for Azure Cognitive Search

            

Raw data

            {
    "_id": null,
    "home_page": "https://docs.microsoft.com/python/api/overview/azure/ml/?view=azure-ml-py",
    "name": "azureml-rag",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "Microsoft Corporation",
    "author_email": null,
    "download_url": null,
    "platform": null,
    "description": "# AzureML Retrieval Augmented Generation Utilities\r\n\r\nThis package is in alpha stage at the moment, use at risk of breaking changes and unstable behavior.\r\n\r\nIt contains utilities for:\r\n\r\n- Processing text documents into chunks appropriate for use in LLM prompts, with metadata such is source url.\r\n- Embedding chunks with OpenAI or HuggingFace embeddings models, including the ability to update a set of embeddings over time.\r\n- Create MLIndex artifacts from embeddings, a yaml file capturing metadata needed to deserialize different kinds of Vector Indexes for use in langchain. Supported Index types:\r\n  - FAISS index (via langchain)\r\n  - Azure Cognitive Search index\r\n  - Pinecone index\r\n  - Milvus index\r\n  - Azure Cosmos Mongo vCore index\r\n\r\n## Getting started\r\n\r\nYou can install AzureMLs RAG package using pip.\r\n\r\n```bash\r\npip install azureml-rag\r\n```\r\n\r\nThere are various extra installs you probably want to include based on intended use:\r\n- `faiss`: When using FAISS based Vector Indexes\r\n- `cognitive_search`: When using Azure Cognitive Search Indexes\r\n- `pinecone`: When using Pinecone Indexes\r\n- `azure_cosmos_mongo_vcore`: When using Azure Cosmos Mongo vCore Indexes\r\n- `hugging_face`: When using Sentence Transformer embedding models from HuggingFace (local inference)\r\n- `document_parsing`: When cracking and chunking documents locally to put in an Index\r\n\r\n## MLIndex\r\n\r\nMLIndex files describe an index of data + embeddings and the embeddings model used in yaml.\r\n\r\nAzure Cognitive Search Index:\r\n\r\n```yaml\r\nembeddings:\r\n  dimension: 768\r\n  kind: hugging_face\r\n  model: sentence-transformers/all-mpnet-base-v2\r\n  schema_version: '2'\r\nindex:\r\n  api_version: 2021-04-30-Preview\r\n  connection:\r\n    id: /subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/connections/<acs_connection_name>\r\n  connection_type: workspace_connection\r\n  endpoint: https://<acs_name>.search.windows.net\r\n  engine: azure-sdk\r\n  field_mapping:\r\n    content: content\r\n    filename: filepath\r\n    metadata: meta_json_string\r\n    title: title\r\n    url: url\r\n    embedding: contentVector\r\n  index: azureml-rag-test-206e03b6-3880-407b-9bc4-c0a1162d6c70\r\n  kind: acs\r\n```\r\n\r\nPinecone Index:\r\n\r\n```yaml\r\nembeddings:\r\n  dimension: 768\r\n  kind: hugging_face\r\n  model: sentence-transformers/all-mpnet-base-v2\r\n  schema_version: '2'\r\nindex:\r\n  connection:\r\n    id: /subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/connections/<pinecone_connection_name>\r\n  connection_type: workspace_connection\r\n  engine: pinecone-sdk\r\n  field_mapping:\r\n    content: content\r\n    filename: filepath\r\n    metadata: metadata_json_string\r\n    title: title\r\n    url: url\r\n  index: azureml-rag-test-206e03b6-3880-407b-9bc4-c0a1162d6c70\r\n  kind: pinecone\r\n```\r\n\r\nAzure Cosmos Mongo vCore Index:\r\n\r\n```yaml\r\nembeddings:\r\n  dimension: 768\r\n  kind: hugging_face\r\n  model: sentence-transformers/all-mpnet-base-v2\r\n  schema_version: '2'\r\nindex:\r\n  connection:\r\n    id: /subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/connections/<cosmos_connection_name>\r\n  connection_type: workspace_connection\r\n  engine: pymongo-sdk\r\n  field_mapping:\r\n    content: content\r\n    filename: filepath\r\n    metadata: metadata_json_string\r\n    title: title\r\n    url: url\r\n    embedding: contentVector\r\n  database: azureml-rag-test-db\r\n  collection: azureml-rag-test-collection\r\n  index: azureml-rag-test-206e03b6-3880-407b-9bc4-c0a1162d6c70\r\n  kind: azure_cosmos_mongo_vcore\r\n```\r\n\r\n### Create MLIndex\r\n\r\nExamples using MLIndex remotely with AzureML and locally with langchain live here: https://github.com/Azure/azureml-examples/tree/main/sdk/python/generative-ai/rag\r\n\r\n### Consume MLIndex\r\n\r\n```python\r\nfrom azureml.rag.mlindex import MLIndex\r\n\r\nretriever = MLIndex(uri_to_folder_with_mlindex).as_langchain_retriever()\r\nretriever.get_relevant_documents('What is an AzureML Compute Instance?')\r\n```\r\n\r\n\r\n# Changelog\r\n\r\nPlease insert change log into \"Next Release\" ONLY.\r\n\r\n## Next release\r\n\r\n## 0.2.31\r\n\r\n- Categorize user error and system error, and update RH accordingly to show in logs\r\n- Bugfix using obo credential for AAD connections.\r\n- Prevention fix to support AadCredentialConfig in Connection object\r\n- Update Pinecone legacy API\r\n- Creating image embedding index with azure-search-documents 11.4.0\r\n\r\n## 0.2.30.2\r\n\r\n- Bugfix remove azure_ad_token_provider from EmbeddingContainer metadata\r\n- set embeddings_model as optional argument\r\n\r\n## 0.2.30.1\r\n\r\n- Introduce `elasticsearch` extra to declare transitive dependency on the `elasticsearch` package when using Elasticsearch indices.\r\n\r\n## 0.2.30\r\n\r\n- Bugfix in models.py to handle empty deployment name.\r\n- Supporting existing elasticsearch indices\r\n- Bug fix in `crack_and_chunk_and_embed_and_index`\r\n- Fixing bug in using AAD auth type ACS connections.\r\n\r\n## 0.2.29.2\r\n\r\n- Fixing ACS index creation failure with azure-search-documents 11.4.0\r\n\r\n## 0.2.29.1\r\n\r\n- Fixing FAISS, dependable_faiss_import import failure with Langchain 0.1.x\r\n\r\n## 0.2.29\r\n\r\n- Support AAD and MSI auth type in AOAI, ACS connection\r\n\r\n## 0.2.28\r\n\r\n- Ensure compatibility with newer versions of azure-ai-ml.\r\n- Upgrade langchain to support up to 0.1\r\n\r\n## 0.2.27\r\n\r\n- Support Cohere serverless endpoint\r\n- Support multiple ACS lookups in the same process, eliminating field mapping conflicts\r\n- Support pass-in credential in get_connection_by_name_v2 to unblock managed vNet setup\r\n- Update validate_deployments in crack_chunk_embed_index_and_register.py\r\n\r\n## 0.2.26\r\n\r\n- Support for .csv and .json file extensions in pipeline\r\n- Ignore mlflow.exceptions.RestException in safe_mlflow_log_metric\r\n- validate_deployments supports openai v1.0+\r\n- Removing unexpected keyword argument 'engine'\r\n- Checking ACS account has enough index quota\r\n- infer_deployment supports openai v1.0+\r\n- Create missing fields for existing index\r\n\r\n## 0.2.25\r\n\r\n- Using local cached encodings.\r\n- Adding convert_to_dict() for openai v1.0+\r\n- Check index_config before passing in validate_deployments.py\r\n- Limit size of documents upload to ACS in one batch to solve RequestEntityTooLargeError\r\n\r\n## 0.2.24.2\r\n\r\n- Supporting `*.cognitiveservices.*` endpoint\r\n- Adding azureml-rag specific user_agent when using DocumentIntelligence\r\n- Refactored update index tasks\r\n- Supporting uppercase file extensions name in crack_and_chunk\r\n- Fixing Deployment importing bug in utils\r\n- Adding the playgroundType tag in MLIndex Asset used for Azure AI studio playground\r\n- Remove mandatory module-level imports of optional extra packages\r\n\r\n## 0.2.24.1\r\n\r\n- Fixing is_florence key detection\r\n- Using 'embedding_connection_id' instead of 'florence_connection_id' as parameter name\r\n\r\n## 0.2.24\r\n\r\n- Introducing image ingestion with florence embedding API\r\n- Adding dummy output to validate_deployments for holding the right order\r\n- Fixing DeploymentNotFound bug\r\n\r\n## 0.2.23.5\r\n\r\n- Deprecate pkg_resources in logging.py (https://setuptools.pypa.io/en/latest/pkg_resources.html)\r\n\r\n## 0.2.23.4\r\n\r\n- Make the `api_type` parameter non-case sensitive in OpenAIEmbedder\r\n- Bug fix in embeddings container path\r\n\r\n## 0.2.23.3\r\n\r\n- Set upper bound for `langchain` to 0.0.348\r\n\r\n## 0.2.23.2\r\n\r\n- Make tiktoken pull from a cache instead of making the outgoing network call to get encodings files\r\n- Add support for Azure Cosmos Mongo vCore\r\n\r\n## 0.2.23.1\r\n\r\n- Fixing exception handling in validate_deployments to support OpenAI v1.0+\r\n\r\n## 0.2.23\r\n\r\n- Support OpenAI v1.0 +\r\n- Handle FAISS.load_local() change since Langchain 0.0.318\r\n- Handle mailto links in url crawling component.\r\n- Add support for Milvus vector store\r\n\r\n## 0.2.22\r\n\r\n- update pypdf's version to 3.17.1 in document-parsing.\r\n\r\n## 0.2.21\r\n\r\n- Use workspace connection tags instead of metadata since it's deprecated.\r\n- Fix bug handling single files in `files_to_document_sources`\r\n\r\n## 0.2.20\r\n\r\n- Initial introduction of validate_deployments.\r\n- Asset registration in \\*\\_and_register attempts to infer target workspace from asset_uri and handle multiple auth options\r\n- activity_logger moved out as first arg, this is an intermediate step as logger also shouldn't be first arg and instead handled by get_logger, activity_logger should be truly optional.\r\n- validate_deployments itself was modified to make its interface closer to what existing tasks expect as input, and callable from other tasks as a function.\r\n\r\n## 0.2.19\r\n\r\n- Introduce a new `path` parameter in the `index` section of MLIndex documents over FAISS indices, to allow the path to FAISS index files to be different from the MLIndex document path.\r\n- Ensure `MLIndex.base_uri` is never undefined for a valid MLIndex object.\r\n\r\n## 0.2.18.1\r\n\r\n- Only save out metadata before embedding in crack_and_chunk_and_embed_and_index\r\n- Update create_embeddings to return num_embedded value.\r\n  - This enables crack_and_chunk_and_embed to skip loading EmbeddedDocument partitions when no documents were embedded (all reused).\r\n\r\n## 0.2.18\r\n\r\n- Add new task to crack, chunk, embed, index to ACS, and register MLIndex in one step.\r\n- Handle `openai.api_type` being `None`\r\n\r\n## 0.2.17\r\n\r\n- Fix loading MLIndex failure. Don't need to get the `endpoint` from connection when it is already provided.\r\n- Try use `langchain` VectorStore and fallback to vendor\r\n- Support `azure-search-documents==11.4.0b11``\r\n- Add support for Pinecone in DataIndex\r\n\r\n## 0.2.16\r\n\r\n- Use Retry-After when aoai embedding endpoint throws RateLimitError\r\n\r\n## 0.2.15.1\r\n\r\n- Fix vendored FAISS langchain VectorStore to only error when a doc is `None` (rather than when a Document isn't exactly the right class)\r\n\r\n## 0.2.15\r\n\r\n- Support PDF cracking with Azure Document Intelligence service\r\n- `crack_and_chunk_and_embed` now pulls documents through to embedding (streaming) and embeds documents in parallel batches\r\n- Update default field names.\r\n- Fix long file name bug when writing to output during crack and chunk\r\n\r\n## 0.2.14\r\n\r\n- Fix git_clone to handle WorkspaceConnections, again.\r\n\r\n## 0.2.13\r\n\r\n- Fix git_clone to handle WorkspaceConnection objects and urls with usernames already in them.\r\n\r\n## 0.2.12\r\n\r\n- Only process `.jsonl` and `.csv` files when reading chunks for embedding.\r\n\r\n## 0.2.11\r\n\r\n- Check casing for model kind and api_type\r\n- Ensure api_version not being set is supported and default make sense.\r\n- Add support for Pinecone indexes\r\n\r\n## 0.2.10\r\n\r\n- Fix QA generator and connections check for ApiType metadata\r\n\r\n## 0.2.9\r\n\r\n- QA data generation accepts connection as input\r\n\r\n## 0.2.8\r\n\r\n- Remove `allowed_special=\"all\"` from tiktoken usage as it encodes special tokens like `<|endoftext|>` as their special token rather then as plain text (which is the case when only `disallowed_special=()` is set on its own)\r\n- Stop truncating texts to embed (to model ctx length) as new `azureml.rag.embeddings.OpenAIEmbedder` handles batching and splitting long texts pre-embed then averaging the results into a single final embedding.\r\n- Loosen tiktoken version range from `~=0.3.0` to `<1`\r\n\r\n## 0.2.7\r\n\r\n- Don't try and use MLClient for connections if azure-ai-ml<1.10.0\r\n- Handle Custom Conenctions which azure-ai-ml can't deserialize today.\r\n- Allow passing faiss index engine to MLIndex local\r\n- Pass chunks directly into write_chunks_to_jsonl\r\n\r\n## 0.2.6\r\n\r\n- Fix jsonl output mode of crack_and_chunk writing csv internally.\r\n\r\n## 0.2.5\r\n\r\n- Ensure EmbeddingsContainer.mount_and_load sets `create_destination=True` when mounting to create embeddings_cache location if it's not already created.\r\n- Fix `safe_mlflow_start_run` to `yield None` when mlflow not available\r\n- Handle custom `field_mappings` passed to `update_acs` task.\r\n\r\n## 0.2.4\r\n\r\n- Introduce `crack_and_chunk_and_embed` task which tracks deletions and reused source + documents to enable full sync with indexes, levering EmbeddingsContainer for storage of this information across Snapshots.\r\n- Restore `workspace_connection_to_credential` function.\r\n\r\n## 0.2.3\r\n\r\n- Fix git clone url format bug\r\n\r\n## 0.2.2\r\n\r\n- Fix all langchain splitter to use tiktoken in an airgap friendly way.\r\n\r\n## 0.2.1\r\n\r\n- Introduce DataIndex interface for scheduling Vector Index Pipeline in AzureML and creating MLIndex Assets\r\n- Vendor various langchain components to avoid breaking changes to MLIndex internal logic\r\n\r\n## 0.1.24.2\r\n\r\n- Fix all langchain splitter to use tiktoken in an airgap friendly way.\r\n\r\n## 0.1.24.1\r\n\r\n- Fix subsplitter init bug in MarkdownHeaderSplitter\r\n- Support getting langchain retriever for ACS based MLIndex with embeddings.kind: none.\r\n\r\n## 0.1.24\r\n\r\n- Don't mlflow log unless there's an active mlflow run.\r\n- Support `langchain.vectorstores.azuresearch` after `langchain>=0.0.273` upgraded to `azure-search-documents==11.4.0b8`\r\n- Use tiktoken encodings from package for other splitter types\r\n\r\n## 0.1.23.2\r\n\r\n- Handle `Path` objects passed into `MLIndex` init.\r\n\r\n## 0.1.23.1\r\n\r\n- Handle <region>.api.cognitive style aoai endpoints correctly\r\n\r\n## 0.1.23\r\n\r\n- Ensure tiktoken encodings are packaged in wheel\r\n\r\n## 0.1.22\r\n\r\n- Set environment variables to pull encodings files from directory with cache key to avoid tiktoken external network call\r\n- Fix mlflow log error when there's no files input\r\n\r\n## 0.1.21\r\n\r\n- Fix top level imports in `update_acs` task failing without helpful reason when old `azure-search-documents` is installed.\r\n\r\n## 0.1.20\r\n\r\n- Fix Crack'n'Chunk race-condition where same named files would overwrite each other.\r\n\r\n## 0.1.19\r\n\r\n- Various bug fixes:\r\n  - Handle some malformed git urls in `git_clone` task\r\n  - Try fall back when parsing csv with pandas fails\r\n  - Allow chunking special tokens\r\n  - Ensure logging with mlflow can't fail a task\r\n- Update to support latest `azure-search-documents==11.4.0b8`\r\n\r\n## 0.1.18\r\n\r\n- Add FaissAndDocStore and FileBasedDocStore which closely mirror langchains' FAISS and InMemoryDocStore without the langchain or pickle dependency. These are default not used until PromptFlow support has been added.\r\n- Pin `azure-documents-search==11.4.0b6` as there's breaking changes in `11.4.0b7` and `11.4.0b8`\r\n\r\n## 0.1.17\r\n\r\n- Update interactions with Azure Cognitive Search to use latest azure-documents-search SDK\r\n\r\n## 0.1.16\r\n\r\n- Convert api_type from Workspace Connections to lower case to appease langchains case sensitive checking.\r\n\r\n## 0.1.15\r\n\r\n- Add support for custom loaders\r\n- Added logging for MLIndex.**init** to understand usage of MLindex\r\n\r\n## 0.1.14\r\n\r\n- Add Support for CustomKeys connections\r\n- Add OpenAI support for QA Gen and Embeddings\r\n\r\n## 0.1.13 (2023-07-12)\r\n\r\n- Implement single node non-PRS embed task to enable clearer logs for users.\r\n\r\n## 0.1.12 (2023-06-29)\r\n\r\n- Fix casing check of ApiVersion, ApiType in infer_deployment util\r\n\r\n## 0.1.11 (2023-06-28)\r\n\r\n- Update casing check for workspace connection ApiVersion, ApiType\r\n- int casting for temperature, max_tokens\r\n\r\n## 0.1.10 (2023-06-26)\r\n\r\n- Update data asset registering to have adjustable output_type\r\n- Remove asset registering from generate_qa.py\r\n\r\n## 0.1.9 (2023-06-22)\r\n\r\n- Add `azureml.rag.data_generation` module.\r\n- Fixed bug that would cause crack_and_chunk to fail for documents that contain non-utf-8 characters. Currently these characters will be ignored.\r\n- Improved heading extraction from Markdown files. When `use_rcts=False` Markdown files will be split on headings and each chunk with have the heading context up to the root as a prefix (e.g. `# Heading 1\\n## Heading 2\\n# Heading 3\\n{content}`)\r\n\r\n## 0.1.8 (2023-06-21)\r\n\r\n- Add deployment inferring util for use in azureml-insider notebooks.\r\n\r\n## 0.1.7 (2023-06-08)\r\n\r\n- Improved telemetry for tasks (used in RAG Pipeline Components)\r\n\r\n## 0.1.6 (2023-05-31)\r\n\r\n- Fail crack_and_chunk task when no files were processed (usually because of a malformed `input_glob`)\r\n- Change `update_acs.py` to default `push_embeddings=True` instead of `False`.\r\n\r\n## 0.1.5 (2023-05-19)\r\n\r\n- Add api_base back to MLIndex embeddings config for back-compat (until all clients start getting it from Workspace Connection).\r\n- Add telemetry for tasks used in pipeline components, not enabled by default for SDK usage.\r\n\r\n## 0.1.4 (2023-05-17)\r\n\r\n- Fix bug where enabling rcts option on split_documents used nltk splitter instead.\r\n\r\n## 0.1.3 (2023-05-12)\r\n\r\n- Support Workspace Connection based auth for Git, Azure OpenAI and Azure Cognitive Search usage.\r\n\r\n## 0.1.2 (2023-05-05)\r\n\r\n- Refactored document chunking to allow insertion of custom processing logic\r\n\r\n## 0.0.1 (2023-04-25)\r\n\r\n### Features Added\r\n\r\n- Introduced package\r\n- langchain Retriever for Azure Cognitive Search\r\n",
    "bugtrack_url": null,
    "license": "Proprietary https://aka.ms/azureml-preview-sdk-license",
    "summary": "Contains Retrieval Augmented Generation related utilities for Azure Machine Learning and OSS interoperability.",
    "version": "0.2.31",
    "project_urls": {
        "Homepage": "https://docs.microsoft.com/python/api/overview/azure/ml/?view=azure-ml-py"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "49fd82242c62d6d33b22b16494b165bad2d2e08a92c07e7b039ac02018ecbf77",
                "md5": "9cd65f4e9b2e3f67b5dca1a1df3d109a",
                "sha256": "0c55927d36e4195e93aa70d0792661b8ec3b8c9939dfd34aa2a35b2763f26bf1"
            },
            "downloads": -1,
            "filename": "azureml_rag-0.2.31-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9cd65f4e9b2e3f67b5dca1a1df3d109a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.8",
            "size": 1686514,
            "upload_time": "2024-05-07T05:39:25",
            "upload_time_iso_8601": "2024-05-07T05:39:25.086060Z",
            "url": "https://files.pythonhosted.org/packages/49/fd/82242c62d6d33b22b16494b165bad2d2e08a92c07e7b039ac02018ecbf77/azureml_rag-0.2.31-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-07 05:39:25",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "azureml-rag"
}
        
Elapsed time: 0.24761s