Name | adlfs JSON |
Version |
2024.7.0
JSON |
| download |
home_page | None |
Summary | Access Azure Datalake Gen1 with fsspec and dask |
upload_time | 2024-07-22 12:10:33 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.8 |
license | BSD |
keywords |
file-system
dask
azure
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
Filesystem interface to Azure-Datalake Gen1 and Gen2 Storage
------------------------------------------------------------
[![PyPI version shields.io](https://img.shields.io/pypi/v/adlfs.svg)](https://pypi.python.org/pypi/adlfs/)
[![Latest conda-forge version](https://img.shields.io/conda/vn/conda-forge/adlfs?logo=conda-forge)](https://anaconda.org/conda-forge/aldfs)
Quickstart
----------
This package can be installed using:
`pip install adlfs`
or
`conda install -c conda-forge adlfs`
The `adl://` and `abfs://` protocols are included in fsspec's known_implementations registry
in fsspec > 0.6.1, otherwise users must explicitly inform fsspec about the supported adlfs protocols.
To use the Gen1 filesystem:
```python
import dask.dataframe as dd
storage_options={'tenant_id': TENANT_ID, 'client_id': CLIENT_ID, 'client_secret': CLIENT_SECRET}
dd.read_csv('adl://{STORE_NAME}/{FOLDER}/*.csv', storage_options=storage_options)
```
To use the Gen2 filesystem you can use the protocol `abfs` or `az`:
```python
import dask.dataframe as dd
storage_options={'account_name': ACCOUNT_NAME, 'account_key': ACCOUNT_KEY}
ddf = dd.read_csv('abfs://{CONTAINER}/{FOLDER}/*.csv', storage_options=storage_options)
ddf = dd.read_parquet('az://{CONTAINER}/folder.parquet', storage_options=storage_options)
Accepted protocol / uri formats include:
'PROTOCOL://container/path-part/file'
'PROTOCOL://container@account.dfs.core.windows.net/path-part/file'
or optionally, if AZURE_STORAGE_ACCOUNT_NAME and an AZURE_STORAGE_<CREDENTIAL> is
set as an environmental variable, then storage_options will be read from the environmental
variables
```
To read from a public storage blob you are required to specify the `'account_name'`.
For example, you can access [NYC Taxi & Limousine Commission](https://azure.microsoft.com/en-us/services/open-datasets/catalog/nyc-taxi-limousine-commission-green-taxi-trip-records/) as:
```python
storage_options = {'account_name': 'azureopendatastorage'}
ddf = dd.read_parquet('az://nyctlc/green/puYear=2019/puMonth=*/*.parquet', storage_options=storage_options)
```
Details
-------
The package includes pythonic filesystem implementations for both
Azure Datalake Gen1 and Azure Datalake Gen2, that facilitate
interactions between both Azure Datalake implementations and Dask. This is done leveraging the
[intake/filesystem_spec](https://github.com/intake/filesystem_spec/tree/master/fsspec) base class and Azure Python SDKs.
Operations against both Gen1 Datalake currently only work with an Azure ServicePrincipal
with suitable credentials to perform operations on the resources of choice.
Operations against the Gen2 Datalake are implemented by leveraging [Azure Blob Storage Python SDK](https://github.com/Azure/azure-sdk-for-python).
### Setting credentials
The `storage_options` can be instantiated with a variety of keyword arguments depending on the filesystem. The most commonly used arguments are:
- `connection_string`
- `account_name`
- `account_key`
- `sas_token`
- `tenant_id`, `client_id`, and `client_secret` are combined for an Azure ServicePrincipal e.g. `storage_options={'account_name': ACCOUNT_NAME, 'tenant_id': TENANT_ID, 'client_id': CLIENT_ID, 'client_secret': CLIENT_SECRET}`
- `anon`: boo, optional.
The value to use for whether to attempt anonymous access if no other credential is passed. By default (`None`), the
`AZURE_STORAGE_ANON` environment variable is checked. False values (`false`, `0`, `f`) will resolve to `False` and
anonymous access will not be attempted. Otherwise the value for `anon` resolves to True.
- `location_mode`: valid values are "primary" or "secondary" and apply to RA-GRS accounts
For more argument details see all arguments for [`AzureBlobFileSystem` here](https://github.com/fsspec/adlfs/blob/f15c37a43afd87a04f01b61cd90294dd57181e1d/adlfs/spec.py#L328) and [`AzureDatalakeFileSystem` here](https://github.com/fsspec/adlfs/blob/f15c37a43afd87a04f01b61cd90294dd57181e1d/adlfs/spec.py#L69).
The following environmental variables can also be set and picked up for authentication:
- "AZURE_STORAGE_CONNECTION_STRING"
- "AZURE_STORAGE_ACCOUNT_NAME"
- "AZURE_STORAGE_ACCOUNT_KEY"
- "AZURE_STORAGE_SAS_TOKEN"
- "AZURE_STORAGE_TENANT_ID"
- "AZURE_STORAGE_CLIENT_ID"
- "AZURE_STORAGE_CLIENT_SECRET"
The filesystem can be instantiated for different use cases based on a variety of `storage_options` combinations. The following list describes some common use cases utilizing `AzureBlobFileSystem`, i.e. protocols `abfs`or `az`. Note that all cases require the `account_name` argument to be provided:
1. Anonymous connection to public container: `storage_options={'account_name': ACCOUNT_NAME, 'anon': True}` will assume the `ACCOUNT_NAME` points to a public container, and attempt to use an anonymous login. Note, the default value for `anon` is True.
2. Auto credential solving using Azure's DefaultAzureCredential() library: `storage_options={'account_name': ACCOUNT_NAME, 'anon': False}` will use [`DefaultAzureCredential`](https://learn.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential?view=azure-python) to get valid credentials to the container `ACCOUNT_NAME`. `DefaultAzureCredential` attempts to authenticate via the [mechanisms and order visualized here](https://learn.microsoft.com/en-us/python/api/overview/azure/identity-readme?view=azure-python#defaultazurecredential).
3. Auto credential solving without requiring `storage_options`: Set `AZURE_STORAGE_ANON` to `false`, resulting in automatic credential resolution. Useful for compatibility with fsspec.
4. Azure ServicePrincipal: `tenant_id`, `client_id`, and `client_secret` are all used as credentials for an Azure ServicePrincipal: e.g. `storage_options={'account_name': ACCOUNT_NAME, 'tenant_id': TENANT_ID, 'client_id': CLIENT_ID, 'client_secret': CLIENT_SECRET}`.
### Append Blob
The `AzureBlobFileSystem` accepts [all of the Async BlobServiceClient arguments](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python).
By default, write operations create BlockBlobs in Azure, which, once written can not be appended. It is possible to create an AppendBlob using `mode="ab"` when creating and operating on blobs. Currently, AppendBlobs are not available if hierarchical namespaces are enabled.
Raw data
{
"_id": null,
"home_page": null,
"name": "adlfs",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "Greg Hayes <hayesgb@gmail.com>",
"keywords": "file-system, dask, azure",
"author": null,
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/b4/1e/6d5146676044247af566fa5843b335b1a647e6446070cec9c8b61c31b369/adlfs-2024.7.0.tar.gz",
"platform": null,
"description": "Filesystem interface to Azure-Datalake Gen1 and Gen2 Storage \n------------------------------------------------------------\n\n\n[![PyPI version shields.io](https://img.shields.io/pypi/v/adlfs.svg)](https://pypi.python.org/pypi/adlfs/)\n[![Latest conda-forge version](https://img.shields.io/conda/vn/conda-forge/adlfs?logo=conda-forge)](https://anaconda.org/conda-forge/aldfs)\n\nQuickstart\n----------\n\nThis package can be installed using:\n\n`pip install adlfs`\n\nor\n\n`conda install -c conda-forge adlfs`\n\nThe `adl://` and `abfs://` protocols are included in fsspec's known_implementations registry \nin fsspec > 0.6.1, otherwise users must explicitly inform fsspec about the supported adlfs protocols.\n\nTo use the Gen1 filesystem:\n\n```python\nimport dask.dataframe as dd\n\nstorage_options={'tenant_id': TENANT_ID, 'client_id': CLIENT_ID, 'client_secret': CLIENT_SECRET}\n\ndd.read_csv('adl://{STORE_NAME}/{FOLDER}/*.csv', storage_options=storage_options)\n```\n\nTo use the Gen2 filesystem you can use the protocol `abfs` or `az`:\n\n```python\nimport dask.dataframe as dd\n\nstorage_options={'account_name': ACCOUNT_NAME, 'account_key': ACCOUNT_KEY}\n\nddf = dd.read_csv('abfs://{CONTAINER}/{FOLDER}/*.csv', storage_options=storage_options)\nddf = dd.read_parquet('az://{CONTAINER}/folder.parquet', storage_options=storage_options)\n\nAccepted protocol / uri formats include:\n'PROTOCOL://container/path-part/file'\n'PROTOCOL://container@account.dfs.core.windows.net/path-part/file'\n\nor optionally, if AZURE_STORAGE_ACCOUNT_NAME and an AZURE_STORAGE_<CREDENTIAL> is \nset as an environmental variable, then storage_options will be read from the environmental\nvariables\n```\n\nTo read from a public storage blob you are required to specify the `'account_name'`.\nFor example, you can access [NYC Taxi & Limousine Commission](https://azure.microsoft.com/en-us/services/open-datasets/catalog/nyc-taxi-limousine-commission-green-taxi-trip-records/) as:\n\n```python\nstorage_options = {'account_name': 'azureopendatastorage'}\nddf = dd.read_parquet('az://nyctlc/green/puYear=2019/puMonth=*/*.parquet', storage_options=storage_options)\n```\n\nDetails\n-------\nThe package includes pythonic filesystem implementations for both \nAzure Datalake Gen1 and Azure Datalake Gen2, that facilitate \ninteractions between both Azure Datalake implementations and Dask. This is done leveraging the \n[intake/filesystem_spec](https://github.com/intake/filesystem_spec/tree/master/fsspec) base class and Azure Python SDKs.\n\nOperations against both Gen1 Datalake currently only work with an Azure ServicePrincipal\nwith suitable credentials to perform operations on the resources of choice.\n\nOperations against the Gen2 Datalake are implemented by leveraging [Azure Blob Storage Python SDK](https://github.com/Azure/azure-sdk-for-python).\n\n### Setting credentials\nThe `storage_options` can be instantiated with a variety of keyword arguments depending on the filesystem. The most commonly used arguments are:\n- `connection_string`\n- `account_name`\n- `account_key`\n- `sas_token`\n- `tenant_id`, `client_id`, and `client_secret` are combined for an Azure ServicePrincipal e.g. `storage_options={'account_name': ACCOUNT_NAME, 'tenant_id': TENANT_ID, 'client_id': CLIENT_ID, 'client_secret': CLIENT_SECRET}`\n- `anon`: boo, optional.\n The value to use for whether to attempt anonymous access if no other credential is passed. By default (`None`), the\n `AZURE_STORAGE_ANON` environment variable is checked. False values (`false`, `0`, `f`) will resolve to `False` and\n anonymous access will not be attempted. Otherwise the value for `anon` resolves to True.\n- `location_mode`: valid values are \"primary\" or \"secondary\" and apply to RA-GRS accounts\n\nFor more argument details see all arguments for [`AzureBlobFileSystem` here](https://github.com/fsspec/adlfs/blob/f15c37a43afd87a04f01b61cd90294dd57181e1d/adlfs/spec.py#L328) and [`AzureDatalakeFileSystem` here](https://github.com/fsspec/adlfs/blob/f15c37a43afd87a04f01b61cd90294dd57181e1d/adlfs/spec.py#L69).\n\nThe following environmental variables can also be set and picked up for authentication:\n- \"AZURE_STORAGE_CONNECTION_STRING\"\n- \"AZURE_STORAGE_ACCOUNT_NAME\"\n- \"AZURE_STORAGE_ACCOUNT_KEY\"\n- \"AZURE_STORAGE_SAS_TOKEN\"\n- \"AZURE_STORAGE_TENANT_ID\"\n- \"AZURE_STORAGE_CLIENT_ID\"\n- \"AZURE_STORAGE_CLIENT_SECRET\"\n\nThe filesystem can be instantiated for different use cases based on a variety of `storage_options` combinations. The following list describes some common use cases utilizing `AzureBlobFileSystem`, i.e. protocols `abfs`or `az`. Note that all cases require the `account_name` argument to be provided:\n1. Anonymous connection to public container: `storage_options={'account_name': ACCOUNT_NAME, 'anon': True}` will assume the `ACCOUNT_NAME` points to a public container, and attempt to use an anonymous login. Note, the default value for `anon` is True.\n2. Auto credential solving using Azure's DefaultAzureCredential() library: `storage_options={'account_name': ACCOUNT_NAME, 'anon': False}` will use [`DefaultAzureCredential`](https://learn.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential?view=azure-python) to get valid credentials to the container `ACCOUNT_NAME`. `DefaultAzureCredential` attempts to authenticate via the [mechanisms and order visualized here](https://learn.microsoft.com/en-us/python/api/overview/azure/identity-readme?view=azure-python#defaultazurecredential).\n3. Auto credential solving without requiring `storage_options`: Set `AZURE_STORAGE_ANON` to `false`, resulting in automatic credential resolution. Useful for compatibility with fsspec.\n4. Azure ServicePrincipal: `tenant_id`, `client_id`, and `client_secret` are all used as credentials for an Azure ServicePrincipal: e.g. `storage_options={'account_name': ACCOUNT_NAME, 'tenant_id': TENANT_ID, 'client_id': CLIENT_ID, 'client_secret': CLIENT_SECRET}`.\n\n### Append Blob\nThe `AzureBlobFileSystem` accepts [all of the Async BlobServiceClient arguments](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python).\n\nBy default, write operations create BlockBlobs in Azure, which, once written can not be appended. It is possible to create an AppendBlob using `mode=\"ab\"` when creating and operating on blobs. Currently, AppendBlobs are not available if hierarchical namespaces are enabled.\n",
"bugtrack_url": null,
"license": "BSD",
"summary": "Access Azure Datalake Gen1 with fsspec and dask",
"version": "2024.7.0",
"project_urls": null,
"split_keywords": [
"file-system",
" dask",
" azure"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "6f51a71c457bd0bc8af3e522b6999ff300852c7c446e384fd9904b0794f875df",
"md5": "3939b7b51567e18a1752ff5e7c4ab1e9",
"sha256": "2005c8e124fda3948f2a6abb2dbebb2c936d2d821acaca6afd61932edfa9bc07"
},
"downloads": -1,
"filename": "adlfs-2024.7.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "3939b7b51567e18a1752ff5e7c4ab1e9",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 41349,
"upload_time": "2024-07-22T12:10:32",
"upload_time_iso_8601": "2024-07-22T12:10:32.226150Z",
"url": "https://files.pythonhosted.org/packages/6f/51/a71c457bd0bc8af3e522b6999ff300852c7c446e384fd9904b0794f875df/adlfs-2024.7.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "b41e6d5146676044247af566fa5843b335b1a647e6446070cec9c8b61c31b369",
"md5": "6b2f6d94b8666ee3e62866da8a048033",
"sha256": "106995b91f0eb5e775bcd5957d180d9a14faef3271a063b1f65c66fd5ab05ddf"
},
"downloads": -1,
"filename": "adlfs-2024.7.0.tar.gz",
"has_sig": false,
"md5_digest": "6b2f6d94b8666ee3e62866da8a048033",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 48588,
"upload_time": "2024-07-22T12:10:33",
"upload_time_iso_8601": "2024-07-22T12:10:33.849861Z",
"url": "https://files.pythonhosted.org/packages/b4/1e/6d5146676044247af566fa5843b335b1a647e6446070cec9c8b61c31b369/adlfs-2024.7.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-22 12:10:33",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "adlfs"
}