pyarrowfs-adlgen2


Namepyarrowfs-adlgen2 JSON
Version 0.2.5 PyPI version JSON
download
home_pagehttps://github.com/kaaveland/pyarrowfs-adlgen2
SummaryUse pyarrow with Azure Data Lake gen2
upload_time2024-06-27 12:51:59
maintainerNone
docs_urlNone
authorRobin Kåveland
requires_python>=3.6
licenseMIT
keywords azure datalake filesystem pyarrow parquet
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            pyarrowfs-adlgen2
==

[![Downloads](https://static.pepy.tech/badge/pyarrowfs-adlgen2)](https://pepy.tech/project/pyarrowfs-adlgen2)
[![Downloads](https://static.pepy.tech/badge/pyarrowfs-adlgen2/month)](https://pepy.tech/project/pyarrowfs-adlgen2)

pyarrowfs-adlgen2 is an implementation of a pyarrow filesystem for Azure Data Lake Gen2.

It allows you to use pyarrow and pandas to read parquet datasets directly from Azure without
the need to copy files to local storage first.

Compared with [adlfs](https://github.com/fsspec/adlfs/), you may see better performance when reading datasets 
with many files, as pyarrowfs-adlgen2 uses the  datalake gen2 sdk, which has fast directory listing, unlike
the blob sdk used by adlfs.

pyarrowfs-adlgen2 is stable software with a small API, and no major features are planned.

Installation
--

`pip install pyarrowfs-adlgen2`

Reading datasets
--

Example usage with pandas dataframe:

```python
import azure.identity
import pandas as pd
import pyarrow.fs
import pyarrowfs_adlgen2

handler = pyarrowfs_adlgen2.AccountHandler.from_account_name(
    'YOUR_ACCOUNT_NAME', azure.identity.DefaultAzureCredential())
fs = pyarrow.fs.PyFileSystem(handler)
df = pd.read_parquet('container/dataset.parq', filesystem=fs)
```

Example usage with arrow tables:

```python
import azure.identity
import pyarrow.dataset
import pyarrow.fs
import pyarrowfs_adlgen2

handler = pyarrowfs_adlgen2.AccountHandler.from_account_name(
    'YOUR_ACCOUNT_NAME', azure.identity.DefaultAzureCredential())
fs = pyarrow.fs.PyFileSystem(handler)
ds = pyarrow.dataset.dataset('container/dataset.parq', filesystem=fs)
table = ds.to_table()
```

Configuring timeouts
--

Timeouts are passed to azure-storage-file-datalake SDK methods. The timeout unit is in seconds.

```python
import azure.identity
import pyarrowfs_adlgen2

handler = pyarrowfs_adlgen2.AccountHandler.from_account_name(
    'YOUR_ACCOUNT_NAME',
    azure.identity.DefaultAzureCredential(),
    timeouts=pyarrowfs_adlgen2.Timeouts(file_system_timeout=10)
)
# or mutate it:
handler.timeouts.file_client_timeout = 20
```

Writing datasets
--

With pyarrow version 3 or greater, you can write datasets from arrow tables:

```python
import pyarrow as pa
import pyarrow.dataset

pyarrow.dataset.write_dataset(
    table,
    'name.pq',
    format='parquet',
    partitioning=pyarrow.dataset.partitioning(
        schema=pyarrow.schema([('year', pa.int32())]), flavor='hive'
    ),
    filesystem=pyarrow.fs.PyFileSystem(handler)
)
```

With earlier versions, files must be opened/written one at a time:

As of pyarrow version 1.0.1, `pyarrow.parquet.ParquetWriter` does not support `pyarrow.fs.PyFileSystem`, but data can be written to open files:

```python
with fs.open_output_stream('container/out.parq') as out:
    df.to_parquet(out)
```

Or with arrow tables:

```python
import pyarrow.parquet

with fs.open_output_stream('container/out.parq') as out:
    pyarrow.parquet.write_table(table, out)
```

Accessing only a single container/file-system
--

If you do not want, or can't access the whole storage account as a single filesystem, you can use `pyarrowfs_adlgen2.FilesystemHandler` to view a single file system within an account:

```python
import azure.identity
import pyarrowfs_adlgen2

handler = pyarrowfs_adlgen2.FilesystemHandler.from_account_name(
   "STORAGE_ACCOUNT", "FS_NAME", azure.identity.DefaultAzureCredential())
```

All access is done through the file system within the storage account.

Set http headers for files for pyarrow >= 5
--

You can set headers for any output files by using the `metadata` argument to `handler.open_output_stream`:

```python
import pyarrowfs_adlgen2

fs = pyarrowfs_adlgen2.AccountHandler.from_account_name("theaccount").to_fs()
metadata = {"content_type": "application/json"}
with fs.open_output_stream("container/data.json", metadata) as out:
    out.write("{}")
```

Note that the spelling is different than you might expect! For a list of valid keys, see
[ContentSettings](https://docs.microsoft.com/en-us/python/api/azure-storage-file-datalake/azure.storage.filedatalake.contentsettings?view=azure-python).

You can do this for pyarrow >= 5 when using `pyarrow.fs.PyFileSystem`, and for any pyarrow if using the handlers
from pyarrowfs_adlgen2 directly.


Running tests
--

To run the integration tests, you need:

- Azure Storage Account V2 with hierarchial namespace enabled (Data Lake gen2 account)
- To configure azure login (f. ex. use `$ az login` or set up environment variables, see ` azure.identity.DefaultAzureCredential`)
- Install pytest, f. ex. `pip install pytest`

**NB! All data in the storage account is deleted during testing, USE AN EMPTY ACCOUNT**

```
AZUREARROWFS_TEST_ACT=thestorageaccount pytest
```

Performance
==

Here is an informal comparison test against adlfs, done against a copy of the 
[NYC taxi dataset](https://learn.microsoft.com/en-us/azure/open-datasets/dataset-taxi-yellow).

The test setup was as follows:

1. Create an Azure Data Lake Gen2 storage account with a container. I clicked through the portal to do this step. Grant
   yourself the Azure Storage Data Owner role on the account.
2. Upload the NYC taxi dataset to the container. You want to do this with `azcopy` or `az cli`, or it's going to take a 
   long time. Here's the command I used, it only took a few seconds:
   `az storage copy -s https://azureopendatastorage.blob.core.windows.net/nyctlc/yellow --recursive -d https://benchpyarrowfs.blob.core.windows.net/taxi/`
3. Set up a venv for the test, and install the dependencies: 
   `python -m venv && source venv/bin/activate && pip install pyarrowfs-adlgen2 pandas pyarrow adlfs azure-identity`
4. Make sure to log in with `az login` and set the correct subscription using `az account set -s playground-sub`

That's the entire test setup. Now we can run some commands against the dataset and time them. Let's see
how long it takes to read the `passengerCount` and `tripDistance` columns for one month of data, 2014/10 using
`pyarrowfs-adlgen2` and the `pyarrow` dataset api:

```shell 
$ time python adlg2_taxi.py 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14227692 entries, 0 to 14227691
Data columns (total 2 columns):
 #   Column          Dtype  
---  ------          -----  
 0   passengerCount  int32  
 1   tripDistance    float64
dtypes: float64(1), int32(1)
memory usage: 162.8 MB

real	0m11,000s
user	0m2,018s
sys	0m1,605s
```

Now let's do the same with `adlfs`:

```shell
$ time python adlfs_taxi.py 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14227692 entries, 0 to 14227691
Data columns (total 2 columns):
 #   Column          Dtype  
---  ------          -----  
 0   passengerCount  int32  
 1   tripDistance    float64
dtypes: float64(1), int32(1)
memory usage: 162.8 MB

real	0m31,985s
user	0m3,204s
sys	0m2,110s
```

The `pyarrowfs-adlgen2` implementation is about 3 times faster than `adlfs` for this dataset and that's not due to
bandwidth or compute limitations. This reflects my own experience using both professionally as well. I believe that
the difference here is primarily due to the fact that `adlfs` uses the blob storage SDK, which is slow at listing
directories, and that the nyc taxi data set has a lot of files and structure. adlfs is being forced to parse that
to recover the structure, whereas adlgen2 gets it for free from the datalake gen2 SDK.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/kaaveland/pyarrowfs-adlgen2",
    "name": "pyarrowfs-adlgen2",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": "azure datalake filesystem pyarrow parquet",
    "author": "Robin K\u00e5veland",
    "author_email": "kaaveland@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/3b/15/78f6d21d046db717074c5e530305d5248ac74935c864711d4b54be0f1849/pyarrowfs_adlgen2-0.2.5.tar.gz",
    "platform": null,
    "description": "pyarrowfs-adlgen2\n==\n\n[![Downloads](https://static.pepy.tech/badge/pyarrowfs-adlgen2)](https://pepy.tech/project/pyarrowfs-adlgen2)\n[![Downloads](https://static.pepy.tech/badge/pyarrowfs-adlgen2/month)](https://pepy.tech/project/pyarrowfs-adlgen2)\n\npyarrowfs-adlgen2 is an implementation of a pyarrow filesystem for Azure Data Lake Gen2.\n\nIt allows you to use pyarrow and pandas to read parquet datasets directly from Azure without\nthe need to copy files to local storage first.\n\nCompared with [adlfs](https://github.com/fsspec/adlfs/), you may see better performance when reading datasets \nwith many files, as pyarrowfs-adlgen2 uses the  datalake gen2 sdk, which has fast directory listing, unlike\nthe blob sdk used by adlfs.\n\npyarrowfs-adlgen2 is stable software with a small API, and no major features are planned.\n\nInstallation\n--\n\n`pip install pyarrowfs-adlgen2`\n\nReading datasets\n--\n\nExample usage with pandas dataframe:\n\n```python\nimport azure.identity\nimport pandas as pd\nimport pyarrow.fs\nimport pyarrowfs_adlgen2\n\nhandler = pyarrowfs_adlgen2.AccountHandler.from_account_name(\n    'YOUR_ACCOUNT_NAME', azure.identity.DefaultAzureCredential())\nfs = pyarrow.fs.PyFileSystem(handler)\ndf = pd.read_parquet('container/dataset.parq', filesystem=fs)\n```\n\nExample usage with arrow tables:\n\n```python\nimport azure.identity\nimport pyarrow.dataset\nimport pyarrow.fs\nimport pyarrowfs_adlgen2\n\nhandler = pyarrowfs_adlgen2.AccountHandler.from_account_name(\n    'YOUR_ACCOUNT_NAME', azure.identity.DefaultAzureCredential())\nfs = pyarrow.fs.PyFileSystem(handler)\nds = pyarrow.dataset.dataset('container/dataset.parq', filesystem=fs)\ntable = ds.to_table()\n```\n\nConfiguring timeouts\n--\n\nTimeouts are passed to azure-storage-file-datalake SDK methods. The timeout unit is in seconds.\n\n```python\nimport azure.identity\nimport pyarrowfs_adlgen2\n\nhandler = pyarrowfs_adlgen2.AccountHandler.from_account_name(\n    'YOUR_ACCOUNT_NAME',\n    azure.identity.DefaultAzureCredential(),\n    timeouts=pyarrowfs_adlgen2.Timeouts(file_system_timeout=10)\n)\n# or mutate it:\nhandler.timeouts.file_client_timeout = 20\n```\n\nWriting datasets\n--\n\nWith pyarrow version 3 or greater, you can write datasets from arrow tables:\n\n```python\nimport pyarrow as pa\nimport pyarrow.dataset\n\npyarrow.dataset.write_dataset(\n    table,\n    'name.pq',\n    format='parquet',\n    partitioning=pyarrow.dataset.partitioning(\n        schema=pyarrow.schema([('year', pa.int32())]), flavor='hive'\n    ),\n    filesystem=pyarrow.fs.PyFileSystem(handler)\n)\n```\n\nWith earlier versions, files must be opened/written one at a time:\n\nAs of pyarrow version 1.0.1, `pyarrow.parquet.ParquetWriter` does not support `pyarrow.fs.PyFileSystem`, but data can be written to open files:\n\n```python\nwith fs.open_output_stream('container/out.parq') as out:\n    df.to_parquet(out)\n```\n\nOr with arrow tables:\n\n```python\nimport pyarrow.parquet\n\nwith fs.open_output_stream('container/out.parq') as out:\n    pyarrow.parquet.write_table(table, out)\n```\n\nAccessing only a single container/file-system\n--\n\nIf you do not want, or can't access the whole storage account as a single filesystem, you can use `pyarrowfs_adlgen2.FilesystemHandler` to view a single file system within an account:\n\n```python\nimport azure.identity\nimport pyarrowfs_adlgen2\n\nhandler = pyarrowfs_adlgen2.FilesystemHandler.from_account_name(\n   \"STORAGE_ACCOUNT\", \"FS_NAME\", azure.identity.DefaultAzureCredential())\n```\n\nAll access is done through the file system within the storage account.\n\nSet http headers for files for pyarrow >= 5\n--\n\nYou can set headers for any output files by using the `metadata` argument to `handler.open_output_stream`:\n\n```python\nimport pyarrowfs_adlgen2\n\nfs = pyarrowfs_adlgen2.AccountHandler.from_account_name(\"theaccount\").to_fs()\nmetadata = {\"content_type\": \"application/json\"}\nwith fs.open_output_stream(\"container/data.json\", metadata) as out:\n    out.write(\"{}\")\n```\n\nNote that the spelling is different than you might expect! For a list of valid keys, see\n[ContentSettings](https://docs.microsoft.com/en-us/python/api/azure-storage-file-datalake/azure.storage.filedatalake.contentsettings?view=azure-python).\n\nYou can do this for pyarrow >= 5 when using `pyarrow.fs.PyFileSystem`, and for any pyarrow if using the handlers\nfrom pyarrowfs_adlgen2 directly.\n\n\nRunning tests\n--\n\nTo run the integration tests, you need:\n\n- Azure Storage Account V2 with hierarchial namespace enabled (Data Lake gen2 account)\n- To configure azure login (f. ex. use `$ az login` or set up environment variables, see ` azure.identity.DefaultAzureCredential`)\n- Install pytest, f. ex. `pip install pytest`\n\n**NB! All data in the storage account is deleted during testing, USE AN EMPTY ACCOUNT**\n\n```\nAZUREARROWFS_TEST_ACT=thestorageaccount pytest\n```\n\nPerformance\n==\n\nHere is an informal comparison test against adlfs, done against a copy of the \n[NYC taxi dataset](https://learn.microsoft.com/en-us/azure/open-datasets/dataset-taxi-yellow).\n\nThe test setup was as follows:\n\n1. Create an Azure Data Lake Gen2 storage account with a container. I clicked through the portal to do this step. Grant\n   yourself the Azure Storage Data Owner role on the account.\n2. Upload the NYC taxi dataset to the container. You want to do this with `azcopy` or `az cli`, or it's going to take a \n   long time. Here's the command I used, it only took a few seconds:\n   `az storage copy -s https://azureopendatastorage.blob.core.windows.net/nyctlc/yellow --recursive -d https://benchpyarrowfs.blob.core.windows.net/taxi/`\n3. Set up a venv for the test, and install the dependencies: \n   `python -m venv && source venv/bin/activate && pip install pyarrowfs-adlgen2 pandas pyarrow adlfs azure-identity`\n4. Make sure to log in with `az login` and set the correct subscription using `az account set -s playground-sub`\n\nThat's the entire test setup. Now we can run some commands against the dataset and time them. Let's see\nhow long it takes to read the `passengerCount` and `tripDistance` columns for one month of data, 2014/10 using\n`pyarrowfs-adlgen2` and the `pyarrow` dataset api:\n\n```shell \n$ time python adlg2_taxi.py \n<class 'pandas.core.frame.DataFrame'>\nRangeIndex: 14227692 entries, 0 to 14227691\nData columns (total 2 columns):\n #   Column          Dtype  \n---  ------          -----  \n 0   passengerCount  int32  \n 1   tripDistance    float64\ndtypes: float64(1), int32(1)\nmemory usage: 162.8 MB\n\nreal\t0m11,000s\nuser\t0m2,018s\nsys\t0m1,605s\n```\n\nNow let's do the same with `adlfs`:\n\n```shell\n$ time python adlfs_taxi.py \n<class 'pandas.core.frame.DataFrame'>\nRangeIndex: 14227692 entries, 0 to 14227691\nData columns (total 2 columns):\n #   Column          Dtype  \n---  ------          -----  \n 0   passengerCount  int32  \n 1   tripDistance    float64\ndtypes: float64(1), int32(1)\nmemory usage: 162.8 MB\n\nreal\t0m31,985s\nuser\t0m3,204s\nsys\t0m2,110s\n```\n\nThe `pyarrowfs-adlgen2` implementation is about 3 times faster than `adlfs` for this dataset and that's not due to\nbandwidth or compute limitations. This reflects my own experience using both professionally as well. I believe that\nthe difference here is primarily due to the fact that `adlfs` uses the blob storage SDK, which is slow at listing\ndirectories, and that the nyc taxi data set has a lot of files and structure. adlfs is being forced to parse that\nto recover the structure, whereas adlgen2 gets it for free from the datalake gen2 SDK.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Use pyarrow with Azure Data Lake gen2",
    "version": "0.2.5",
    "project_urls": {
        "Homepage": "https://github.com/kaaveland/pyarrowfs-adlgen2"
    },
    "split_keywords": [
        "azure",
        "datalake",
        "filesystem",
        "pyarrow",
        "parquet"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "25da0637e54224a6bf3458939a180d6dcfc50d1bba16d67d5e8fd4a31094f783",
                "md5": "ea4620118ee9b1364686f823b816122b",
                "sha256": "dad4be87a7268cfd27e89125f76501f4655fbcef36527145956ccbbfd65b8d23"
            },
            "downloads": -1,
            "filename": "pyarrowfs_adlgen2-0.2.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ea4620118ee9b1364686f823b816122b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 11487,
            "upload_time": "2024-06-27T12:51:57",
            "upload_time_iso_8601": "2024-06-27T12:51:57.467085Z",
            "url": "https://files.pythonhosted.org/packages/25/da/0637e54224a6bf3458939a180d6dcfc50d1bba16d67d5e8fd4a31094f783/pyarrowfs_adlgen2-0.2.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3b1578f6d21d046db717074c5e530305d5248ac74935c864711d4b54be0f1849",
                "md5": "a9891173543c15946fd95e0c2f1aeabf",
                "sha256": "b5ae09d58b21f48f45d538cd1842a93b67eb0950b58ff31d5101bf5bb665ac41"
            },
            "downloads": -1,
            "filename": "pyarrowfs_adlgen2-0.2.5.tar.gz",
            "has_sig": false,
            "md5_digest": "a9891173543c15946fd95e0c2f1aeabf",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 14098,
            "upload_time": "2024-06-27T12:51:59",
            "upload_time_iso_8601": "2024-06-27T12:51:59.201083Z",
            "url": "https://files.pythonhosted.org/packages/3b/15/78f6d21d046db717074c5e530305d5248ac74935c864711d4b54be0f1849/pyarrowfs_adlgen2-0.2.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-27 12:51:59",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "kaaveland",
    "github_project": "pyarrowfs-adlgen2",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "pyarrowfs-adlgen2"
}
        
Elapsed time: 0.27438s