fsspec-utils


Namefsspec-utils JSON
Version 0.1.10 PyPI version JSON
download
home_pageNone
SummaryEnhanced utilities and extensions for fsspec filesystems with multi-format I/O support
upload_time2025-08-21 00:06:09
maintainerNone
docs_urlNone
authorNone
requires_python>=3.11
licenseNone
keywords azure cloud-storage csv data-io filesystem fsspec gcs json parquet s3
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # fsspec-utils

Enhanced utilities and extensions for fsspec filesystems with multi-format I/O support.

## Overview

`fsspec-utils` is a comprehensive toolkit that extends [fsspec](https://filesystem-spec.readthedocs.io/) with:

- **Multi-cloud storage configuration** - Easy setup for AWS S3, Google Cloud Storage, Azure Storage, GitHub, and GitLab
- **Enhanced caching** - Improved caching filesystem with monitoring and path preservation  
- **Extended I/O operations** - Read/write operations for JSON, CSV, Parquet with Polars/PyArrow integration
- **Utility functions** - Type conversion, parallel processing, and data transformation helpers

## Installation

```bash
# Basic installation
pip install fsspec-utils

# With all optional dependencies
pip install fsspec-utils[full]

# Specific cloud providers
pip install fsspec-utils[aws]     # AWS S3 support
pip install fsspec-utils[gcp]     # Google Cloud Storage
pip install fsspec-utils[azure]   # Azure Storage
```

## Quick Start

### Basic Filesystem Operations

```python
from fsspec_utils import filesystem

# Local filesystem
fs = filesystem("file")
files = fs.ls("/path/to/data")

# S3 with caching
fs = filesystem("s3://my-bucket/", cached=True)
data = fs.cat("data/file.txt")
```

### Storage Configuration

```python
from fsspec_utils.storage import AwsStorageOptions

# Configure S3 access
options = AwsStorageOptions(
    region="us-west-2",
    access_key_id="YOUR_KEY",
    secret_access_key="YOUR_SECRET"
)

fs = filesystem("s3", storage_options=options, cached=True)
```

### Environment-based Configuration

```python
from fsspec_utils.storage import AwsStorageOptions

# Load from environment variables
options = AwsStorageOptions.from_env()
fs = filesystem("s3", storage_options=options)
```

### Multiple Cloud Providers

```python
from fsspec_utils.storage import (
    AwsStorageOptions, 
    GcsStorageOptions,
    GitHubStorageOptions
)

# AWS S3
s3_fs = filesystem("s3", storage_options=AwsStorageOptions.from_env())

# Google Cloud Storage  
gcs_fs = filesystem("gs", storage_options=GcsStorageOptions.from_env())

# GitHub repository
github_fs = filesystem("github", storage_options=GitHubStorageOptions(
    org="microsoft",
    repo="vscode", 
    token="ghp_xxxx"
))
```

## Storage Options

### AWS S3

```python
from fsspec_utils.storage import AwsStorageOptions

# Basic credentials
options = AwsStorageOptions(
    access_key_id="AKIAXXXXXXXX",
    secret_access_key="SECRET",
    region="us-east-1"
)

# From AWS profile
options = AwsStorageOptions.create(profile="dev")

# S3-compatible service (MinIO)
options = AwsStorageOptions(
    endpoint_url="http://localhost:9000",
    access_key_id="minioadmin",
    secret_access_key="minioadmin",
    allow_http=True
)
```

### Google Cloud Storage

```python
from fsspec_utils.storage import GcsStorageOptions

# Service account
options = GcsStorageOptions(
    token="path/to/service-account.json",
    project="my-project-123"
)

# From environment
options = GcsStorageOptions.from_env()
```

### Azure Storage

```python
from fsspec_utils.storage import AzureStorageOptions

# Account key
options = AzureStorageOptions(
    protocol="az",
    account_name="mystorageacct",
    account_key="key123..."
)

# Connection string
options = AzureStorageOptions(
    protocol="az",
    connection_string="DefaultEndpoints..."
)
```

### GitHub

```python
from fsspec_utils.storage import GitHubStorageOptions

# Public repository
options = GitHubStorageOptions(
    org="microsoft",
    repo="vscode",
    ref="main"
)

# Private repository
options = GitHubStorageOptions(
    org="myorg",
    repo="private-repo",
    token="ghp_xxxx",
    ref="develop"
)
```

### GitLab

```python
from fsspec_utils.storage import GitLabStorageOptions

# Public project
options = GitLabStorageOptions(
    project_name="group/project",
    ref="main"
)

# Private project with token
options = GitLabStorageOptions(
    project_id=12345,
    token="glpat_xxxx",
    ref="develop"
)
```

## Enhanced Caching

```python
from fsspec_utils import filesystem

# Enable caching with monitoring
fs = filesystem(
    "s3://my-bucket/",
    cached=True,
    cache_storage="/tmp/my_cache",
    verbose=True
)

# Cache preserves directory structure
data = fs.cat("deep/nested/path/file.txt")
# Cached at: /tmp/my_cache/deep/nested/path/file.txt
```

## Utilities

### Parallel Processing

```python
from fsspec_utils.utils import run_parallel

# Run function in parallel
def process_file(path, multiplier=1):
    return len(path) * multiplier

results = run_parallel(
    process_file,
    ["/path1", "/path2", "/path3"],
    multiplier=2,
    n_jobs=4,
    verbose=True
)
```

### Type Conversion

```python
from fsspec_utils.utils import dict_to_dataframe, to_pyarrow_table

# Convert dict to DataFrame
data = {"col1": [1, 2, 3], "col2": [4, 5, 6]}
df = dict_to_dataframe(data)

# Convert to PyArrow table
table = to_pyarrow_table(df)
```

### Logging

```python
from fsspec_utils.utils import setup_logging

# Configure logging
setup_logging(level="DEBUG", format_string="{time} | {level} | {message}")
```

## Dependencies

### Core Dependencies
- `fsspec>=2023.1.0` - Filesystem interface
- `msgspec>=0.18.0` - Serialization
- `pyyaml>=6.0` - YAML support
- `requests>=2.25.0` - HTTP requests
- `loguru>=0.7.0` - Logging

### Optional Dependencies
- `orjson>=3.8.0` - Fast JSON processing
- `polars>=0.19.0` - Fast DataFrames
- `pyarrow>=10.0.0` - Columnar data
- `pandas>=1.5.0` - Data analysis
- `joblib>=1.3.0` - Parallel processing
- `rich>=13.0.0` - Progress bars

### Cloud Provider Dependencies
- `boto3>=1.26.0`, `s3fs>=2023.1.0` - AWS S3
- `gcsfs>=2023.1.0` - Google Cloud Storage  
- `adlfs>=2023.1.0` - Azure Storage

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Relationship to FlowerPower

This package was extracted from the [FlowerPower](https://github.com/your-org/flowerpower) workflow framework to provide standalone filesystem utilities that can be used independently or as a dependency in other projects.
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "fsspec-utils",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "azure, cloud-storage, csv, data-io, filesystem, fsspec, gcs, json, parquet, s3",
    "author": null,
    "author_email": "legout <ligno.blades@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/80/97/02b8a6aab01beb83fa6a73fca80a4306b6183b4232c89c86db5c518c21fe/fsspec_utils-0.1.10.tar.gz",
    "platform": null,
    "description": "# fsspec-utils\n\nEnhanced utilities and extensions for fsspec filesystems with multi-format I/O support.\n\n## Overview\n\n`fsspec-utils` is a comprehensive toolkit that extends [fsspec](https://filesystem-spec.readthedocs.io/) with:\n\n- **Multi-cloud storage configuration** - Easy setup for AWS S3, Google Cloud Storage, Azure Storage, GitHub, and GitLab\n- **Enhanced caching** - Improved caching filesystem with monitoring and path preservation  \n- **Extended I/O operations** - Read/write operations for JSON, CSV, Parquet with Polars/PyArrow integration\n- **Utility functions** - Type conversion, parallel processing, and data transformation helpers\n\n## Installation\n\n```bash\n# Basic installation\npip install fsspec-utils\n\n# With all optional dependencies\npip install fsspec-utils[full]\n\n# Specific cloud providers\npip install fsspec-utils[aws]     # AWS S3 support\npip install fsspec-utils[gcp]     # Google Cloud Storage\npip install fsspec-utils[azure]   # Azure Storage\n```\n\n## Quick Start\n\n### Basic Filesystem Operations\n\n```python\nfrom fsspec_utils import filesystem\n\n# Local filesystem\nfs = filesystem(\"file\")\nfiles = fs.ls(\"/path/to/data\")\n\n# S3 with caching\nfs = filesystem(\"s3://my-bucket/\", cached=True)\ndata = fs.cat(\"data/file.txt\")\n```\n\n### Storage Configuration\n\n```python\nfrom fsspec_utils.storage import AwsStorageOptions\n\n# Configure S3 access\noptions = AwsStorageOptions(\n    region=\"us-west-2\",\n    access_key_id=\"YOUR_KEY\",\n    secret_access_key=\"YOUR_SECRET\"\n)\n\nfs = filesystem(\"s3\", storage_options=options, cached=True)\n```\n\n### Environment-based Configuration\n\n```python\nfrom fsspec_utils.storage import AwsStorageOptions\n\n# Load from environment variables\noptions = AwsStorageOptions.from_env()\nfs = filesystem(\"s3\", storage_options=options)\n```\n\n### Multiple Cloud Providers\n\n```python\nfrom fsspec_utils.storage import (\n    AwsStorageOptions, \n    GcsStorageOptions,\n    GitHubStorageOptions\n)\n\n# AWS S3\ns3_fs = filesystem(\"s3\", storage_options=AwsStorageOptions.from_env())\n\n# Google Cloud Storage  \ngcs_fs = filesystem(\"gs\", storage_options=GcsStorageOptions.from_env())\n\n# GitHub repository\ngithub_fs = filesystem(\"github\", storage_options=GitHubStorageOptions(\n    org=\"microsoft\",\n    repo=\"vscode\", \n    token=\"ghp_xxxx\"\n))\n```\n\n## Storage Options\n\n### AWS S3\n\n```python\nfrom fsspec_utils.storage import AwsStorageOptions\n\n# Basic credentials\noptions = AwsStorageOptions(\n    access_key_id=\"AKIAXXXXXXXX\",\n    secret_access_key=\"SECRET\",\n    region=\"us-east-1\"\n)\n\n# From AWS profile\noptions = AwsStorageOptions.create(profile=\"dev\")\n\n# S3-compatible service (MinIO)\noptions = AwsStorageOptions(\n    endpoint_url=\"http://localhost:9000\",\n    access_key_id=\"minioadmin\",\n    secret_access_key=\"minioadmin\",\n    allow_http=True\n)\n```\n\n### Google Cloud Storage\n\n```python\nfrom fsspec_utils.storage import GcsStorageOptions\n\n# Service account\noptions = GcsStorageOptions(\n    token=\"path/to/service-account.json\",\n    project=\"my-project-123\"\n)\n\n# From environment\noptions = GcsStorageOptions.from_env()\n```\n\n### Azure Storage\n\n```python\nfrom fsspec_utils.storage import AzureStorageOptions\n\n# Account key\noptions = AzureStorageOptions(\n    protocol=\"az\",\n    account_name=\"mystorageacct\",\n    account_key=\"key123...\"\n)\n\n# Connection string\noptions = AzureStorageOptions(\n    protocol=\"az\",\n    connection_string=\"DefaultEndpoints...\"\n)\n```\n\n### GitHub\n\n```python\nfrom fsspec_utils.storage import GitHubStorageOptions\n\n# Public repository\noptions = GitHubStorageOptions(\n    org=\"microsoft\",\n    repo=\"vscode\",\n    ref=\"main\"\n)\n\n# Private repository\noptions = GitHubStorageOptions(\n    org=\"myorg\",\n    repo=\"private-repo\",\n    token=\"ghp_xxxx\",\n    ref=\"develop\"\n)\n```\n\n### GitLab\n\n```python\nfrom fsspec_utils.storage import GitLabStorageOptions\n\n# Public project\noptions = GitLabStorageOptions(\n    project_name=\"group/project\",\n    ref=\"main\"\n)\n\n# Private project with token\noptions = GitLabStorageOptions(\n    project_id=12345,\n    token=\"glpat_xxxx\",\n    ref=\"develop\"\n)\n```\n\n## Enhanced Caching\n\n```python\nfrom fsspec_utils import filesystem\n\n# Enable caching with monitoring\nfs = filesystem(\n    \"s3://my-bucket/\",\n    cached=True,\n    cache_storage=\"/tmp/my_cache\",\n    verbose=True\n)\n\n# Cache preserves directory structure\ndata = fs.cat(\"deep/nested/path/file.txt\")\n# Cached at: /tmp/my_cache/deep/nested/path/file.txt\n```\n\n## Utilities\n\n### Parallel Processing\n\n```python\nfrom fsspec_utils.utils import run_parallel\n\n# Run function in parallel\ndef process_file(path, multiplier=1):\n    return len(path) * multiplier\n\nresults = run_parallel(\n    process_file,\n    [\"/path1\", \"/path2\", \"/path3\"],\n    multiplier=2,\n    n_jobs=4,\n    verbose=True\n)\n```\n\n### Type Conversion\n\n```python\nfrom fsspec_utils.utils import dict_to_dataframe, to_pyarrow_table\n\n# Convert dict to DataFrame\ndata = {\"col1\": [1, 2, 3], \"col2\": [4, 5, 6]}\ndf = dict_to_dataframe(data)\n\n# Convert to PyArrow table\ntable = to_pyarrow_table(df)\n```\n\n### Logging\n\n```python\nfrom fsspec_utils.utils import setup_logging\n\n# Configure logging\nsetup_logging(level=\"DEBUG\", format_string=\"{time} | {level} | {message}\")\n```\n\n## Dependencies\n\n### Core Dependencies\n- `fsspec>=2023.1.0` - Filesystem interface\n- `msgspec>=0.18.0` - Serialization\n- `pyyaml>=6.0` - YAML support\n- `requests>=2.25.0` - HTTP requests\n- `loguru>=0.7.0` - Logging\n\n### Optional Dependencies\n- `orjson>=3.8.0` - Fast JSON processing\n- `polars>=0.19.0` - Fast DataFrames\n- `pyarrow>=10.0.0` - Columnar data\n- `pandas>=1.5.0` - Data analysis\n- `joblib>=1.3.0` - Parallel processing\n- `rich>=13.0.0` - Progress bars\n\n### Cloud Provider Dependencies\n- `boto3>=1.26.0`, `s3fs>=2023.1.0` - AWS S3\n- `gcsfs>=2023.1.0` - Google Cloud Storage  \n- `adlfs>=2023.1.0` - Azure Storage\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n## Relationship to FlowerPower\n\nThis package was extracted from the [FlowerPower](https://github.com/your-org/flowerpower) workflow framework to provide standalone filesystem utilities that can be used independently or as a dependency in other projects.",
    "bugtrack_url": null,
    "license": null,
    "summary": "Enhanced utilities and extensions for fsspec filesystems with multi-format I/O support",
    "version": "0.1.10",
    "project_urls": {
        "Documentation": "https://legout.github.io/fsspec-utils",
        "Homepage": "https://github.com/legout/fsspec-utils",
        "Issues": "https://github.com/legout/fsspec-utils/issues",
        "Repository": "https://github.com/legout/fsspec-utils.git"
    },
    "split_keywords": [
        "azure",
        " cloud-storage",
        " csv",
        " data-io",
        " filesystem",
        " fsspec",
        " gcs",
        " json",
        " parquet",
        " s3"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "c547937e302a7cb95d3ba6bd5d3306c56b3b305f8e451a548d7ff95596ecfc7b",
                "md5": "689ef71ec5128846366286c3695693b6",
                "sha256": "88928233c28ef0170a4b45b918bc54d761254a7e998d6b2f45ff99b684bb12fc"
            },
            "downloads": -1,
            "filename": "fsspec_utils-0.1.10-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "689ef71ec5128846366286c3695693b6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 56233,
            "upload_time": "2025-08-21T00:06:06",
            "upload_time_iso_8601": "2025-08-21T00:06:06.957599Z",
            "url": "https://files.pythonhosted.org/packages/c5/47/937e302a7cb95d3ba6bd5d3306c56b3b305f8e451a548d7ff95596ecfc7b/fsspec_utils-0.1.10-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "809702b8a6aab01beb83fa6a73fca80a4306b6183b4232c89c86db5c518c21fe",
                "md5": "31ce87f343d22eda56411657a9ca27d6",
                "sha256": "36bb1f5bd272f950631b6e4b98081e0908ac870f988d1c8f91e5c17a16d60b9b"
            },
            "downloads": -1,
            "filename": "fsspec_utils-0.1.10.tar.gz",
            "has_sig": false,
            "md5_digest": "31ce87f343d22eda56411657a9ca27d6",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 2486408,
            "upload_time": "2025-08-21T00:06:09",
            "upload_time_iso_8601": "2025-08-21T00:06:09.248042Z",
            "url": "https://files.pythonhosted.org/packages/80/97/02b8a6aab01beb83fa6a73fca80a4306b6183b4232c89c86db5c518c21fe/fsspec_utils-0.1.10.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-21 00:06:09",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "legout",
    "github_project": "fsspec-utils",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "fsspec-utils"
}
        
Elapsed time: 0.58777s