# fsspec-utils
Enhanced utilities and extensions for fsspec filesystems with multi-format I/O support.
## Overview
`fsspec-utils` is a comprehensive toolkit that extends [fsspec](https://filesystem-spec.readthedocs.io/) with:
- **Multi-cloud storage configuration** - Easy setup for AWS S3, Google Cloud Storage, Azure Storage, GitHub, and GitLab
- **Enhanced caching** - Improved caching filesystem with monitoring and path preservation
- **Extended I/O operations** - Read/write operations for JSON, CSV, Parquet with Polars/PyArrow integration
- **Utility functions** - Type conversion, parallel processing, and data transformation helpers
## Installation
```bash
# Basic installation
pip install fsspec-utils
# With all optional dependencies
pip install fsspec-utils[full]
# Specific cloud providers
pip install fsspec-utils[aws] # AWS S3 support
pip install fsspec-utils[gcp] # Google Cloud Storage
pip install fsspec-utils[azure] # Azure Storage
```
## Quick Start
### Basic Filesystem Operations
```python
from fsspec_utils import filesystem
# Local filesystem
fs = filesystem("file")
files = fs.ls("/path/to/data")
# S3 with caching
fs = filesystem("s3://my-bucket/", cached=True)
data = fs.cat("data/file.txt")
```
### Storage Configuration
```python
from fsspec_utils.storage import AwsStorageOptions
# Configure S3 access
options = AwsStorageOptions(
region="us-west-2",
access_key_id="YOUR_KEY",
secret_access_key="YOUR_SECRET"
)
fs = filesystem("s3", storage_options=options, cached=True)
```
### Environment-based Configuration
```python
from fsspec_utils.storage import AwsStorageOptions
# Load from environment variables
options = AwsStorageOptions.from_env()
fs = filesystem("s3", storage_options=options)
```
### Multiple Cloud Providers
```python
from fsspec_utils.storage import (
AwsStorageOptions,
GcsStorageOptions,
GitHubStorageOptions
)
# AWS S3
s3_fs = filesystem("s3", storage_options=AwsStorageOptions.from_env())
# Google Cloud Storage
gcs_fs = filesystem("gs", storage_options=GcsStorageOptions.from_env())
# GitHub repository
github_fs = filesystem("github", storage_options=GitHubStorageOptions(
org="microsoft",
repo="vscode",
token="ghp_xxxx"
))
```
## Storage Options
### AWS S3
```python
from fsspec_utils.storage import AwsStorageOptions
# Basic credentials
options = AwsStorageOptions(
access_key_id="AKIAXXXXXXXX",
secret_access_key="SECRET",
region="us-east-1"
)
# From AWS profile
options = AwsStorageOptions.create(profile="dev")
# S3-compatible service (MinIO)
options = AwsStorageOptions(
endpoint_url="http://localhost:9000",
access_key_id="minioadmin",
secret_access_key="minioadmin",
allow_http=True
)
```
### Google Cloud Storage
```python
from fsspec_utils.storage import GcsStorageOptions
# Service account
options = GcsStorageOptions(
token="path/to/service-account.json",
project="my-project-123"
)
# From environment
options = GcsStorageOptions.from_env()
```
### Azure Storage
```python
from fsspec_utils.storage import AzureStorageOptions
# Account key
options = AzureStorageOptions(
protocol="az",
account_name="mystorageacct",
account_key="key123..."
)
# Connection string
options = AzureStorageOptions(
protocol="az",
connection_string="DefaultEndpoints..."
)
```
### GitHub
```python
from fsspec_utils.storage import GitHubStorageOptions
# Public repository
options = GitHubStorageOptions(
org="microsoft",
repo="vscode",
ref="main"
)
# Private repository
options = GitHubStorageOptions(
org="myorg",
repo="private-repo",
token="ghp_xxxx",
ref="develop"
)
```
### GitLab
```python
from fsspec_utils.storage import GitLabStorageOptions
# Public project
options = GitLabStorageOptions(
project_name="group/project",
ref="main"
)
# Private project with token
options = GitLabStorageOptions(
project_id=12345,
token="glpat_xxxx",
ref="develop"
)
```
## Enhanced Caching
```python
from fsspec_utils import filesystem
# Enable caching with monitoring
fs = filesystem(
"s3://my-bucket/",
cached=True,
cache_storage="/tmp/my_cache",
verbose=True
)
# Cache preserves directory structure
data = fs.cat("deep/nested/path/file.txt")
# Cached at: /tmp/my_cache/deep/nested/path/file.txt
```
## Utilities
### Parallel Processing
```python
from fsspec_utils.utils import run_parallel
# Run function in parallel
def process_file(path, multiplier=1):
return len(path) * multiplier
results = run_parallel(
process_file,
["/path1", "/path2", "/path3"],
multiplier=2,
n_jobs=4,
verbose=True
)
```
### Type Conversion
```python
from fsspec_utils.utils import dict_to_dataframe, to_pyarrow_table
# Convert dict to DataFrame
data = {"col1": [1, 2, 3], "col2": [4, 5, 6]}
df = dict_to_dataframe(data)
# Convert to PyArrow table
table = to_pyarrow_table(df)
```
### Logging
```python
from fsspec_utils.utils import setup_logging
# Configure logging
setup_logging(level="DEBUG", format_string="{time} | {level} | {message}")
```
## Dependencies
### Core Dependencies
- `fsspec>=2023.1.0` - Filesystem interface
- `msgspec>=0.18.0` - Serialization
- `pyyaml>=6.0` - YAML support
- `requests>=2.25.0` - HTTP requests
- `loguru>=0.7.0` - Logging
### Optional Dependencies
- `orjson>=3.8.0` - Fast JSON processing
- `polars>=0.19.0` - Fast DataFrames
- `pyarrow>=10.0.0` - Columnar data
- `pandas>=1.5.0` - Data analysis
- `joblib>=1.3.0` - Parallel processing
- `rich>=13.0.0` - Progress bars
### Cloud Provider Dependencies
- `boto3>=1.26.0`, `s3fs>=2023.1.0` - AWS S3
- `gcsfs>=2023.1.0` - Google Cloud Storage
- `adlfs>=2023.1.0` - Azure Storage
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## License
This project is licensed under the MIT License - see the LICENSE file for details.
## Relationship to FlowerPower
This package was extracted from the [FlowerPower](https://github.com/your-org/flowerpower) workflow framework to provide standalone filesystem utilities that can be used independently or as a dependency in other projects.
Raw data
{
"_id": null,
"home_page": null,
"name": "fsspec-utils",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": null,
"keywords": "azure, cloud-storage, csv, data-io, filesystem, fsspec, gcs, json, parquet, s3",
"author": null,
"author_email": "legout <ligno.blades@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/80/97/02b8a6aab01beb83fa6a73fca80a4306b6183b4232c89c86db5c518c21fe/fsspec_utils-0.1.10.tar.gz",
"platform": null,
"description": "# fsspec-utils\n\nEnhanced utilities and extensions for fsspec filesystems with multi-format I/O support.\n\n## Overview\n\n`fsspec-utils` is a comprehensive toolkit that extends [fsspec](https://filesystem-spec.readthedocs.io/) with:\n\n- **Multi-cloud storage configuration** - Easy setup for AWS S3, Google Cloud Storage, Azure Storage, GitHub, and GitLab\n- **Enhanced caching** - Improved caching filesystem with monitoring and path preservation \n- **Extended I/O operations** - Read/write operations for JSON, CSV, Parquet with Polars/PyArrow integration\n- **Utility functions** - Type conversion, parallel processing, and data transformation helpers\n\n## Installation\n\n```bash\n# Basic installation\npip install fsspec-utils\n\n# With all optional dependencies\npip install fsspec-utils[full]\n\n# Specific cloud providers\npip install fsspec-utils[aws] # AWS S3 support\npip install fsspec-utils[gcp] # Google Cloud Storage\npip install fsspec-utils[azure] # Azure Storage\n```\n\n## Quick Start\n\n### Basic Filesystem Operations\n\n```python\nfrom fsspec_utils import filesystem\n\n# Local filesystem\nfs = filesystem(\"file\")\nfiles = fs.ls(\"/path/to/data\")\n\n# S3 with caching\nfs = filesystem(\"s3://my-bucket/\", cached=True)\ndata = fs.cat(\"data/file.txt\")\n```\n\n### Storage Configuration\n\n```python\nfrom fsspec_utils.storage import AwsStorageOptions\n\n# Configure S3 access\noptions = AwsStorageOptions(\n region=\"us-west-2\",\n access_key_id=\"YOUR_KEY\",\n secret_access_key=\"YOUR_SECRET\"\n)\n\nfs = filesystem(\"s3\", storage_options=options, cached=True)\n```\n\n### Environment-based Configuration\n\n```python\nfrom fsspec_utils.storage import AwsStorageOptions\n\n# Load from environment variables\noptions = AwsStorageOptions.from_env()\nfs = filesystem(\"s3\", storage_options=options)\n```\n\n### Multiple Cloud Providers\n\n```python\nfrom fsspec_utils.storage import (\n AwsStorageOptions, \n GcsStorageOptions,\n GitHubStorageOptions\n)\n\n# AWS S3\ns3_fs = filesystem(\"s3\", storage_options=AwsStorageOptions.from_env())\n\n# Google Cloud Storage \ngcs_fs = filesystem(\"gs\", storage_options=GcsStorageOptions.from_env())\n\n# GitHub repository\ngithub_fs = filesystem(\"github\", storage_options=GitHubStorageOptions(\n org=\"microsoft\",\n repo=\"vscode\", \n token=\"ghp_xxxx\"\n))\n```\n\n## Storage Options\n\n### AWS S3\n\n```python\nfrom fsspec_utils.storage import AwsStorageOptions\n\n# Basic credentials\noptions = AwsStorageOptions(\n access_key_id=\"AKIAXXXXXXXX\",\n secret_access_key=\"SECRET\",\n region=\"us-east-1\"\n)\n\n# From AWS profile\noptions = AwsStorageOptions.create(profile=\"dev\")\n\n# S3-compatible service (MinIO)\noptions = AwsStorageOptions(\n endpoint_url=\"http://localhost:9000\",\n access_key_id=\"minioadmin\",\n secret_access_key=\"minioadmin\",\n allow_http=True\n)\n```\n\n### Google Cloud Storage\n\n```python\nfrom fsspec_utils.storage import GcsStorageOptions\n\n# Service account\noptions = GcsStorageOptions(\n token=\"path/to/service-account.json\",\n project=\"my-project-123\"\n)\n\n# From environment\noptions = GcsStorageOptions.from_env()\n```\n\n### Azure Storage\n\n```python\nfrom fsspec_utils.storage import AzureStorageOptions\n\n# Account key\noptions = AzureStorageOptions(\n protocol=\"az\",\n account_name=\"mystorageacct\",\n account_key=\"key123...\"\n)\n\n# Connection string\noptions = AzureStorageOptions(\n protocol=\"az\",\n connection_string=\"DefaultEndpoints...\"\n)\n```\n\n### GitHub\n\n```python\nfrom fsspec_utils.storage import GitHubStorageOptions\n\n# Public repository\noptions = GitHubStorageOptions(\n org=\"microsoft\",\n repo=\"vscode\",\n ref=\"main\"\n)\n\n# Private repository\noptions = GitHubStorageOptions(\n org=\"myorg\",\n repo=\"private-repo\",\n token=\"ghp_xxxx\",\n ref=\"develop\"\n)\n```\n\n### GitLab\n\n```python\nfrom fsspec_utils.storage import GitLabStorageOptions\n\n# Public project\noptions = GitLabStorageOptions(\n project_name=\"group/project\",\n ref=\"main\"\n)\n\n# Private project with token\noptions = GitLabStorageOptions(\n project_id=12345,\n token=\"glpat_xxxx\",\n ref=\"develop\"\n)\n```\n\n## Enhanced Caching\n\n```python\nfrom fsspec_utils import filesystem\n\n# Enable caching with monitoring\nfs = filesystem(\n \"s3://my-bucket/\",\n cached=True,\n cache_storage=\"/tmp/my_cache\",\n verbose=True\n)\n\n# Cache preserves directory structure\ndata = fs.cat(\"deep/nested/path/file.txt\")\n# Cached at: /tmp/my_cache/deep/nested/path/file.txt\n```\n\n## Utilities\n\n### Parallel Processing\n\n```python\nfrom fsspec_utils.utils import run_parallel\n\n# Run function in parallel\ndef process_file(path, multiplier=1):\n return len(path) * multiplier\n\nresults = run_parallel(\n process_file,\n [\"/path1\", \"/path2\", \"/path3\"],\n multiplier=2,\n n_jobs=4,\n verbose=True\n)\n```\n\n### Type Conversion\n\n```python\nfrom fsspec_utils.utils import dict_to_dataframe, to_pyarrow_table\n\n# Convert dict to DataFrame\ndata = {\"col1\": [1, 2, 3], \"col2\": [4, 5, 6]}\ndf = dict_to_dataframe(data)\n\n# Convert to PyArrow table\ntable = to_pyarrow_table(df)\n```\n\n### Logging\n\n```python\nfrom fsspec_utils.utils import setup_logging\n\n# Configure logging\nsetup_logging(level=\"DEBUG\", format_string=\"{time} | {level} | {message}\")\n```\n\n## Dependencies\n\n### Core Dependencies\n- `fsspec>=2023.1.0` - Filesystem interface\n- `msgspec>=0.18.0` - Serialization\n- `pyyaml>=6.0` - YAML support\n- `requests>=2.25.0` - HTTP requests\n- `loguru>=0.7.0` - Logging\n\n### Optional Dependencies\n- `orjson>=3.8.0` - Fast JSON processing\n- `polars>=0.19.0` - Fast DataFrames\n- `pyarrow>=10.0.0` - Columnar data\n- `pandas>=1.5.0` - Data analysis\n- `joblib>=1.3.0` - Parallel processing\n- `rich>=13.0.0` - Progress bars\n\n### Cloud Provider Dependencies\n- `boto3>=1.26.0`, `s3fs>=2023.1.0` - AWS S3\n- `gcsfs>=2023.1.0` - Google Cloud Storage \n- `adlfs>=2023.1.0` - Azure Storage\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n## Relationship to FlowerPower\n\nThis package was extracted from the [FlowerPower](https://github.com/your-org/flowerpower) workflow framework to provide standalone filesystem utilities that can be used independently or as a dependency in other projects.",
"bugtrack_url": null,
"license": null,
"summary": "Enhanced utilities and extensions for fsspec filesystems with multi-format I/O support",
"version": "0.1.10",
"project_urls": {
"Documentation": "https://legout.github.io/fsspec-utils",
"Homepage": "https://github.com/legout/fsspec-utils",
"Issues": "https://github.com/legout/fsspec-utils/issues",
"Repository": "https://github.com/legout/fsspec-utils.git"
},
"split_keywords": [
"azure",
" cloud-storage",
" csv",
" data-io",
" filesystem",
" fsspec",
" gcs",
" json",
" parquet",
" s3"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "c547937e302a7cb95d3ba6bd5d3306c56b3b305f8e451a548d7ff95596ecfc7b",
"md5": "689ef71ec5128846366286c3695693b6",
"sha256": "88928233c28ef0170a4b45b918bc54d761254a7e998d6b2f45ff99b684bb12fc"
},
"downloads": -1,
"filename": "fsspec_utils-0.1.10-py3-none-any.whl",
"has_sig": false,
"md5_digest": "689ef71ec5128846366286c3695693b6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 56233,
"upload_time": "2025-08-21T00:06:06",
"upload_time_iso_8601": "2025-08-21T00:06:06.957599Z",
"url": "https://files.pythonhosted.org/packages/c5/47/937e302a7cb95d3ba6bd5d3306c56b3b305f8e451a548d7ff95596ecfc7b/fsspec_utils-0.1.10-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "809702b8a6aab01beb83fa6a73fca80a4306b6183b4232c89c86db5c518c21fe",
"md5": "31ce87f343d22eda56411657a9ca27d6",
"sha256": "36bb1f5bd272f950631b6e4b98081e0908ac870f988d1c8f91e5c17a16d60b9b"
},
"downloads": -1,
"filename": "fsspec_utils-0.1.10.tar.gz",
"has_sig": false,
"md5_digest": "31ce87f343d22eda56411657a9ca27d6",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 2486408,
"upload_time": "2025-08-21T00:06:09",
"upload_time_iso_8601": "2025-08-21T00:06:09.248042Z",
"url": "https://files.pythonhosted.org/packages/80/97/02b8a6aab01beb83fa6a73fca80a4306b6183b4232c89c86db5c518c21fe/fsspec_utils-0.1.10.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-21 00:06:09",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "legout",
"github_project": "fsspec-utils",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "fsspec-utils"
}