Name | dask-cudf-cu12 JSON |
Version |
24.12.0
JSON |
| download |
home_page | None |
Summary | Utilities for Dask and cuDF interactions |
upload_time | 2024-12-12 18:20:40 |
maintainer | None |
docs_url | None |
author | NVIDIA Corporation |
requires_python | >=3.10 |
license | Apache 2.0 |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# <div align="left"><img src="../../img/rapids_logo.png" width="90px"/> Dask cuDF - A GPU Backend for Dask DataFrame</div>
Dask cuDF (a.k.a. dask-cudf or `dask_cudf`) is an extension library for [Dask DataFrame](https://docs.dask.org/en/stable/dataframe.html) that provides a Pandas-like API for parallel and larger-than-memory DataFrame computing on GPUs. When installed, Dask cuDF is automatically registered as the `"cudf"` [dataframe backend](https://docs.dask.org/en/stable/how-to/selecting-the-collection-backend.html) for Dask DataFrame.
> [!IMPORTANT]
> Dask cuDF does not provide support for multi-GPU or multi-node execution on its own. You must also deploy a distributed cluster (ideally with [Dask-CUDA](https://docs.rapids.ai/api/dask-cuda/stable/)) to leverage multiple GPUs efficiently.
## Using Dask cuDF
Please visit [the official documentation page](https://docs.rapids.ai/api/dask-cudf/stable/) for detailed information about using Dask cuDF.
## Installation
See the [RAPIDS install page](https://docs.rapids.ai/install) for the most up-to-date information and commands for installing Dask cuDF and other RAPIDS packages.
## Resources
- [Dask cuDF documentation](https://docs.rapids.ai/api/dask-cudf/stable/)
- [Best practices](https://docs.rapids.ai/api/dask-cudf/stable/best_practices/)
- [cuDF documentation](https://docs.rapids.ai/api/cudf/stable/)
- [10 Minutes to cuDF and Dask cuDF](https://docs.rapids.ai/api/cudf/stable/user_guide/10min/)
- [Dask-CUDA documentation](https://docs.rapids.ai/api/dask-cuda/stable/)
- [Deployment](https://docs.rapids.ai/deployment/stable/)
- [RAPIDS Community](https://rapids.ai/learn-more/#get-involved): Get help, contribute, and collaborate.
### Quick-start example
A very common Dask cuDF use case is single-node multi-GPU data processing. These workflows typically use the following pattern:
```python
import dask
import dask.dataframe as dd
from dask_cuda import LocalCUDACluster
from distributed import Client
if __name__ == "__main__":
# Define a GPU-aware cluster to leverage multiple GPUs
client = Client(
LocalCUDACluster(
CUDA_VISIBLE_DEVICES="0,1", # Use two workers (on devices 0 and 1)
rmm_pool_size=0.9, # Use 90% of GPU memory as a pool for faster allocations
enable_cudf_spill=True, # Improve device memory stability
local_directory="/fast/scratch/", # Use fast local storage for spilling
)
)
# Set the default dataframe backend to "cudf"
dask.config.set({"dataframe.backend": "cudf"})
# Create your DataFrame collection from on-disk
# or in-memory data
df = dd.read_parquet("/my/parquet/dataset/")
# Use cudf-like syntax to transform and/or query your data
query = df.groupby('item')['price'].mean()
# Compute, persist, or write out the result
query.head()
```
If you do not have multiple GPUs available, using `LocalCUDACluster` is optional. However, it is still a good idea to [enable cuDF spilling](https://docs.rapids.ai/api/cudf/stable/developer_guide/library_design/#spilling-to-host-memory).
If you wish to scale across multiple nodes, you will need to use a different mechanism to deploy your Dask-CUDA workers. Please see [the RAPIDS deployment documentation](https://docs.rapids.ai/deployment/stable/) for more instructions.
Raw data
{
"_id": null,
"home_page": null,
"name": "dask-cudf-cu12",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": null,
"author": "NVIDIA Corporation",
"author_email": null,
"download_url": null,
"platform": null,
"description": "# <div align=\"left\"><img src=\"../../img/rapids_logo.png\" width=\"90px\"/> Dask cuDF - A GPU Backend for Dask DataFrame</div>\n\nDask cuDF (a.k.a. dask-cudf or `dask_cudf`) is an extension library for [Dask DataFrame](https://docs.dask.org/en/stable/dataframe.html) that provides a Pandas-like API for parallel and larger-than-memory DataFrame computing on GPUs. When installed, Dask cuDF is automatically registered as the `\"cudf\"` [dataframe backend](https://docs.dask.org/en/stable/how-to/selecting-the-collection-backend.html) for Dask DataFrame.\n\n> [!IMPORTANT]\n> Dask cuDF does not provide support for multi-GPU or multi-node execution on its own. You must also deploy a distributed cluster (ideally with [Dask-CUDA](https://docs.rapids.ai/api/dask-cuda/stable/)) to leverage multiple GPUs efficiently.\n\n## Using Dask cuDF\n\nPlease visit [the official documentation page](https://docs.rapids.ai/api/dask-cudf/stable/) for detailed information about using Dask cuDF.\n\n## Installation\n\nSee the [RAPIDS install page](https://docs.rapids.ai/install) for the most up-to-date information and commands for installing Dask cuDF and other RAPIDS packages.\n\n## Resources\n\n- [Dask cuDF documentation](https://docs.rapids.ai/api/dask-cudf/stable/)\n- [Best practices](https://docs.rapids.ai/api/dask-cudf/stable/best_practices/)\n- [cuDF documentation](https://docs.rapids.ai/api/cudf/stable/)\n- [10 Minutes to cuDF and Dask cuDF](https://docs.rapids.ai/api/cudf/stable/user_guide/10min/)\n- [Dask-CUDA documentation](https://docs.rapids.ai/api/dask-cuda/stable/)\n- [Deployment](https://docs.rapids.ai/deployment/stable/)\n- [RAPIDS Community](https://rapids.ai/learn-more/#get-involved): Get help, contribute, and collaborate.\n\n### Quick-start example\n\nA very common Dask cuDF use case is single-node multi-GPU data processing. These workflows typically use the following pattern:\n\n```python\nimport dask\nimport dask.dataframe as dd\nfrom dask_cuda import LocalCUDACluster\nfrom distributed import Client\n\nif __name__ == \"__main__\":\n\n # Define a GPU-aware cluster to leverage multiple GPUs\n client = Client(\n LocalCUDACluster(\n CUDA_VISIBLE_DEVICES=\"0,1\", # Use two workers (on devices 0 and 1)\n rmm_pool_size=0.9, # Use 90% of GPU memory as a pool for faster allocations\n enable_cudf_spill=True, # Improve device memory stability\n local_directory=\"/fast/scratch/\", # Use fast local storage for spilling\n )\n )\n\n # Set the default dataframe backend to \"cudf\"\n dask.config.set({\"dataframe.backend\": \"cudf\"})\n\n # Create your DataFrame collection from on-disk\n # or in-memory data\n df = dd.read_parquet(\"/my/parquet/dataset/\")\n\n # Use cudf-like syntax to transform and/or query your data\n query = df.groupby('item')['price'].mean()\n\n # Compute, persist, or write out the result\n query.head()\n```\n\nIf you do not have multiple GPUs available, using `LocalCUDACluster` is optional. However, it is still a good idea to [enable cuDF spilling](https://docs.rapids.ai/api/cudf/stable/developer_guide/library_design/#spilling-to-host-memory).\n\nIf you wish to scale across multiple nodes, you will need to use a different mechanism to deploy your Dask-CUDA workers. Please see [the RAPIDS deployment documentation](https://docs.rapids.ai/deployment/stable/) for more instructions.\n",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "Utilities for Dask and cuDF interactions",
"version": "24.12.0",
"project_urls": {
"Homepage": "https://github.com/rapidsai/cudf"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "1607b593d1830f580d4dfce628688a003715e033e72d91eddc39c2cb24a87127",
"md5": "49aeff2ce1810ca74181bd8b94ac7dad",
"sha256": "39adc2cf8f79bd6aba9b92d2b4e7775017603932b26a2e177be18367c16588a0"
},
"downloads": -1,
"filename": "dask_cudf_cu12-24.12.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "49aeff2ce1810ca74181bd8b94ac7dad",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 67174,
"upload_time": "2024-12-12T18:20:40",
"upload_time_iso_8601": "2024-12-12T18:20:40.559912Z",
"url": "https://files.pythonhosted.org/packages/16/07/b593d1830f580d4dfce628688a003715e033e72d91eddc39c2cb24a87127/dask_cudf_cu12-24.12.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-12 18:20:40",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "rapidsai",
"github_project": "cudf",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "dask-cudf-cu12"
}