dask-cudf-cu12


Namedask-cudf-cu12 JSON
Version 24.12.0 PyPI version JSON
download
home_pageNone
SummaryUtilities for Dask and cuDF interactions
upload_time2024-12-12 18:20:40
maintainerNone
docs_urlNone
authorNVIDIA Corporation
requires_python>=3.10
licenseApache 2.0
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # <div align="left"><img src="../../img/rapids_logo.png" width="90px"/>&nbsp;Dask cuDF - A GPU Backend for Dask DataFrame</div>

Dask cuDF (a.k.a. dask-cudf or `dask_cudf`) is an extension library for [Dask DataFrame](https://docs.dask.org/en/stable/dataframe.html) that provides a Pandas-like API for parallel and larger-than-memory DataFrame computing on GPUs. When installed, Dask cuDF is automatically registered as the `"cudf"` [dataframe backend](https://docs.dask.org/en/stable/how-to/selecting-the-collection-backend.html) for Dask DataFrame.

> [!IMPORTANT]
> Dask cuDF does not provide support for multi-GPU or multi-node execution on its own. You must also deploy a distributed cluster (ideally with [Dask-CUDA](https://docs.rapids.ai/api/dask-cuda/stable/)) to leverage multiple GPUs efficiently.

## Using Dask cuDF

Please visit [the official documentation page](https://docs.rapids.ai/api/dask-cudf/stable/) for detailed information about using Dask cuDF.

## Installation

See the [RAPIDS install page](https://docs.rapids.ai/install) for the most up-to-date information and commands for installing Dask cuDF and other RAPIDS packages.

## Resources

- [Dask cuDF documentation](https://docs.rapids.ai/api/dask-cudf/stable/)
- [Best practices](https://docs.rapids.ai/api/dask-cudf/stable/best_practices/)
- [cuDF documentation](https://docs.rapids.ai/api/cudf/stable/)
- [10 Minutes to cuDF and Dask cuDF](https://docs.rapids.ai/api/cudf/stable/user_guide/10min/)
- [Dask-CUDA documentation](https://docs.rapids.ai/api/dask-cuda/stable/)
- [Deployment](https://docs.rapids.ai/deployment/stable/)
- [RAPIDS Community](https://rapids.ai/learn-more/#get-involved): Get help, contribute, and collaborate.

### Quick-start example

A very common Dask cuDF use case is single-node multi-GPU data processing. These workflows typically use the following pattern:

```python
import dask
import dask.dataframe as dd
from dask_cuda import LocalCUDACluster
from distributed import Client

if __name__ == "__main__":

  # Define a GPU-aware cluster to leverage multiple GPUs
  client = Client(
    LocalCUDACluster(
      CUDA_VISIBLE_DEVICES="0,1",  # Use two workers (on devices 0 and 1)
      rmm_pool_size=0.9,  # Use 90% of GPU memory as a pool for faster allocations
      enable_cudf_spill=True,  # Improve device memory stability
      local_directory="/fast/scratch/",  # Use fast local storage for spilling
    )
  )

  # Set the default dataframe backend to "cudf"
  dask.config.set({"dataframe.backend": "cudf"})

  # Create your DataFrame collection from on-disk
  # or in-memory data
  df = dd.read_parquet("/my/parquet/dataset/")

  # Use cudf-like syntax to transform and/or query your data
  query = df.groupby('item')['price'].mean()

  # Compute, persist, or write out the result
  query.head()
```

If you do not have multiple GPUs available, using `LocalCUDACluster` is optional. However, it is still a good idea to [enable cuDF spilling](https://docs.rapids.ai/api/cudf/stable/developer_guide/library_design/#spilling-to-host-memory).

If you wish to scale across multiple nodes, you will need to use a different mechanism to deploy your Dask-CUDA workers. Please see [the RAPIDS deployment documentation](https://docs.rapids.ai/deployment/stable/) for more instructions.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "dask-cudf-cu12",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": "NVIDIA Corporation",
    "author_email": null,
    "download_url": null,
    "platform": null,
    "description": "# <div align=\"left\"><img src=\"../../img/rapids_logo.png\" width=\"90px\"/>&nbsp;Dask cuDF - A GPU Backend for Dask DataFrame</div>\n\nDask cuDF (a.k.a. dask-cudf or `dask_cudf`) is an extension library for [Dask DataFrame](https://docs.dask.org/en/stable/dataframe.html) that provides a Pandas-like API for parallel and larger-than-memory DataFrame computing on GPUs. When installed, Dask cuDF is automatically registered as the `\"cudf\"` [dataframe backend](https://docs.dask.org/en/stable/how-to/selecting-the-collection-backend.html) for Dask DataFrame.\n\n> [!IMPORTANT]\n> Dask cuDF does not provide support for multi-GPU or multi-node execution on its own. You must also deploy a distributed cluster (ideally with [Dask-CUDA](https://docs.rapids.ai/api/dask-cuda/stable/)) to leverage multiple GPUs efficiently.\n\n## Using Dask cuDF\n\nPlease visit [the official documentation page](https://docs.rapids.ai/api/dask-cudf/stable/) for detailed information about using Dask cuDF.\n\n## Installation\n\nSee the [RAPIDS install page](https://docs.rapids.ai/install) for the most up-to-date information and commands for installing Dask cuDF and other RAPIDS packages.\n\n## Resources\n\n- [Dask cuDF documentation](https://docs.rapids.ai/api/dask-cudf/stable/)\n- [Best practices](https://docs.rapids.ai/api/dask-cudf/stable/best_practices/)\n- [cuDF documentation](https://docs.rapids.ai/api/cudf/stable/)\n- [10 Minutes to cuDF and Dask cuDF](https://docs.rapids.ai/api/cudf/stable/user_guide/10min/)\n- [Dask-CUDA documentation](https://docs.rapids.ai/api/dask-cuda/stable/)\n- [Deployment](https://docs.rapids.ai/deployment/stable/)\n- [RAPIDS Community](https://rapids.ai/learn-more/#get-involved): Get help, contribute, and collaborate.\n\n### Quick-start example\n\nA very common Dask cuDF use case is single-node multi-GPU data processing. These workflows typically use the following pattern:\n\n```python\nimport dask\nimport dask.dataframe as dd\nfrom dask_cuda import LocalCUDACluster\nfrom distributed import Client\n\nif __name__ == \"__main__\":\n\n  # Define a GPU-aware cluster to leverage multiple GPUs\n  client = Client(\n    LocalCUDACluster(\n      CUDA_VISIBLE_DEVICES=\"0,1\",  # Use two workers (on devices 0 and 1)\n      rmm_pool_size=0.9,  # Use 90% of GPU memory as a pool for faster allocations\n      enable_cudf_spill=True,  # Improve device memory stability\n      local_directory=\"/fast/scratch/\",  # Use fast local storage for spilling\n    )\n  )\n\n  # Set the default dataframe backend to \"cudf\"\n  dask.config.set({\"dataframe.backend\": \"cudf\"})\n\n  # Create your DataFrame collection from on-disk\n  # or in-memory data\n  df = dd.read_parquet(\"/my/parquet/dataset/\")\n\n  # Use cudf-like syntax to transform and/or query your data\n  query = df.groupby('item')['price'].mean()\n\n  # Compute, persist, or write out the result\n  query.head()\n```\n\nIf you do not have multiple GPUs available, using `LocalCUDACluster` is optional. However, it is still a good idea to [enable cuDF spilling](https://docs.rapids.ai/api/cudf/stable/developer_guide/library_design/#spilling-to-host-memory).\n\nIf you wish to scale across multiple nodes, you will need to use a different mechanism to deploy your Dask-CUDA workers. Please see [the RAPIDS deployment documentation](https://docs.rapids.ai/deployment/stable/) for more instructions.\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "Utilities for Dask and cuDF interactions",
    "version": "24.12.0",
    "project_urls": {
        "Homepage": "https://github.com/rapidsai/cudf"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1607b593d1830f580d4dfce628688a003715e033e72d91eddc39c2cb24a87127",
                "md5": "49aeff2ce1810ca74181bd8b94ac7dad",
                "sha256": "39adc2cf8f79bd6aba9b92d2b4e7775017603932b26a2e177be18367c16588a0"
            },
            "downloads": -1,
            "filename": "dask_cudf_cu12-24.12.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "49aeff2ce1810ca74181bd8b94ac7dad",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 67174,
            "upload_time": "2024-12-12T18:20:40",
            "upload_time_iso_8601": "2024-12-12T18:20:40.559912Z",
            "url": "https://files.pythonhosted.org/packages/16/07/b593d1830f580d4dfce628688a003715e033e72d91eddc39c2cb24a87127/dask_cudf_cu12-24.12.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-12 18:20:40",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "rapidsai",
    "github_project": "cudf",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "dask-cudf-cu12"
}
        
Elapsed time: 0.41430s