pandas-parallel-apply


Namepandas-parallel-apply JSON
Version 2.2 PyPI version JSON
download
home_pagehttps://gitlab.com/meehai/pandas-parallel-apply
SummaryWrapper for df and df[col].apply parallelized
upload_time2023-05-29 20:52:25
maintainer
docs_urlNone
author
requires_python>=3.8
licenseWTFPL
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # pandas-parallel-apply

<div align="center">
  <a href="https://gitlab.com/meehai/pandas-parallel-apply/-/blob/main/LICENSE">
    <img src="https://img.shields.io/gitlab/license/meehai/pandas-parallel-apply" alt="License"/>
  </a>
  <a href="https://pypi.org/project/pandas-parallel-apply/">
    <img src="https://img.shields.io/pypi/v/pandas-parallel-apply" alt="PyPi Latest Release"/>
  </a>
</div>

Parallel wrappers for `df.apply(fn)`, `df[col].apply(fn)`, `series.apply(fn)`, and `df.groupby([cols]).apply(fn)`, with tqdm progress bars included.

## Installation

`pip install pandas-parallel-apply`

Import with:
```python
from pandas_parallel_apply import DataFrameParallel, SeriesParallel
```

## Examples
See `examples/` for usage on some dummy dataframes and series.

## Usage

```python
# Apply on each row of a dataframe
df.apply(fn)
# ->
DataFrameParallel(df, n_cores: int = None, pbar: bool = True).apply(fn)

# Apply on a column of a dataframe (returns a Series)
df[col].apply(fn, axis=1)
# ->
DataFrameParallel(df, n_cores: int = None, pbar: bool = True)[col].apply(fn, axis=1)

# Apply on a series
series.apply(fn)
# -> 
SeriesParallel(series, n_cores: int = None, pbar: bool = True).apply(fn)

# GroupBy apply
df.groupby([cols]).apply(fn)
# ->
DataFrameParallel(df, n_cores: int = None, pbar: bool = True).groupby([cols]).apply(fn)
```

## How it works

It takes the length of your dataframe (or series, or grouper) = N and the `n_cores` provided to the constructors (K).
It then splits the dataframe in K chunks of N/K size and spawns K new processes, each processing the desired chunks.

Only row-wise (perfect parallelable) operations are supported, so `df.apply(fn, axis=1)` is okay, but
`df.apply(fn, axis=0)` is not because it may require rows that are on other workers.

It is assumed that each row is processed in similar time, so the N/K chunks will finishe more or less at the same time.

### Future Improvement

Not supported but may be interesting: define also a number of chunks (C>K), so the df is actually split in N/C chunks,
and theses are passed using a round-robin approach to the K processes. Right now, C=K, so whenever one process
finishes, it will not be assigned any more work.

## n_cores semantics
- `n_cores < -1` -> throws an error
- `n_cores == -1` -> uses `cpu_count()` - 1 cores
- `n_cores == 0` -> uses serial/standard pandas functions
- `n_cores == 1` -> spawns a single process alongside the main one
- `n_cores > 1` -> spanws N processes and chunks the df
- `n_cores > cpu_cpunt()` -> throws an warning
- `n_cores > len(df)` -> limits to `len(df)`

On CPU-bound tasks (calculations), `n_cores = -1` is likely to be fastest. On network-bound operations (e.g., where
threads may invoke network calls), using a very high `n_cores` value may be beneficial.

## Disclaimers

- This is an experimental repository. It may lead to unexpected behaviour.

- Not all the merging semantics of pandas are supported. Pandas has weird and complex methods of converting an apply
return. For example, a series apply function may return a dataframe, a series, a dict, a list, etc. All of these are
converted in some specific way. Some cases may not be supported.

- Groupby apply functions are **much** slower than their serial variant currently. Still experimenting with how to make
it faster. It looks correct, just 10-100x slower for some small examples. May be better as dataframe get bigger.

- Using `n_cores = 1` will create a multiprocessing pool of just 1 core, so the code is parallel (thus not running on
the main process), but may not yield much speed improvement, except for not blocking the main process. May be useful
in some GUI apps.

- You can use `parallelism=multithread` of `parallelism=multiprocess` (2nd is default) for all constructors. Using
multiprocess requires the functions to be picklable though (lambda for example, must be global)

That's all.

            

Raw data

            {
    "_id": null,
    "home_page": "https://gitlab.com/meehai/pandas-parallel-apply",
    "name": "pandas-parallel-apply",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "",
    "author": "",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/ff/b5/5fcc44f8b372bf93ca311432391ac310817ccb1a0c3e3fdfef1ffe6d3036/pandas-parallel-apply-2.2.tar.gz",
    "platform": null,
    "description": "# pandas-parallel-apply\n\n<div align=\"center\">\n  <a href=\"https://gitlab.com/meehai/pandas-parallel-apply/-/blob/main/LICENSE\">\n    <img src=\"https://img.shields.io/gitlab/license/meehai/pandas-parallel-apply\" alt=\"License\"/>\n  </a>\n  <a href=\"https://pypi.org/project/pandas-parallel-apply/\">\n    <img src=\"https://img.shields.io/pypi/v/pandas-parallel-apply\" alt=\"PyPi Latest Release\"/>\n  </a>\n</div>\n\nParallel wrappers for `df.apply(fn)`, `df[col].apply(fn)`, `series.apply(fn)`, and `df.groupby([cols]).apply(fn)`, with tqdm progress bars included.\n\n## Installation\n\n`pip install pandas-parallel-apply`\n\nImport with:\n```python\nfrom pandas_parallel_apply import DataFrameParallel, SeriesParallel\n```\n\n## Examples\nSee `examples/` for usage on some dummy dataframes and series.\n\n## Usage\n\n```python\n# Apply on each row of a dataframe\ndf.apply(fn)\n# ->\nDataFrameParallel(df, n_cores: int = None, pbar: bool = True).apply(fn)\n\n# Apply on a column of a dataframe (returns a Series)\ndf[col].apply(fn, axis=1)\n# ->\nDataFrameParallel(df, n_cores: int = None, pbar: bool = True)[col].apply(fn, axis=1)\n\n# Apply on a series\nseries.apply(fn)\n# -> \nSeriesParallel(series, n_cores: int = None, pbar: bool = True).apply(fn)\n\n# GroupBy apply\ndf.groupby([cols]).apply(fn)\n# ->\nDataFrameParallel(df, n_cores: int = None, pbar: bool = True).groupby([cols]).apply(fn)\n```\n\n## How it works\n\nIt takes the length of your dataframe (or series, or grouper) = N and the `n_cores` provided to the constructors (K).\nIt then splits the dataframe in K chunks of N/K size and spawns K new processes, each processing the desired chunks.\n\nOnly row-wise (perfect parallelable) operations are supported, so `df.apply(fn, axis=1)` is okay, but\n`df.apply(fn, axis=0)` is not because it may require rows that are on other workers.\n\nIt is assumed that each row is processed in similar time, so the N/K chunks will finishe more or less at the same time.\n\n### Future Improvement\n\nNot supported but may be interesting: define also a number of chunks (C>K), so the df is actually split in N/C chunks,\nand theses are passed using a round-robin approach to the K processes. Right now, C=K, so whenever one process\nfinishes, it will not be assigned any more work.\n\n## n_cores semantics\n- `n_cores < -1` -> throws an error\n- `n_cores == -1` -> uses `cpu_count()` - 1 cores\n- `n_cores == 0` -> uses serial/standard pandas functions\n- `n_cores == 1` -> spawns a single process alongside the main one\n- `n_cores > 1` -> spanws N processes and chunks the df\n- `n_cores > cpu_cpunt()` -> throws an warning\n- `n_cores > len(df)` -> limits to `len(df)`\n\nOn CPU-bound tasks (calculations), `n_cores = -1` is likely to be fastest. On network-bound operations (e.g., where\nthreads may invoke network calls), using a very high `n_cores` value may be beneficial.\n\n## Disclaimers\n\n- This is an experimental repository. It may lead to unexpected behaviour.\n\n- Not all the merging semantics of pandas are supported. Pandas has weird and complex methods of converting an apply\nreturn. For example, a series apply function may return a dataframe, a series, a dict, a list, etc. All of these are\nconverted in some specific way. Some cases may not be supported.\n\n- Groupby apply functions are **much** slower than their serial variant currently. Still experimenting with how to make\nit faster. It looks correct, just 10-100x slower for some small examples. May be better as dataframe get bigger.\n\n- Using `n_cores = 1` will create a multiprocessing pool of just 1 core, so the code is parallel (thus not running on\nthe main process), but may not yield much speed improvement, except for not blocking the main process. May be useful\nin some GUI apps.\n\n- You can use `parallelism=multithread` of `parallelism=multiprocess` (2nd is default) for all constructors. Using\nmultiprocess requires the functions to be picklable though (lambda for example, must be global)\n\nThat's all.\n",
    "bugtrack_url": null,
    "license": "WTFPL",
    "summary": "Wrapper for df and df[col].apply parallelized",
    "version": "2.2",
    "project_urls": {
        "Homepage": "https://gitlab.com/meehai/pandas-parallel-apply"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2f1c78646c2d360417ee44495e998c6aa7985b363a3801e2cf4a794f0538e073",
                "md5": "fabf703a3ad3e74445775a84387dfdf7",
                "sha256": "a9e284312440e135c3a23f7443b63627b91fd1f4a31228971687bc84691b8f41"
            },
            "downloads": -1,
            "filename": "pandas_parallel_apply-2.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "fabf703a3ad3e74445775a84387dfdf7",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 9292,
            "upload_time": "2023-05-29T20:52:22",
            "upload_time_iso_8601": "2023-05-29T20:52:22.405116Z",
            "url": "https://files.pythonhosted.org/packages/2f/1c/78646c2d360417ee44495e998c6aa7985b363a3801e2cf4a794f0538e073/pandas_parallel_apply-2.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ffb55fcc44f8b372bf93ca311432391ac310817ccb1a0c3e3fdfef1ffe6d3036",
                "md5": "74517b046a198338a10000743798ccf7",
                "sha256": "0c07dc5964ef7a635f8a635f7d158eb2a22bfd3cf20f580e60827b6df4952099"
            },
            "downloads": -1,
            "filename": "pandas-parallel-apply-2.2.tar.gz",
            "has_sig": false,
            "md5_digest": "74517b046a198338a10000743798ccf7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 10346,
            "upload_time": "2023-05-29T20:52:25",
            "upload_time_iso_8601": "2023-05-29T20:52:25.387899Z",
            "url": "https://files.pythonhosted.org/packages/ff/b5/5fcc44f8b372bf93ca311432391ac310817ccb1a0c3e3fdfef1ffe6d3036/pandas-parallel-apply-2.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-05-29 20:52:25",
    "github": false,
    "gitlab": true,
    "bitbucket": false,
    "codeberg": false,
    "gitlab_user": "meehai",
    "gitlab_project": "pandas-parallel-apply",
    "lcname": "pandas-parallel-apply"
}
        
Elapsed time: 0.07227s