parallel-statistics

Name	parallel-statistics JSON
Version	0.13 JSON
	download
home_page	https://github.com/LSSTDESC/parallel_statistics
Summary	Calculating basic statistics in parallel, incrementally
upload_time	2023-03-24 15:51:18
maintainer
docs_url	None
author	Joe Zuntz
requires_python	>=3.6
license
keywords	mpi statistics
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            Overview
--------

This package collects tools which compute weighted statistics on parallel, incremental data, i.e. data being read by multiple processors, a chunk at a time.  

The tools available are:
- ``ParallelSum``
- ``ParallelMean``
- ``ParallelMeanVariance``
- ``ParallelHistogram``
- ``SparseArray``

All assume that mpi4py is being used among the processes, and are passed a communicator object (often ``mpi4py.MPI.COMM_WORLD``).

Installation
------------

For now you can install this package using:

```
pip install parallel_statistics
```

Documentation
-------------

Documentation can be found at https://parallel-statistics.readthedocs.io/

Example
-------

The three tools ``ParallelSum``, ``ParallelMean``, and ``ParallelMeanVariance`` compute statistics in bins, and you add data to them per bin.

The usage pattern for them, and ``ParallelHistogram``, is:

- Create a parallel calculator object in each MPI process
- Have each process read in their own chunks of data and add it using the ``add_data`` methods
- Once complete, call the ``collect`` method to get the combined results.

Here's an example of splitting up data from an HDF5 file, using an example from the DESC tomographic challenge.  You can run it either on its own, or under MPI with different numbers of processors, and the results should be the same:

```python
import mpi4py.MPI
import h5py
import parallel_statistics
import numpy as np

# This data file is available at
# https://portal.nersc.gov/project/lsst/txpipe/tomo_challenge_data/ugrizy/mini_training.hdf5
f = h5py.File("mini_training.hdf5", "r")
comm = mpi4py.MPI.COMM_WORLD

# We must divide up the data between the processes
# Choose the chunk sizes to use here
chunk_size = 1000
total_size = f['redshift_true'].size
nchunk = total_size // chunk_size
if nchunk * chunk_size < total_size:
    nchunk += 1

# Choose the binning in which to put values
nbin = 20
dz = 0.2

# Make our calculator
calc = parallel_statistics.ParallelMeanVariance(size=nbin)

# Loop through the data
for i in range(nchunk):
    # Each process only reads its assigned chunks,
    # otherwise, skip this chunk
    if i % comm.size != comm.rank:
        continue
    # work out the data range to read
    start = i * chunk_size
    end = start + chunk_size

    # read in the input data
    z = f['redshift_true'][start:end]
    r = f['r_mag'][start:end]

    # Work out which bins to use for it
    b = (z / dz).astype(int)

    # add add each one
    for j in range(z.size):
        # skip inf, nan, and sentinel values
        if not r[j] < 30:
            continue
        # add each data point
        calc.add_datum(b[j], r[j])

# Finally, collect the results together
weight, mean, variance = calc.collect(comm)

# Print out results - only the root process gets the data, unless you pass
# mode=allreduce to collect.  Will print out NaNs for bins with no objects in.
if comm.rank == 0:
    for i in range(nbin):
        print(f"z = [{ dz * i :.1f} .. { dz * (i+1) :.1f}]    r = { mean[i] :.2f} ± { variance[i] :.2f}")
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/LSSTDESC/parallel_statistics",
    "name": "parallel-statistics",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "MPI,statistics",
    "author": "Joe Zuntz",
    "author_email": "joe.zuntz@ed.ac.uk",
    "download_url": "https://files.pythonhosted.org/packages/8d/c3/09d1fa59a9e81c89edf6561222860a4d56b383fa40ff35b1a3fa0c087cb1/parallel_statistics-0.13.tar.gz",
    "platform": null,
    "description": "Overview\n--------\n\nThis package collects tools which compute weighted statistics on parallel, incremental data, i.e. data being read by multiple processors, a chunk at a time.  \n\nThe tools available are:\n- ``ParallelSum``\n- ``ParallelMean``\n- ``ParallelMeanVariance``\n- ``ParallelHistogram``\n- ``SparseArray``\n\nAll assume that mpi4py is being used among the processes, and are passed a communicator object (often ``mpi4py.MPI.COMM_WORLD``).\n\nInstallation\n------------\n\nFor now you can install this package using:\n\n```\npip install parallel_statistics\n```\n\nDocumentation\n-------------\n\nDocumentation can be found at https://parallel-statistics.readthedocs.io/\n\nExample\n-------\n\nThe three tools ``ParallelSum``, ``ParallelMean``, and ``ParallelMeanVariance`` compute statistics in bins, and you add data to them per bin.\n\nThe usage pattern for them, and ``ParallelHistogram``, is:\n\n- Create a parallel calculator object in each MPI process\n- Have each process read in their own chunks of data and add it using the ``add_data`` methods\n- Once complete, call the ``collect`` method to get the combined results.\n\nHere's an example of splitting up data from an HDF5 file, using an example from the DESC tomographic challenge.  You can run it either on its own, or under MPI with different numbers of processors, and the results should be the same:\n\n```python\nimport mpi4py.MPI\nimport h5py\nimport parallel_statistics\nimport numpy as np\n\n# This data file is available at\n# https://portal.nersc.gov/project/lsst/txpipe/tomo_challenge_data/ugrizy/mini_training.hdf5\nf = h5py.File(\"mini_training.hdf5\", \"r\")\ncomm = mpi4py.MPI.COMM_WORLD\n\n# We must divide up the data between the processes\n# Choose the chunk sizes to use here\nchunk_size = 1000\ntotal_size = f['redshift_true'].size\nnchunk = total_size // chunk_size\nif nchunk * chunk_size < total_size:\n    nchunk += 1\n\n# Choose the binning in which to put values\nnbin = 20\ndz = 0.2\n\n# Make our calculator\ncalc = parallel_statistics.ParallelMeanVariance(size=nbin)\n\n# Loop through the data\nfor i in range(nchunk):\n    # Each process only reads its assigned chunks,\n    # otherwise, skip this chunk\n    if i % comm.size != comm.rank:\n        continue\n    # work out the data range to read\n    start = i * chunk_size\n    end = start + chunk_size\n\n    # read in the input data\n    z = f['redshift_true'][start:end]\n    r = f['r_mag'][start:end]\n\n    # Work out which bins to use for it\n    b = (z / dz).astype(int)\n\n    # add add each one\n    for j in range(z.size):\n        # skip inf, nan, and sentinel values\n        if not r[j] < 30:\n            continue\n        # add each data point\n        calc.add_datum(b[j], r[j])\n\n# Finally, collect the results together\nweight, mean, variance = calc.collect(comm)\n\n# Print out results - only the root process gets the data, unless you pass\n# mode=allreduce to collect.  Will print out NaNs for bins with no objects in.\nif comm.rank == 0:\n    for i in range(nbin):\n        print(f\"z = [{ dz * i :.1f} .. { dz * (i+1) :.1f}]    r = { mean[i] :.2f} \u00b1 { variance[i] :.2f}\")\n```\n\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Calculating basic statistics in parallel, incrementally",
    "version": "0.13",
    "split_keywords": [
        "mpi",
        "statistics"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8dc309d1fa59a9e81c89edf6561222860a4d56b383fa40ff35b1a3fa0c087cb1",
                "md5": "1b9045a763df1764d622d45c728f10af",
                "sha256": "8b9ac2f35bdbe773295941221b7981f1a0388cd702e4eab9343bcd9d3098a340"
            },
            "downloads": -1,
            "filename": "parallel_statistics-0.13.tar.gz",
            "has_sig": false,
            "md5_digest": "1b9045a763df1764d622d45c728f10af",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 11945,
            "upload_time": "2023-03-24T15:51:18",
            "upload_time_iso_8601": "2023-03-24T15:51:18.677168Z",
            "url": "https://files.pythonhosted.org/packages/8d/c3/09d1fa59a9e81c89edf6561222860a4d56b383fa40ff35b1a3fa0c087cb1/parallel_statistics-0.13.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-03-24 15:51:18",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "LSSTDESC",
    "github_project": "parallel_statistics",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "parallel-statistics"
}

Joe Zuntz