askcarl


Nameaskcarl JSON
Version 0.2.1 PyPI version JSON
download
home_pagehttps://github.com/JohannesBuchner/askcarl
SummaryGaussian mixture models with support for missing values and upper limits in some features.
upload_time2024-10-08 18:30:57
maintainerNone
docs_urlNone
authorJohannes Buchner
requires_python!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,!=3.5.*,!=3.6.*,!=3.7.*,>=2.7
licenseGNU General Public License v3
keywords multivariate gaussians with support for upper limits and missing data
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ========
askcarl
========

Gaussian Mixture Model with support for heterogeneous missing and censored (upper limit) data.

Pure python.

.. image:: https://img.shields.io/pypi/v/askcarl.svg
        :target: https://pypi.python.org/pypi/askcarl

.. image:: https://github.com/JohannesBuchner/askcarl/actions/workflows/tests.yml/badge.svg
        :target: https://github.com/JohannesBuchner/askcarl/actions/workflows/tests.yml

.. image:: https://img.shields.io/badge/docs-published-ok.svg
        :target: https://johannesbuchner.github.io/askcarl/
        :alt: Documentation Status

About
-----

Gaussian mixture models (GMMs) consist of 
weighted sums of Gaussian probability distributions.
They are a flexible tool to describe observations, and can be used
for classification and model density approximation in the context of 
simulation-based inference.

Missing data can occur when no measurement of a given feature was taken.
In that case, the probability of a GMM density can be obtained 
by marginalisation.
This is implemented in askcarl analytically.
This is different to `pygmmis <https://github.com/pmelchior/pygmmis>`_,
which approximates this situation with large measurement uncertainties.
This is different to `gmm-mcar <https://github.com/avati/gmm-mcar>`_,
which assumes that missing measurements occur uniformly randomly.

Upper limits can occur when the measurement of a given feature was not
sensitive enough.
In that case, the probability of a GMM density can be obtained by
marginalisation up to the upper limit.
This is implemented in askcarl analytically, and each data point can have
its own individual upper limit (heterogeneous).
This is different to typical censored GMMs, which assume a common 
upper limit for all data (homogeneous) (`see here for example <https://github.com/tranbahien/Truncated-Censored-EM>`_).

For these cases, askcarl implements evaluating the PDF and log-PDF of a mixture.
askcarl does not implement finding the mixture parameters.

Just ask Carl Friedrich Gauss for the probability.

Example
---------

Lets take the Iris flower data set (dots), and learn GMM as in
this `scikit-learn example <https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html>`_::

        iris = datasets.load_iris()
        X = iris.data[:,:3]
        y = iris.target

        gmm = GaussianMixture(n_components=3)
        gmm.fit(X)

This gives us a mixture with three Gaussians (shown as ellipses):

.. image:: iris.png

Lets import the learned mixture into askcarl::

        mix = askcarl.GaussianMixture.from_sklearn(gmm)

Now, we compute the probability of a few points:

1. the lime point in the blue setosa region, with coordinates (5, 3.5, 1.5).
2. the black point not near any cluster, with coordinates (6, 5.0, 5.0).
3. the red point, which has a upper limit on the sepal length <5.5, sepal width=3, and is missing data on petal length.

We can encode this information as follows::

        x = np.array([
           [5, 3.5, 1.5],
           [6, 5.0, 5.0],
           [5.5, 3, np.inf],
        ])
        mask = np.array([
           [True, True, True],
           [True, True, True],
           [False, True, False],
        ], dtype=bool)

Now we can ask for the probability to belong to each cluster::

        resp = np.array([g.pdf(x, mask) for g in mix.components])
        print(resp)  # shown in the top right panel
        #> [[4.04951446e+00 9.35236679e-97 5.54554910e-01]
        #>  [2.81243808e-28 2.16218666e-35 6.52744205e-03]
        #>  [3.34489158e-14 1.69515947e-12 6.53231941e-02]]
        # and the most probable corresponding class:
        print(resp.argmax(axis=0))
        #> [0 2 0]

Here we see that the first (lime) point is assigned to setosa,
with a high probability.
The second point has low probability in all classes.
The third point is assigned to the last class, virginica.

Finally, we can compute the probability given the positions::

        p = mix.pdf(x, mask)
        print(p)
        #> [1.34983815e+00 8.13140712e-13 2.17406640e-01]

Here we see again that the second point has very low probability,
indicating it is an outlier.

The third point, despite the missing data and upper limits, could be 
handled without needing to modify the original mixture.

Why
---

askcarl can be used for likelihood-based inference (LBI) with
simulation-based inference (SBI) generating samples, a EM algorithm
identifying the GMM parameters, but applied to data with missing data or upper limits.

This is a common case for photometric flux measurements in astronomy.

Usage
^^^^^

Read the full documentation at:

https://johannesbuchner.github.io/askcarl/


Licence
^^^^^^^

GPLv3 (see LICENCE file). If you require another license, please contact me.



==============
Release Notes
==============

0.1.0 (2024-09-28)
------------------

* First version

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/JohannesBuchner/askcarl",
    "name": "askcarl",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,!=3.5.*,!=3.6.*,!=3.7.*,>=2.7",
    "maintainer_email": null,
    "keywords": "Multivariate Gaussians with support for upper limits and missing data",
    "author": "Johannes Buchner",
    "author_email": "johannes.buchner.acad@gmx.com",
    "download_url": "https://files.pythonhosted.org/packages/bf/05/da9e4d73c1fbcd35cde33a3f4572da74424dfa9ded4f4959a1c2f6635ddb/askcarl-0.2.1.tar.gz",
    "platform": null,
    "description": "========\naskcarl\n========\n\nGaussian Mixture Model with support for heterogeneous missing and censored (upper limit) data.\n\nPure python.\n\n.. image:: https://img.shields.io/pypi/v/askcarl.svg\n        :target: https://pypi.python.org/pypi/askcarl\n\n.. image:: https://github.com/JohannesBuchner/askcarl/actions/workflows/tests.yml/badge.svg\n        :target: https://github.com/JohannesBuchner/askcarl/actions/workflows/tests.yml\n\n.. image:: https://img.shields.io/badge/docs-published-ok.svg\n        :target: https://johannesbuchner.github.io/askcarl/\n        :alt: Documentation Status\n\nAbout\n-----\n\nGaussian mixture models (GMMs) consist of \nweighted sums of Gaussian probability distributions.\nThey are a flexible tool to describe observations, and can be used\nfor classification and model density approximation in the context of \nsimulation-based inference.\n\nMissing data can occur when no measurement of a given feature was taken.\nIn that case, the probability of a GMM density can be obtained \nby marginalisation.\nThis is implemented in askcarl analytically.\nThis is different to `pygmmis <https://github.com/pmelchior/pygmmis>`_,\nwhich approximates this situation with large measurement uncertainties.\nThis is different to `gmm-mcar <https://github.com/avati/gmm-mcar>`_,\nwhich assumes that missing measurements occur uniformly randomly.\n\nUpper limits can occur when the measurement of a given feature was not\nsensitive enough.\nIn that case, the probability of a GMM density can be obtained by\nmarginalisation up to the upper limit.\nThis is implemented in askcarl analytically, and each data point can have\nits own individual upper limit (heterogeneous).\nThis is different to typical censored GMMs, which assume a common \nupper limit for all data (homogeneous) (`see here for example <https://github.com/tranbahien/Truncated-Censored-EM>`_).\n\nFor these cases, askcarl implements evaluating the PDF and log-PDF of a mixture.\naskcarl does not implement finding the mixture parameters.\n\nJust ask Carl Friedrich Gauss for the probability.\n\nExample\n---------\n\nLets take the Iris flower data set (dots), and learn GMM as in\nthis `scikit-learn example <https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html>`_::\n\n        iris = datasets.load_iris()\n        X = iris.data[:,:3]\n        y = iris.target\n\n        gmm = GaussianMixture(n_components=3)\n        gmm.fit(X)\n\nThis gives us a mixture with three Gaussians (shown as ellipses):\n\n.. image:: iris.png\n\nLets import the learned mixture into askcarl::\n\n        mix = askcarl.GaussianMixture.from_sklearn(gmm)\n\nNow, we compute the probability of a few points:\n\n1. the lime point in the blue setosa region, with coordinates (5, 3.5, 1.5).\n2. the black point not near any cluster, with coordinates (6, 5.0, 5.0).\n3. the red point, which has a upper limit on the sepal length <5.5, sepal width=3, and is missing data on petal length.\n\nWe can encode this information as follows::\n\n        x = np.array([\n           [5, 3.5, 1.5],\n           [6, 5.0, 5.0],\n           [5.5, 3, np.inf],\n        ])\n        mask = np.array([\n           [True, True, True],\n           [True, True, True],\n           [False, True, False],\n        ], dtype=bool)\n\nNow we can ask for the probability to belong to each cluster::\n\n        resp = np.array([g.pdf(x, mask) for g in mix.components])\n        print(resp)  # shown in the top right panel\n        #> [[4.04951446e+00 9.35236679e-97 5.54554910e-01]\n        #>  [2.81243808e-28 2.16218666e-35 6.52744205e-03]\n        #>  [3.34489158e-14 1.69515947e-12 6.53231941e-02]]\n        # and the most probable corresponding class:\n        print(resp.argmax(axis=0))\n        #> [0 2 0]\n\nHere we see that the first (lime) point is assigned to setosa,\nwith a high probability.\nThe second point has low probability in all classes.\nThe third point is assigned to the last class, virginica.\n\nFinally, we can compute the probability given the positions::\n\n        p = mix.pdf(x, mask)\n        print(p)\n        #> [1.34983815e+00 8.13140712e-13 2.17406640e-01]\n\nHere we see again that the second point has very low probability,\nindicating it is an outlier.\n\nThe third point, despite the missing data and upper limits, could be \nhandled without needing to modify the original mixture.\n\nWhy\n---\n\naskcarl can be used for likelihood-based inference (LBI) with\nsimulation-based inference (SBI) generating samples, a EM algorithm\nidentifying the GMM parameters, but applied to data with missing data or upper limits.\n\nThis is a common case for photometric flux measurements in astronomy.\n\nUsage\n^^^^^\n\nRead the full documentation at:\n\nhttps://johannesbuchner.github.io/askcarl/\n\n\nLicence\n^^^^^^^\n\nGPLv3 (see LICENCE file). If you require another license, please contact me.\n\n\n\n==============\nRelease Notes\n==============\n\n0.1.0 (2024-09-28)\n------------------\n\n* First version\n",
    "bugtrack_url": null,
    "license": "GNU General Public License v3",
    "summary": "Gaussian mixture models with support for missing values and upper limits in some features.",
    "version": "0.2.1",
    "project_urls": {
        "Homepage": "https://github.com/JohannesBuchner/askcarl"
    },
    "split_keywords": [
        "multivariate",
        "gaussians",
        "with",
        "support",
        "for",
        "upper",
        "limits",
        "and",
        "missing",
        "data"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bf05da9e4d73c1fbcd35cde33a3f4572da74424dfa9ded4f4959a1c2f6635ddb",
                "md5": "024396729f76b4ba8d2ce07125352d72",
                "sha256": "c77ed76ab52561c9aa5b75dbe0d23b1c03b9afbd362cde2c49a5cccb50f01897"
            },
            "downloads": -1,
            "filename": "askcarl-0.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "024396729f76b4ba8d2ce07125352d72",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,!=3.5.*,!=3.6.*,!=3.7.*,>=2.7",
            "size": 82549,
            "upload_time": "2024-10-08T18:30:57",
            "upload_time_iso_8601": "2024-10-08T18:30:57.144041Z",
            "url": "https://files.pythonhosted.org/packages/bf/05/da9e4d73c1fbcd35cde33a3f4572da74424dfa9ded4f4959a1c2f6635ddb/askcarl-0.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-08 18:30:57",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "JohannesBuchner",
    "github_project": "askcarl",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "askcarl"
}
        
Elapsed time: 1.79285s