TwoSampleHC


NameTwoSampleHC JSON
Version 0.3.3 PyPI version JSON
download
home_pagehttps://github.com/alonkipnis/TwoSampleHC
SummarySeveral two-samples tests for contingency tables with counts data
upload_time2024-09-10 12:12:52
maintainerNone
docs_urlNone
authorAlon Kipnis
requires_python>=3.6
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # TwoSampleHC -- Higher Criticism Test between Two Frequency Tables

This package provides an adaptation of the Donoho-Jin-Tukey Higher-
Critisim (HC) test to frequency tables. This adapatation uses a binomial
allocation model for the number of occurances of each feature in two-
samples, each of which is associated with a frequency table. The exact
binomial test associated with each feature yields a p-value. The HC
statistic combines these P-values to a global test against the null
hypothesis that the two tables are two realizations of the same data
generating mechanism. 

This test is particularly useful in identifying non-null effects under
weak and sparse alternatives, i.e., when the difference between the
tables is due to few features, and the evidence each such feature
provide is realtively weak. See references below for more details.
[1] Alon Kipnis. (2022). Higher Criticism for Discriminating Word
 Frequency Tables and Testing Authorship. Annals of Applied Statistics.
[2] David L. Donoho and Alon Kipnis. (2022). Higher criticism to compare
 two large frequency tables, with sensitivity to possible rare and weak
 differences. Annals of Statistics. 


## Example:
```
from TwoSampleHC import two_sample_pvals, HC
import numpy as np

N = 1000 # number of features
n = 5 * N #number of samples

P = 1 / np.arange(1,N+1) # Zipf base distribution
P = P / P.sum()

ep = 0.02 #fraction of features to perturb
mu = 0.005 #intensity of perturbation

TH = np.random.rand(N) < ep
Q = P.copy()
Q[TH] += mu
Q = Q / np.sum(Q)

smp_P = np.random.multinomial(n, P)  # sample form P
smp_Q = np.random.multinomial(n, Q)  # sample from Q

pv = two_sample_pvals(smp_Q, smp_P) # binomial P-values
hc = HC(pv)
hc_val, p_th = hc.HCstar(gamma = 0.25) # Small sample Higher Criticism test

print("TV distance between P and Q: ", 0.5*np.sum(np.abs(P-Q)))
print("Higher-Criticism score for testing P == Q: ", hc_val)  
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/alonkipnis/TwoSampleHC",
    "name": "TwoSampleHC",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": null,
    "author": "Alon Kipnis",
    "author_email": "alonkipnis@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/f3/c9/35387d4c19bb1b3b1a920407d0113256eb8ba9edae56bd7661179996c147/twosamplehc-0.3.3.tar.gz",
    "platform": null,
    "description": "# TwoSampleHC -- Higher Criticism Test between Two Frequency Tables\n\nThis package provides an adaptation of the Donoho-Jin-Tukey Higher-\nCritisim (HC) test to frequency tables. This adapatation uses a binomial\nallocation model for the number of occurances of each feature in two-\nsamples, each of which is associated with a frequency table. The exact\nbinomial test associated with each feature yields a p-value. The HC\nstatistic combines these P-values to a global test against the null\nhypothesis that the two tables are two realizations of the same data\ngenerating mechanism. \n\nThis test is particularly useful in identifying non-null effects under\nweak and sparse alternatives, i.e., when the difference between the\ntables is due to few features, and the evidence each such feature\nprovide is realtively weak. See references below for more details.\n[1] Alon Kipnis. (2022). Higher Criticism for Discriminating Word\n Frequency Tables and Testing Authorship. Annals of Applied Statistics.\n[2] David L. Donoho and Alon Kipnis. (2022). Higher criticism to compare\n two large frequency tables, with sensitivity to possible rare and weak\n differences. Annals of Statistics. \n\n\n## Example:\n```\nfrom TwoSampleHC import two_sample_pvals, HC\nimport numpy as np\n\nN = 1000 # number of features\nn = 5 * N #number of samples\n\nP = 1 / np.arange(1,N+1) # Zipf base distribution\nP = P / P.sum()\n\nep = 0.02 #fraction of features to perturb\nmu = 0.005 #intensity of perturbation\n\nTH = np.random.rand(N) < ep\nQ = P.copy()\nQ[TH] += mu\nQ = Q / np.sum(Q)\n\nsmp_P = np.random.multinomial(n, P)  # sample form P\nsmp_Q = np.random.multinomial(n, Q)  # sample from Q\n\npv = two_sample_pvals(smp_Q, smp_P) # binomial P-values\nhc = HC(pv)\nhc_val, p_th = hc.HCstar(gamma = 0.25) # Small sample Higher Criticism test\n\nprint(\"TV distance between P and Q: \", 0.5*np.sum(np.abs(P-Q)))\nprint(\"Higher-Criticism score for testing P == Q: \", hc_val)  \n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Several two-samples tests for contingency tables with counts data",
    "version": "0.3.3",
    "project_urls": {
        "Download": "https://github.com/alonkipnis/TwoSampleHC",
        "Homepage": "https://github.com/alonkipnis/TwoSampleHC"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "438fc6a7fc6019855eeed33196acd736534fb61a77ce01999ab70be3baadf833",
                "md5": "b1485b2967e06b2c6ce1f2898b262bbe",
                "sha256": "72cab9c49a09c0adce6acba7f1577cb8c3e339b2c024d692b7bd621de6de7b91"
            },
            "downloads": -1,
            "filename": "TwoSampleHC-0.3.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b1485b2967e06b2c6ce1f2898b262bbe",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 8243,
            "upload_time": "2024-09-10T12:12:50",
            "upload_time_iso_8601": "2024-09-10T12:12:50.794477Z",
            "url": "https://files.pythonhosted.org/packages/43/8f/c6a7fc6019855eeed33196acd736534fb61a77ce01999ab70be3baadf833/TwoSampleHC-0.3.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f3c935387d4c19bb1b3b1a920407d0113256eb8ba9edae56bd7661179996c147",
                "md5": "46fca47954dfb43163d60b87d41f83c4",
                "sha256": "3c56d7024c3d1dc6d2e35e5211137d508222dbbb16fbe3ab075924ea9ea30a67"
            },
            "downloads": -1,
            "filename": "twosamplehc-0.3.3.tar.gz",
            "has_sig": false,
            "md5_digest": "46fca47954dfb43163d60b87d41f83c4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 9136,
            "upload_time": "2024-09-10T12:12:52",
            "upload_time_iso_8601": "2024-09-10T12:12:52.309334Z",
            "url": "https://files.pythonhosted.org/packages/f3/c9/35387d4c19bb1b3b1a920407d0113256eb8ba9edae56bd7661179996c147/twosamplehc-0.3.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-10 12:12:52",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "alonkipnis",
    "github_project": "TwoSampleHC",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "twosamplehc"
}
        
Elapsed time: 0.31290s