corpusit


Namecorpusit JSON
Version 0.1.3 PyPI version JSON
download
home_pageNone
SummaryNone
upload_time2022-12-07 02:38:09
maintainerNone
docs_urlNone
authorXin Du
requires_python>=3.6
licenseNone
keywords natural language modeling corpus skipgram
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Corpusit
`corpusit` provides easy-to-use dataset iterators for natural language modeling
tasks, such as SkipGram.

It is written in rust to enable fast multi-threading random sampling with
deterministic results. So you dont have to worry about the speed /
reproducibility.

Corpusit does not provide tokenization functionalities. So please use `corpusit`
on tokenized corpus files (plain texts).

# Environment

Python >= 3.6

# Installation

```bash
$ pip install corpusit
```

## On Windows and MacOS

Please install [rust](https://www.rust-lang.org/tools/install) compiler before
executing `pip install corpusit`. 

# Usage

## SkipGram

Each line in the corpus file is a document, and the tokens should be separated by whitespace.

```python
import corpusit

corpus_path = 'corpusit/data/corpus.txt'
vocab = corpusit.Vocab.build(corpus_path, min_count=1, unk='<unk>')

dataset = corpusit.SkipGramDataset(
    path_to_corpus=corpus_path,
    vocab=vocab,
    win_size=10,
    sep=" ",
    mode="onepass",       # onepass | repeat | shuffle
    subsample=1e-3,
    power=0.75,
    n_neg=1,
)

it = dataset.positive_sampler(batch_size=100, seed=0, num_threads=4)

for i, pair in enumerate(it):
    print(f'Iter {i:>4d}, shape={pair.shape}. First pair: '
          f'{pair[0,0]:>5} ({vocab.i2s[pair[0,0]]:>10}), '
          f'{pair[0,1]:>5} ({vocab.i2s[pair[0,1]]:>10})')

# Return:
# Iter    0, shape=(100, 2). First pair:    14 (        is),    10 ( anarchism)
# Iter    1, shape=(100, 2). First pair:     8 (        to),   540 (      and/)
# Iter    2, shape=(100, 2). First pair:   775 (constitutes),    34 (anarchists)
# Iter    3, shape=(100, 2). First pair:    72 (     other),   214 (  criteria)
# Iter    4, shape=(100, 2). First pair:   650 (  defining),   487 ( companion)
# ...
```


## SkipGram with negative sampling
```python
it = dataset.sampler(100, seed=0, num_threads=4)

for i, res in enumerate(it):
    pair, label = res
    print(f'Iter {i:>4d}, shape={pair.shape}. First pair: '
          f'{pair[0,0]:>5} ({vocab.i2s[pair[0,0]]:>10}), '
          f'{pair[0,1]:>5} ({vocab.i2s[pair[0,1]]:>10}), '
          f'label = {label[0]}')

# Returns:
# Iter    0, shape=(200, 2). First pair:    15 (        is),    10 ( anarchism), label = True
# Iter    1, shape=(200, 2). First pair:     9 (        to),   722 (      and/), label = True
# Iter    2, shape=(200, 2). First pair:   389 (constitutes),    34 (anarchists), label = True
# Iter    3, shape=(200, 2). First pair:    73 (     other),   212 (  criteria), label = True
# Iter    4, shape=(200, 2). First pair:   445 (  defining),   793 ( companion), label = True
# ...
```

# Roadmap
- GloVe


# License
MIT

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "corpusit",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": "natural language modeling,corpus,skipgram",
    "author": "Xin Du",
    "author_email": "duxin.ac@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/20/96/99e6ac19c935b3c543f725ed767f03d4b331bdba33cefbca32413201d988/corpusit-0.1.3.tar.gz",
    "platform": null,
    "description": "# Corpusit\n`corpusit` provides easy-to-use dataset iterators for natural language modeling\ntasks, such as SkipGram.\n\nIt is written in rust to enable fast multi-threading random sampling with\ndeterministic results. So you dont have to worry about the speed /\nreproducibility.\n\nCorpusit does not provide tokenization functionalities. So please use `corpusit`\non tokenized corpus files (plain texts).\n\n# Environment\n\nPython >= 3.6\n\n# Installation\n\n```bash\n$ pip install corpusit\n```\n\n## On Windows and MacOS\n\nPlease install [rust](https://www.rust-lang.org/tools/install) compiler before\nexecuting `pip install corpusit`. \n\n# Usage\n\n## SkipGram\n\nEach line in the corpus file is a document, and the tokens should be separated by whitespace.\n\n```python\nimport corpusit\n\ncorpus_path = 'corpusit/data/corpus.txt'\nvocab = corpusit.Vocab.build(corpus_path, min_count=1, unk='<unk>')\n\ndataset = corpusit.SkipGramDataset(\n    path_to_corpus=corpus_path,\n    vocab=vocab,\n    win_size=10,\n    sep=\" \",\n    mode=\"onepass\",       # onepass | repeat | shuffle\n    subsample=1e-3,\n    power=0.75,\n    n_neg=1,\n)\n\nit = dataset.positive_sampler(batch_size=100, seed=0, num_threads=4)\n\nfor i, pair in enumerate(it):\n    print(f'Iter {i:>4d}, shape={pair.shape}. First pair: '\n          f'{pair[0,0]:>5} ({vocab.i2s[pair[0,0]]:>10}), '\n          f'{pair[0,1]:>5} ({vocab.i2s[pair[0,1]]:>10})')\n\n# Return:\n# Iter    0, shape=(100, 2). First pair:    14 (        is),    10 ( anarchism)\n# Iter    1, shape=(100, 2). First pair:     8 (        to),   540 (      and/)\n# Iter    2, shape=(100, 2). First pair:   775 (constitutes),    34 (anarchists)\n# Iter    3, shape=(100, 2). First pair:    72 (     other),   214 (  criteria)\n# Iter    4, shape=(100, 2). First pair:   650 (  defining),   487 ( companion)\n# ...\n```\n\n\n## SkipGram with negative sampling\n```python\nit = dataset.sampler(100, seed=0, num_threads=4)\n\nfor i, res in enumerate(it):\n    pair, label = res\n    print(f'Iter {i:>4d}, shape={pair.shape}. First pair: '\n          f'{pair[0,0]:>5} ({vocab.i2s[pair[0,0]]:>10}), '\n          f'{pair[0,1]:>5} ({vocab.i2s[pair[0,1]]:>10}), '\n          f'label = {label[0]}')\n\n# Returns:\n# Iter    0, shape=(200, 2). First pair:    15 (        is),    10 ( anarchism), label = True\n# Iter    1, shape=(200, 2). First pair:     9 (        to),   722 (      and/), label = True\n# Iter    2, shape=(200, 2). First pair:   389 (constitutes),    34 (anarchists), label = True\n# Iter    3, shape=(200, 2). First pair:    73 (     other),   212 (  criteria), label = True\n# Iter    4, shape=(200, 2). First pair:   445 (  defining),   793 ( companion), label = True\n# ...\n```\n\n# Roadmap\n- GloVe\n\n\n# License\nMIT\n",
    "bugtrack_url": null,
    "license": null,
    "summary": null,
    "version": "0.1.3",
    "split_keywords": [
        "natural language modeling",
        "corpus",
        "skipgram"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "md5": "ad2f194675258f651ebac23e79d63696",
                "sha256": "d7ce015590031f15566fd841dce0e82aacc034f1aeba8175d906099147dd4798"
            },
            "downloads": -1,
            "filename": "corpusit-0.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "ad2f194675258f651ebac23e79d63696",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.6",
            "size": 463753,
            "upload_time": "2022-12-07T02:37:53",
            "upload_time_iso_8601": "2022-12-07T02:37:53.879213Z",
            "url": "https://files.pythonhosted.org/packages/04/de/93c62357e645263ac1b35427330c7893e3c4bc66e4a4c8804ce0d098c021/corpusit-0.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "md5": "19b6ed113375ea6b128997e550f80854",
                "sha256": "6673e797a17167990efb1f6bbfbccc7ba60a7634bd55e07cc337fc2d66caa42d"
            },
            "downloads": -1,
            "filename": "corpusit-0.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "19b6ed113375ea6b128997e550f80854",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": ">=3.6",
            "size": 463754,
            "upload_time": "2022-12-07T02:37:57",
            "upload_time_iso_8601": "2022-12-07T02:37:57.088361Z",
            "url": "https://files.pythonhosted.org/packages/0d/9f/b5a2194c8ddd3fee4290b40df4e68a6fca7631de3b7453ad8514366c79c6/corpusit-0.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "md5": "06a7a052c8b8e4b4bbc1d56a65562b63",
                "sha256": "372229db993d472ecc4b87de1fb3bc88986e024b2834d21510761759f899d6cd"
            },
            "downloads": -1,
            "filename": "corpusit-0.1.3-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "06a7a052c8b8e4b4bbc1d56a65562b63",
            "packagetype": "bdist_wheel",
            "python_version": "cp36",
            "requires_python": ">=3.6",
            "size": 461943,
            "upload_time": "2022-12-07T02:37:40",
            "upload_time_iso_8601": "2022-12-07T02:37:40.169420Z",
            "url": "https://files.pythonhosted.org/packages/d7/2f/b7a6ef75adfb854a3efc8a4414f2138a1958919e9f6942e4e4e14b124967/corpusit-0.1.3-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "md5": "152f0768ba3239465747849a64e2c27a",
                "sha256": "d84dbd2cc16af49748f91e0b75c5bc912730d80be90cfd17d3f68ac3cc209544"
            },
            "downloads": -1,
            "filename": "corpusit-0.1.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "152f0768ba3239465747849a64e2c27a",
            "packagetype": "bdist_wheel",
            "python_version": "cp37",
            "requires_python": ">=3.6",
            "size": 463667,
            "upload_time": "2022-12-07T02:37:43",
            "upload_time_iso_8601": "2022-12-07T02:37:43.987962Z",
            "url": "https://files.pythonhosted.org/packages/bc/50/267cd5972cd99866fb0ca9e054dbe5f5684d91643e72212bf25d789761fa/corpusit-0.1.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "md5": "838e3af2d9c8525e75c502f968777256",
                "sha256": "be155d45fc342f86e1a771792464012d29a956d576d02c4e77f66d35bccefcc7"
            },
            "downloads": -1,
            "filename": "corpusit-0.1.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "838e3af2d9c8525e75c502f968777256",
            "packagetype": "bdist_wheel",
            "python_version": "cp38",
            "requires_python": ">=3.6",
            "size": 463448,
            "upload_time": "2022-12-07T02:37:47",
            "upload_time_iso_8601": "2022-12-07T02:37:47.263292Z",
            "url": "https://files.pythonhosted.org/packages/98/59/cb1f14beaa6ad52e7173ff768a07a2c746aac43457a8dde91ab207fd77bd/corpusit-0.1.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "md5": "f95b58f95379756421d1bfe28a3d6a35",
                "sha256": "63acb216bd759ddede98a9ff8dcb8141ad50125dd89c64ecf35043d33991c70a"
            },
            "downloads": -1,
            "filename": "corpusit-0.1.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "f95b58f95379756421d1bfe28a3d6a35",
            "packagetype": "bdist_wheel",
            "python_version": "cp39",
            "requires_python": ">=3.6",
            "size": 463811,
            "upload_time": "2022-12-07T02:37:50",
            "upload_time_iso_8601": "2022-12-07T02:37:50.691319Z",
            "url": "https://files.pythonhosted.org/packages/c8/e4/c889f4fdf049001b325c2d9c7ebd1d3e94be2d5f8aa584361098e3180161/corpusit-0.1.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "md5": "6ccc9f2fc99011b033dc44eb79da9bb7",
                "sha256": "47b9e71848fb9a0a35472f6cb6aa45da190a1017af2d76e88e55ffa93c59b550"
            },
            "downloads": -1,
            "filename": "corpusit-0.1.3-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "6ccc9f2fc99011b033dc44eb79da9bb7",
            "packagetype": "bdist_wheel",
            "python_version": "pp37",
            "requires_python": ">=3.6",
            "size": 465552,
            "upload_time": "2022-12-07T02:38:00",
            "upload_time_iso_8601": "2022-12-07T02:38:00.357928Z",
            "url": "https://files.pythonhosted.org/packages/b0/f7/f251db5f98782379be8f830d4ceac0941634cf76aad6f8fc50d3ea51ccf6/corpusit-0.1.3-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "md5": "093e36d75554e89bb966d9a22158e11e",
                "sha256": "eb432ae1d78114d354040dcb199b67ceaf277368d8d7de7931c7372f99e73545"
            },
            "downloads": -1,
            "filename": "corpusit-0.1.3-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "093e36d75554e89bb966d9a22158e11e",
            "packagetype": "bdist_wheel",
            "python_version": "pp38",
            "requires_python": ">=3.6",
            "size": 463525,
            "upload_time": "2022-12-07T02:38:03",
            "upload_time_iso_8601": "2022-12-07T02:38:03.353867Z",
            "url": "https://files.pythonhosted.org/packages/cb/21/ac40b83daaea66a5596c3c3468ecbc621757ba575de1871216b784226114/corpusit-0.1.3-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "md5": "106ed07e223c14398b6a9e88041ae43c",
                "sha256": "1735ab179a14e15029a3218e1049ec9e8a29729baf71848b20b9940fd19be48b"
            },
            "downloads": -1,
            "filename": "corpusit-0.1.3-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "106ed07e223c14398b6a9e88041ae43c",
            "packagetype": "bdist_wheel",
            "python_version": "pp39",
            "requires_python": ">=3.6",
            "size": 463758,
            "upload_time": "2022-12-07T02:38:06",
            "upload_time_iso_8601": "2022-12-07T02:38:06.684346Z",
            "url": "https://files.pythonhosted.org/packages/6d/3b/23916742d1a66e77decbe38f2f5af8a6174e4bdc41206c830e27ef17440b/corpusit-0.1.3-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "md5": "4ca58f50e3873b3bad381b7f0678a0ef",
                "sha256": "60cc146b8d4045bc75ad29257f352f647aba61933bc4c9caaa1207b90a5e4223"
            },
            "downloads": -1,
            "filename": "corpusit-0.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "4ca58f50e3873b3bad381b7f0678a0ef",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 71417,
            "upload_time": "2022-12-07T02:38:09",
            "upload_time_iso_8601": "2022-12-07T02:38:09.055483Z",
            "url": "https://files.pythonhosted.org/packages/20/96/99e6ac19c935b3c543f725ed767f03d4b331bdba33cefbca32413201d988/corpusit-0.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-07 02:38:09",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "lcname": "corpusit"
}
        
Elapsed time: 0.01977s