# Corpusit
`corpusit` provides easy-to-use dataset iterators for natural language modeling
tasks, such as SkipGram.
It is written in rust to enable fast multi-threading random sampling with
deterministic results. So you dont have to worry about the speed /
reproducibility.
Corpusit does not provide tokenization functionalities. So please use `corpusit`
on tokenized corpus files (plain texts).
# Environment
Python >= 3.6
# Installation
```bash
$ pip install corpusit
```
## On Windows and MacOS
Please install [rust](https://www.rust-lang.org/tools/install) compiler before
executing `pip install corpusit`.
# Usage
## SkipGram
Each line in the corpus file is a document, and the tokens should be separated by whitespace.
```python
import corpusit
corpus_path = 'corpusit/data/corpus.txt'
vocab = corpusit.Vocab.build(corpus_path, min_count=1, unk='<unk>')
dataset = corpusit.SkipGramDataset(
path_to_corpus=corpus_path,
vocab=vocab,
win_size=10,
sep=" ",
mode="onepass", # onepass | repeat | shuffle
subsample=1e-3,
power=0.75,
n_neg=1,
)
it = dataset.positive_sampler(batch_size=100, seed=0, num_threads=4)
for i, pair in enumerate(it):
print(f'Iter {i:>4d}, shape={pair.shape}. First pair: '
f'{pair[0,0]:>5} ({vocab.i2s[pair[0,0]]:>10}), '
f'{pair[0,1]:>5} ({vocab.i2s[pair[0,1]]:>10})')
# Return:
# Iter 0, shape=(100, 2). First pair: 14 ( is), 10 ( anarchism)
# Iter 1, shape=(100, 2). First pair: 8 ( to), 540 ( and/)
# Iter 2, shape=(100, 2). First pair: 775 (constitutes), 34 (anarchists)
# Iter 3, shape=(100, 2). First pair: 72 ( other), 214 ( criteria)
# Iter 4, shape=(100, 2). First pair: 650 ( defining), 487 ( companion)
# ...
```
## SkipGram with negative sampling
```python
it = dataset.sampler(100, seed=0, num_threads=4)
for i, res in enumerate(it):
pair, label = res
print(f'Iter {i:>4d}, shape={pair.shape}. First pair: '
f'{pair[0,0]:>5} ({vocab.i2s[pair[0,0]]:>10}), '
f'{pair[0,1]:>5} ({vocab.i2s[pair[0,1]]:>10}), '
f'label = {label[0]}')
# Returns:
# Iter 0, shape=(200, 2). First pair: 15 ( is), 10 ( anarchism), label = True
# Iter 1, shape=(200, 2). First pair: 9 ( to), 722 ( and/), label = True
# Iter 2, shape=(200, 2). First pair: 389 (constitutes), 34 (anarchists), label = True
# Iter 3, shape=(200, 2). First pair: 73 ( other), 212 ( criteria), label = True
# Iter 4, shape=(200, 2). First pair: 445 ( defining), 793 ( companion), label = True
# ...
```
# Roadmap
- GloVe
# License
MIT
Raw data
{
"_id": null,
"home_page": null,
"name": "corpusit",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": null,
"keywords": "natural language modeling,corpus,skipgram",
"author": "Xin Du",
"author_email": "duxin.ac@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/20/96/99e6ac19c935b3c543f725ed767f03d4b331bdba33cefbca32413201d988/corpusit-0.1.3.tar.gz",
"platform": null,
"description": "# Corpusit\n`corpusit` provides easy-to-use dataset iterators for natural language modeling\ntasks, such as SkipGram.\n\nIt is written in rust to enable fast multi-threading random sampling with\ndeterministic results. So you dont have to worry about the speed /\nreproducibility.\n\nCorpusit does not provide tokenization functionalities. So please use `corpusit`\non tokenized corpus files (plain texts).\n\n# Environment\n\nPython >= 3.6\n\n# Installation\n\n```bash\n$ pip install corpusit\n```\n\n## On Windows and MacOS\n\nPlease install [rust](https://www.rust-lang.org/tools/install) compiler before\nexecuting `pip install corpusit`. \n\n# Usage\n\n## SkipGram\n\nEach line in the corpus file is a document, and the tokens should be separated by whitespace.\n\n```python\nimport corpusit\n\ncorpus_path = 'corpusit/data/corpus.txt'\nvocab = corpusit.Vocab.build(corpus_path, min_count=1, unk='<unk>')\n\ndataset = corpusit.SkipGramDataset(\n path_to_corpus=corpus_path,\n vocab=vocab,\n win_size=10,\n sep=\" \",\n mode=\"onepass\", # onepass | repeat | shuffle\n subsample=1e-3,\n power=0.75,\n n_neg=1,\n)\n\nit = dataset.positive_sampler(batch_size=100, seed=0, num_threads=4)\n\nfor i, pair in enumerate(it):\n print(f'Iter {i:>4d}, shape={pair.shape}. First pair: '\n f'{pair[0,0]:>5} ({vocab.i2s[pair[0,0]]:>10}), '\n f'{pair[0,1]:>5} ({vocab.i2s[pair[0,1]]:>10})')\n\n# Return:\n# Iter 0, shape=(100, 2). First pair: 14 ( is), 10 ( anarchism)\n# Iter 1, shape=(100, 2). First pair: 8 ( to), 540 ( and/)\n# Iter 2, shape=(100, 2). First pair: 775 (constitutes), 34 (anarchists)\n# Iter 3, shape=(100, 2). First pair: 72 ( other), 214 ( criteria)\n# Iter 4, shape=(100, 2). First pair: 650 ( defining), 487 ( companion)\n# ...\n```\n\n\n## SkipGram with negative sampling\n```python\nit = dataset.sampler(100, seed=0, num_threads=4)\n\nfor i, res in enumerate(it):\n pair, label = res\n print(f'Iter {i:>4d}, shape={pair.shape}. First pair: '\n f'{pair[0,0]:>5} ({vocab.i2s[pair[0,0]]:>10}), '\n f'{pair[0,1]:>5} ({vocab.i2s[pair[0,1]]:>10}), '\n f'label = {label[0]}')\n\n# Returns:\n# Iter 0, shape=(200, 2). First pair: 15 ( is), 10 ( anarchism), label = True\n# Iter 1, shape=(200, 2). First pair: 9 ( to), 722 ( and/), label = True\n# Iter 2, shape=(200, 2). First pair: 389 (constitutes), 34 (anarchists), label = True\n# Iter 3, shape=(200, 2). First pair: 73 ( other), 212 ( criteria), label = True\n# Iter 4, shape=(200, 2). First pair: 445 ( defining), 793 ( companion), label = True\n# ...\n```\n\n# Roadmap\n- GloVe\n\n\n# License\nMIT\n",
"bugtrack_url": null,
"license": null,
"summary": null,
"version": "0.1.3",
"split_keywords": [
"natural language modeling",
"corpus",
"skipgram"
],
"urls": [
{
"comment_text": null,
"digests": {
"md5": "ad2f194675258f651ebac23e79d63696",
"sha256": "d7ce015590031f15566fd841dce0e82aacc034f1aeba8175d906099147dd4798"
},
"downloads": -1,
"filename": "corpusit-0.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "ad2f194675258f651ebac23e79d63696",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.6",
"size": 463753,
"upload_time": "2022-12-07T02:37:53",
"upload_time_iso_8601": "2022-12-07T02:37:53.879213Z",
"url": "https://files.pythonhosted.org/packages/04/de/93c62357e645263ac1b35427330c7893e3c4bc66e4a4c8804ce0d098c021/corpusit-0.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"md5": "19b6ed113375ea6b128997e550f80854",
"sha256": "6673e797a17167990efb1f6bbfbccc7ba60a7634bd55e07cc337fc2d66caa42d"
},
"downloads": -1,
"filename": "corpusit-0.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "19b6ed113375ea6b128997e550f80854",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.6",
"size": 463754,
"upload_time": "2022-12-07T02:37:57",
"upload_time_iso_8601": "2022-12-07T02:37:57.088361Z",
"url": "https://files.pythonhosted.org/packages/0d/9f/b5a2194c8ddd3fee4290b40df4e68a6fca7631de3b7453ad8514366c79c6/corpusit-0.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"md5": "06a7a052c8b8e4b4bbc1d56a65562b63",
"sha256": "372229db993d472ecc4b87de1fb3bc88986e024b2834d21510761759f899d6cd"
},
"downloads": -1,
"filename": "corpusit-0.1.3-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "06a7a052c8b8e4b4bbc1d56a65562b63",
"packagetype": "bdist_wheel",
"python_version": "cp36",
"requires_python": ">=3.6",
"size": 461943,
"upload_time": "2022-12-07T02:37:40",
"upload_time_iso_8601": "2022-12-07T02:37:40.169420Z",
"url": "https://files.pythonhosted.org/packages/d7/2f/b7a6ef75adfb854a3efc8a4414f2138a1958919e9f6942e4e4e14b124967/corpusit-0.1.3-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"md5": "152f0768ba3239465747849a64e2c27a",
"sha256": "d84dbd2cc16af49748f91e0b75c5bc912730d80be90cfd17d3f68ac3cc209544"
},
"downloads": -1,
"filename": "corpusit-0.1.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "152f0768ba3239465747849a64e2c27a",
"packagetype": "bdist_wheel",
"python_version": "cp37",
"requires_python": ">=3.6",
"size": 463667,
"upload_time": "2022-12-07T02:37:43",
"upload_time_iso_8601": "2022-12-07T02:37:43.987962Z",
"url": "https://files.pythonhosted.org/packages/bc/50/267cd5972cd99866fb0ca9e054dbe5f5684d91643e72212bf25d789761fa/corpusit-0.1.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"md5": "838e3af2d9c8525e75c502f968777256",
"sha256": "be155d45fc342f86e1a771792464012d29a956d576d02c4e77f66d35bccefcc7"
},
"downloads": -1,
"filename": "corpusit-0.1.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "838e3af2d9c8525e75c502f968777256",
"packagetype": "bdist_wheel",
"python_version": "cp38",
"requires_python": ">=3.6",
"size": 463448,
"upload_time": "2022-12-07T02:37:47",
"upload_time_iso_8601": "2022-12-07T02:37:47.263292Z",
"url": "https://files.pythonhosted.org/packages/98/59/cb1f14beaa6ad52e7173ff768a07a2c746aac43457a8dde91ab207fd77bd/corpusit-0.1.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"md5": "f95b58f95379756421d1bfe28a3d6a35",
"sha256": "63acb216bd759ddede98a9ff8dcb8141ad50125dd89c64ecf35043d33991c70a"
},
"downloads": -1,
"filename": "corpusit-0.1.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "f95b58f95379756421d1bfe28a3d6a35",
"packagetype": "bdist_wheel",
"python_version": "cp39",
"requires_python": ">=3.6",
"size": 463811,
"upload_time": "2022-12-07T02:37:50",
"upload_time_iso_8601": "2022-12-07T02:37:50.691319Z",
"url": "https://files.pythonhosted.org/packages/c8/e4/c889f4fdf049001b325c2d9c7ebd1d3e94be2d5f8aa584361098e3180161/corpusit-0.1.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"md5": "6ccc9f2fc99011b033dc44eb79da9bb7",
"sha256": "47b9e71848fb9a0a35472f6cb6aa45da190a1017af2d76e88e55ffa93c59b550"
},
"downloads": -1,
"filename": "corpusit-0.1.3-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "6ccc9f2fc99011b033dc44eb79da9bb7",
"packagetype": "bdist_wheel",
"python_version": "pp37",
"requires_python": ">=3.6",
"size": 465552,
"upload_time": "2022-12-07T02:38:00",
"upload_time_iso_8601": "2022-12-07T02:38:00.357928Z",
"url": "https://files.pythonhosted.org/packages/b0/f7/f251db5f98782379be8f830d4ceac0941634cf76aad6f8fc50d3ea51ccf6/corpusit-0.1.3-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"md5": "093e36d75554e89bb966d9a22158e11e",
"sha256": "eb432ae1d78114d354040dcb199b67ceaf277368d8d7de7931c7372f99e73545"
},
"downloads": -1,
"filename": "corpusit-0.1.3-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "093e36d75554e89bb966d9a22158e11e",
"packagetype": "bdist_wheel",
"python_version": "pp38",
"requires_python": ">=3.6",
"size": 463525,
"upload_time": "2022-12-07T02:38:03",
"upload_time_iso_8601": "2022-12-07T02:38:03.353867Z",
"url": "https://files.pythonhosted.org/packages/cb/21/ac40b83daaea66a5596c3c3468ecbc621757ba575de1871216b784226114/corpusit-0.1.3-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"md5": "106ed07e223c14398b6a9e88041ae43c",
"sha256": "1735ab179a14e15029a3218e1049ec9e8a29729baf71848b20b9940fd19be48b"
},
"downloads": -1,
"filename": "corpusit-0.1.3-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "106ed07e223c14398b6a9e88041ae43c",
"packagetype": "bdist_wheel",
"python_version": "pp39",
"requires_python": ">=3.6",
"size": 463758,
"upload_time": "2022-12-07T02:38:06",
"upload_time_iso_8601": "2022-12-07T02:38:06.684346Z",
"url": "https://files.pythonhosted.org/packages/6d/3b/23916742d1a66e77decbe38f2f5af8a6174e4bdc41206c830e27ef17440b/corpusit-0.1.3-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"md5": "4ca58f50e3873b3bad381b7f0678a0ef",
"sha256": "60cc146b8d4045bc75ad29257f352f647aba61933bc4c9caaa1207b90a5e4223"
},
"downloads": -1,
"filename": "corpusit-0.1.3.tar.gz",
"has_sig": false,
"md5_digest": "4ca58f50e3873b3bad381b7f0678a0ef",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 71417,
"upload_time": "2022-12-07T02:38:09",
"upload_time_iso_8601": "2022-12-07T02:38:09.055483Z",
"url": "https://files.pythonhosted.org/packages/20/96/99e6ac19c935b3c543f725ed767f03d4b331bdba33cefbca32413201d988/corpusit-0.1.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-12-07 02:38:09",
"github": false,
"gitlab": false,
"bitbucket": false,
"lcname": "corpusit"
}