Name | hist-w2v JSON |
Version |
0.1.4
JSON |
| download |
home_page | None |
Summary | Tools for downloading, processing, and training word2vec models on Google Ngrams |
upload_time | 2025-02-05 16:37:18 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.7 |
license | MIT |
keywords |
nlp
word2vec
ngrams
history
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# hist_w2v: Tools for downloading, processing, and training word2vec models on Google Ngrams
This Python package is meant to help researchers use Google Ngrams to examine how words' meanings have changed over time. The tools assist with (1) downloading and pre-processing raw ngrams and (2) training `word2vec` models on a specified ngram corpus. After installing, the best way to learn how to use these tools is to work through the provided Jupyter Notebook workflows.
## Package Contents
The library consists of the following modules and notebooks:
`src/ngram_tools`
1. `downoad_ngrams.py`: downloads the desired ngram types (e.g., 3-grams with part-of-speech [POS] tags, 5-grams without POS tags).
2. `convert_to_jsonl.py`: converts the raw-text ngrams from Google into a more flexible JSONL format.
3. `lowercase_ngrams.py`: makes the ngrams all lowercase.
4. `lemmatize_ngrams.py`: lemmatizes the ngrams (i.e., reduce them to their base grammatical forms).
5. `filter_ngrams.py`: screens out undesired tokens (e.g., stop words, numbers, words not in a vocabulary file) from the ngrams.
6. `sort_ngrams.py`: combines multiple ngrams files into a single sorted file.
7. `consolidate_ngrams.py`: consolidates duplicate ngrams resulting from the previous steps.
8. `index_and_create_vocabulary.py`: numerically indexes a list of unigrams and create a "vocabulary file" to screen multigrams.
9. `create_yearly_files.py`: splits the master corpus into yearly sub-corpora.
10. `helpers/file_handler.py`: helper script to simplify reading and writing files in the other modules.
11. `helpers/print_jsonl_lines.py`: helper script to view a snippet of ngrams in a JSONL file.
12. `helpers/verify_sort.py`: helper script to confirm whether an ngram file is properly sorted.
`src/training_tools`
1. `train_ngrams.py`: train `word2vec` models on pre-processed multigram corpora.
2. `evaluate_models.py`: evaluate training quality on intrinsic benchmarks (i.e., similarity and analogy tests).
3. `plotting.py`: plot various types of model results.
`notebooks`
1. `workflow_unigrams.ipynb`: Jupyter Notebook showing how to download and preprocess unigrams.
2. `workflow_multigrams.ipynb`: Jupyter Notebook showing how to download and preprocess multigrams.
3. `workflow_training.ipynb`: Jupyter Notebook showing how to train, evaluate, and plots results from `word2vec` models.
Finally, the `training_results` folder is where a file containing evaluation metrics for a set of models is stored.
## System Requirements
Unless you have an very powerful personal computer, the code is lilely only suitable to run on a high-performance computing (HPC) cluster; efficiently downloading, processing, and training models on ngrams in parallel takes lots of processors and memory. On my university's HPC, I typically request 14 cores and 128G of RAM. A priority for development is refactoring the code for individual systems.
## Citing hist_w2v
If you use `hist_w2v` in your research or other publications, I kindly ask you to cite it. Use the GitHub citation to create citation text.
## License
This project is released under the [MIT License](https://github.com/eric-d-knowles/hist_w2v/blob/main/LICENSE).
Raw data
{
"_id": null,
"home_page": null,
"name": "hist-w2v",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "nlp, word2vec, ngrams, history",
"author": null,
"author_email": "\"Eric D. Knowles\" <edk202@nyu.edu>",
"download_url": "https://files.pythonhosted.org/packages/6e/1a/3f29d376651947220ef30cade1eb4d4f49458b8e5108274c64e74fccd9e5/hist_w2v-0.1.4.tar.gz",
"platform": null,
"description": "# hist_w2v: Tools for downloading, processing, and training word2vec models on Google Ngrams\nThis Python package is meant to help researchers use Google Ngrams to examine how words' meanings have changed over time. The tools assist with (1) downloading and pre-processing raw ngrams and (2) training `word2vec` models on a specified ngram corpus. After installing, the best way to learn how to use these tools is to work through the provided Jupyter Notebook workflows.\n\n## Package Contents\nThe library consists of the following modules and notebooks:\n\n`src/ngram_tools`\n1. `downoad_ngrams.py`: downloads the desired ngram types (e.g., 3-grams with part-of-speech [POS] tags, 5-grams without POS tags).\n2. `convert_to_jsonl.py`: converts the raw-text ngrams from Google into a more flexible JSONL format.\n3. `lowercase_ngrams.py`: makes the ngrams all lowercase.\n4. `lemmatize_ngrams.py`: lemmatizes the ngrams (i.e., reduce them to their base grammatical forms).\n5. `filter_ngrams.py`: screens out undesired tokens (e.g., stop words, numbers, words not in a vocabulary file) from the ngrams.\n6. `sort_ngrams.py`: combines multiple ngrams files into a single sorted file.\n7. `consolidate_ngrams.py`: consolidates duplicate ngrams resulting from the previous steps.\n8. `index_and_create_vocabulary.py`: numerically indexes a list of unigrams and create a \"vocabulary file\" to screen multigrams.\n9. `create_yearly_files.py`: splits the master corpus into yearly sub-corpora.\n10. `helpers/file_handler.py`: helper script to simplify reading and writing files in the other modules.\n11. `helpers/print_jsonl_lines.py`: helper script to view a snippet of ngrams in a JSONL file.\n12. `helpers/verify_sort.py`: helper script to confirm whether an ngram file is properly sorted. \n\n`src/training_tools`\n1. `train_ngrams.py`: train `word2vec` models on pre-processed multigram corpora.\n2. `evaluate_models.py`: evaluate training quality on intrinsic benchmarks (i.e., similarity and analogy tests).\n3. `plotting.py`: plot various types of model results.\n\n`notebooks`\n1. `workflow_unigrams.ipynb`: Jupyter Notebook showing how to download and preprocess unigrams.\n2. `workflow_multigrams.ipynb`: Jupyter Notebook showing how to download and preprocess multigrams.\n3. `workflow_training.ipynb`: Jupyter Notebook showing how to train, evaluate, and plots results from `word2vec` models.\n\nFinally, the `training_results` folder is where a file containing evaluation metrics for a set of models is stored. \n\n## System Requirements\nUnless you have an very powerful personal computer, the code is lilely only suitable to run on a high-performance computing (HPC) cluster; efficiently downloading, processing, and training models on ngrams in parallel takes lots of processors and memory. On my university's HPC, I typically request 14 cores and 128G of RAM. A priority for development is refactoring the code for individual systems.\n\n## Citing hist_w2v\nIf you use `hist_w2v` in your research or other publications, I kindly ask you to cite it. Use the GitHub citation to create citation text.\n\n## License\n\nThis project is released under the [MIT License](https://github.com/eric-d-knowles/hist_w2v/blob/main/LICENSE).\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Tools for downloading, processing, and training word2vec models on Google Ngrams",
"version": "0.1.4",
"project_urls": {
"Homepage": "https://github.com/eric-d-knowles/hist_w2v"
},
"split_keywords": [
"nlp",
" word2vec",
" ngrams",
" history"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "323dc75ff24a4155745ea0475c95b7bbbc3d220d1f6c9d3461f2a243b9241fc3",
"md5": "c70213e388320280a33c9b6ae0968810",
"sha256": "eb5202deba304bc9bdea4cf5690631a7bceed5f51666afd7ab291f20b2e3dc46"
},
"downloads": -1,
"filename": "hist_w2v-0.1.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "c70213e388320280a33c9b6ae0968810",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 54144,
"upload_time": "2025-02-05T16:37:16",
"upload_time_iso_8601": "2025-02-05T16:37:16.245719Z",
"url": "https://files.pythonhosted.org/packages/32/3d/c75ff24a4155745ea0475c95b7bbbc3d220d1f6c9d3461f2a243b9241fc3/hist_w2v-0.1.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "6e1a3f29d376651947220ef30cade1eb4d4f49458b8e5108274c64e74fccd9e5",
"md5": "189f2f9887db39829cae097b80beb3c3",
"sha256": "4d9fdd722f67d0539c6c2e7630304dd1ccf07361b546341bb75e543a9f02dfec"
},
"downloads": -1,
"filename": "hist_w2v-0.1.4.tar.gz",
"has_sig": false,
"md5_digest": "189f2f9887db39829cae097b80beb3c3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 39027,
"upload_time": "2025-02-05T16:37:18",
"upload_time_iso_8601": "2025-02-05T16:37:18.436835Z",
"url": "https://files.pythonhosted.org/packages/6e/1a/3f29d376651947220ef30cade1eb4d4f49458b8e5108274c64e74fccd9e5/hist_w2v-0.1.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-05 16:37:18",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "eric-d-knowles",
"github_project": "hist_w2v",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "hist-w2v"
}