topic-context-model


Nametopic-context-model JSON
Version 0.1.2 PyPI version JSON
download
home_pageNone
SummaryTopic Context Model (TCM).
upload_time2024-03-21 13:44:25
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseGPLv3+
keywords topic context model tcm lda lsa
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Topic Context Modell (TCM)

Calculates the surprisal of a word given a context.

![Tests](https://github.com/jnphilipp/tcm/actions/workflows/tests.yml/badge.svg)

## Requirements

* Python >= 3.10
* scipy
* scikit-learn

## Usage

```
$ python tcm.py -h
usage: tcm [-h] [-V] [-m {lda,lsa}] [--model-file MODEL_FILE] [--data DATA [DATA ...]]
           [--fields FIELDS [FIELDS ...]] [--words WORDS] [--n-components N_COMPONENTS]
           [--doc-topic-prior DOC_TOPIC_PRIOR] [--topic-word-prior TOPIC_WORD_PRIOR]
           [--learning-method LEARNING_METHOD] [--learning-decay LEARNING_DECAY] [--learning-offset LEARNING_OFFSET]
           [--max-iter MAX_ITER] [--batch-size BATCH_SIZE] [--evaluate-every EVALUATE_EVERY] [--perp-tol PERP_TOL]
           [--mean-change-tol MEAN_CHANGE_TOL] [--max-doc-update-iter MAX_DOC_UPDATE_ITER] [--n-jobs N_JOBS]
           [--random-state RANDOM_STATE] [-v] [--log-format LOG_FORMAT] [--log-file LOG_FILE]
           [--log-file-format LOG_FILE_FORMAT]
           {train,surprisal} [{train,surprisal} ...]

positional arguments:
  {train,surprisal}     what to do, train lda/lsa or calculate surprisal.

options:
  -h, --help            show this help message and exit
  -V, --version         show program's version number and exit
  -m {lda,lsa}, --model {lda,lsa}
                        which model to use. (default: lda)
  --model-file MODEL_FILE
                        file to load model from or save to, if path exists tries to load model. (default: lda.jl.z)
  --data DATA [DATA ...]
                        file(s) to load texts from, either txt or csv optionally gzip compressed. (default: None)
  --fields FIELDS [FIELDS ...]
                        field(s) to load texts when using csv data. (default: None)
  --words WORDS         file to load words from and/or save to, either txt or json optionally gzip compressed. (default: words.txt.gz)
  -v, --verbose         verbosity level; multiple times increases the level, the maximum is 3, for debugging. (default: 0)
  --log-format LOG_FORMAT
                        set logging format. (default: %(message)s)
  --log-file LOG_FILE   log output to a file. (default: None)
  --log-file-format LOG_FILE_FORMAT
                        set logging format for log file. (default: [%(levelname)s] %(message)s)

LDA config:
  --n-components N_COMPONENTS
                        number of topics. (default: 10)
  --doc-topic-prior DOC_TOPIC_PRIOR
                        prior of document topic distribution `theta`. If the value is None, defaults to `1 / n_components`. (default: None)
  --topic-word-prior TOPIC_WORD_PRIOR
                        prior of topic word distribution `beta`. If the value is None, defaults to `1 / n_components`. (default: None)
  --learning-method LEARNING_METHOD
                        method used to update `_component`. (default: batch)
  --learning-decay LEARNING_DECAY
                        it is a parameter that control learning rate in the online learning method. The value should be set between (0.5, 1.0] to guarantee asymptotic convergence. When the value is 0.0 and batch_size is `n_samples`, the update method is same as batch learning. In the literature, this is called kappa. (default: 0.7)
  --learning-offset LEARNING_OFFSET
                        a (positive) parameter that downweights early iterations in online learning.  It should be greater than 1.0. In the literature, this is called tau_0. (default: 10.0)
  --max-iter MAX_ITER   the maximum number of passes over the training data (aka epochs). (default: 10)
  --batch-size BATCH_SIZE
                        number of documents to use in each EM iteration. Only used in online learning. (default: 128)
  --evaluate-every EVALUATE_EVERY
                        how often to evaluate perplexity. Set it to 0 or negative number to not evaluate perplexity in training at all. Evaluating perplexity can help you check convergence in training process, but it will also increase total training time. Evaluating perplexity in every iteration might increase training time up to two-fold. (default: -1)
  --perp-tol PERP_TOL   perplexity tolerance in batch learning. Only used when `evaluate_every` is greater than 0. (default: 0.1)
  --mean-change-tol MEAN_CHANGE_TOL
                        stopping tolerance for updating document topic distribution in E-step. (default: 0.001)
  --max-doc-update-iter MAX_DOC_UPDATE_ITER
                        max number of iterations for updating document topic distribution in the E-step. (default: 100)
  --n-jobs N_JOBS       the number of jobs to use in the E-step. `None` means 1. `-1` means using all processors. (default: None)
  --random-state RANDOM_STATE
                        pass an int for reproducible results across multiple function calls. (default: None)
```

## References
* [Max Kölbl, Yuki Kyogoku, J. Nathanael Philipp, Michael Richter, Tariq Yousef: Keyword extraction in German: Information-theory vs. deep learning. ICAART 2020 Special Session NLPinAI, Volume: Vol. 1: 459 - 464](https://doi.org/10.1007/978-3-030-63787-3_5)
* [Max Kölbl, Yuki Kyogoku, J. Nathanael Philipp, Michael Richter, Clemens Rietdorf, and Tariq Yousef: The Semantic Level of Shannon Information: Are Highly Informative Words Good Keywords? A Study on German. Natural Language Processing in Artificial Intelligence - NLPinAI 2020 939 (2021): 139-161.](https://doi.org/10.1007/978-3-030-63787-3_5)
* [Nathanael Philipp, Max Kölbl, Yuki Kyogoku, Tariq Yousef, Michael Richter (2022) One Step Beyond: Keyword Extraction in German Utilising Surprisal from Topic Contexts. In: Arai, K. (eds) Intelligent Computing. SAI 2022. Lecture Notes in Networks and Systems, vol 507. Springer, Cham. doi: 10.1007/978-3-031-10464-0_53](https://doi.org/10.1007/978-3-031-10464-0_53)

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "topic-context-model",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "topic context model, tcm, lda, lsa",
    "author": null,
    "author_email": "\"J. Nathanael Philipp\" <nathanael@philipp.land>",
    "download_url": "https://files.pythonhosted.org/packages/67/64/ec35dd416cbe8b2959e8d7c355232e04967e40d47036645b1500105ff609/topic-context-model-0.1.2.tar.gz",
    "platform": null,
    "description": "# Topic Context Modell (TCM)\n\nCalculates the surprisal of a word given a context.\n\n![Tests](https://github.com/jnphilipp/tcm/actions/workflows/tests.yml/badge.svg)\n\n## Requirements\n\n* Python >= 3.10\n* scipy\n* scikit-learn\n\n## Usage\n\n```\n$ python tcm.py -h\nusage: tcm [-h] [-V] [-m {lda,lsa}] [--model-file MODEL_FILE] [--data DATA [DATA ...]]\n           [--fields FIELDS [FIELDS ...]] [--words WORDS] [--n-components N_COMPONENTS]\n           [--doc-topic-prior DOC_TOPIC_PRIOR] [--topic-word-prior TOPIC_WORD_PRIOR]\n           [--learning-method LEARNING_METHOD] [--learning-decay LEARNING_DECAY] [--learning-offset LEARNING_OFFSET]\n           [--max-iter MAX_ITER] [--batch-size BATCH_SIZE] [--evaluate-every EVALUATE_EVERY] [--perp-tol PERP_TOL]\n           [--mean-change-tol MEAN_CHANGE_TOL] [--max-doc-update-iter MAX_DOC_UPDATE_ITER] [--n-jobs N_JOBS]\n           [--random-state RANDOM_STATE] [-v] [--log-format LOG_FORMAT] [--log-file LOG_FILE]\n           [--log-file-format LOG_FILE_FORMAT]\n           {train,surprisal} [{train,surprisal} ...]\n\npositional arguments:\n  {train,surprisal}     what to do, train lda/lsa or calculate surprisal.\n\noptions:\n  -h, --help            show this help message and exit\n  -V, --version         show program's version number and exit\n  -m {lda,lsa}, --model {lda,lsa}\n                        which model to use. (default: lda)\n  --model-file MODEL_FILE\n                        file to load model from or save to, if path exists tries to load model. (default: lda.jl.z)\n  --data DATA [DATA ...]\n                        file(s) to load texts from, either txt or csv optionally gzip compressed. (default: None)\n  --fields FIELDS [FIELDS ...]\n                        field(s) to load texts when using csv data. (default: None)\n  --words WORDS         file to load words from and/or save to, either txt or json optionally gzip compressed. (default: words.txt.gz)\n  -v, --verbose         verbosity level; multiple times increases the level, the maximum is 3, for debugging. (default: 0)\n  --log-format LOG_FORMAT\n                        set logging format. (default: %(message)s)\n  --log-file LOG_FILE   log output to a file. (default: None)\n  --log-file-format LOG_FILE_FORMAT\n                        set logging format for log file. (default: [%(levelname)s] %(message)s)\n\nLDA config:\n  --n-components N_COMPONENTS\n                        number of topics. (default: 10)\n  --doc-topic-prior DOC_TOPIC_PRIOR\n                        prior of document topic distribution `theta`. If the value is None, defaults to `1 / n_components`. (default: None)\n  --topic-word-prior TOPIC_WORD_PRIOR\n                        prior of topic word distribution `beta`. If the value is None, defaults to `1 / n_components`. (default: None)\n  --learning-method LEARNING_METHOD\n                        method used to update `_component`. (default: batch)\n  --learning-decay LEARNING_DECAY\n                        it is a parameter that control learning rate in the online learning method. The value should be set between (0.5, 1.0] to guarantee asymptotic convergence. When the value is 0.0 and batch_size is `n_samples`, the update method is same as batch learning. In the literature, this is called kappa. (default: 0.7)\n  --learning-offset LEARNING_OFFSET\n                        a (positive) parameter that downweights early iterations in online learning.  It should be greater than 1.0. In the literature, this is called tau_0. (default: 10.0)\n  --max-iter MAX_ITER   the maximum number of passes over the training data (aka epochs). (default: 10)\n  --batch-size BATCH_SIZE\n                        number of documents to use in each EM iteration. Only used in online learning. (default: 128)\n  --evaluate-every EVALUATE_EVERY\n                        how often to evaluate perplexity. Set it to 0 or negative number to not evaluate perplexity in training at all. Evaluating perplexity can help you check convergence in training process, but it will also increase total training time. Evaluating perplexity in every iteration might increase training time up to two-fold. (default: -1)\n  --perp-tol PERP_TOL   perplexity tolerance in batch learning. Only used when `evaluate_every` is greater than 0. (default: 0.1)\n  --mean-change-tol MEAN_CHANGE_TOL\n                        stopping tolerance for updating document topic distribution in E-step. (default: 0.001)\n  --max-doc-update-iter MAX_DOC_UPDATE_ITER\n                        max number of iterations for updating document topic distribution in the E-step. (default: 100)\n  --n-jobs N_JOBS       the number of jobs to use in the E-step. `None` means 1. `-1` means using all processors. (default: None)\n  --random-state RANDOM_STATE\n                        pass an int for reproducible results across multiple function calls. (default: None)\n```\n\n## References\n* [Max K\u00f6lbl, Yuki Kyogoku, J. Nathanael Philipp, Michael Richter, Tariq Yousef: Keyword extraction in German: Information-theory vs. deep learning. ICAART 2020 Special Session NLPinAI, Volume: Vol. 1: 459 - 464](https://doi.org/10.1007/978-3-030-63787-3_5)\n* [Max K\u00f6lbl, Yuki Kyogoku, J. Nathanael Philipp, Michael Richter, Clemens Rietdorf, and Tariq Yousef: The Semantic Level of Shannon Information: Are Highly Informative Words Good Keywords? A Study on German. Natural Language Processing in Artificial Intelligence - NLPinAI 2020 939 (2021): 139-161.](https://doi.org/10.1007/978-3-030-63787-3_5)\n* [Nathanael Philipp, Max K\u00f6lbl, Yuki Kyogoku, Tariq Yousef, Michael Richter (2022) One Step Beyond: Keyword Extraction in German Utilising Surprisal from Topic Contexts. In: Arai, K. (eds) Intelligent Computing. SAI 2022. Lecture Notes in Networks and Systems, vol 507. Springer, Cham. doi: 10.1007/978-3-031-10464-0_53](https://doi.org/10.1007/978-3-031-10464-0_53)\n",
    "bugtrack_url": null,
    "license": "GPLv3+",
    "summary": "Topic Context Model (TCM).",
    "version": "0.1.2",
    "project_urls": {
        "Bug Tracker": "http://github.com/jnphilipp/tcm/issues",
        "Homepage": "https://github.com/jnphilipp/tcm"
    },
    "split_keywords": [
        "topic context model",
        " tcm",
        " lda",
        " lsa"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2fb9c8482ed094633d362ca7edd212726a13b982dbdd3201ed0bce8df04549fb",
                "md5": "23866f1558d401d7202949edce28f95c",
                "sha256": "a1a0a76717e2197b9e640cf9c2b98f0227176ce90d44a0c2d83a48c61416ce75"
            },
            "downloads": -1,
            "filename": "topic_context_model-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "23866f1558d401d7202949edce28f95c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 27844,
            "upload_time": "2024-03-21T13:44:23",
            "upload_time_iso_8601": "2024-03-21T13:44:23.078051Z",
            "url": "https://files.pythonhosted.org/packages/2f/b9/c8482ed094633d362ca7edd212726a13b982dbdd3201ed0bce8df04549fb/topic_context_model-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6764ec35dd416cbe8b2959e8d7c355232e04967e40d47036645b1500105ff609",
                "md5": "1e8ea9cf972cce752270b64b5d854bc8",
                "sha256": "6cdbc7019e7acb1f85cdf319a9bb80bef8cc062ab14c50d13ed2faa1f067d0b6"
            },
            "downloads": -1,
            "filename": "topic-context-model-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "1e8ea9cf972cce752270b64b5d854bc8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 25488,
            "upload_time": "2024-03-21T13:44:25",
            "upload_time_iso_8601": "2024-03-21T13:44:25.067043Z",
            "url": "https://files.pythonhosted.org/packages/67/64/ec35dd416cbe8b2959e8d7c355232e04967e40d47036645b1500105ff609/topic-context-model-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-21 13:44:25",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jnphilipp",
    "github_project": "tcm",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "topic-context-model"
}
        
Elapsed time: 0.30027s