thesaurus-generator


Namethesaurus-generator JSON
Version 0.0.7 PyPI version JSON
download
home_pagehttps://github.com/pegondo/thesaurus_generator
SummaryAn automatic thesaurus generator.
upload_time2023-08-20 16:59:48
maintainer
docs_urlNone
authorJosé Alonso González
requires_python
licenseMIT
keywords thesaurus generation automatic nlp
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # thesaurus_generator

This module provides a simple tool to automatically generate thesaurus in Spanish from a given txt file.

# Installation

You can install this module using pypi:

```bash
pip install thesaurus-generator
```

# Usage

Here is a simple example:

```python
from thesaurus_generator import ThesaurusGenerator

# Generate the thesaurus.
t = ThesaurusGenerator()
thesaurus = t.generate('./topics/topic_2.txt')

# Save the thesaurus in JSON format.
t.save_thesaurus('thesaurus.json')
```

# Configuration

As in the example above, you can use this module without providing any configuration, as the default configuration works well in most cases. Anyway, if you want some customization, feel free to edit the configuration.

Here is a summary of the configuration supported:

- `verbose` defines if logs describing the process will appear or not.
- `use_spacy` defines if you want the pipeline to use Spacy's pipeline or not. It is True by default, and should be
  True unless you have an incompatibility with Spacy: https://spacy.io/
- `use_spacy` defines if you want the pipeline to use Stanza's pipeline or not. It is True by default, and should be
  True unless you have an incompatibility with Stanza: https://www.stanza.es/
- `key_terms` defines the configuration to extract the most important terms from the text. If its value is `'auto'`,
  then the default configuration will be used. Here is the format of the configuration: - `config` defines the configuration used to extract the most important terms of each length. This piece of
  configuraiton is an array of objects, where the object at index `i` defines the configuration used to extract
  terms formed by `i-1` words. It must contain three objects. Here is what each object must contain: - `criteria` defines the criteria used to extract the key words. It can be `tf-idf`, which will point the
  relevance of each term based on a TF-iDF index; `text-relevance`, which will point the relevance of each term
  according to the similarity between the embedding of the whole text and the embedding of the term; or `both`,
  which is the average of each metric. - `ratio` defines the ratio of the elements with the highest score that will be kept. E.g: a ratio of 0.05
  indicates that the 5% of the terms with the higher score will be kept. - `remove_stop_words` defines a criteria to remove stop words. The possible values are `None`, which indicates
  that the stop words will not be removed; `'hard'` that indicates that the terms that contain any stop word will
  be discarted; and `'soft'` that only discards the terms that are totally formed by stop words. - `stop_words` is a list with stop words to be considered.
- `key_terms_from_models` defines the configuration used to extract the key terms using external models. If the value is
  `'auto'`, the default configuration is loaded. Here is a description of the configuration: - `models` defines the models to be used in a string format with the models separated using a comma. E.g:
  `textrank,keybert,yake`. The models available are count, keybert, rake, spacy, textrank and yake. - `verbose` defines if the model will log the amount of terms that were extracted using every model. - `count.ratio` defines the ratio of terms to be kept when using the count model. This model considers more important
  the terms that are most repeated in the text. - `keybert` defines the configurations to use for this model. As KeyBERT (https://github.com/MaartenGr/KeyBERT) model
  has a wide variarity of configuration, this piece of configuration supports an array of configurations, that will be
  used to extract terms using the model and will be joined once all the configurations are run. Here is the format of the
  configuration. Each element must contain an object with the following configuration: `diversity`, `nr_candidates`,
  `num_terms`, `use_maxsum` and `use_mmr`. Here you can find the meaning of each property: https://github.com/MaartenGr/KeyBERT - `rake.ratio` defines the ratio of terms to be kept when using the rake model. This model considers more important
  the terms that are most repeated in the text. https://pypi.org/project/rake-nltk/ - `spacy.ratio` defines the ratio of terms to be kept when using the spacy model. This model considers more important
  the terms that are most repeated in the text. https://spacy.io/ - `textrank.ratio` defines the ratio of terms to be kept when using the textrank model. This model considers more important
  the terms that are most repeated in the text. https://github.com/davidadamojr/TextRank - `yake.ratio` defines the ratio of terms to be kept when using the yake model. This model considers more important
  the terms that are most repeated in the text. https://pypi.org/project/yake/
- `special_characters` is a list of characters that when a term contains them it will be removed. You can provide an
  empty array to disable this feature.
- `filter_terms` defines the criteria used to discard the irrelevant terms. If its value is `'auto'`, then the default
  configuration will be used. Here is the format of the configuration: - `criteria` defines the criteria to filter terms. If the value is `'included'`, then the terms that match the patterns
  in `included_pos_tagging` will be included; and if its value is `'excluded'`, then the terms that do not match the
  patterns in `excluded_pos_tagging` will be discarted. - `pos_tagging_groups` defines a mapping between keywords and Spacy POS tagging terms (https://web.archive.org/web/20190206204307/). - `included_pos_tagging` is a list of patterns that will be included in the terms extracted. Each element is a list of
  elements used as keys in `pos_tagging_groups`. - `excluded_pos_tagging` is a list of patterns that will be excluded in the terms extracted. Each element is a list of
  elements used as keys in `pos_tagging_groups`.
- `similarity` defines the similarity measure between the terms that appear in the generated thesaurus. If its value is
  `'auto'`, then the default configuration will be used. Here is the format of the configuration: - `metric` is the metric used to calculate the similarity. It can be `'spacy'`, which uses the document similarity
  defined by Spacy (https://spacy.io/api/doc); `'transformers'`, which uses the similarity defined in sentence_transformers
  (https://pypi.org/project/sentence-transformers/); or `'tfhub'`, which uses the similarity defined in this TF-hub model:
  https://tfhub.dev/google/universal-sentence-encoder/4 - `remove_stop_words` if the stop words in the terms are removed before running the metric.
- `thesaurus_similarity_threshold` defines the minumum score of relevance needed between two terms to be included in the
  generated thesaurus.

## Default configuration

Here is the default configuration the module uses:

```json
{
  "verbose": false,
  "use_spacy": true,
  "use_stanza": true,
  "key_terms": {
    "config": [
      {
        "criteria": "text-relevance",
        "ratio": 0.05,
        "remove_stop_words": "soft"
      },
      {
        "criteria": "text-relevance",
        "ratio": 0.05,
        "remove_stop_words": "soft"
      },
      {
        "criteria": "text-relevance",
        "ratio": 0.02,
        "remove_stop_words": "soft"
      }
    ],
    "stop_words": `nltk.corpus.stopwords.words('spanish')`
  },
  "key_terms_from_models": {
    "models": "textrank,keybert,yake",
    "verbose": false,
    "count": { "ratio": 0.2 },
    "keybert": [
      {
        "diversity": 0.5,
        "nr_candidates": 20,
        "num_terms": 15,
        "use_maxsum": false,
        "use_mmr": false
      },
      {
        "diversity": 0.5,
        "nr_candidates": 20,
        "num_terms": 15,
        "use_maxsum": true,
        "use_mmr": false
      },
      {
        "diversity": 0.7,
        "nr_candidates": 20,
        "num_terms": 15,
        "use_maxsum": false,
        "use_mmr": true
      },
      {
        "diversity": 0.2,
        "nr_candidates": 20,
        "num_terms": 15,
        "use_maxsum": false,
        "use_mmr": true
      }
    ],
    "rake": { "ratio": 0.1 },
    "spacy": { "ratio": 0.1 },
    "textrank": { "ratio": 0.2 },
    "yake": { "num_terms": 125 }
  },
  "special_characters": ["𝒇", "𝑓", "𝒈", "α"],
  "filter_terms": {
    "criteria": "included",
    "pos_tagging_groups": {
      "ADJ": ["ADJ"],
      "ADV": ["ADV"],
      "DET": ["DET", "ADP", "SCONJ", "CCONJ"],
      "NOUN": ["NOUN", "PROPN", "NUM"],
      "OTHER": ["PUNCT", "SPACE", "PART", "SYM", "INTJ", "X"],
      "PRON": ["PRON"],
      "VERB": ["VERB", "AUX"]
    },
    "excluded_pos_tagging": [
      ["PRON"],
      ["ADJ"],
      ["DET"],
      ["ADV"],
      ["OTHER"],
      ["DET", "NOUN"],
      ["*", "DET"],
      ["DET", "VERB"],
      ["DET", "PRON"],
      ["DET", "ADJ"],
      ["DET", "ADV"],
      ["DET", "OTHER"],
      ["DET", "DET", "*"],
      ["*", "DET", "DET"],
      ["*", "*", "DET"]
    ],
    "included_pos_tagging": [
      ["NOUN"],
      ["NOUN", "ADJ"],
      ["NOUN", "ADV", "ADJ"],
      ["NOUN", "DET", "NOUN"]
    ]
  },
  "similarity": {
    "metric": "transformers",
    "remove_stop_words": true
  },
  "thesaurus_similarity_threshold": 0.8
}
```

# Other features

Once you run the `generate` method in the `ThesaurusGenerator` class, you will have access to the following attributes:

- `thesaurus` is the generated thesaurus.
- `terms` is the list of extracted terms.
- `filtered_terms` is the list of terms after filtered.
- `token_pair_similarities` is a list of all the terms pairs and their similarity.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/pegondo/thesaurus_generator",
    "name": "thesaurus-generator",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "thesaurus,generation,automatic,nlp",
    "author": "Jos\u00e9 Alonso Gonz\u00e1lez",
    "author_email": "pegondo99@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/9a/51/426c83965c905c222a2b7116043f2c918159a8170c872fb233895d44455e/thesaurus_generator-0.0.7.tar.gz",
    "platform": null,
    "description": "# thesaurus_generator\r\n\r\nThis module provides a simple tool to automatically generate thesaurus in Spanish from a given txt file.\r\n\r\n# Installation\r\n\r\nYou can install this module using pypi:\r\n\r\n```bash\r\npip install thesaurus-generator\r\n```\r\n\r\n# Usage\r\n\r\nHere is a simple example:\r\n\r\n```python\r\nfrom thesaurus_generator import ThesaurusGenerator\r\n\r\n# Generate the thesaurus.\r\nt = ThesaurusGenerator()\r\nthesaurus = t.generate('./topics/topic_2.txt')\r\n\r\n# Save the thesaurus in JSON format.\r\nt.save_thesaurus('thesaurus.json')\r\n```\r\n\r\n# Configuration\r\n\r\nAs in the example above, you can use this module without providing any configuration, as the default configuration works well in most cases. Anyway, if you want some customization, feel free to edit the configuration.\r\n\r\nHere is a summary of the configuration supported:\r\n\r\n- `verbose` defines if logs describing the process will appear or not.\r\n- `use_spacy` defines if you want the pipeline to use Spacy's pipeline or not. It is True by default, and should be\r\n  True unless you have an incompatibility with Spacy: https://spacy.io/\r\n- `use_spacy` defines if you want the pipeline to use Stanza's pipeline or not. It is True by default, and should be\r\n  True unless you have an incompatibility with Stanza: https://www.stanza.es/\r\n- `key_terms` defines the configuration to extract the most important terms from the text. If its value is `'auto'`,\r\n  then the default configuration will be used. Here is the format of the configuration: - `config` defines the configuration used to extract the most important terms of each length. This piece of\r\n  configuraiton is an array of objects, where the object at index `i` defines the configuration used to extract\r\n  terms formed by `i-1` words. It must contain three objects. Here is what each object must contain: - `criteria` defines the criteria used to extract the key words. It can be `tf-idf`, which will point the\r\n  relevance of each term based on a TF-iDF index; `text-relevance`, which will point the relevance of each term\r\n  according to the similarity between the embedding of the whole text and the embedding of the term; or `both`,\r\n  which is the average of each metric. - `ratio` defines the ratio of the elements with the highest score that will be kept. E.g: a ratio of 0.05\r\n  indicates that the 5% of the terms with the higher score will be kept. - `remove_stop_words` defines a criteria to remove stop words. The possible values are `None`, which indicates\r\n  that the stop words will not be removed; `'hard'` that indicates that the terms that contain any stop word will\r\n  be discarted; and `'soft'` that only discards the terms that are totally formed by stop words. - `stop_words` is a list with stop words to be considered.\r\n- `key_terms_from_models` defines the configuration used to extract the key terms using external models. If the value is\r\n  `'auto'`, the default configuration is loaded. Here is a description of the configuration: - `models` defines the models to be used in a string format with the models separated using a comma. E.g:\r\n  `textrank,keybert,yake`. The models available are count, keybert, rake, spacy, textrank and yake. - `verbose` defines if the model will log the amount of terms that were extracted using every model. - `count.ratio` defines the ratio of terms to be kept when using the count model. This model considers more important\r\n  the terms that are most repeated in the text. - `keybert` defines the configurations to use for this model. As KeyBERT (https://github.com/MaartenGr/KeyBERT) model\r\n  has a wide variarity of configuration, this piece of configuration supports an array of configurations, that will be\r\n  used to extract terms using the model and will be joined once all the configurations are run. Here is the format of the\r\n  configuration. Each element must contain an object with the following configuration: `diversity`, `nr_candidates`,\r\n  `num_terms`, `use_maxsum` and `use_mmr`. Here you can find the meaning of each property: https://github.com/MaartenGr/KeyBERT - `rake.ratio` defines the ratio of terms to be kept when using the rake model. This model considers more important\r\n  the terms that are most repeated in the text. https://pypi.org/project/rake-nltk/ - `spacy.ratio` defines the ratio of terms to be kept when using the spacy model. This model considers more important\r\n  the terms that are most repeated in the text. https://spacy.io/ - `textrank.ratio` defines the ratio of terms to be kept when using the textrank model. This model considers more important\r\n  the terms that are most repeated in the text. https://github.com/davidadamojr/TextRank - `yake.ratio` defines the ratio of terms to be kept when using the yake model. This model considers more important\r\n  the terms that are most repeated in the text. https://pypi.org/project/yake/\r\n- `special_characters` is a list of characters that when a term contains them it will be removed. You can provide an\r\n  empty array to disable this feature.\r\n- `filter_terms` defines the criteria used to discard the irrelevant terms. If its value is `'auto'`, then the default\r\n  configuration will be used. Here is the format of the configuration: - `criteria` defines the criteria to filter terms. If the value is `'included'`, then the terms that match the patterns\r\n  in `included_pos_tagging` will be included; and if its value is `'excluded'`, then the terms that do not match the\r\n  patterns in `excluded_pos_tagging` will be discarted. - `pos_tagging_groups` defines a mapping between keywords and Spacy POS tagging terms (https://web.archive.org/web/20190206204307/). - `included_pos_tagging` is a list of patterns that will be included in the terms extracted. Each element is a list of\r\n  elements used as keys in `pos_tagging_groups`. - `excluded_pos_tagging` is a list of patterns that will be excluded in the terms extracted. Each element is a list of\r\n  elements used as keys in `pos_tagging_groups`.\r\n- `similarity` defines the similarity measure between the terms that appear in the generated thesaurus. If its value is\r\n  `'auto'`, then the default configuration will be used. Here is the format of the configuration: - `metric` is the metric used to calculate the similarity. It can be `'spacy'`, which uses the document similarity\r\n  defined by Spacy (https://spacy.io/api/doc); `'transformers'`, which uses the similarity defined in sentence_transformers\r\n  (https://pypi.org/project/sentence-transformers/); or `'tfhub'`, which uses the similarity defined in this TF-hub model:\r\n  https://tfhub.dev/google/universal-sentence-encoder/4 - `remove_stop_words` if the stop words in the terms are removed before running the metric.\r\n- `thesaurus_similarity_threshold` defines the minumum score of relevance needed between two terms to be included in the\r\n  generated thesaurus.\r\n\r\n## Default configuration\r\n\r\nHere is the default configuration the module uses:\r\n\r\n```json\r\n{\r\n  \"verbose\": false,\r\n  \"use_spacy\": true,\r\n  \"use_stanza\": true,\r\n  \"key_terms\": {\r\n    \"config\": [\r\n      {\r\n        \"criteria\": \"text-relevance\",\r\n        \"ratio\": 0.05,\r\n        \"remove_stop_words\": \"soft\"\r\n      },\r\n      {\r\n        \"criteria\": \"text-relevance\",\r\n        \"ratio\": 0.05,\r\n        \"remove_stop_words\": \"soft\"\r\n      },\r\n      {\r\n        \"criteria\": \"text-relevance\",\r\n        \"ratio\": 0.02,\r\n        \"remove_stop_words\": \"soft\"\r\n      }\r\n    ],\r\n    \"stop_words\": `nltk.corpus.stopwords.words('spanish')`\r\n  },\r\n  \"key_terms_from_models\": {\r\n    \"models\": \"textrank,keybert,yake\",\r\n    \"verbose\": false,\r\n    \"count\": { \"ratio\": 0.2 },\r\n    \"keybert\": [\r\n      {\r\n        \"diversity\": 0.5,\r\n        \"nr_candidates\": 20,\r\n        \"num_terms\": 15,\r\n        \"use_maxsum\": false,\r\n        \"use_mmr\": false\r\n      },\r\n      {\r\n        \"diversity\": 0.5,\r\n        \"nr_candidates\": 20,\r\n        \"num_terms\": 15,\r\n        \"use_maxsum\": true,\r\n        \"use_mmr\": false\r\n      },\r\n      {\r\n        \"diversity\": 0.7,\r\n        \"nr_candidates\": 20,\r\n        \"num_terms\": 15,\r\n        \"use_maxsum\": false,\r\n        \"use_mmr\": true\r\n      },\r\n      {\r\n        \"diversity\": 0.2,\r\n        \"nr_candidates\": 20,\r\n        \"num_terms\": 15,\r\n        \"use_maxsum\": false,\r\n        \"use_mmr\": true\r\n      }\r\n    ],\r\n    \"rake\": { \"ratio\": 0.1 },\r\n    \"spacy\": { \"ratio\": 0.1 },\r\n    \"textrank\": { \"ratio\": 0.2 },\r\n    \"yake\": { \"num_terms\": 125 }\r\n  },\r\n  \"special_characters\": [\"\ud835\udc87\", \"\ud835\udc53\", \"\ud835\udc88\", \"\u03b1\"],\r\n  \"filter_terms\": {\r\n    \"criteria\": \"included\",\r\n    \"pos_tagging_groups\": {\r\n      \"ADJ\": [\"ADJ\"],\r\n      \"ADV\": [\"ADV\"],\r\n      \"DET\": [\"DET\", \"ADP\", \"SCONJ\", \"CCONJ\"],\r\n      \"NOUN\": [\"NOUN\", \"PROPN\", \"NUM\"],\r\n      \"OTHER\": [\"PUNCT\", \"SPACE\", \"PART\", \"SYM\", \"INTJ\", \"X\"],\r\n      \"PRON\": [\"PRON\"],\r\n      \"VERB\": [\"VERB\", \"AUX\"]\r\n    },\r\n    \"excluded_pos_tagging\": [\r\n      [\"PRON\"],\r\n      [\"ADJ\"],\r\n      [\"DET\"],\r\n      [\"ADV\"],\r\n      [\"OTHER\"],\r\n      [\"DET\", \"NOUN\"],\r\n      [\"*\", \"DET\"],\r\n      [\"DET\", \"VERB\"],\r\n      [\"DET\", \"PRON\"],\r\n      [\"DET\", \"ADJ\"],\r\n      [\"DET\", \"ADV\"],\r\n      [\"DET\", \"OTHER\"],\r\n      [\"DET\", \"DET\", \"*\"],\r\n      [\"*\", \"DET\", \"DET\"],\r\n      [\"*\", \"*\", \"DET\"]\r\n    ],\r\n    \"included_pos_tagging\": [\r\n      [\"NOUN\"],\r\n      [\"NOUN\", \"ADJ\"],\r\n      [\"NOUN\", \"ADV\", \"ADJ\"],\r\n      [\"NOUN\", \"DET\", \"NOUN\"]\r\n    ]\r\n  },\r\n  \"similarity\": {\r\n    \"metric\": \"transformers\",\r\n    \"remove_stop_words\": true\r\n  },\r\n  \"thesaurus_similarity_threshold\": 0.8\r\n}\r\n```\r\n\r\n# Other features\r\n\r\nOnce you run the `generate` method in the `ThesaurusGenerator` class, you will have access to the following attributes:\r\n\r\n- `thesaurus` is the generated thesaurus.\r\n- `terms` is the list of extracted terms.\r\n- `filtered_terms` is the list of terms after filtered.\r\n- `token_pair_similarities` is a list of all the terms pairs and their similarity.\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "An automatic thesaurus generator.",
    "version": "0.0.7",
    "project_urls": {
        "Download": "https://github.com/pegondo/thesaurus_generator/archive/refs/tags/0.0.7.tar.gz",
        "Homepage": "https://github.com/pegondo/thesaurus_generator"
    },
    "split_keywords": [
        "thesaurus",
        "generation",
        "automatic",
        "nlp"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9a51426c83965c905c222a2b7116043f2c918159a8170c872fb233895d44455e",
                "md5": "aa7cd3422f67535d1d9d7c4a16377272",
                "sha256": "60c6f781d89c03c20c8f222fafa1c1d4985eb00ecb1fe31dc617a61dd8728f98"
            },
            "downloads": -1,
            "filename": "thesaurus_generator-0.0.7.tar.gz",
            "has_sig": false,
            "md5_digest": "aa7cd3422f67535d1d9d7c4a16377272",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 16313,
            "upload_time": "2023-08-20T16:59:48",
            "upload_time_iso_8601": "2023-08-20T16:59:48.092089Z",
            "url": "https://files.pythonhosted.org/packages/9a/51/426c83965c905c222a2b7116043f2c918159a8170c872fb233895d44455e/thesaurus_generator-0.0.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-20 16:59:48",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "pegondo",
    "github_project": "thesaurus_generator",
    "github_not_found": true,
    "lcname": "thesaurus-generator"
}
        
Elapsed time: 0.17090s