ndl-aspect

Name	ndl-aspect JSON
Version	0.0.6 JSON
	download
home_page	https://github.com/pypa/sampleproject
Summary	A package for training NDL models to predict aspect in Polish verbs
upload_time	2024-01-17 11:34:07
maintainer
docs_url	None
author	Irene Testini
requires_python
license	MIT
keywords	ndl nlp polish aspect
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Polish-Aspect

A package for training an NDL model to learn aspect in Polish verbs using pyndl; it can also be used to convert NKJP tags in Tense and Aspect tags. 

## installation

To use this package, please install the dependencies via PyPI and then clone the GitHub repository.
You can install the latest pip by typing "pip install pip".
From a terminal, type "pip install ndl-aspect". 
Then clone the repository found here: https://github.com/ooominds/Polish-Aspect

## pipeline

The pipeline.py file (found in the GitHub repository) acts as a step-by-step guide to run the code from data preparation and annotation, to model simulations.
The train your model, run
```
python pipeline.py --stratification <STRATIFICATION> --path_to_local_corpus <PATH_TO_LOCAL_CORPUS_FILE> --type_of_cues <TYPES_OF_CUES> --size_of_sample <SAMPLE_SIZE>
```
which requires three input arguments: 
- `--stratification <STRATIFICATION>`: the type of stratification of the data (specify 'balanced' for balanced dataset, else 'stratified' for frequency sampling)
- `--path_to_local_corpus <PATH_TO_LOCAL_CORPUS_FILE>`: the path to the downloaded Araneum Polonicum, and the cues you want to include, choose from:
- `--type_of_cues <TYPES_OF_CUES>`: the cues you want to include, choose from:
  - 'all'
  - 'superlemmas'
  - 'tenses'
  - 'ngrams_superlemmas'
  - 'superlemmas_tenses'
  - 'ngrams_tenses'
- `--size_of_sample <SAMPLE_SIZE>`: size of Araneum sample, leave unspecified if you wish to use the whole corpus (NOTE: we used a sample of 10,000,000)

To run the script, uncomment each step in order. Each step relies on the output from the previous one. You can choose to uncomment all and run all steps in a single job, or run each step separately, depending on your set-up and size of corpus. 
- To run the code: 
- Example call:

```
cd Polish-Aspect/
python pipeline.py --path_to_local_corpus 'araneum_sample.txt' --stratification 'stratified'  --type_of_cues 'all' --size_of_sample '18'

```
### Data Preparation
#### Step I: Create sentences file from corpus file with verb tags 
 - Main File: Step01_extract_sentences.py - requires the path to the corpus (not provided with the package due to licensing)

#### Step II: Convert verb tags into tenses and aspects (tense aspect annotation) 

- Main file: Step02_annotate_sentences.py.  
NOTE: This script contains a set of heuristics for tagging verbs with tense and aspect; these rules exploited the tags extracted from Araneum, which follow the National Corpus of Polish (NKJP) labelling conventions, as well as lexical cues in the sentence, such as the presence of ‘być’ or whether the verb had certain endings. The script only considers indicative mood and discards any sentences in the corpus which are in the indicative.


#### Step III: Find reflexives
- Main file: Step03_find_reflexives.py  
NOTE: The original lemma provided by Araneum did not take reflexive forms into account; this script contains a set of heuristics to correctly label reflexives.

#### Step IV: Prepare corpus
- Main file: Step04_prepare_corpus.py - This script extracts ngrams and assigns a superlemma to each verb based on the dictionary file aspectual_pairs.pickle provided in DataPreparation/DownloadedData; moreover, it removes sentences for which a superlemma is not provided and extracts a sample of 10,000, stratified on Tense Aspect tags.

#### Step V: Extract superlemmas
- Main file: Step05_extract_lemmas.py

#### Step VI: Extract ngrams
- Main file: Step06_extract_ngrams.py

#### Step VII: Prepare cues
- Main file: Step07_prepare_cues_to_use.py - This script produces various files containing the cues needed to filter the event files at a later stage based on input argument 3.

#### Step VIII: Split dataset
- Main file: Step08_split.py - this splits our corpus in training and testing files

#### Step IX: Make eventfiles
- Main file: Step09_make_eventfiles.py - this produces training and test files in the format required by NDL, cues and outcomes in tab separated columns for each learning event


### NDL

#### Train model
- Main file: TrainNDL.py - this script runs an NDL simulation on the chosen dataset (argument 1), using the cues of interest (argument 3), and produces a results file containing predicted aspect for each test sentence, and a weight file representing the association matrix of cues and outcomes.






## contributors

This packages is based on code written for R by Adnane Ez-zizi - date of last change 06/08/2020. This code was corrected, updated and adapted as a Python package by Irene Testini, completed January 2023.

The tense-aspect annotation heuristics were provided by Dagmar Divjak.

Work by all contributors was funded by Leverhulme Trust Leadership Award RL-2016-001 to Dagmar Divjak.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/pypa/sampleproject",
    "name": "ndl-aspect",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "NDL,NLP,Polish,Aspect",
    "author": "Irene Testini",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/13/e0/4ab93b7bb4768c4453b7892466f421cb69b7a0e6f668f4d6808c3e8c90e4/ndl_aspect-0.0.6.tar.gz",
    "platform": null,
    "description": "# Polish-Aspect\r\n\r\nA package for training an NDL model to learn aspect in Polish verbs using pyndl; it can also be used to convert NKJP tags in Tense and Aspect tags. \r\n\r\n## installation\r\n\r\nTo use this package, please install the dependencies via PyPI and then clone the GitHub repository.\r\nYou can install the latest pip by typing \"pip install pip\".\r\nFrom a terminal, type \"pip install ndl-aspect\". \r\nThen clone the repository found here: https://github.com/ooominds/Polish-Aspect\r\n\r\n## pipeline\r\n\r\nThe pipeline.py file (found in the GitHub repository) acts as a step-by-step guide to run the code from data preparation and annotation, to model simulations.\r\nThe train your model, run\r\n```\r\npython pipeline.py --stratification <STRATIFICATION> --path_to_local_corpus <PATH_TO_LOCAL_CORPUS_FILE> --type_of_cues <TYPES_OF_CUES> --size_of_sample <SAMPLE_SIZE>\r\n```\r\nwhich requires three input arguments: \r\n- `--stratification <STRATIFICATION>`: the type of stratification of the data (specify 'balanced' for balanced dataset, else 'stratified' for frequency sampling)\r\n- `--path_to_local_corpus <PATH_TO_LOCAL_CORPUS_FILE>`: the path to the downloaded Araneum Polonicum, and the cues you want to include, choose from:\r\n- `--type_of_cues <TYPES_OF_CUES>`: the cues you want to include, choose from:\r\n  - 'all'\r\n  - 'superlemmas'\r\n  - 'tenses'\r\n  - 'ngrams_superlemmas'\r\n  - 'superlemmas_tenses'\r\n  - 'ngrams_tenses'\r\n- `--size_of_sample <SAMPLE_SIZE>`: size of Araneum sample, leave unspecified if you wish to use the whole corpus (NOTE: we used a sample of 10,000,000)\r\n\r\nTo run the script, uncomment each step in order. Each step relies on the output from the previous one. You can choose to uncomment all and run all steps in a single job, or run each step separately, depending on your set-up and size of corpus. \r\n- To run the code: \r\n- Example call:\r\n\r\n```\r\ncd Polish-Aspect/\r\npython pipeline.py --path_to_local_corpus 'araneum_sample.txt' --stratification 'stratified'  --type_of_cues 'all' --size_of_sample '18'\r\n\r\n```\r\n### Data Preparation\r\n#### Step I: Create sentences file from corpus file with verb tags \r\n - Main File: Step01_extract_sentences.py - requires the path to the corpus (not provided with the package due to licensing)\r\n\r\n#### Step II: Convert verb tags into tenses and aspects (tense aspect annotation) \r\n\r\n- Main file: Step02_annotate_sentences.py.  \r\nNOTE: This script contains a set of heuristics for tagging verbs with tense and aspect; these rules exploited the tags extracted from Araneum, which follow the National Corpus of Polish (NKJP) labelling conventions, as well as lexical cues in the sentence, such as the presence of \u2018by\u0107\u2019 or whether the verb had certain endings. The script only considers indicative mood and discards any sentences in the corpus which are in the indicative.\r\n\r\n\r\n#### Step III: Find reflexives\r\n- Main file: Step03_find_reflexives.py  \r\nNOTE: The original lemma provided by Araneum did not take reflexive forms into account; this script contains a set of heuristics to correctly label reflexives.\r\n\r\n#### Step IV: Prepare corpus\r\n- Main file: Step04_prepare_corpus.py - This script extracts ngrams and assigns a superlemma to each verb based on the dictionary file aspectual_pairs.pickle provided in DataPreparation/DownloadedData; moreover, it removes sentences for which a superlemma is not provided and extracts a sample of 10,000, stratified on Tense Aspect tags.\r\n\r\n#### Step V: Extract superlemmas\r\n- Main file: Step05_extract_lemmas.py\r\n\r\n#### Step VI: Extract ngrams\r\n- Main file: Step06_extract_ngrams.py\r\n\r\n#### Step VII: Prepare cues\r\n- Main file: Step07_prepare_cues_to_use.py - This script produces various files containing the cues needed to filter the event files at a later stage based on input argument 3.\r\n\r\n#### Step VIII: Split dataset\r\n- Main file: Step08_split.py - this splits our corpus in training and testing files\r\n\r\n#### Step IX: Make eventfiles\r\n- Main file: Step09_make_eventfiles.py - this produces training and test files in the format required by NDL, cues and outcomes in tab separated columns for each learning event\r\n\r\n\r\n### NDL\r\n\r\n#### Train model\r\n- Main file: TrainNDL.py - this script runs an NDL simulation on the chosen dataset (argument 1), using the cues of interest (argument 3), and produces a results file containing predicted aspect for each test sentence, and a weight file representing the association matrix of cues and outcomes.\r\n\r\n\r\n\r\n\r\n\r\n\r\n## contributors\r\n\r\nThis packages is based on code written for R by Adnane Ez-zizi - date of last change 06/08/2020. This code was corrected, updated and adapted as a Python package by Irene Testini, completed January 2023.\r\n\r\nThe tense-aspect annotation heuristics were provided by Dagmar Divjak.\r\n\r\nWork by all contributors was funded by Leverhulme Trust Leadership Award RL-2016-001 to Dagmar Divjak.\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A package for training NDL models to predict aspect in Polish verbs",
    "version": "0.0.6",
    "project_urls": {
        "Download": "https://github.com/ooominds/Polish-Aspect/archive/refs/tags/v_01.tar.gz",
        "Homepage": "https://github.com/pypa/sampleproject"
    },
    "split_keywords": [
        "ndl",
        "nlp",
        "polish",
        "aspect"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "16ea1daf85acdad6b1ba67ae005a8b2cacf009c19899493d7f9c2b1373c87b3e",
                "md5": "f4d04e036109db419b1045938fabe599",
                "sha256": "7029ea498a3f2e9461a03a8846d7e17109198bbb6709c3cf359cbcd56e9c6476"
            },
            "downloads": -1,
            "filename": "ndl_aspect-0.0.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f4d04e036109db419b1045938fabe599",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 4353,
            "upload_time": "2024-01-17T11:34:05",
            "upload_time_iso_8601": "2024-01-17T11:34:05.808571Z",
            "url": "https://files.pythonhosted.org/packages/16/ea/1daf85acdad6b1ba67ae005a8b2cacf009c19899493d7f9c2b1373c87b3e/ndl_aspect-0.0.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "13e04ab93b7bb4768c4453b7892466f421cb69b7a0e6f668f4d6808c3e8c90e4",
                "md5": "b704b7b5440d19543479e694524f412e",
                "sha256": "f71b19968895c783c4d25e41507052c21698187a62686cd5bc78a28c050cdb91"
            },
            "downloads": -1,
            "filename": "ndl_aspect-0.0.6.tar.gz",
            "has_sig": false,
            "md5_digest": "b704b7b5440d19543479e694524f412e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 4612,
            "upload_time": "2024-01-17T11:34:07",
            "upload_time_iso_8601": "2024-01-17T11:34:07.156205Z",
            "url": "https://files.pythonhosted.org/packages/13/e0/4ab93b7bb4768c4453b7892466f421cb69b7a0e6f668f4d6808c3e8c90e4/ndl_aspect-0.0.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-17 11:34:07",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "pypa",
    "github_project": "sampleproject",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "tox": true,
    "lcname": "ndl-aspect"
}

Irene Testini