ctextcore


Namectextcore JSON
Version 0.0.2 PyPI version JSON
download
home_page
SummaryAn open-source Python package for existing NCHLT core technologies for ten South African languages.
upload_time2024-02-29 12:15:46
maintainer
docs_urlNone
author
requires_python>=3.8
licenseApache License 2.0
keywords nchlt ctextcore ctext nlp south african languages afrikaans isindebele isixhosa isizulu setswana sepedi sesotho siswati tshivenḓa xitsonga
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ## About The Project

This project is an open-source Python package for existing NCHLT core technologies for ten South African 
languages (Afrikaans, English, isiNdebele, isiXhosa, isiZulu, Sesotho sa Leboa, Sesotho, Setswana, Siswati, Tshivenḓa, Xitsonga). The technologies include the following: Tokenisers, Sentence Separators, Part of Speech Taggers, Named Entity 
Recognisers, Phrase Chunkers, Optical Character Recognisers, and a Language Identifier.
Totalling 19 technologies.

## Getting Started

To get a local copy up and running, follow these steps.

### Prerequisites

* Python 3.8+ (https://www.python.org/downloads/)
* Java OpenJDK 11+ (https://openjdk.org)

### Installation

### pip

```sh
pip install ctextcore
```

### GitHub

```
# Download the source code from GitHub
git clone https://github.com/ctextdev/ctextcore.git

# Install from source
cd ctextcore
py -m pip install .

# Install from source in Development Mode
cd ctextcore
py -m pip install -e .
```

## Usage 

### Importing the CTexT Core library

```Python
from ctextcore.core import CCore as core
server = core()
```

The core method accepts the following configuration arguments:

```Python
port: 8079              # Set the port the server should use
timeout: 60000          # Set the timeout of HTTP requests
threads: 5              # Set the total number of threads to use
memory: "4G"            # Set the maximum memory allowed to be used by the server
be_quiet: False         # Set the logging output from the server
max_char_length: 10000  # Set the maximum character length

server = core(port=8081,memory="16G",...)
```

### Downloading models

#### Download all language models for a specific technology

```Python
# This call will download all the language models for POS.
server.download_model(tech='pos', language='all')
```

#### Download all technologies for a specific language

```Python
# This call will download all the technology models for isiZulu.
server.download_model(tech='all', language='zu')
```
    
#### Download a specific language model for a specific technology

```Python
# This call will download the POS technology model for Sesotho sa Leboa.
server.download_model(tech='pos', language='nso')
```

### Using a model

```Python
# This call will run the isiZulu POS tagger on the input text 'E uma lungekho usuku olufakiwe, usuku lwakho lokubhalisa luyofakwa nge-othomathikhi kube usuku lokuqala lwenyanga elandelayo ukuze kungadaleki izikweletu.'.
output_process = server.process_text(text_input='E uma lungekho usuku olufakiwe, usuku lwakho lokubhalisa luyofakwa nge-othomathikhi kube usuku lokuqala lwenyanga elandelayo ukuze kungadaleki izikweletu.', language='zu', tech='pos')
print(output_process)

from pathlib import Path # Path needs to be imported to be able to use OCR

# This call will run the Sesotho sa Leboa OCR on the image or pdf path provided in the text_input argument.
output_process = server.process_text(text_input=Path('<path-to-image-or-pdf>'), language='nso', tech='ocr')
print(output_process)

# This call will run LID on the input text 'Sizoqhubeka ukwenza ngcono ukusebenza kukagesi wethu kanye nokuthembela kugesi ophinde uvuseleleke.' and the confidence level should be above 50%.
output_process = server.process_text(text_input='Sizoqhubeka ukwenza ngcono ukusebenza kukagesi wethu kanye nokuthembela kugesi ophinde uvuseleleke.', tech='lid', confidence=0.5)
print(output_process)
```

#### Output formats

The ctextcore package offers three different output formats (JSON, Delimited, Array), the default output format is JSON and can be changed by providing the output_format argument in the process_text method. An extra argument, delimiter, can be used together with the delimited output format to change the delimiter used in the output. The default delimiter is _.

```Python

# This call will run the Afrikaans POS tagger on the input text 'Hierdie is ''n voorbeeldsin om die funksionaliteit te toets.' and will return a delimited output.
output_process = server.process_text(text_input='Hierdie is \'n voorbeeldsin om die funksionaliteit te toets.', language='af', tech='pos', output_format="delimited", delimiter="|")
print(output_process)

```

#### Output examples:

```Python

# JSON
[{'doc': {'p': {'lid': 'NONE', 'sent': {'tokens': [{'start_char': 0, 'pos': 'PA', 'end_char': 7, 'id': 1, 'text': 'Hierdie'}, {'start_char': 8, 'pos': 'VTHOK', 'end_char': 10, 'id': 2, 'text': 'is'}, {'start_char': 11, 'pos': 'LO', 'end_char': 13, 'id': 3, 'text': "'n"}, {'start_char': 14, 'pos': 'NSE', 'end_char': 26, 'id': 4, 'text': 'voorbeeldsin'}, {'start_char': 27, 'pos': 'SVS', 'end_char': 29, 'id': 5, 'text': 'om'}, {'start_char': 30, 'pos': 'LB', 'end_char': 33, 'id': 6, 'text': 'die'}, {'start_char': 34, 'pos': 'NSE', 'end_char': 49, 'id': 7, 'text': 'funksionaliteit'}, {'start_char': 50, 'pos': 'UPI', 'end_char': 52, 'id': 8, 'text': 'te'}, {'start_char': 53, 'pos': 'VTHSG', 'end_char': 58, 'id': 9, 'text': 'toets'}, {'start_char': 58, 'pos': 'ZE', 'end_char': 59, 'id': 10, 'text': '.'}]}}}}]

# List
[('Hierdie', 'PA'), ('is', 'VTHOK'), ("'n", 'LO'), ('voorbeeldsin', 'NSE'), ('om', 'SVS'), ('die', 'LB'), ('funksionaliteit', 'NSE'), ('te', 'UPI'), ('toets', 'VTHSG'), ('.', 'ZE')]

# Delimited
['Hierdie|PA', 'is|VTHOK', "'n|LO", 'voorbeeldsin|NSE', 'om|SVS', 'die|LB', 'funksionaliteit|NSE', 'te|UPI', 'toets|VTHSG', '.|ZE']


```

## Testing

The ctextcore package uses pytest version 8.0.0 or above as a testing framework and is a required prerequisite to be able to run the unit tests of the package.

### Running all the unit tests of the ctextcore package

```sh
python -m pytest --pyargs ctextcore.tests
```

### Running individual unit tests of the ctextcore package

The ctextcore package contains the following unit tests:

* lid
* ner
* ocr
* pc
* pos
* sent
* tok

#### Running an individual unit test

```sh
python -m pytest --pyargs ctextcore.tests.test_name
```

#### Example

```sh
python -m pytest --pyargs ctextcore.tests.test_lid
```

## License

Licensed under the Apache License, Version 2.0. See `LICENSE.txt` for more information.

## Contact

Centre for Text Technology (CTexT) - ctextdev@gmail.com - https://humanities.nwu.ac.za/ctext


            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "ctextcore",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "nchlt,ctextcore,CTexT,nlp,South African languages,Afrikaans,isiNdebele,isiXhosa,isiZulu,Setswana,Sepedi,Sesotho,Siswati,Tshiven\u1e13a,Xitsonga",
    "author": "",
    "author_email": "\"Centre for Text Technology (CTexT)\" <ctextdev@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/79/34/4adfc7975f3837b271b30525e999e5f90ee31b639c6d94b519856f5295f2/ctextcore-0.0.2.tar.gz",
    "platform": null,
    "description": "## About The Project\r\n\r\nThis project is an open-source Python package for existing NCHLT core technologies for ten South African \r\nlanguages (Afrikaans, English, isiNdebele, isiXhosa, isiZulu, Sesotho sa Leboa, Sesotho, Setswana, Siswati, Tshiven\u1e13a, Xitsonga). The technologies include the following: Tokenisers, Sentence Separators, Part of Speech Taggers, Named Entity \r\nRecognisers, Phrase Chunkers, Optical Character Recognisers, and a Language Identifier.\r\nTotalling 19 technologies.\r\n\r\n## Getting Started\r\n\r\nTo get a local copy up and running, follow these steps.\r\n\r\n### Prerequisites\r\n\r\n* Python 3.8+ (https://www.python.org/downloads/)\r\n* Java OpenJDK 11+ (https://openjdk.org)\r\n\r\n### Installation\r\n\r\n### pip\r\n\r\n```sh\r\npip install ctextcore\r\n```\r\n\r\n### GitHub\r\n\r\n```\r\n# Download the source code from GitHub\r\ngit clone https://github.com/ctextdev/ctextcore.git\r\n\r\n# Install from source\r\ncd ctextcore\r\npy -m pip install .\r\n\r\n# Install from source in Development Mode\r\ncd ctextcore\r\npy -m pip install -e .\r\n```\r\n\r\n## Usage \r\n\r\n### Importing the CTexT Core library\r\n\r\n```Python\r\nfrom ctextcore.core import CCore as core\r\nserver = core()\r\n```\r\n\r\nThe core method accepts the following configuration arguments:\r\n\r\n```Python\r\nport: 8079              # Set the port the server should use\r\ntimeout: 60000          # Set the timeout of HTTP requests\r\nthreads: 5              # Set the total number of threads to use\r\nmemory: \"4G\"            # Set the maximum memory allowed to be used by the server\r\nbe_quiet: False         # Set the logging output from the server\r\nmax_char_length: 10000  # Set the maximum character length\r\n\r\nserver = core(port=8081,memory=\"16G\",...)\r\n```\r\n\r\n### Downloading models\r\n\r\n#### Download all language models for a specific technology\r\n\r\n```Python\r\n# This call will download all the language models for POS.\r\nserver.download_model(tech='pos', language='all')\r\n```\r\n\r\n#### Download all technologies for a specific language\r\n\r\n```Python\r\n# This call will download all the technology models for isiZulu.\r\nserver.download_model(tech='all', language='zu')\r\n```\r\n    \r\n#### Download a specific language model for a specific technology\r\n\r\n```Python\r\n# This call will download the POS technology model for Sesotho sa Leboa.\r\nserver.download_model(tech='pos', language='nso')\r\n```\r\n\r\n### Using a model\r\n\r\n```Python\r\n# This call will run the isiZulu POS tagger on the input text 'E uma lungekho usuku olufakiwe, usuku lwakho lokubhalisa luyofakwa nge-othomathikhi kube usuku lokuqala lwenyanga elandelayo ukuze kungadaleki izikweletu.'.\r\noutput_process = server.process_text(text_input='E uma lungekho usuku olufakiwe, usuku lwakho lokubhalisa luyofakwa nge-othomathikhi kube usuku lokuqala lwenyanga elandelayo ukuze kungadaleki izikweletu.', language='zu', tech='pos')\r\nprint(output_process)\r\n\r\nfrom pathlib import Path # Path needs to be imported to be able to use OCR\r\n\r\n# This call will run the Sesotho sa Leboa OCR on the image or pdf path provided in the text_input argument.\r\noutput_process = server.process_text(text_input=Path('<path-to-image-or-pdf>'), language='nso', tech='ocr')\r\nprint(output_process)\r\n\r\n# This call will run LID on the input text 'Sizoqhubeka ukwenza ngcono ukusebenza kukagesi wethu kanye nokuthembela kugesi ophinde uvuseleleke.' and the confidence level should be above 50%.\r\noutput_process = server.process_text(text_input='Sizoqhubeka ukwenza ngcono ukusebenza kukagesi wethu kanye nokuthembela kugesi ophinde uvuseleleke.', tech='lid', confidence=0.5)\r\nprint(output_process)\r\n```\r\n\r\n#### Output formats\r\n\r\nThe ctextcore package offers three different output formats (JSON, Delimited, Array), the default output format is JSON and can be changed by providing the output_format argument in the process_text method. An extra argument, delimiter, can be used together with the delimited output format to change the delimiter used in the output. The default delimiter is _.\r\n\r\n```Python\r\n\r\n# This call will run the Afrikaans POS tagger on the input text 'Hierdie is ''n voorbeeldsin om die funksionaliteit te toets.' and will return a delimited output.\r\noutput_process = server.process_text(text_input='Hierdie is \\'n voorbeeldsin om die funksionaliteit te toets.', language='af', tech='pos', output_format=\"delimited\", delimiter=\"|\")\r\nprint(output_process)\r\n\r\n```\r\n\r\n#### Output examples:\r\n\r\n```Python\r\n\r\n# JSON\r\n[{'doc': {'p': {'lid': 'NONE', 'sent': {'tokens': [{'start_char': 0, 'pos': 'PA', 'end_char': 7, 'id': 1, 'text': 'Hierdie'}, {'start_char': 8, 'pos': 'VTHOK', 'end_char': 10, 'id': 2, 'text': 'is'}, {'start_char': 11, 'pos': 'LO', 'end_char': 13, 'id': 3, 'text': \"'n\"}, {'start_char': 14, 'pos': 'NSE', 'end_char': 26, 'id': 4, 'text': 'voorbeeldsin'}, {'start_char': 27, 'pos': 'SVS', 'end_char': 29, 'id': 5, 'text': 'om'}, {'start_char': 30, 'pos': 'LB', 'end_char': 33, 'id': 6, 'text': 'die'}, {'start_char': 34, 'pos': 'NSE', 'end_char': 49, 'id': 7, 'text': 'funksionaliteit'}, {'start_char': 50, 'pos': 'UPI', 'end_char': 52, 'id': 8, 'text': 'te'}, {'start_char': 53, 'pos': 'VTHSG', 'end_char': 58, 'id': 9, 'text': 'toets'}, {'start_char': 58, 'pos': 'ZE', 'end_char': 59, 'id': 10, 'text': '.'}]}}}}]\r\n\r\n# List\r\n[('Hierdie', 'PA'), ('is', 'VTHOK'), (\"'n\", 'LO'), ('voorbeeldsin', 'NSE'), ('om', 'SVS'), ('die', 'LB'), ('funksionaliteit', 'NSE'), ('te', 'UPI'), ('toets', 'VTHSG'), ('.', 'ZE')]\r\n\r\n# Delimited\r\n['Hierdie|PA', 'is|VTHOK', \"'n|LO\", 'voorbeeldsin|NSE', 'om|SVS', 'die|LB', 'funksionaliteit|NSE', 'te|UPI', 'toets|VTHSG', '.|ZE']\r\n\r\n\r\n```\r\n\r\n## Testing\r\n\r\nThe ctextcore package uses pytest version 8.0.0 or above as a testing framework and is a required prerequisite to be able to run the unit tests of the package.\r\n\r\n### Running all the unit tests of the ctextcore package\r\n\r\n```sh\r\npython -m pytest --pyargs ctextcore.tests\r\n```\r\n\r\n### Running individual unit tests of the ctextcore package\r\n\r\nThe ctextcore package contains the following unit tests:\r\n\r\n* lid\r\n* ner\r\n* ocr\r\n* pc\r\n* pos\r\n* sent\r\n* tok\r\n\r\n#### Running an individual unit test\r\n\r\n```sh\r\npython -m pytest --pyargs ctextcore.tests.test_name\r\n```\r\n\r\n#### Example\r\n\r\n```sh\r\npython -m pytest --pyargs ctextcore.tests.test_lid\r\n```\r\n\r\n## License\r\n\r\nLicensed under the Apache License, Version 2.0. See `LICENSE.txt` for more information.\r\n\r\n## Contact\r\n\r\nCentre for Text Technology (CTexT) - ctextdev@gmail.com - https://humanities.nwu.ac.za/ctext\r\n\r\n",
    "bugtrack_url": null,
    "license": "Apache License 2.0",
    "summary": "An open-source Python package for existing NCHLT core technologies for ten South African languages.",
    "version": "0.0.2",
    "project_urls": null,
    "split_keywords": [
        "nchlt",
        "ctextcore",
        "ctext",
        "nlp",
        "south african languages",
        "afrikaans",
        "isindebele",
        "isixhosa",
        "isizulu",
        "setswana",
        "sepedi",
        "sesotho",
        "siswati",
        "tshiven\u1e13a",
        "xitsonga"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ca7f7da2bace7bbd65fd4f6d9d7812558d32c40db7a46ad7c2218d9157542743",
                "md5": "7b2882688df53375f98d9875d62a3d6a",
                "sha256": "bfe347dc3b76f909f526192a6c950e5c056a9a1fd909c5c118ef66dd7fce8999"
            },
            "downloads": -1,
            "filename": "ctextcore-0.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7b2882688df53375f98d9875d62a3d6a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 60565552,
            "upload_time": "2024-02-29T12:15:41",
            "upload_time_iso_8601": "2024-02-29T12:15:41.486063Z",
            "url": "https://files.pythonhosted.org/packages/ca/7f/7da2bace7bbd65fd4f6d9d7812558d32c40db7a46ad7c2218d9157542743/ctextcore-0.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "79344adfc7975f3837b271b30525e999e5f90ee31b639c6d94b519856f5295f2",
                "md5": "59f02af94ac218ea4dd7399bbb1bf0aa",
                "sha256": "159f7dd6ce0d7993fafe4ec46a95a2582362d8aa02fe9af912ec90c0dfb07a1f"
            },
            "downloads": -1,
            "filename": "ctextcore-0.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "59f02af94ac218ea4dd7399bbb1bf0aa",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 60377386,
            "upload_time": "2024-02-29T12:15:46",
            "upload_time_iso_8601": "2024-02-29T12:15:46.786825Z",
            "url": "https://files.pythonhosted.org/packages/79/34/4adfc7975f3837b271b30525e999e5f90ee31b639c6d94b519856f5295f2/ctextcore-0.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-29 12:15:46",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "ctextcore"
}
        
Elapsed time: 0.19322s