nlp-id


Namenlp-id JSON
Version 0.1.15.0 PyPI version JSON
download
home_pagehttps://github.com/kumparan/nlp-id
SummaryKumparan's NLP Services
upload_time2023-07-11 02:31:22
maintainer
docs_urlNone
authorFrandy Eddy, Dhanang Hadhi Sasmita, Zavli Juwantara
requires_python
licenseMIT
keywords indonesian bahasa nlp
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Kumparan's NLP Services

`nlp-id` is a collection of modules which provides various functions for Natural Language Processing for Bahasa Indonesia. This repository contains all source code related to NLP services.

## Installation

To install `nlp-id`, use the following command:

    $ pip install nlp-id     


## Usage

Description on how to use the lemmatizer, tokenizer, POS-tagger, etc. will be explained in more detail in this section.

### Lemmatizer

Lemmatizer is used to get the root words from every word in a sentence.

    from nlp_id.lemmatizer import Lemmatizer 
    lemmatizer = Lemmatizer() 
    lemmatizer.lemmatize('Saya sedang mencoba') 
    # saya sedang coba 
    
### Tokenizer

Tokenizer is used to convert text into tokens of word, punctuation, number, date, email, URL, etc. 
There are two kinds of tokenizer in this repository, **standard tokenizer** and **phrase tokenizer**. 
The **standard tokenizer** tokenizes the text into separate tokens where the word tokens are single-word tokens.
Tokens that started with *ku-* or ended with *-ku*, *-mu*, *-nya*, *-lah*, *-kah* will be split if it is personal pronoun or particle.

    from nlp_id.tokenizer import Tokenizer 
    tokenizer = Tokenizer() 
    tokenizer.tokenize('Lionel Messi pergi ke pasar di daerah Jakarta Pusat.') 
    # ['Lionel', 'Messi', 'pergi', 'ke', 'pasar', 'di', 'daerah', 'Jakarta', 'Pusat', '.']

    tokenizer.tokenize('Lionel Messi pergi ke rumahmu di daerah Jakarta Pusat.') 
    # ['Lionel', 'Messi', 'pergi', 'ke', 'rumah', 'mu', 'di', 'daerah', 'Jakarta', 'Pusat', '.']
    
The **phrase tokenizer** tokenizes the text into separate tokens where the word tokens are phrases (single or multi-word tokens). 

    from nlp_id.tokenizer import PhraseTokenizer 
    tokenizer = PhraseTokenizer() 
    tokenizer.tokenize('Lionel Messi pergi ke pasar di daerah Jakarta Pusat.') 
    # ['Lionel Messi', 'pergi', 'ke', 'pasar', 'di', 'daerah', 'Jakarta Pusat', '.']
    
### POS Tagger

POS tagger is used to obtain the Part-Of-Speech tag from a text.
There are two kinds of POS tagger in this repository, **standard POS tagger** and **phrase POS tagger**. 
The tokens in **standard POS Tagger** are single-word tokens, while the tokens in **phrase POS Tagger** are phrases (single or multi-word tokens).

    from nlp_id.postag import PosTag
    postagger = PosTag() 
    postagger.get_pos_tag('Lionel Messi pergi ke pasar di daerah Jakarta Pusat.') 
    # [('Lionel', 'NNP'), ('Messi', 'NNP'), ('pergi', 'VB'), ('ke', 'IN'), ('pasar', 'NN'), ('di', 'IN'), ('daerah', 'NN'),  
      ('Jakarta', 'NNP'), ('Pusat', 'NNP'), ('.', 'SYM')]
    
    postagger.get_phrase_tag('Lionel Messi pergi ke pasar di daerah Jakarta Pusat.') 
    # [('Lionel Messi', 'NP'), ('pergi', 'VP'), ('ke', 'IN'), ('pasar', 'NN'), ('di', 'IN'), ('daerah', 'NN'), 
      ('Jakarta Pusat', 'NP'), ('.', 'SYM')]

    
Description of tagset used for POS Tagger:

| No. | Tag | Description | Example |
|:-----:|:-----:|:--------|:------------|
| 1 | ADV | Adverbs. Includes adverb, modal, and auxiliary verb | sangat, hanya, justru, boleh, harus, mesti|
| 2 | CC  | Coordinating conjunction. Coordinating conjunction links two or more syntactically equivalent parts of a sentence. Coordinating conjunction can link independent clauses, phrases, or words. | dan, tetapi, atau |
| 3 | DT  | Determiner/article. A grammatical unit which limits the potential referent of a noun phrase, whose basic role is to mark noun phrases as either definite or indefinite.| para, sang, si, ini, itu, nya |
| 4 | FW | Foreign word. Foreign word is a word which comes from foreign language and is not yet included in Indonesian dictionary| workshop, business, e-commerce |
| 5 | IN  | Preposition. A preposition links word or phrase and constituent in front of that preposition and results prepositional phrase. | dalam, dengan, di, ke|
| 6 | JJ | Adjective. Adjectives are words which describe, modify, or specify some properties of the head noun of the phrase | bersih, panjang, jauh, marah |
| 7 | NEG | Negation | tidak, belum, jangan |
| 8 | NN | Noun. Nouns are words which refer to human, animal, thing, concept, or understanding | meja, kursi, monyet, perkumpulan |
| 9 | NNP | Proper Noun. Proper noun is a specific name of a person, thing, place, event, etc. | Indonesia, Jakarta, Piala Dunia, Idul Fitri, Jokowi |
| 10 | NUM  | Number. Includes cardinal and ordinal number | 9876, 2019, 0,5, empat |
| 11 | PR  | Pronoun. Includes personal pronoun and demonstrative pronoun | saya, kami, kita, kalian, ini, itu, nya, yang |
| 12 | RP  | Particle. Particle which confirms interrogative, imperative, or declarative sentences | pun, lah, kah|
| 13 | SC  | Subordinating Conjunction. Subordinating conjunction links two or more clauses and one of the clauses is a subordinate clause. | sejak, jika, seandainya, dengan, bahwa |
| 14 | SYM | Symbols and Punctuations  | +,%,@ |
| 15 | UH | Interjection. Interjection expresses feeling or state of mind and has no relation with other words syntactically. | ayo, nah, ah|
| 16 | VB | Verb. Includes transitive verbs, intransitive verbs, active verbs, passive verbs, and copulas. | tertidur, bekerja, membaca |
| 17 | ADJP | Adjective Phrase. A group of words headed by an adjective that describes a noun or a pronoun | sangat tinggi |
| 18 | DP | Date Phrase. Date written with whitespaces | 1 Januari 2020 |
| 19 | NP | Noun Phrase. A phrase that has a noun (or indefinite pronoun) as its head | Jakarta Pusat, Lionel Messi |
| 20 | NUMP | Number Phrase.  | 10 juta |
| 21 | VP | Verb Phrase. A syntactic unit composed of at least one verb and its dependents | tidak makan |

### Stopword

`nlp-id` also provide list of Indonesian stopword.

    from nlp_id.stopword import StopWord 
    stopword = StopWord() 
    stopword.get_stopword() 
    # [{list_of_nlp_id_stopword}]    

Stopword Removal is used to remove every Indonesian stopword from the given text.

    from nlp_id.stopword import StopWord 
    text = "Lionel Messi pergi Ke pasar di area Jakarta Pusat" # single sentence
    stopword = StopWord() 
    stopword.remove_stopword(text)
    # Lionel Messi pergi pasar area Jakarta Pusat  
    
    paragraph = "Lionel Messi pergi Ke pasar di area Jakarta Pusat itu. Sedangkan Cristiano Ronaldo ke pasar Di area Jakarta Selatan. Dan mereka tidak bertemu begini-begitu."
    stopword.remove_stopword(text)
    # Lionel Messi pergi pasar area Jakarta Pusat. Cristiano Ronaldo pasar area Jakarta Selatan. bertemu.
    
## Training and Evaluation

Our model is trained using stories from kumparan as the dataset. We managed to get ~93% accuracy on our test set.
    
## Citation
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4556870.svg)](https://doi.org/10.5281/zenodo.4556870)
            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/kumparan/nlp-id",
    "name": "nlp-id",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "Indonesian,Bahasa,NLP",
    "author": "Frandy Eddy, Dhanang Hadhi Sasmita, Zavli Juwantara",
    "author_email": "eddy.frandy@gmail.com, dhananghadhi@gmail.com, juwantaraz@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/85/b9/9306d664c8bbea56d548f2e4b286a47145b2f9ef363a046b2e1f79beb3c6/nlp_id-0.1.15.0.tar.gz",
    "platform": null,
    "description": "# Kumparan's NLP Services\n\n`nlp-id` is a collection of modules which provides various functions for Natural Language Processing for Bahasa Indonesia. This repository contains all source code related to NLP services.\n\n## Installation\n\nTo install `nlp-id`, use the following command:\n\n    $ pip install nlp-id     \n\n\n## Usage\n\nDescription on how to use the lemmatizer, tokenizer, POS-tagger, etc. will be explained in more detail in this section.\n\n### Lemmatizer\n\nLemmatizer is used to get the root words from every word in a sentence.\n\n    from nlp_id.lemmatizer import Lemmatizer \n    lemmatizer = Lemmatizer() \n    lemmatizer.lemmatize('Saya sedang mencoba') \n    # saya sedang coba \n    \n### Tokenizer\n\nTokenizer is used to convert text into tokens of word, punctuation, number, date, email, URL, etc. \nThere are two kinds of tokenizer in this repository, **standard tokenizer** and **phrase tokenizer**. \nThe **standard tokenizer** tokenizes the text into separate tokens where the word tokens are single-word tokens.\nTokens that started with *ku-* or ended with *-ku*, *-mu*, *-nya*, *-lah*, *-kah* will be split if it is personal pronoun or particle.\n\n    from nlp_id.tokenizer import Tokenizer \n    tokenizer = Tokenizer() \n    tokenizer.tokenize('Lionel Messi pergi ke pasar di daerah Jakarta Pusat.') \n    # ['Lionel', 'Messi', 'pergi', 'ke', 'pasar', 'di', 'daerah', 'Jakarta', 'Pusat', '.']\n\n    tokenizer.tokenize('Lionel Messi pergi ke rumahmu di daerah Jakarta Pusat.') \n    # ['Lionel', 'Messi', 'pergi', 'ke', 'rumah', 'mu', 'di', 'daerah', 'Jakarta', 'Pusat', '.']\n    \nThe **phrase tokenizer** tokenizes the text into separate tokens where the word tokens are phrases (single or multi-word tokens). \n\n    from nlp_id.tokenizer import PhraseTokenizer \n    tokenizer = PhraseTokenizer() \n    tokenizer.tokenize('Lionel Messi pergi ke pasar di daerah Jakarta Pusat.') \n    # ['Lionel Messi', 'pergi', 'ke', 'pasar', 'di', 'daerah', 'Jakarta Pusat', '.']\n    \n### POS Tagger\n\nPOS tagger is used to obtain the Part-Of-Speech tag from a text.\nThere are two kinds of POS tagger in this repository, **standard POS tagger** and **phrase POS tagger**. \nThe tokens in **standard POS Tagger** are single-word tokens, while the tokens in **phrase POS Tagger** are phrases (single or multi-word tokens).\n\n    from nlp_id.postag import PosTag\n    postagger = PosTag() \n    postagger.get_pos_tag('Lionel Messi pergi ke pasar di daerah Jakarta Pusat.') \n    # [('Lionel', 'NNP'), ('Messi', 'NNP'), ('pergi', 'VB'), ('ke', 'IN'), ('pasar', 'NN'), ('di', 'IN'), ('daerah', 'NN'),  \n      ('Jakarta', 'NNP'), ('Pusat', 'NNP'), ('.', 'SYM')]\n    \n    postagger.get_phrase_tag('Lionel Messi pergi ke pasar di daerah Jakarta Pusat.') \n    # [('Lionel Messi', 'NP'), ('pergi', 'VP'), ('ke', 'IN'), ('pasar', 'NN'), ('di', 'IN'), ('daerah', 'NN'), \n      ('Jakarta Pusat', 'NP'), ('.', 'SYM')]\n\n    \nDescription of tagset used for POS Tagger:\n\n| No. | Tag | Description | Example |\n|:-----:|:-----:|:--------|:------------|\n| 1 | ADV | Adverbs. Includes adverb, modal, and auxiliary verb | sangat, hanya, justru, boleh, harus, mesti|\n| 2 | CC  | Coordinating conjunction. Coordinating conjunction links two or more syntactically equivalent parts of a sentence. Coordinating conjunction can link independent clauses, phrases, or words. | dan, tetapi, atau |\n| 3 | DT  | Determiner/article. A grammatical unit which limits the potential referent of a noun phrase, whose basic role is to mark noun phrases as either definite or indefinite.| para, sang, si, ini, itu, nya |\n| 4 | FW | Foreign word. Foreign word is a word which comes from foreign language and is not yet included in Indonesian dictionary| workshop, business, e-commerce |\n| 5 | IN  | Preposition. A preposition links word or phrase and constituent in front of that preposition and results prepositional phrase. | dalam, dengan, di, ke|\n| 6 | JJ | Adjective. Adjectives are words which describe, modify, or specify some properties of the head noun of the phrase | bersih, panjang, jauh, marah |\n| 7 | NEG | Negation | tidak, belum, jangan |\n| 8 | NN | Noun. Nouns are words which refer to human, animal, thing, concept, or understanding | meja, kursi, monyet, perkumpulan |\n| 9 | NNP | Proper Noun. Proper noun is a specific name of a person, thing, place, event, etc. | Indonesia, Jakarta, Piala Dunia, Idul Fitri, Jokowi |\n| 10 | NUM  | Number. Includes cardinal and ordinal number | 9876, 2019, 0,5, empat |\n| 11 | PR  | Pronoun. Includes personal pronoun and demonstrative pronoun | saya, kami, kita, kalian, ini, itu, nya, yang |\n| 12 | RP  | Particle. Particle which confirms interrogative, imperative, or declarative sentences | pun, lah, kah|\n| 13 | SC  | Subordinating Conjunction. Subordinating conjunction links two or more clauses and one of the clauses is a subordinate clause. | sejak, jika, seandainya, dengan, bahwa |\n| 14 | SYM | Symbols and Punctuations  | +,%,@ |\n| 15 | UH | Interjection. Interjection expresses feeling or state of mind and has no relation with other words syntactically. | ayo, nah, ah|\n| 16 | VB | Verb. Includes transitive verbs, intransitive verbs, active verbs, passive verbs, and copulas. | tertidur, bekerja, membaca |\n| 17 | ADJP | Adjective Phrase. A group of words headed by an adjective that describes a noun or a pronoun | sangat tinggi |\n| 18 | DP | Date Phrase. Date written with whitespaces | 1 Januari 2020 |\n| 19 | NP | Noun Phrase. A phrase that has a noun (or indefinite pronoun) as its head | Jakarta Pusat, Lionel Messi |\n| 20 | NUMP | Number Phrase.  | 10 juta |\n| 21 | VP | Verb Phrase. A syntactic unit composed of at least one verb and its dependents | tidak makan |\n\n### Stopword\n\n`nlp-id` also provide list of Indonesian stopword.\n\n    from nlp_id.stopword import StopWord \n    stopword = StopWord() \n    stopword.get_stopword() \n    # [{list_of_nlp_id_stopword}]    \n\nStopword Removal is used to remove every Indonesian stopword from the given text.\n\n    from nlp_id.stopword import StopWord \n    text = \"Lionel Messi pergi Ke pasar di area Jakarta Pusat\" # single sentence\n    stopword = StopWord() \n    stopword.remove_stopword(text)\n    # Lionel Messi pergi pasar area Jakarta Pusat  \n    \n    paragraph = \"Lionel Messi pergi Ke pasar di area Jakarta Pusat itu. Sedangkan Cristiano Ronaldo ke pasar Di area Jakarta Selatan. Dan mereka tidak bertemu begini-begitu.\"\n    stopword.remove_stopword(text)\n    # Lionel Messi pergi pasar area Jakarta Pusat. Cristiano Ronaldo pasar area Jakarta Selatan. bertemu.\n    \n## Training and Evaluation\n\nOur model is trained using stories from kumparan as the dataset. We managed to get ~93% accuracy on our test set.\n    \n## Citation\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4556870.svg)](https://doi.org/10.5281/zenodo.4556870)",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Kumparan's NLP Services",
    "version": "0.1.15.0",
    "project_urls": {
        "Homepage": "https://github.com/kumparan/nlp-id"
    },
    "split_keywords": [
        "indonesian",
        "bahasa",
        "nlp"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "85b99306d664c8bbea56d548f2e4b286a47145b2f9ef363a046b2e1f79beb3c6",
                "md5": "ecd240129297e9cc74f19846ce8793e2",
                "sha256": "2a5e1a3e87170eb3dbde178b30ffc9a63c234a2d1d52b8ff84407ed16f689b31"
            },
            "downloads": -1,
            "filename": "nlp_id-0.1.15.0.tar.gz",
            "has_sig": false,
            "md5_digest": "ecd240129297e9cc74f19846ce8793e2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 54776725,
            "upload_time": "2023-07-11T02:31:22",
            "upload_time_iso_8601": "2023-07-11T02:31:22.761160Z",
            "url": "https://files.pythonhosted.org/packages/85/b9/9306d664c8bbea56d548f2e4b286a47145b2f9ef363a046b2e1f79beb3c6/nlp_id-0.1.15.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-11 02:31:22",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "kumparan",
    "github_project": "nlp-id",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "nlp-id"
}
        
Elapsed time: 0.09383s