# Kumparan's NLP Services
`nlp-id` is a collection of modules which provides various functions for Natural Language Processing for Bahasa Indonesia. This repository contains all source code related to NLP services.
## Installation
To install `nlp-id`, use the following command:
$ pip install nlp-id
## Usage
Description on how to use the lemmatizer, tokenizer, POS-tagger, etc. will be explained in more detail in this section.
### Lemmatizer
Lemmatizer is used to get the root words from every word in a sentence.
from nlp_id.lemmatizer import Lemmatizer
lemmatizer = Lemmatizer()
lemmatizer.lemmatize('Saya sedang mencoba')
# saya sedang coba
### Tokenizer
Tokenizer is used to convert text into tokens of word, punctuation, number, date, email, URL, etc.
There are two kinds of tokenizer in this repository, **standard tokenizer** and **phrase tokenizer**.
The **standard tokenizer** tokenizes the text into separate tokens where the word tokens are single-word tokens.
Tokens that started with *ku-* or ended with *-ku*, *-mu*, *-nya*, *-lah*, *-kah* will be split if it is personal pronoun or particle.
from nlp_id.tokenizer import Tokenizer
tokenizer = Tokenizer()
tokenizer.tokenize('Lionel Messi pergi ke pasar di daerah Jakarta Pusat.')
# ['Lionel', 'Messi', 'pergi', 'ke', 'pasar', 'di', 'daerah', 'Jakarta', 'Pusat', '.']
tokenizer.tokenize('Lionel Messi pergi ke rumahmu di daerah Jakarta Pusat.')
# ['Lionel', 'Messi', 'pergi', 'ke', 'rumah', 'mu', 'di', 'daerah', 'Jakarta', 'Pusat', '.']
The **phrase tokenizer** tokenizes the text into separate tokens where the word tokens are phrases (single or multi-word tokens).
from nlp_id.tokenizer import PhraseTokenizer
tokenizer = PhraseTokenizer()
tokenizer.tokenize('Lionel Messi pergi ke pasar di daerah Jakarta Pusat.')
# ['Lionel Messi', 'pergi', 'ke', 'pasar', 'di', 'daerah', 'Jakarta Pusat', '.']
### POS Tagger
POS tagger is used to obtain the Part-Of-Speech tag from a text.
There are two kinds of POS tagger in this repository, **standard POS tagger** and **phrase POS tagger**.
The tokens in **standard POS Tagger** are single-word tokens, while the tokens in **phrase POS Tagger** are phrases (single or multi-word tokens).
from nlp_id.postag import PosTag
postagger = PosTag()
postagger.get_pos_tag('Lionel Messi pergi ke pasar di daerah Jakarta Pusat.')
# [('Lionel', 'NNP'), ('Messi', 'NNP'), ('pergi', 'VB'), ('ke', 'IN'), ('pasar', 'NN'), ('di', 'IN'), ('daerah', 'NN'),
('Jakarta', 'NNP'), ('Pusat', 'NNP'), ('.', 'SYM')]
postagger.get_phrase_tag('Lionel Messi pergi ke pasar di daerah Jakarta Pusat.')
# [('Lionel Messi', 'NP'), ('pergi', 'VP'), ('ke', 'IN'), ('pasar', 'NN'), ('di', 'IN'), ('daerah', 'NN'),
('Jakarta Pusat', 'NP'), ('.', 'SYM')]
Description of tagset used for POS Tagger:
| No. | Tag | Description | Example |
|:-----:|:-----:|:--------|:------------|
| 1 | ADV | Adverbs. Includes adverb, modal, and auxiliary verb | sangat, hanya, justru, boleh, harus, mesti|
| 2 | CC | Coordinating conjunction. Coordinating conjunction links two or more syntactically equivalent parts of a sentence. Coordinating conjunction can link independent clauses, phrases, or words. | dan, tetapi, atau |
| 3 | DT | Determiner/article. A grammatical unit which limits the potential referent of a noun phrase, whose basic role is to mark noun phrases as either definite or indefinite.| para, sang, si, ini, itu, nya |
| 4 | FW | Foreign word. Foreign word is a word which comes from foreign language and is not yet included in Indonesian dictionary| workshop, business, e-commerce |
| 5 | IN | Preposition. A preposition links word or phrase and constituent in front of that preposition and results prepositional phrase. | dalam, dengan, di, ke|
| 6 | JJ | Adjective. Adjectives are words which describe, modify, or specify some properties of the head noun of the phrase | bersih, panjang, jauh, marah |
| 7 | NEG | Negation | tidak, belum, jangan |
| 8 | NN | Noun. Nouns are words which refer to human, animal, thing, concept, or understanding | meja, kursi, monyet, perkumpulan |
| 9 | NNP | Proper Noun. Proper noun is a specific name of a person, thing, place, event, etc. | Indonesia, Jakarta, Piala Dunia, Idul Fitri, Jokowi |
| 10 | NUM | Number. Includes cardinal and ordinal number | 9876, 2019, 0,5, empat |
| 11 | PR | Pronoun. Includes personal pronoun and demonstrative pronoun | saya, kami, kita, kalian, ini, itu, nya, yang |
| 12 | RP | Particle. Particle which confirms interrogative, imperative, or declarative sentences | pun, lah, kah|
| 13 | SC | Subordinating Conjunction. Subordinating conjunction links two or more clauses and one of the clauses is a subordinate clause. | sejak, jika, seandainya, dengan, bahwa |
| 14 | SYM | Symbols and Punctuations | +,%,@ |
| 15 | UH | Interjection. Interjection expresses feeling or state of mind and has no relation with other words syntactically. | ayo, nah, ah|
| 16 | VB | Verb. Includes transitive verbs, intransitive verbs, active verbs, passive verbs, and copulas. | tertidur, bekerja, membaca |
| 17 | ADJP | Adjective Phrase. A group of words headed by an adjective that describes a noun or a pronoun | sangat tinggi |
| 18 | DP | Date Phrase. Date written with whitespaces | 1 Januari 2020 |
| 19 | NP | Noun Phrase. A phrase that has a noun (or indefinite pronoun) as its head | Jakarta Pusat, Lionel Messi |
| 20 | NUMP | Number Phrase. | 10 juta |
| 21 | VP | Verb Phrase. A syntactic unit composed of at least one verb and its dependents | tidak makan |
### Stopword
`nlp-id` also provide list of Indonesian stopword.
from nlp_id.stopword import StopWord
stopword = StopWord()
stopword.get_stopword()
# [{list_of_nlp_id_stopword}]
Stopword Removal is used to remove every Indonesian stopword from the given text.
from nlp_id.stopword import StopWord
text = "Lionel Messi pergi Ke pasar di area Jakarta Pusat" # single sentence
stopword = StopWord()
stopword.remove_stopword(text)
# Lionel Messi pergi pasar area Jakarta Pusat
paragraph = "Lionel Messi pergi Ke pasar di area Jakarta Pusat itu. Sedangkan Cristiano Ronaldo ke pasar Di area Jakarta Selatan. Dan mereka tidak bertemu begini-begitu."
stopword.remove_stopword(text)
# Lionel Messi pergi pasar area Jakarta Pusat. Cristiano Ronaldo pasar area Jakarta Selatan. bertemu.
## Training and Evaluation
Our model is trained using stories from kumparan as the dataset. We managed to get ~93% accuracy on our test set.
## Citation
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4556870.svg)](https://doi.org/10.5281/zenodo.4556870)
Raw data
{
"_id": null,
"home_page": "https://github.com/kumparan/nlp-id",
"name": "nlp-id",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "Indonesian,Bahasa,NLP",
"author": "Frandy Eddy, Dhanang Hadhi Sasmita, Zavli Juwantara",
"author_email": "eddy.frandy@gmail.com, dhananghadhi@gmail.com, juwantaraz@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/85/b9/9306d664c8bbea56d548f2e4b286a47145b2f9ef363a046b2e1f79beb3c6/nlp_id-0.1.15.0.tar.gz",
"platform": null,
"description": "# Kumparan's NLP Services\n\n`nlp-id` is a collection of modules which provides various functions for Natural Language Processing for Bahasa Indonesia. This repository contains all source code related to NLP services.\n\n## Installation\n\nTo install `nlp-id`, use the following command:\n\n $ pip install nlp-id \n\n\n## Usage\n\nDescription on how to use the lemmatizer, tokenizer, POS-tagger, etc. will be explained in more detail in this section.\n\n### Lemmatizer\n\nLemmatizer is used to get the root words from every word in a sentence.\n\n from nlp_id.lemmatizer import Lemmatizer \n lemmatizer = Lemmatizer() \n lemmatizer.lemmatize('Saya sedang mencoba') \n # saya sedang coba \n \n### Tokenizer\n\nTokenizer is used to convert text into tokens of word, punctuation, number, date, email, URL, etc. \nThere are two kinds of tokenizer in this repository, **standard tokenizer** and **phrase tokenizer**. \nThe **standard tokenizer** tokenizes the text into separate tokens where the word tokens are single-word tokens.\nTokens that started with *ku-* or ended with *-ku*, *-mu*, *-nya*, *-lah*, *-kah* will be split if it is personal pronoun or particle.\n\n from nlp_id.tokenizer import Tokenizer \n tokenizer = Tokenizer() \n tokenizer.tokenize('Lionel Messi pergi ke pasar di daerah Jakarta Pusat.') \n # ['Lionel', 'Messi', 'pergi', 'ke', 'pasar', 'di', 'daerah', 'Jakarta', 'Pusat', '.']\n\n tokenizer.tokenize('Lionel Messi pergi ke rumahmu di daerah Jakarta Pusat.') \n # ['Lionel', 'Messi', 'pergi', 'ke', 'rumah', 'mu', 'di', 'daerah', 'Jakarta', 'Pusat', '.']\n \nThe **phrase tokenizer** tokenizes the text into separate tokens where the word tokens are phrases (single or multi-word tokens). \n\n from nlp_id.tokenizer import PhraseTokenizer \n tokenizer = PhraseTokenizer() \n tokenizer.tokenize('Lionel Messi pergi ke pasar di daerah Jakarta Pusat.') \n # ['Lionel Messi', 'pergi', 'ke', 'pasar', 'di', 'daerah', 'Jakarta Pusat', '.']\n \n### POS Tagger\n\nPOS tagger is used to obtain the Part-Of-Speech tag from a text.\nThere are two kinds of POS tagger in this repository, **standard POS tagger** and **phrase POS tagger**. \nThe tokens in **standard POS Tagger** are single-word tokens, while the tokens in **phrase POS Tagger** are phrases (single or multi-word tokens).\n\n from nlp_id.postag import PosTag\n postagger = PosTag() \n postagger.get_pos_tag('Lionel Messi pergi ke pasar di daerah Jakarta Pusat.') \n # [('Lionel', 'NNP'), ('Messi', 'NNP'), ('pergi', 'VB'), ('ke', 'IN'), ('pasar', 'NN'), ('di', 'IN'), ('daerah', 'NN'), \n ('Jakarta', 'NNP'), ('Pusat', 'NNP'), ('.', 'SYM')]\n \n postagger.get_phrase_tag('Lionel Messi pergi ke pasar di daerah Jakarta Pusat.') \n # [('Lionel Messi', 'NP'), ('pergi', 'VP'), ('ke', 'IN'), ('pasar', 'NN'), ('di', 'IN'), ('daerah', 'NN'), \n ('Jakarta Pusat', 'NP'), ('.', 'SYM')]\n\n \nDescription of tagset used for POS Tagger:\n\n| No. | Tag | Description | Example |\n|:-----:|:-----:|:--------|:------------|\n| 1 | ADV | Adverbs. Includes adverb, modal, and auxiliary verb | sangat, hanya, justru, boleh, harus, mesti|\n| 2 | CC | Coordinating conjunction. Coordinating conjunction links two or more syntactically equivalent parts of a sentence. Coordinating conjunction can link independent clauses, phrases, or words. | dan, tetapi, atau |\n| 3 | DT | Determiner/article. A grammatical unit which limits the potential referent of a noun phrase, whose basic role is to mark noun phrases as either definite or indefinite.| para, sang, si, ini, itu, nya |\n| 4 | FW | Foreign word. Foreign word is a word which comes from foreign language and is not yet included in Indonesian dictionary| workshop, business, e-commerce |\n| 5 | IN | Preposition. A preposition links word or phrase and constituent in front of that preposition and results prepositional phrase. | dalam, dengan, di, ke|\n| 6 | JJ | Adjective. Adjectives are words which describe, modify, or specify some properties of the head noun of the phrase | bersih, panjang, jauh, marah |\n| 7 | NEG | Negation | tidak, belum, jangan |\n| 8 | NN | Noun. Nouns are words which refer to human, animal, thing, concept, or understanding | meja, kursi, monyet, perkumpulan |\n| 9 | NNP | Proper Noun. Proper noun is a specific name of a person, thing, place, event, etc. | Indonesia, Jakarta, Piala Dunia, Idul Fitri, Jokowi |\n| 10 | NUM | Number. Includes cardinal and ordinal number | 9876, 2019, 0,5, empat |\n| 11 | PR | Pronoun. Includes personal pronoun and demonstrative pronoun | saya, kami, kita, kalian, ini, itu, nya, yang |\n| 12 | RP | Particle. Particle which confirms interrogative, imperative, or declarative sentences | pun, lah, kah|\n| 13 | SC | Subordinating Conjunction. Subordinating conjunction links two or more clauses and one of the clauses is a subordinate clause. | sejak, jika, seandainya, dengan, bahwa |\n| 14 | SYM | Symbols and Punctuations | +,%,@ |\n| 15 | UH | Interjection. Interjection expresses feeling or state of mind and has no relation with other words syntactically. | ayo, nah, ah|\n| 16 | VB | Verb. Includes transitive verbs, intransitive verbs, active verbs, passive verbs, and copulas. | tertidur, bekerja, membaca |\n| 17 | ADJP | Adjective Phrase. A group of words headed by an adjective that describes a noun or a pronoun | sangat tinggi |\n| 18 | DP | Date Phrase. Date written with whitespaces | 1 Januari 2020 |\n| 19 | NP | Noun Phrase. A phrase that has a noun (or indefinite pronoun) as its head | Jakarta Pusat, Lionel Messi |\n| 20 | NUMP | Number Phrase. | 10 juta |\n| 21 | VP | Verb Phrase. A syntactic unit composed of at least one verb and its dependents | tidak makan |\n\n### Stopword\n\n`nlp-id` also provide list of Indonesian stopword.\n\n from nlp_id.stopword import StopWord \n stopword = StopWord() \n stopword.get_stopword() \n # [{list_of_nlp_id_stopword}] \n\nStopword Removal is used to remove every Indonesian stopword from the given text.\n\n from nlp_id.stopword import StopWord \n text = \"Lionel Messi pergi Ke pasar di area Jakarta Pusat\" # single sentence\n stopword = StopWord() \n stopword.remove_stopword(text)\n # Lionel Messi pergi pasar area Jakarta Pusat \n \n paragraph = \"Lionel Messi pergi Ke pasar di area Jakarta Pusat itu. Sedangkan Cristiano Ronaldo ke pasar Di area Jakarta Selatan. Dan mereka tidak bertemu begini-begitu.\"\n stopword.remove_stopword(text)\n # Lionel Messi pergi pasar area Jakarta Pusat. Cristiano Ronaldo pasar area Jakarta Selatan. bertemu.\n \n## Training and Evaluation\n\nOur model is trained using stories from kumparan as the dataset. We managed to get ~93% accuracy on our test set.\n \n## Citation\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4556870.svg)](https://doi.org/10.5281/zenodo.4556870)",
"bugtrack_url": null,
"license": "MIT",
"summary": "Kumparan's NLP Services",
"version": "0.1.15.0",
"project_urls": {
"Homepage": "https://github.com/kumparan/nlp-id"
},
"split_keywords": [
"indonesian",
"bahasa",
"nlp"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "85b99306d664c8bbea56d548f2e4b286a47145b2f9ef363a046b2e1f79beb3c6",
"md5": "ecd240129297e9cc74f19846ce8793e2",
"sha256": "2a5e1a3e87170eb3dbde178b30ffc9a63c234a2d1d52b8ff84407ed16f689b31"
},
"downloads": -1,
"filename": "nlp_id-0.1.15.0.tar.gz",
"has_sig": false,
"md5_digest": "ecd240129297e9cc74f19846ce8793e2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 54776725,
"upload_time": "2023-07-11T02:31:22",
"upload_time_iso_8601": "2023-07-11T02:31:22.761160Z",
"url": "https://files.pythonhosted.org/packages/85/b9/9306d664c8bbea56d548f2e4b286a47145b2f9ef363a046b2e1f79beb3c6/nlp_id-0.1.15.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-07-11 02:31:22",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "kumparan",
"github_project": "nlp-id",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "nlp-id"
}