spanish-nlp


Namespanish-nlp JSON
Version 0.2.11 PyPI version JSON
download
home_page
SummaryA package for NLP in Spanish
upload_time2023-04-24 00:22:58
maintainer
docs_urlNone
author
requires_python<3.11,>=3.6
license
keywords augmentation clasificacion classifier español language lenguaje nlp pln preprocesamiento preprocess spanish
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Spanish NLP

## Introduction

Spanish NLP is the first low code Python library for Natural Language Processing in Spanish. It provides three main modules:

- **Preprocess**: it offers several text preprocessing options to clean and prepare texts for further analysis.
- **Classify**: it allows users to quickly classify texts using different pre-trained models
- **Augmentation**: it allows generate synthetic data. It is useful for increasing labeled data and improving results in classification model training.

## Installation

Spanish NLP can be installed via pip:

```bash
pip install spanish_nlp
```

## Usage

### Preprocessing

See more information in the [Jupyter Notebook example](https://github.com/jorgeortizfuentes/spanish_nlp/blob/main/examples/Preprocess.ipynb)

To preprocess text using the preprocess module, you can import it and call the desired parameters:

```python
from spanish_nlp import preprocess
sp = preprocess.SpanishPreprocess(
        lower=False,
        remove_url=True,
        remove_hashtags=False,
        split_hashtags=True,
        normalize_breaklines=True,
        remove_emoticons=False,
        remove_emojis=False,
        convert_emoticons=False,
        convert_emojis=False,
        normalize_inclusive_language=True,
        reduce_spam=True,
        remove_vowels_accents=True,
        remove_multiple_spaces=True,
        remove_punctuation=True,
        remove_unprintable=True,
        remove_numbers=True,
        remove_stopwords=False,
        stopwords_list=None,
        lemmatize=False,
        stem=False,
        remove_html_tags=True,
)

test_text = """𝓣𝓮𝔁𝓽𝓸 𝓭𝓮 𝓹𝓻𝓾𝓮𝓫𝓪

<b>Holaaaaaaaa a todxs </b>, este es un texto de prueba :) a continuación les mostraré un poema de Roberto Bolaño llamado "Los perros románticos" 🤭👀😅

https://www.poesi.as/rb9301.htm

¡Me gustan los pingüinos! Sí, los PINGÜINOS 🐧🐧🐧 🐧 #VivanLosPinguinos #SíSeñor #PinguinosDelMundoUníos #ÑanduesDelMundoTambién

Si colaboras con este repositorio te puedes ganar $100.000 (en dinero falso). O tal vez 20 pingüinos. Mi teléfono es +561212121212"""

print(sp.transform(test_text, debug=False))
```

Output:

```bash
hola a todos este es un texto de prueba:) a continuacion los mostrare un poema de roberto bolaño llamado los perros romanticos 🤭 👀 😅
me gustan los pinguinos si los pinguinos 🐧 🐧 🐧 🐧 vivan los pinguinos si señor pinguinos del mundo unios ñandues del mundo tambien
si colaboras con este repositorio te puedes ganar en dinero falso o tal vez pinguinos mi telefono es
```

### Classification

See more information in the [Jupyter Notebook example](https://github.com/jorgeortizfuentes/spanish_nlp/blob/main/examples/Classify.ipynb)

#### Available classifiers

- Hate Speech (hate_speech)
- Incivility (incivility)
- Toxic Speech (toxic_speech)
- Sentiment Analysis (sentiment_analysis)
- Emotion Analysis (emotion_analysis)
- Irony Analysis (irony_analysis)
- Sexist Analysis (sexist_analysis)
- Racism Analysis (racism_analysis)

#### Classification Example

```python
from spanish_nlp import classifiers

sc = classifiers.SpanishClassifier(model_name="hate_speech", device='cpu')
# DISCLAIMER: The following message is merely an example of hate speech and does not represent the views of the author or contributors.
t1 =  "LAS MUJERES Y GAYS DEBERIAN SER EXTERMINADOS"
t2 = "El presidente convocó a una reunión a los representantes de los partidos políticos"
p1 = sc.predict(t1)
p2 = sc.predict(t2)

print("Text 1: ", t1)
print("Prediction 1: ", p1)
print("Text 2: ", t2)
print("Prediction 2: ", p2)
```

Output:

```bash
Text 1:  LAS MUJERES Y GAYS DEBERÍAN SER EXTERMINADOS
Prediction 1:  {'hate_speech': 0.7544152736663818, 'not_hate_speech': 0.24558477103710175}
Text 2:  El presidente convocó a una reunión a los representantes de los partidos políticos
Prediction 2:  {'not_hate_speech': 0.9793208837509155, 'hate_speech': 0.02067909575998783}
```

### Augmentation

See more information in the [Jupyter Notebook example](https://github.com/jorgeortizfuentes/spanish_nlp/blob/main/examples/Data%20Augmentation.ipynb)

#### Available Augmentation Models

- Spelling augmentation
  - Keyboard spelling method
  - OCR spelling method
  - Random spelling replace method
  - Grapheme spelling
  - Word spelling
  - Remove punctuation
  - Remove spaces
  - Remove accents
  - Lowercase
  - Uppercase
  - Randomcase
  - All method
- Masked augmentation
  - Sustitute method
  - Insert method
- Others models under development (such as Synonyms, WordEmbeddings, GenerativeOpenSource, GenerativeOpenAI, BackTranslation, AbstractiveSummarization)

#### Augmentation Models Examples

```python
from spanish_nlp import augmentation

ocr = augmentation.Spelling(method="ocr",
                            stopwords="default",
                            aug_percent=0.3,
                            tokenizer="default")

grapheme_spelling = augmentation.Spelling(method="grapheme_spelling",
                                          stopwords="default",
                                          aug_percent=0.3,
                                          tokenizer="default")

masked_sustitute = augmentation.Masked(method="sustitute",
                                       model="dccuchile/bert-base-spanish-wwm-cased",
                                       tokenizer="default",
                                       stopwords="default",
                                       aug_percent=0.4,
                                       device="cpu",
                                       top_k=10)


text = "En aquel tiempo yo tenía veinte años y estaba loco. Había perdido un país pero había ganado un sueño. Y si tenía ese sueño lo demás no importaba. Ni trabajar ni rezar ni estudiar en la madrugada junto a los perros románticos."

new_texts = [text]
new_texts.append(ocr.augment(text, num_samples=1, num_workers=1))
new_texts.append(grapheme_spelling.augment(text, num_samples=1, num_workers=1))
new_texts.append(masked_sustitute.augment(text, num_samples=1))

for t in new_texts:
    print(t)
    print("---")
```

Output:

```bash
En aquel tiempo yo tenía veinte años y estaba loco. Había perdido un país pero había ganado un sueño. Y si tenía ese sueño lo demás no importaba. Ni trabajar ni rezar ni estudiar en la madrugada junto a los perros románticos.
---
['En a9uel tiempo yo tenía veint3 años y e8ta8a 1oco. Había Rerd1dQ un RaíB pePQ había ganado Vn su3ño. Y si tenía es3 BVeno lo 0emáB n0 iWRQPtaEa. N1 trabajar ni rezar ni 3s7ud1ar en la maOrVga0a junto a 1os p3rPo8 Pománt1Go5.']
---
['Em akel tiempo yo tenía veinte años y estaba loco. Había perdido un país pero  abía janado um sueño. Y si temía ese sueño lo demás no importava. Ni trabajar ni rezar ni estudiar em la nadrugada junto a los perros románticos.']
---
['En aquel tiempo yo tenía veinte años y estaba loco. Había perdido un país pero había ganado un sueño. Y si tenía mi sueño lo demás no importaba. ni trabajar ni rezar ni estudiar en la madrugada junto a los clubes románticos.']
---
```

## License

Spanish NLP is licensed under the [GNU General Public License v3.0](https://github.com/jorgeortizfuentes/spanish_nlp/blob/main/LICENSE).

## Author

This project was developed by [Jorge Ortiz-Fuentes](https://ortizfuentes.com/), Linguist and Data Scientist from Chile.

## Acknowledgements

We would like to express our gratitude to the Millennium Institute For Foundational Research and Department of Computer Science at the University of Chile for supporting the development of Spanish NLP. Special thanks to Felipe Bravo-Marquéz, Ricardo Cordova and Hernán Sarmiento for their knowledge, support and invaluable contribution to the project.

## Contributing

Contributions to Spanish NLP are welcome! Please see the [contributing guide](https://github.com/users/jorgeortizfuentes/projects/1) for more information.

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "spanish-nlp",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "<3.11,>=3.6",
    "maintainer_email": "",
    "keywords": "augmentation,clasificacion,classifier,espa\u00f1ol,language,lenguaje,nlp,pln,preprocesamiento,preprocess,spanish",
    "author": "",
    "author_email": "Jorge Ortiz-Fuentes <jorge@ortizfuentes.com>",
    "download_url": "https://files.pythonhosted.org/packages/cc/13/ae4fa94d702bd020a09d03f7492e69e8480d0df7bbbf3c82a4c8ee7b810c/spanish_nlp-0.2.11.tar.gz",
    "platform": null,
    "description": "# Spanish NLP\n\n## Introduction\n\nSpanish NLP is the first low code Python library for Natural Language Processing in Spanish. It provides three main modules:\n\n- **Preprocess**: it offers several text preprocessing options to clean and prepare texts for further analysis.\n- **Classify**: it allows users to quickly classify texts using different pre-trained models\n- **Augmentation**: it allows generate synthetic data. It is useful for increasing labeled data and improving results in classification model training.\n\n## Installation\n\nSpanish NLP can be installed via pip:\n\n```bash\npip install spanish_nlp\n```\n\n## Usage\n\n### Preprocessing\n\nSee more information in the [Jupyter Notebook example](https://github.com/jorgeortizfuentes/spanish_nlp/blob/main/examples/Preprocess.ipynb)\n\nTo preprocess text using the preprocess module, you can import it and call the desired parameters:\n\n```python\nfrom spanish_nlp import preprocess\nsp = preprocess.SpanishPreprocess(\n        lower=False,\n        remove_url=True,\n        remove_hashtags=False,\n        split_hashtags=True,\n        normalize_breaklines=True,\n        remove_emoticons=False,\n        remove_emojis=False,\n        convert_emoticons=False,\n        convert_emojis=False,\n        normalize_inclusive_language=True,\n        reduce_spam=True,\n        remove_vowels_accents=True,\n        remove_multiple_spaces=True,\n        remove_punctuation=True,\n        remove_unprintable=True,\n        remove_numbers=True,\n        remove_stopwords=False,\n        stopwords_list=None,\n        lemmatize=False,\n        stem=False,\n        remove_html_tags=True,\n)\n\ntest_text = \"\"\"\ud835\udce3\ud835\udcee\ud835\udd01\ud835\udcfd\ud835\udcf8 \ud835\udced\ud835\udcee \ud835\udcf9\ud835\udcfb\ud835\udcfe\ud835\udcee\ud835\udceb\ud835\udcea\n\n<b>Holaaaaaaaa a todxs </b>, este es un texto de prueba :) a continuaci\u00f3n les mostrar\u00e9 un poema de Roberto Bola\u00f1o llamado \"Los perros rom\u00e1nticos\" \ud83e\udd2d\ud83d\udc40\ud83d\ude05\n\nhttps://www.poesi.as/rb9301.htm\n\n\u00a1Me gustan los ping\u00fcinos! S\u00ed, los PING\u00dcINOS \ud83d\udc27\ud83d\udc27\ud83d\udc27 \ud83d\udc27 #VivanLosPinguinos #S\u00edSe\u00f1or #PinguinosDelMundoUn\u00edos #\u00d1anduesDelMundoTambi\u00e9n\n\nSi colaboras con este repositorio te puedes ganar $100.000 (en dinero falso). O tal vez 20 ping\u00fcinos. Mi tel\u00e9fono es +561212121212\"\"\"\n\nprint(sp.transform(test_text, debug=False))\n```\n\nOutput:\n\n```bash\nhola a todos este es un texto de prueba:) a continuacion los mostrare un poema de roberto bola\u00f1o llamado los perros romanticos \ud83e\udd2d \ud83d\udc40 \ud83d\ude05\nme gustan los pinguinos si los pinguinos \ud83d\udc27 \ud83d\udc27 \ud83d\udc27 \ud83d\udc27 vivan los pinguinos si se\u00f1or pinguinos del mundo unios \u00f1andues del mundo tambien\nsi colaboras con este repositorio te puedes ganar en dinero falso o tal vez pinguinos mi telefono es\n```\n\n### Classification\n\nSee more information in the [Jupyter Notebook example](https://github.com/jorgeortizfuentes/spanish_nlp/blob/main/examples/Classify.ipynb)\n\n#### Available classifiers\n\n- Hate Speech (hate_speech)\n- Incivility (incivility)\n- Toxic Speech (toxic_speech)\n- Sentiment Analysis (sentiment_analysis)\n- Emotion Analysis (emotion_analysis)\n- Irony Analysis (irony_analysis)\n- Sexist Analysis (sexist_analysis)\n- Racism Analysis (racism_analysis)\n\n#### Classification Example\n\n```python\nfrom spanish_nlp import classifiers\n\nsc = classifiers.SpanishClassifier(model_name=\"hate_speech\", device='cpu')\n# DISCLAIMER: The following message is merely an example of hate speech and does not represent the views of the author or contributors.\nt1 =  \"LAS MUJERES Y GAYS DEBERIAN SER EXTERMINADOS\"\nt2 = \"El presidente convoc\u00f3 a una reuni\u00f3n a los representantes de los partidos pol\u00edticos\"\np1 = sc.predict(t1)\np2 = sc.predict(t2)\n\nprint(\"Text 1: \", t1)\nprint(\"Prediction 1: \", p1)\nprint(\"Text 2: \", t2)\nprint(\"Prediction 2: \", p2)\n```\n\nOutput:\n\n```bash\nText 1:  LAS MUJERES Y GAYS DEBER\u00cdAN SER EXTERMINADOS\nPrediction 1:  {'hate_speech': 0.7544152736663818, 'not_hate_speech': 0.24558477103710175}\nText 2:  El presidente convoc\u00f3 a una reuni\u00f3n a los representantes de los partidos pol\u00edticos\nPrediction 2:  {'not_hate_speech': 0.9793208837509155, 'hate_speech': 0.02067909575998783}\n```\n\n### Augmentation\n\nSee more information in the [Jupyter Notebook example](https://github.com/jorgeortizfuentes/spanish_nlp/blob/main/examples/Data%20Augmentation.ipynb)\n\n#### Available Augmentation Models\n\n- Spelling augmentation\n  - Keyboard spelling method\n  - OCR spelling method\n  - Random spelling replace method\n  - Grapheme spelling\n  - Word spelling\n  - Remove punctuation\n  - Remove spaces\n  - Remove accents\n  - Lowercase\n  - Uppercase\n  - Randomcase\n  - All method\n- Masked augmentation\n  - Sustitute method\n  - Insert method\n- Others models under development (such as Synonyms, WordEmbeddings, GenerativeOpenSource, GenerativeOpenAI, BackTranslation, AbstractiveSummarization)\n\n#### Augmentation Models Examples\n\n```python\nfrom spanish_nlp import augmentation\n\nocr = augmentation.Spelling(method=\"ocr\",\n                            stopwords=\"default\",\n                            aug_percent=0.3,\n                            tokenizer=\"default\")\n\ngrapheme_spelling = augmentation.Spelling(method=\"grapheme_spelling\",\n                                          stopwords=\"default\",\n                                          aug_percent=0.3,\n                                          tokenizer=\"default\")\n\nmasked_sustitute = augmentation.Masked(method=\"sustitute\",\n                                       model=\"dccuchile/bert-base-spanish-wwm-cased\",\n                                       tokenizer=\"default\",\n                                       stopwords=\"default\",\n                                       aug_percent=0.4,\n                                       device=\"cpu\",\n                                       top_k=10)\n\n\ntext = \"En aquel tiempo yo ten\u00eda veinte a\u00f1os y estaba loco. Hab\u00eda perdido un pa\u00eds pero hab\u00eda ganado un sue\u00f1o. Y si ten\u00eda ese sue\u00f1o lo dem\u00e1s no importaba. Ni trabajar ni rezar ni estudiar en la madrugada junto a los perros rom\u00e1nticos.\"\n\nnew_texts = [text]\nnew_texts.append(ocr.augment(text, num_samples=1, num_workers=1))\nnew_texts.append(grapheme_spelling.augment(text, num_samples=1, num_workers=1))\nnew_texts.append(masked_sustitute.augment(text, num_samples=1))\n\nfor t in new_texts:\n    print(t)\n    print(\"---\")\n```\n\nOutput:\n\n```bash\nEn aquel tiempo yo ten\u00eda veinte a\u00f1os y estaba loco. Hab\u00eda perdido un pa\u00eds pero hab\u00eda ganado un sue\u00f1o. Y si ten\u00eda ese sue\u00f1o lo dem\u00e1s no importaba. Ni trabajar ni rezar ni estudiar en la madrugada junto a los perros rom\u00e1nticos.\n---\n['En a9uel tiempo yo ten\u00eda veint3 a\u00f1os y e8ta8a 1oco. Hab\u00eda Rerd1dQ un Ra\u00edB pePQ hab\u00eda ganado Vn su3\u00f1o. Y si ten\u00eda es3 BVeno lo 0em\u00e1B n0 iWRQPtaEa. N1 trabajar ni rezar ni 3s7ud1ar en la maOrVga0a junto a 1os p3rPo8 Pom\u00e1nt1Go5.']\n---\n['Em akel tiempo yo ten\u00eda veinte a\u00f1os y estaba loco. Hab\u00eda perdido un pa\u00eds pero  ab\u00eda janado um sue\u00f1o. Y si tem\u00eda ese sue\u00f1o lo dem\u00e1s no importava. Ni trabajar ni rezar ni estudiar em la nadrugada junto a los perros rom\u00e1nticos.']\n---\n['En aquel tiempo yo ten\u00eda veinte a\u00f1os y estaba loco. Hab\u00eda perdido un pa\u00eds pero hab\u00eda ganado un sue\u00f1o. Y si ten\u00eda mi sue\u00f1o lo dem\u00e1s no importaba. ni trabajar ni rezar ni estudiar en la madrugada junto a los clubes rom\u00e1nticos.']\n---\n```\n\n## License\n\nSpanish NLP is licensed under the [GNU General Public License v3.0](https://github.com/jorgeortizfuentes/spanish_nlp/blob/main/LICENSE).\n\n## Author\n\nThis project was developed by [Jorge Ortiz-Fuentes](https://ortizfuentes.com/), Linguist and Data Scientist from Chile.\n\n## Acknowledgements\n\nWe would like to express our gratitude to the Millennium Institute For Foundational Research and Department of Computer Science at the University of Chile for supporting the development of Spanish NLP. Special thanks to Felipe Bravo-Marqu\u00e9z, Ricardo Cordova and Hern\u00e1n Sarmiento for their knowledge, support and invaluable contribution to the project.\n\n## Contributing\n\nContributions to Spanish NLP are welcome! Please see the [contributing guide](https://github.com/users/jorgeortizfuentes/projects/1) for more information.\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "A package for NLP in Spanish",
    "version": "0.2.11",
    "split_keywords": [
        "augmentation",
        "clasificacion",
        "classifier",
        "espa\u00f1ol",
        "language",
        "lenguaje",
        "nlp",
        "pln",
        "preprocesamiento",
        "preprocess",
        "spanish"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8eb4c87a5bf3d4584e33d51a56d6c258d6cd8687746bffb576b9a5622352718f",
                "md5": "623d7dcaacf8303797cb5acbb2e48170",
                "sha256": "691ca0e0b87e31d9b0f3b0b59fc01ac4564a1f610cd4b0f13ddc0997dd036248"
            },
            "downloads": -1,
            "filename": "spanish_nlp-0.2.11-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "623d7dcaacf8303797cb5acbb2e48170",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.11,>=3.6",
            "size": 38352,
            "upload_time": "2023-04-24T00:22:57",
            "upload_time_iso_8601": "2023-04-24T00:22:57.043588Z",
            "url": "https://files.pythonhosted.org/packages/8e/b4/c87a5bf3d4584e33d51a56d6c258d6cd8687746bffb576b9a5622352718f/spanish_nlp-0.2.11-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "cc13ae4fa94d702bd020a09d03f7492e69e8480d0df7bbbf3c82a4c8ee7b810c",
                "md5": "cec1fd554979d5311de59b4d2fedc60c",
                "sha256": "9069b2afaf11e9617fb1898730772f3ac6ba1c6a6f6843f531f8d2de191c4356"
            },
            "downloads": -1,
            "filename": "spanish_nlp-0.2.11.tar.gz",
            "has_sig": false,
            "md5_digest": "cec1fd554979d5311de59b4d2fedc60c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.11,>=3.6",
            "size": 35525,
            "upload_time": "2023-04-24T00:22:58",
            "upload_time_iso_8601": "2023-04-24T00:22:58.513146Z",
            "url": "https://files.pythonhosted.org/packages/cc/13/ae4fa94d702bd020a09d03f7492e69e8480d0df7bbbf3c82a4c8ee7b810c/spanish_nlp-0.2.11.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-04-24 00:22:58",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "lcname": "spanish-nlp"
}
        
Elapsed time: 0.08995s