#
<div align="center" markdown>
![project logo](https://raw.githubusercontent.com/huspacy/huspacy/develop/.github/resources/logo.png)
[![python version](https://img.shields.io/badge/Python-%3E=3.7-blue)](https://github.com/huspacy/huspacy)
[![spacy](https://img.shields.io/badge/built%20with-spaCy-09a3d5.svg)](https://spacy.io)
![PyPI - Wheel](https://img.shields.io/pypi/wheel/huspacy)
[![PyPI version](https://badge.fury.io/py/huspacy.svg)](https://pypi.org/project/huspacy/)
[![Demo](https://img.shields.io/badge/Try%20the-Demo-important)](https://huggingface.co/spaces/huspacy/demo)
<br/>
[![Build](https://github.com/huspacy/huspacy/actions/workflows/build.yml/badge.svg)](https://github.com/huspacy/huspacy/actions/workflows/build.yml)
[![Models](https://github.com/huspacy/huspacy/actions/workflows/test_models.yml/badge.svg)](https://github.com/huspacy/huspacy/actions/workflows/test_models.yml)
[![Downloads](https://static.pepy.tech/personalized-badge/huspacy?period=month&units=international_system&left_color=grey&right_color=green&left_text=Downloads/month)](https://pepy.tech/project/huspacy)
[![Downloads](https://static.pepy.tech/personalized-badge/huspacy?period=total&units=international_system&left_color=grey&right_color=green&left_text=Downloads)](https://pepy.tech/project/huspacy)
[![Hits](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2Fhuspacy%2Fhuspacy&count_bg=%2379C83D&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=hits&edge_flat=true)](https://hits.seeyoufarm.com)
[![stars](https://img.shields.io/github/stars/huspacy/huspacy?style=social)](https://github.com/huspacy/huspacy)
</div>
HuSpaCy is a [spaCy](https://spacy.io) library providing industrial-strength Hungarian language processing facilities through spaCy models.
The released pipelines consist of a tokenizer, sentence splitter, lemmatizer, tagger (predicting morphological features as well), dependency parser and a named entity recognition module.
Word and phrase embeddings are also available through spaCy's API.
All models have high throughput, decent memory usage and close to state-of-the-art accuracy.
A live demo is available [here](https://huggingface.co/spaces/huspacy/demo), model releases are published to [Hugging Face Hub](https://huggingface.co/huspacy/).
This repository contains material to build HuSpaCy and all of its models in a reproducible way.
# Installation
To get started using the tool, first, we need to download one of the models. The easiest way to achieve this is to install `huspacy` (from [PyPI](https://pypi.org/project/huspacy/)) and then fetch a model through its API.
```bash
pip install huspacy
```
```python
import huspacy
# Download the latest CPU optimized model
huspacy.download()
```
### Install the models directly
You can install the latest models directly from 🤗 Hugging Face Hub:
- CPU optimized [large model](https://huggingface.co/huspacy/hu_core_news_lg): `pip install https://huggingface.co/huspacy/hu_core_news_lg/resolve/main/hu_core_news_lg-any-py3-none-any.whl`
- GPU optimized [transformers model](https://huggingface.co/huspacy/hu_core_news_trf): `pip install https://huggingface.co/huspacy/hu_core_news_trf/resolve/main/hu_core_news_trf-any-py3-none-any.whl`
To speed up inference on GPU, CUDA must be installed as described in [https://spacy.io/usage](https://spacy.io/usage).
Read more on the models [here](https://huspacy.github.io/models)
# Quickstart
HuSpaCy is fully compatible with [spaCy's API](https://spacy.io/api/doc/), newcomers can easily get started with [spaCy 101](https://spacy.io/usage/spacy-101) guide.
Although HuSpacy models can be loaded with `spacy.load(...)`, the tool provides convenience methods to easily access downloaded models.
```python
# Load the model using spacy.load(...)
import spacy
nlp = spacy.load("hu_core_news_lg")
```
```python
# Load the default large model (if downloaded)
import huspacy
nlp = huspacy.load()
```
```python
# Load the model directly as a module
import hu_core_news_lg
nlp = hu_core_news_lg.load()
```
To process texts, you can simply call the loaded model (i.e. the [`nlp` callable object](https://spacy.io/api/language#call))
<!--pytest-codeblocks:cont-->
```python
doc = nlp("Csiribiri csiribiri zabszalma - négy csillag közt alszom ma.")
```
As HuSpaCy is built on spaCy, the returned [`doc` document](https://spacy.io/api/doc#_title) contains all the annotations given by the pipeline components.
API Documentation is available in [our website](https://huspacy.github.io/).
# Models overview
We provide several pretrained models:
1. [`hu_core_news_lg`](https://huggingface.co/huspacy/hu_core_news_lg) is a CNN-based large model which achieves a good
balance between accuracy and processing speed. This default model provides tokenization, sentence splitting,
part-of-speech tagging (UD labels w/ detailed morphosyntactic features), lemmatization, dependency parsing and named
entity recognition and ships with pretrained word vectors.
2. [`hu_core_news_trf`](https://huggingface.co/huspacy/hu_core_news_trf) is built
on [huBERT](https://huggingface.co/SZTAKI-HLT/hubert-base-cc) and provides the same functionality as the large model
except the word vectors. It comes with much higher accuracy in the price of increased computational resource usage.
We suggest using it with GPU support.
3. [`hu_core_news_md`](https://huggingface.co/huspacy/hu_core_news_md) greatly improves on `hu_core_news_lg`'s
throughput by loosing some accuracy. This model could be a good choice when processing speed is crucial.
4. [`hu_core_news_trf_xl`](https://huggingface.co/huspacy/hu_core_news_trf_xl) is an experimental model built
on [XLM-RoBERTa-large](https://huggingface.co/xlm-roberta-large). It provides the same functionality as
the `hu_core_news_trf` model, however it comes with slightly higher accuracy in the price of significantly increased
computational resource usage.
We suggest using it with GPU support.
HuSpaCy's model versions follows [spaCy's versioning scheme](https://spacy.io/models#model-versioning).
A demo of the models is available at [Hugging Face Spaces](https://huggingface.co/spaces/huspacy/demo).
To read more about the model's architecture we suggest
reading [the relevant sections from spaCy's documentation](https://spacy.io/models#design).
### Comparison
| Models | `md` | `lg` | `trf` | `trf_xl` |
|-----------------|--------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|
| Embeddings | 100d floret | 300d floret | transformer:<br/>[`huBERT`](https://huggingface.co/SZTAKI-HLT/hubert-base-cc) | transformer:<br/>[`XLM-RoBERTa-large`](https://huggingface.co/xlm-roberta-large) |
| Target hardware | CPU | CPU | GPU | GPU |
| Accuracy | ⭑⭑⭑⭒ | ⭑⭑⭑⭑ | ⭑⭑⭑⭑⭒ | ⭑⭑⭑⭑⭑ |
| Resource usage | ⭑⭑⭑⭑⭑ | ⭑⭑⭑⭑ | ⭑⭑ | ⭒ |
# Citation
If you use HuSpaCy or any of its models, please cite it as:
[![arxiv](http://img.shields.io/badge/cs.CL-arXiv%3A2308.12635-B31B1B.svg)](https://arxiv.org/abs/2308.12635)
```bibtex
@InProceedings{HuSpaCy:2023,
author= {"Orosz, Gy{\"o}rgy and Szab{\'o}, Gerg{\H{o}} and Berkecz, P{\'e}ter and Sz{\'a}nt{\'o}, Zsolt and Farkas, Rich{\'a}rd"},
editor= {"Ek{\v{s}}tein, Kamil and P{\'a}rtl, Franti{\v{s}}ek and Konop{\'i}k, Miloslav"},
title = {{"Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines"}},
booktitle = {{"Text, Speech, and Dialogue"}},
year = "2023",
publisher = {{"Springer Nature Switzerland"}},
address = {{"Cham"}},
pages = "58--69",
isbn = "978-3-031-40498-6"
}
```
[![arxiv](http://img.shields.io/badge/cs.CL-arXiv%3A2201.01956-B31B1B.svg)](https://arxiv.org/abs/2201.01956)
```bibtex
@InProceedings{HuSpaCy:2021,
title = {{HuSpaCy: an industrial-strength Hungarian natural language processing toolkit}},
booktitle = {{XVIII. Magyar Sz{\'a}m{\'\i}t{\'o}g{\'e}pes Nyelv{\'e}szeti Konferencia}},
author = {Orosz, Gy{\"o}rgy and Sz{\' a}nt{\' o}, Zsolt and Berkecz, P{\' e}ter and Szab{\' o}, Gerg{\H o} and Farkas, Rich{\' a}rd},
location = {{Szeged}},
pages = "59--73",
year = {2022},
}
```
# Contact
For feature requests, issues and bugs please use the [GitHub Issue Tracker](https://github.com/huspacy/huspacy/issues). Otherwise, reach out to us in the [Discussion Forum](https://github.com/huspacy/huspacy/discussions).
## Authors
HuSpaCy is implemented in the [SzegedAI](https://szegedai.github.io/) team, coordinated by [Orosz György](mailto:gyorgy@orosz.link) in the [Hungarian AI National Laboratory, MILAB](https://mi.nemzetilabor.hu/) program.
# License
This library is released under the [Apache 2.0 License](https://github.com/huspacy/huspacy/blob/master/LICENSE)
Trained models have their own license ([CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)) as described on the [models page](https://huggingface.co/huspacy/).
Raw data
{
"_id": null,
"home_page": "https://github.com/huspacy/huspacy",
"name": "huspacy",
"maintainer": "Gy\u00f6rgy Orosz",
"docs_url": null,
"requires_python": "<3.14,>=3.9",
"maintainer_email": "gyorgy@orosz.link",
"keywords": "nlp, huspacy, Hungarian, text processing, text processing, language processing, text mining, tokenization, sentence boundary detection, sbd, sentence splitting, pos tagging, tagging, lemmatization, ner, named entity recognition, parsing, word embeddings, word vectors, spacy, spacy model",
"author": "SzegedAI, MILAB",
"author_email": "gyorgy@orosz.link",
"download_url": null,
"platform": null,
"description": "\n# \n\n<div align=\"center\" markdown>\n\n![project logo](https://raw.githubusercontent.com/huspacy/huspacy/develop/.github/resources/logo.png)\n\n[![python version](https://img.shields.io/badge/Python-%3E=3.7-blue)](https://github.com/huspacy/huspacy)\n[![spacy](https://img.shields.io/badge/built%20with-spaCy-09a3d5.svg)](https://spacy.io)\n![PyPI - Wheel](https://img.shields.io/pypi/wheel/huspacy)\n[![PyPI version](https://badge.fury.io/py/huspacy.svg)](https://pypi.org/project/huspacy/)\n[![Demo](https://img.shields.io/badge/Try%20the-Demo-important)](https://huggingface.co/spaces/huspacy/demo)\n<br/>\n[![Build](https://github.com/huspacy/huspacy/actions/workflows/build.yml/badge.svg)](https://github.com/huspacy/huspacy/actions/workflows/build.yml)\n[![Models](https://github.com/huspacy/huspacy/actions/workflows/test_models.yml/badge.svg)](https://github.com/huspacy/huspacy/actions/workflows/test_models.yml)\n[![Downloads](https://static.pepy.tech/personalized-badge/huspacy?period=month&units=international_system&left_color=grey&right_color=green&left_text=Downloads/month)](https://pepy.tech/project/huspacy)\n[![Downloads](https://static.pepy.tech/personalized-badge/huspacy?period=total&units=international_system&left_color=grey&right_color=green&left_text=Downloads)](https://pepy.tech/project/huspacy)\n[![Hits](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2Fhuspacy%2Fhuspacy&count_bg=%2379C83D&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=hits&edge_flat=true)](https://hits.seeyoufarm.com)\n[![stars](https://img.shields.io/github/stars/huspacy/huspacy?style=social)](https://github.com/huspacy/huspacy)\n</div>\n\nHuSpaCy is a [spaCy](https://spacy.io) library providing industrial-strength Hungarian language processing facilities through spaCy models. \nThe released pipelines consist of a tokenizer, sentence splitter, lemmatizer, tagger (predicting morphological features as well), dependency parser and a named entity recognition module. \nWord and phrase embeddings are also available through spaCy's API.\nAll models have high throughput, decent memory usage and close to state-of-the-art accuracy. \nA live demo is available [here](https://huggingface.co/spaces/huspacy/demo), model releases are published to [Hugging Face Hub](https://huggingface.co/huspacy/).\n\nThis repository contains material to build HuSpaCy and all of its models in a reproducible way.\n\n# Installation\n\nTo get started using the tool, first, we need to download one of the models. The easiest way to achieve this is to install `huspacy` (from [PyPI](https://pypi.org/project/huspacy/)) and then fetch a model through its API.\n\n```bash\npip install huspacy\n```\n\n```python\nimport huspacy\n\n# Download the latest CPU optimized model\nhuspacy.download()\n```\n\n### Install the models directly\n\nYou can install the latest models directly from \ud83e\udd17 Hugging Face Hub:\n\n- CPU optimized [large model](https://huggingface.co/huspacy/hu_core_news_lg): `pip install https://huggingface.co/huspacy/hu_core_news_lg/resolve/main/hu_core_news_lg-any-py3-none-any.whl`\n- GPU optimized [transformers model](https://huggingface.co/huspacy/hu_core_news_trf): `pip install https://huggingface.co/huspacy/hu_core_news_trf/resolve/main/hu_core_news_trf-any-py3-none-any.whl`\n\nTo speed up inference on GPU, CUDA must be installed as described in [https://spacy.io/usage](https://spacy.io/usage).\n\nRead more on the models [here](https://huspacy.github.io/models)\n\n# Quickstart\nHuSpaCy is fully compatible with [spaCy's API](https://spacy.io/api/doc/), newcomers can easily get started with [spaCy 101](https://spacy.io/usage/spacy-101) guide.\n\nAlthough HuSpacy models can be loaded with `spacy.load(...)`, the tool provides convenience methods to easily access downloaded models.\n\n```python\n# Load the model using spacy.load(...)\nimport spacy\nnlp = spacy.load(\"hu_core_news_lg\")\n```\n\n```python\n# Load the default large model (if downloaded)\nimport huspacy\nnlp = huspacy.load()\n```\n\n```python\n# Load the model directly as a module\nimport hu_core_news_lg\nnlp = hu_core_news_lg.load()\n```\n\nTo process texts, you can simply call the loaded model (i.e. the [`nlp` callable object](https://spacy.io/api/language#call)) \n\n<!--pytest-codeblocks:cont-->\n\n```python\ndoc = nlp(\"Csiribiri csiribiri zabszalma - n\u00e9gy csillag k\u00f6zt alszom ma.\")\n```\n\nAs HuSpaCy is built on spaCy, the returned [`doc` document](https://spacy.io/api/doc#_title) contains all the annotations given by the pipeline components.\n\nAPI Documentation is available in [our website](https://huspacy.github.io/).\n\n# Models overview\n\nWe provide several pretrained models:\n\n1. [`hu_core_news_lg`](https://huggingface.co/huspacy/hu_core_news_lg) is a CNN-based large model which achieves a good\n balance between accuracy and processing speed. This default model provides tokenization, sentence splitting,\n part-of-speech tagging (UD labels w/ detailed morphosyntactic features), lemmatization, dependency parsing and named\n entity recognition and ships with pretrained word vectors.\n2. [`hu_core_news_trf`](https://huggingface.co/huspacy/hu_core_news_trf) is built\n on [huBERT](https://huggingface.co/SZTAKI-HLT/hubert-base-cc) and provides the same functionality as the large model\n except the word vectors. It comes with much higher accuracy in the price of increased computational resource usage.\n We suggest using it with GPU support.\n3. [`hu_core_news_md`](https://huggingface.co/huspacy/hu_core_news_md) greatly improves on `hu_core_news_lg`'s\n throughput by loosing some accuracy. This model could be a good choice when processing speed is crucial.\n4. [`hu_core_news_trf_xl`](https://huggingface.co/huspacy/hu_core_news_trf_xl) is an experimental model built\n on [XLM-RoBERTa-large](https://huggingface.co/xlm-roberta-large). It provides the same functionality as\n the `hu_core_news_trf` model, however it comes with slightly higher accuracy in the price of significantly increased\n computational resource usage.\n We suggest using it with GPU support.\n\nHuSpaCy's model versions follows [spaCy's versioning scheme](https://spacy.io/models#model-versioning).\n\nA demo of the models is available at [Hugging Face Spaces](https://huggingface.co/spaces/huspacy/demo).\n\nTo read more about the model's architecture we suggest\nreading [the relevant sections from spaCy's documentation](https://spacy.io/models#design).\n\n### Comparison\n\n| Models | `md` | `lg` | `trf` | `trf_xl` |\n|-----------------|--------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------| \n| Embeddings | 100d floret | 300d floret | transformer:<br/>[`huBERT`](https://huggingface.co/SZTAKI-HLT/hubert-base-cc) | transformer:<br/>[`XLM-RoBERTa-large`](https://huggingface.co/xlm-roberta-large) |\n| Target hardware | CPU | CPU | GPU | GPU |\n| Accuracy | \u2b51\u2b51\u2b51\u2b52 | \u2b51\u2b51\u2b51\u2b51 | \u2b51\u2b51\u2b51\u2b51\u2b52 | \u2b51\u2b51\u2b51\u2b51\u2b51 |\n| Resource usage | \u2b51\u2b51\u2b51\u2b51\u2b51 | \u2b51\u2b51\u2b51\u2b51 | \u2b51\u2b51 | \u2b52 |\n\n# Citation\n \nIf you use HuSpaCy or any of its models, please cite it as: \n\n[![arxiv](http://img.shields.io/badge/cs.CL-arXiv%3A2308.12635-B31B1B.svg)](https://arxiv.org/abs/2308.12635)\n\n```bibtex\n@InProceedings{HuSpaCy:2023,\n author= {\"Orosz, Gy{\\\"o}rgy and Szab{\\'o}, Gerg{\\H{o}} and Berkecz, P{\\'e}ter and Sz{\\'a}nt{\\'o}, Zsolt and Farkas, Rich{\\'a}rd\"},\n editor= {\"Ek{\\v{s}}tein, Kamil and P{\\'a}rtl, Franti{\\v{s}}ek and Konop{\\'i}k, Miloslav\"},\n title = {{\"Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines\"}},\n booktitle = {{\"Text, Speech, and Dialogue\"}},\n year = \"2023\",\n publisher = {{\"Springer Nature Switzerland\"}},\n address = {{\"Cham\"}},\n pages = \"58--69\",\n isbn = \"978-3-031-40498-6\"\n}\n```\n\n[![arxiv](http://img.shields.io/badge/cs.CL-arXiv%3A2201.01956-B31B1B.svg)](https://arxiv.org/abs/2201.01956)\n\n```bibtex\n@InProceedings{HuSpaCy:2021,\n title = {{HuSpaCy: an industrial-strength Hungarian natural language processing toolkit}},\n booktitle = {{XVIII. Magyar Sz{\\'a}m{\\'\\i}t{\\'o}g{\\'e}pes Nyelv{\\'e}szeti Konferencia}},\n author = {Orosz, Gy{\\\"o}rgy and Sz{\\' a}nt{\\' o}, Zsolt and Berkecz, P{\\' e}ter and Szab{\\' o}, Gerg{\\H o} and Farkas, Rich{\\' a}rd},\n location = {{Szeged}},\n pages = \"59--73\",\n year = {2022},\n}\n```\n\n# Contact\n\nFor feature requests, issues and bugs please use the [GitHub Issue Tracker](https://github.com/huspacy/huspacy/issues). Otherwise, reach out to us in the [Discussion Forum](https://github.com/huspacy/huspacy/discussions).\n\n## Authors\n\nHuSpaCy is implemented in the [SzegedAI](https://szegedai.github.io/) team, coordinated by [Orosz Gy\u00f6rgy](mailto:gyorgy@orosz.link) in the [Hungarian AI National Laboratory, MILAB](https://mi.nemzetilabor.hu/) program.\n\n# License\n\nThis library is released under the [Apache 2.0 License](https://github.com/huspacy/huspacy/blob/master/LICENSE)\n\nTrained models have their own license ([CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)) as described on the [models page](https://huggingface.co/huspacy/).\n\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "HuSpaCy: industrial strength Hungarian natural language processing",
"version": "0.12.0",
"project_urls": {
"Homepage": "https://github.com/huspacy/huspacy",
"Repository": "https://github.com/huspacy/huspacy"
},
"split_keywords": [
"nlp",
" huspacy",
" hungarian",
" text processing",
" text processing",
" language processing",
" text mining",
" tokenization",
" sentence boundary detection",
" sbd",
" sentence splitting",
" pos tagging",
" tagging",
" lemmatization",
" ner",
" named entity recognition",
" parsing",
" word embeddings",
" word vectors",
" spacy",
" spacy model"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "e956a0dce9dee23b68570facef00905df1fc8c2e345d570eae04689fd9fb7b6c",
"md5": "bc5fb9b8db75ab62f4fa65b0830ec0ff",
"sha256": "4b645c311822b3df4f694d8f8d5da96948647f91cc130463ae301cfce3689f48"
},
"downloads": -1,
"filename": "huspacy-0.12.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "bc5fb9b8db75ab62f4fa65b0830ec0ff",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.14,>=3.9",
"size": 92811,
"upload_time": "2024-10-28T10:30:55",
"upload_time_iso_8601": "2024-10-28T10:30:55.783007Z",
"url": "https://files.pythonhosted.org/packages/e9/56/a0dce9dee23b68570facef00905df1fc8c2e345d570eae04689fd9fb7b6c/huspacy-0.12.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-28 10:30:55",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "huspacy",
"github_project": "huspacy",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "huspacy"
}