spacymoji


Namespacymoji JSON
Version 3.1.0 PyPI version JSON
download
home_pagehttps://github.com/explosion/spacymoji
SummaryspaCy pipeline component for adding emoji metadata to Doc, Token and Span objects
upload_time2023-05-10 14:44:34
maintainer
docs_urlNone
authorExplosion
requires_python>=3.6
licenseMIT
keywords
VCS
bugtrack_url
requirements spacy emoji pytest
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # spacymoji: emoji for spaCy

[spaCy](https://spacy.io) extension and pipeline component for adding emoji meta
data to `Doc` objects. Detects emoji consisting of one or more unicode
characters, and can optionally merge multi-char emoji (combined pictures, emoji
with skin tone modifiers) into one token. Human-readable emoji descriptions are
added as a custom attribute, and an optional lookup table can be provided for
your own descriptions. The extension sets the custom `Doc`, `Token` and `Span`
attributes `._.is_emoji`, `._.emoji_desc`, `._.has_emoji` and `._.emoji`. You
can read more about custom pipeline components and extension attributes
[here](https://spacy.io/usage/processing-pipelines).

Emoji are matched using spaCy's
[`PhraseMatcher`](https://spacy.io/api/phrasematcher), and looked up in the data
table provided by the [`emoji` package](https://github.com/carpedm20/emoji).

[![tests](https://github.com/explosion/spacymoji/actions/workflows/tests.yml/badge.svg)](https://github.com/explosion/spacymoji/actions/workflows/tests.yml)
[![Current Release Version](https://img.shields.io/github/release/explosion/spacymoji.svg?style=flat-square&logo=github)](https://github.com/explosion/spacymoji/releases)
[![pypi Version](https://img.shields.io/pypi/v/spacymoji.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/spacymoji/)

# ⏳ Installation

`spacymoji` requires `spacy` v3.0.0 or higher. For spaCy v2.x, install
`spacymoji==2.0.0`.

```bash
pip install spacymoji
```

# ☝️ Usage

Import the component and add it anywhere in your pipeline using the string name
of the `"emoji"` component factory:

```python
import spacy

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("emoji", first=True)
doc = nlp("This is a test 😻 👍🏿")
assert doc._.has_emoji is True
assert doc[2:5]._.has_emoji is True
assert doc[0]._.is_emoji is False
assert doc[4]._.is_emoji is True
assert doc[5]._.emoji_desc == "thumbs up dark skin tone"
assert len(doc._.emoji) == 2
assert doc._.emoji[1] == ("👍🏿", 5, "thumbs up dark skin tone")
```

`spacymoji` only cares about the token text, so you can use it on a blank
`Language` instance (it should work for all
[available languages](https://spacy.io/usage/models#languages)!), or in a
pipeline with a loaded pipeline. If your pipeline includes a tagger, parser and
entity recognizer, make sure to add the emoji component as `first=True`, so the
spans are merged right after tokenization, and _before_ the document is parsed.
If your text contains a lot of emoji, this might even give you a nice boost in
parser accuracy.

## Available attributes

The extension sets attributes on the `Doc`, `Span` and `Token`. You can change
the attribute names (and other parameters of the Emoji component) by passing
them via the `config` parameter in the `nlp.add_pipe(...)` method. For more
details on custom components and attributes, see the
[processing pipelines documentation](https://spacy.io/usage/processing-pipelines#custom-components).

| Attribute            | Type                       | Description                                                   |
| -------------------- | -------------------------- | ------------------------------------------------------------- |
| `Token._.is_emoji`   | bool                       | Whether the token is an emoji.                                |
| `Token._.emoji_desc` | str                        | A human-readable description of the emoji.                    |
| `Doc._.has_emoji`    | bool                       | Whether the document contains emoji.                          |
| `Doc._.emoji`        | List[Tuple[str, int, str]] | `(emoji, index, description)` tuples of the document's emoji. |
| `Span._.has_emoji`   | bool                       | Whether the span contains emoji.                              |
| `Span._.emoji`       | List[Tuple[str, int, str]] | `(emoji, index, description)` tuples of the span's emoji.     |

## Settings

You can configure the `emoji` factory by setting any of the following parameters
in the `config` dictionary:

| Setting       | Type                      | Description                                                                                                                            |
| ------------- | ------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
| `attrs`       | Tuple[str, str, str, str] | Attributes to set on the `._` property. Defaults to `('has_emoji', 'is_emoji', 'emoji_desc', 'emoji')`.                                |
| `pattern_id`  | str                       | ID of match pattern, defaults to `'EMOJI'`. Can be changed to avoid ID conflicts.                                                      |
| `merge_spans` | bool                      | Merge spans containing multi-character emoji, defaults to `True`. Will only merge combined emoji resulting in one icon, not sequences. |
| `lookup`      | Dict[str, str]            | Optional lookup table that maps emoji strings to custom descriptions, e.g. translations or other annotations.                          |

```python
emoji_config = {"attrs": ("has_e", "is_e", "e_desc", "e"), lookup={"👨‍🎤": "David Bowie"})
nlp.add_pipe(emoji, first=True, config=emoji_config)
doc = nlp("We can be 👨‍🎤 heroes")
assert doc[3]._.is_e
assert doc[3]._.e_desc == "David Bowie"
```

If you're training a pipeline, you can define the component config in your
[`config.cfg`](https://spacy.io/usage/training):

```ini
[nlp]
pipeline = ["emoji", "ner"]
# ...

[components.emoji]
factory = "emoji"
merge_spans = false
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/explosion/spacymoji",
    "name": "spacymoji",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "",
    "author": "Explosion",
    "author_email": "contact@explosion.ai",
    "download_url": "https://files.pythonhosted.org/packages/ef/25/fc60fecc03e34078f32402694139bab644e6f64a45341a3270539a93bf8b/spacymoji-3.1.0.tar.gz",
    "platform": null,
    "description": "# spacymoji: emoji for spaCy\n\n[spaCy](https://spacy.io) extension and pipeline component for adding emoji meta\ndata to `Doc` objects. Detects emoji consisting of one or more unicode\ncharacters, and can optionally merge multi-char emoji (combined pictures, emoji\nwith skin tone modifiers) into one token. Human-readable emoji descriptions are\nadded as a custom attribute, and an optional lookup table can be provided for\nyour own descriptions. The extension sets the custom `Doc`, `Token` and `Span`\nattributes `._.is_emoji`, `._.emoji_desc`, `._.has_emoji` and `._.emoji`. You\ncan read more about custom pipeline components and extension attributes\n[here](https://spacy.io/usage/processing-pipelines).\n\nEmoji are matched using spaCy's\n[`PhraseMatcher`](https://spacy.io/api/phrasematcher), and looked up in the data\ntable provided by the [`emoji` package](https://github.com/carpedm20/emoji).\n\n[![tests](https://github.com/explosion/spacymoji/actions/workflows/tests.yml/badge.svg)](https://github.com/explosion/spacymoji/actions/workflows/tests.yml)\n[![Current Release Version](https://img.shields.io/github/release/explosion/spacymoji.svg?style=flat-square&logo=github)](https://github.com/explosion/spacymoji/releases)\n[![pypi Version](https://img.shields.io/pypi/v/spacymoji.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/spacymoji/)\n\n# \u23f3 Installation\n\n`spacymoji` requires `spacy` v3.0.0 or higher. For spaCy v2.x, install\n`spacymoji==2.0.0`.\n\n```bash\npip install spacymoji\n```\n\n# \u261d\ufe0f Usage\n\nImport the component and add it anywhere in your pipeline using the string name\nof the `\"emoji\"` component factory:\n\n```python\nimport spacy\n\nnlp = spacy.load(\"en_core_web_sm\")\nnlp.add_pipe(\"emoji\", first=True)\ndoc = nlp(\"This is a test \ud83d\ude3b \ud83d\udc4d\ud83c\udfff\")\nassert doc._.has_emoji is True\nassert doc[2:5]._.has_emoji is True\nassert doc[0]._.is_emoji is False\nassert doc[4]._.is_emoji is True\nassert doc[5]._.emoji_desc == \"thumbs up dark skin tone\"\nassert len(doc._.emoji) == 2\nassert doc._.emoji[1] == (\"\ud83d\udc4d\ud83c\udfff\", 5, \"thumbs up dark skin tone\")\n```\n\n`spacymoji` only cares about the token text, so you can use it on a blank\n`Language` instance (it should work for all\n[available languages](https://spacy.io/usage/models#languages)!), or in a\npipeline with a loaded pipeline. If your pipeline includes a tagger, parser and\nentity recognizer, make sure to add the emoji component as `first=True`, so the\nspans are merged right after tokenization, and _before_ the document is parsed.\nIf your text contains a lot of emoji, this might even give you a nice boost in\nparser accuracy.\n\n## Available attributes\n\nThe extension sets attributes on the `Doc`, `Span` and `Token`. You can change\nthe attribute names (and other parameters of the Emoji component) by passing\nthem via the `config` parameter in the `nlp.add_pipe(...)` method. For more\ndetails on custom components and attributes, see the\n[processing pipelines documentation](https://spacy.io/usage/processing-pipelines#custom-components).\n\n| Attribute            | Type                       | Description                                                   |\n| -------------------- | -------------------------- | ------------------------------------------------------------- |\n| `Token._.is_emoji`   | bool                       | Whether the token is an emoji.                                |\n| `Token._.emoji_desc` | str                        | A human-readable description of the emoji.                    |\n| `Doc._.has_emoji`    | bool                       | Whether the document contains emoji.                          |\n| `Doc._.emoji`        | List[Tuple[str, int, str]] | `(emoji, index, description)` tuples of the document's emoji. |\n| `Span._.has_emoji`   | bool\u00a0                      | Whether the span contains emoji.                              |\n| `Span._.emoji`       | List[Tuple[str, int, str]] | `(emoji, index, description)` tuples of the span's emoji.     |\n\n## Settings\n\nYou can configure the `emoji` factory by setting any of the following parameters\nin the `config` dictionary:\n\n| Setting       | Type                      | Description                                                                                                                            |\n| ------------- | ------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |\n| `attrs`       | Tuple[str, str, str, str] | Attributes to set on the `._` property. Defaults to `('has_emoji', 'is_emoji', 'emoji_desc', 'emoji')`.                                |\n| `pattern_id`  | str                       | ID of match pattern, defaults to `'EMOJI'`. Can be changed to avoid ID conflicts.                                                      |\n| `merge_spans` | bool                      | Merge spans containing multi-character emoji, defaults to `True`. Will only merge combined emoji resulting in one icon, not sequences. |\n| `lookup`      | Dict[str, str]            | Optional lookup table that maps emoji strings to custom descriptions, e.g. translations or other annotations.                          |\n\n```python\nemoji_config = {\"attrs\": (\"has_e\", \"is_e\", \"e_desc\", \"e\"), lookup={\"\ud83d\udc68\u200d\ud83c\udfa4\": \"David Bowie\"})\nnlp.add_pipe(emoji, first=True, config=emoji_config)\ndoc = nlp(\"We can be \ud83d\udc68\u200d\ud83c\udfa4 heroes\")\nassert doc[3]._.is_e\nassert doc[3]._.e_desc == \"David Bowie\"\n```\n\nIf you're training a pipeline, you can define the component config in your\n[`config.cfg`](https://spacy.io/usage/training):\n\n```ini\n[nlp]\npipeline = [\"emoji\", \"ner\"]\n# ...\n\n[components.emoji]\nfactory = \"emoji\"\nmerge_spans = false\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "spaCy pipeline component for adding emoji metadata to Doc, Token and Span objects",
    "version": "3.1.0",
    "project_urls": {
        "Homepage": "https://github.com/explosion/spacymoji"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3c5dcf1f18f9c3a88fc2cd51aad40f7bfeb9657d3c2c937ff950ede3e6029079",
                "md5": "279745c4d6abdc0aebd70641e7c5c687",
                "sha256": "443df056e4bf23afb1f6ff8a372d9088e02d5eb2bd4a37a51fa0d19c35d0312b"
            },
            "downloads": -1,
            "filename": "spacymoji-3.1.0-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "279745c4d6abdc0aebd70641e7c5c687",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": ">=3.6",
            "size": 8456,
            "upload_time": "2023-05-10T14:44:32",
            "upload_time_iso_8601": "2023-05-10T14:44:32.344479Z",
            "url": "https://files.pythonhosted.org/packages/3c/5d/cf1f18f9c3a88fc2cd51aad40f7bfeb9657d3c2c937ff950ede3e6029079/spacymoji-3.1.0-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ef25fc60fecc03e34078f32402694139bab644e6f64a45341a3270539a93bf8b",
                "md5": "da4cff8205125923f6006be335acb79b",
                "sha256": "55f171fd88bb1131ea7dd19754541c3f9206b19d608ed965b5f95e1e81107e94"
            },
            "downloads": -1,
            "filename": "spacymoji-3.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "da4cff8205125923f6006be335acb79b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 8992,
            "upload_time": "2023-05-10T14:44:34",
            "upload_time_iso_8601": "2023-05-10T14:44:34.119258Z",
            "url": "https://files.pythonhosted.org/packages/ef/25/fc60fecc03e34078f32402694139bab644e6f64a45341a3270539a93bf8b/spacymoji-3.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-05-10 14:44:34",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "explosion",
    "github_project": "spacymoji",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "spacy",
            "specs": [
                [
                    ">=",
                    "3.0.0"
                ],
                [
                    "<",
                    "4.0.0"
                ]
            ]
        },
        {
            "name": "emoji",
            "specs": [
                [
                    ">=",
                    "2.0.0"
                ],
                [
                    "<",
                    "3.0.0"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": [
                [
                    ">=",
                    "5.2.0"
                ]
            ]
        }
    ],
    "lcname": "spacymoji"
}
        
Elapsed time: 0.07469s