ftfy


Nameftfy JSON
Version 6.2.0 PyPI version JSON
download
home_page
SummaryFixes mojibake and other problems with Unicode, after the fact
upload_time2024-03-15 22:38:57
maintainer
docs_urlNone
authorRobyn Speer
requires_python>=3.8,<4
licenseApache-2.0
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ftfy: fixes text for you

[![PyPI package](https://badge.fury.io/py/ftfy.svg)](https://badge.fury.io/py/ftfy)
[![Docs](https://readthedocs.org/projects/ftfy/badge/?version=latest)](https://ftfy.readthedocs.org/en/latest/)

```python

>>> from ftfy import fix_encoding
>>> print(fix_encoding("(ง'⌣')ง"))
(ง'⌣')ง

```

The full documentation of ftfy is available at [ftfy.readthedocs.org](https://ftfy.readthedocs.org). The documentation covers a lot more than this README, so here are
some links into it:

- [Fixing problems and getting explanations](https://ftfy.readthedocs.io/en/latest/explain.html)
- [Configuring ftfy](https://ftfy.readthedocs.io/en/latest/config.html)
- [Encodings ftfy can handle](https://ftfy.readthedocs.io/en/latest/encodings.html)
- [“Fixer” functions](https://ftfy.readthedocs.io/en/latest/fixes.html)
- [Is ftfy an encoding detector?](https://ftfy.readthedocs.io/en/latest/detect.html)
- [Heuristics for detecting mojibake](https://ftfy.readthedocs.io/en/latest/heuristic.html)
- [Support for “bad” encodings](https://ftfy.readthedocs.io/en/latest/bad_encodings.html)
- [Command-line usage](https://ftfy.readthedocs.io/en/latest/cli.html)
- [Citing ftfy](https://ftfy.readthedocs.io/en/latest/cite.html)

## Testimonials

- “My life is livable again!”
  — [@planarrowspace](https://twitter.com/planarrowspace)
- “A handy piece of magic”
  — [@simonw](https://twitter.com/simonw)
- “Saved me a large amount of frustrating dev work”
  — [@iancal](https://twitter.com/iancal)
- “ftfy did the right thing right away, with no faffing about. Excellent work, solving a very tricky real-world (whole-world!) problem.”
  — Brennan Young
- “I have no idea when I’m gonna need this, but I’m definitely bookmarking it.”
  — [/u/ocrow](https://reddit.com/u/ocrow)
- “9.2/10”
  — [pylint](https://bitbucket.org/logilab/pylint/)

## What it does

Here are some examples (found in the real world) of what ftfy can do:

ftfy can fix mojibake (encoding mix-ups), by detecting patterns of characters that were clearly meant to be UTF-8 but were decoded as something else:

    >>> import ftfy
    >>> ftfy.fix_text('✔ No problems')
    '✔ No problems'

Does this sound impossible? It's really not. UTF-8 is a well-designed encoding that makes it obvious when it's being misused, and a string of mojibake usually contains all the information we need to recover the original string.

ftfy can fix multiple layers of mojibake simultaneously:

    >>> ftfy.fix_text('The Mona Lisa doesn’t have eyebrows.')
    "The Mona Lisa doesn't have eyebrows."

It can fix mojibake that has had "curly quotes" applied on top of it, which cannot be consistently decoded until the quotes are uncurled:

    >>> ftfy.fix_text("l’humanité")
    "l'humanité"

ftfy can fix mojibake that would have included the character U+A0 (non-breaking space), but the U+A0 was turned into an ASCII space and then combined with another following space:

    >>> ftfy.fix_text('Ã\xa0 perturber la réflexion')
    'à perturber la réflexion'
    >>> ftfy.fix_text('à perturber la réflexion')
    'à perturber la réflexion'

ftfy can also decode HTML entities that appear outside of HTML, even in cases where the entity has been incorrectly capitalized:

    >>> # by the HTML 5 standard, only 'P&Eacute;REZ' is acceptable
    >>> ftfy.fix_text('P&EACUTE;REZ')
    'PÉREZ'
  
These fixes are not applied in all cases, because ftfy has a strongly-held goal of avoiding false positives -- it should never change correctly-decoded text to something else.

The following text could be encoded in Windows-1252 and decoded in UTF-8, and it would decode as 'MARQUɅ'. However, the original text is already sensible, so it is unchanged.

    >>> ftfy.fix_text('IL Y MARQUÉ…')
    'IL Y MARQUÉ…'

## Installing

ftfy is a Python 3 package that can be installed using `pip`:

    pip install ftfy

(Or use `pip3 install ftfy` on systems where Python 2 and 3 are both globally
installed and `pip` refers to Python 2.)

### Local development

ftfy is developed using `poetry`. Its `setup.py` is vestigial and is not the
recommended way to install it.

[Install Poetry](https://python-poetry.org/docs/master/#installing-with-the-official-installer), check out this repository, and run `poetry install` to install ftfy for local development, such as experimenting with the heuristic or running tests.

## Who maintains ftfy?

I'm Robyn Speer, also known as Elia Robyn Lake. You can find me
[on GitHub](https://github.com/rspeer) or [Cohost](https://cohost.org/arborelia).

## Citing ftfy

ftfy has been used as a crucial data processing step in major NLP research.

It's important to give credit appropriately to everyone whose work you build on
in research. This includes software, not just high-status contributions such as
mathematical models. All I ask when you use ftfy for research is that you cite
it.

ftfy has a citable record [on Zenodo](https://zenodo.org/record/2591652).
A citation of ftfy may look like this:

    Robyn Speer. (2019). ftfy (Version 5.5). Zenodo.
    http://doi.org/10.5281/zenodo.2591652

In BibTeX format, the citation is::

    @misc{speer-2019-ftfy,
      author       = {Robyn Speer},
      title        = {ftfy},
      note         = {Version 5.5},
      year         = 2019,
      howpublished = {Zenodo},
      doi          = {10.5281/zenodo.2591652},
      url          = {https://doi.org/10.5281/zenodo.2591652}
    }

## Important license clarifications

If you do not follow ftfy's license, you do not have a license to ftfy.

This sounds obvious and tautological, but there are people who think open source licenses mean that they can just do what they want, especially in the field of generative AI. It's a permissive license but you still have to follow it. The [Apache license](https://www.apache.org/licenses/LICENSE-2.0) is the only thing that gives you permission to use and copy ftfy; otherwise, all rights are reserved.

If you use or distribute ftfy, you must follow the terms of the [Apache license](https://www.apache.org/licenses/LICENSE-2.0), including that you must attribute the author of ftfy (Robyn Speer) correctly.

You _may not_ make a derived work of ftfy that obscures its authorship, such as by putting its code in an AI training dataset, including the code in AI training at runtime, or using a generative AI that copies code from such a dataset.

At my discretion, I may notify you of a license violation, and give you a chance to either remedy it or delete all copies of ftfy in your possession.


            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "ftfy",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8,<4",
    "maintainer_email": "",
    "keywords": "",
    "author": "Robyn Speer",
    "author_email": "rspeer@arborelia.net",
    "download_url": "https://files.pythonhosted.org/packages/a8/cf/b53c42f47090525740b30007a1f53e61088109a5d7eae8e41c8398a7ba7a/ftfy-6.2.0.tar.gz",
    "platform": null,
    "description": "# ftfy: fixes text for you\n\n[![PyPI package](https://badge.fury.io/py/ftfy.svg)](https://badge.fury.io/py/ftfy)\n[![Docs](https://readthedocs.org/projects/ftfy/badge/?version=latest)](https://ftfy.readthedocs.org/en/latest/)\n\n```python\n\n>>> from ftfy import fix_encoding\n>>> print(fix_encoding(\"(\u00e0\u00b8\u2021'\u00e2\u0152\u00a3')\u00e0\u00b8\u2021\"))\n(\u0e07'\u2323')\u0e07\n\n```\n\nThe full documentation of ftfy is available at [ftfy.readthedocs.org](https://ftfy.readthedocs.org). The documentation covers a lot more than this README, so here are\nsome links into it:\n\n- [Fixing problems and getting explanations](https://ftfy.readthedocs.io/en/latest/explain.html)\n- [Configuring ftfy](https://ftfy.readthedocs.io/en/latest/config.html)\n- [Encodings ftfy can handle](https://ftfy.readthedocs.io/en/latest/encodings.html)\n- [\u201cFixer\u201d functions](https://ftfy.readthedocs.io/en/latest/fixes.html)\n- [Is ftfy an encoding detector?](https://ftfy.readthedocs.io/en/latest/detect.html)\n- [Heuristics for detecting mojibake](https://ftfy.readthedocs.io/en/latest/heuristic.html)\n- [Support for \u201cbad\u201d encodings](https://ftfy.readthedocs.io/en/latest/bad_encodings.html)\n- [Command-line usage](https://ftfy.readthedocs.io/en/latest/cli.html)\n- [Citing ftfy](https://ftfy.readthedocs.io/en/latest/cite.html)\n\n## Testimonials\n\n- \u201cMy life is livable again!\u201d\n  \u2014 [@planarrowspace](https://twitter.com/planarrowspace)\n- \u201cA handy piece of magic\u201d\n  \u2014 [@simonw](https://twitter.com/simonw)\n- \u201cSaved me a large amount of frustrating dev work\u201d\n  \u2014 [@iancal](https://twitter.com/iancal)\n- \u201cftfy did the right thing right away, with no faffing about. Excellent work, solving a very tricky real-world (whole-world!) problem.\u201d\n  \u2014 Brennan Young\n- \u201cI have no idea when I\u2019m gonna need this, but I\u2019m definitely bookmarking it.\u201d\n  \u2014 [/u/ocrow](https://reddit.com/u/ocrow)\n- \u201c9.2/10\u201d\n  \u2014 [pylint](https://bitbucket.org/logilab/pylint/)\n\n## What it does\n\nHere are some examples (found in the real world) of what ftfy can do:\n\nftfy can fix mojibake (encoding mix-ups), by detecting patterns of characters that were clearly meant to be UTF-8 but were decoded as something else:\n\n    >>> import ftfy\n    >>> ftfy.fix_text('\u00e2\u0153\u201d No problems')\n    '\u2714 No problems'\n\nDoes this sound impossible? It's really not. UTF-8 is a well-designed encoding that makes it obvious when it's being misused, and a string of mojibake usually contains all the information we need to recover the original string.\n\nftfy can fix multiple layers of mojibake simultaneously:\n\n    >>> ftfy.fix_text('The Mona Lisa doesn\u00c3\u0192\u00c2\u00a2\u00c3\u00a2\u00e2\u20ac\u0161\u00c2\u00ac\u00c3\u00a2\u00e2\u20ac\u017e\u00c2\u00a2t have eyebrows.')\n    \"The Mona Lisa doesn't have eyebrows.\"\n\nIt can fix mojibake that has had \"curly quotes\" applied on top of it, which cannot be consistently decoded until the quotes are uncurled:\n\n    >>> ftfy.fix_text(\"l\u2019humanit\u00c3\u00a9\")\n    \"l'humanit\u00e9\"\n\nftfy can fix mojibake that would have included the character U+A0 (non-breaking space), but the U+A0 was turned into an ASCII space and then combined with another following space:\n\n    >>> ftfy.fix_text('\u00c3\\xa0 perturber la r\u00c3\u00a9flexion')\n    '\u00e0 perturber la r\u00e9flexion'\n    >>> ftfy.fix_text('\u00c3 perturber la r\u00c3\u00a9flexion')\n    '\u00e0 perturber la r\u00e9flexion'\n\nftfy can also decode HTML entities that appear outside of HTML, even in cases where the entity has been incorrectly capitalized:\n\n    >>> # by the HTML 5 standard, only 'P&Eacute;REZ' is acceptable\n    >>> ftfy.fix_text('P&EACUTE;REZ')\n    'P\u00c9REZ'\n  \nThese fixes are not applied in all cases, because ftfy has a strongly-held goal of avoiding false positives -- it should never change correctly-decoded text to something else.\n\nThe following text could be encoded in Windows-1252 and decoded in UTF-8, and it would decode as 'MARQU\u0245'. However, the original text is already sensible, so it is unchanged.\n\n    >>> ftfy.fix_text('IL Y MARQU\u00c9\u2026')\n    'IL Y MARQU\u00c9\u2026'\n\n## Installing\n\nftfy is a Python 3 package that can be installed using `pip`:\n\n    pip install ftfy\n\n(Or use `pip3 install ftfy` on systems where Python 2 and 3 are both globally\ninstalled and `pip` refers to Python 2.)\n\n### Local development\n\nftfy is developed using `poetry`. Its `setup.py` is vestigial and is not the\nrecommended way to install it.\n\n[Install Poetry](https://python-poetry.org/docs/master/#installing-with-the-official-installer), check out this repository, and run `poetry install` to install ftfy for local development, such as experimenting with the heuristic or running tests.\n\n## Who maintains ftfy?\n\nI'm Robyn Speer, also known as Elia Robyn Lake. You can find me\n[on GitHub](https://github.com/rspeer) or [Cohost](https://cohost.org/arborelia).\n\n## Citing ftfy\n\nftfy has been used as a crucial data processing step in major NLP research.\n\nIt's important to give credit appropriately to everyone whose work you build on\nin research. This includes software, not just high-status contributions such as\nmathematical models. All I ask when you use ftfy for research is that you cite\nit.\n\nftfy has a citable record [on Zenodo](https://zenodo.org/record/2591652).\nA citation of ftfy may look like this:\n\n    Robyn Speer. (2019). ftfy (Version 5.5). Zenodo.\n    http://doi.org/10.5281/zenodo.2591652\n\nIn BibTeX format, the citation is::\n\n    @misc{speer-2019-ftfy,\n      author       = {Robyn Speer},\n      title        = {ftfy},\n      note         = {Version 5.5},\n      year         = 2019,\n      howpublished = {Zenodo},\n      doi          = {10.5281/zenodo.2591652},\n      url          = {https://doi.org/10.5281/zenodo.2591652}\n    }\n\n## Important license clarifications\n\nIf you do not follow ftfy's license, you do not have a license to ftfy.\n\nThis sounds obvious and tautological, but there are people who think open source licenses mean that they can just do what they want, especially in the field of generative AI. It's a permissive license but you still have to follow it. The [Apache license](https://www.apache.org/licenses/LICENSE-2.0) is the only thing that gives you permission to use and copy ftfy; otherwise, all rights are reserved.\n\nIf you use or distribute ftfy, you must follow the terms of the [Apache license](https://www.apache.org/licenses/LICENSE-2.0), including that you must attribute the author of ftfy (Robyn Speer) correctly.\n\nYou _may not_ make a derived work of ftfy that obscures its authorship, such as by putting its code in an AI training dataset, including the code in AI training at runtime, or using a generative AI that copies code from such a dataset.\n\nAt my discretion, I may notify you of a license violation, and give you a chance to either remedy it or delete all copies of ftfy in your possession.\n\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Fixes mojibake and other problems with Unicode, after the fact",
    "version": "6.2.0",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f4f021efef51304172736b823689aaf82f33dbc64f54e9b046b75f5212d5cee7",
                "md5": "eada65b1bcdae5e506115a257f6c230f",
                "sha256": "f94a2c34b76e07475720e3096f5ca80911d152406fbde66fdb45c4d0c9150026"
            },
            "downloads": -1,
            "filename": "ftfy-6.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "eada65b1bcdae5e506115a257f6c230f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8,<4",
            "size": 54433,
            "upload_time": "2024-03-15T22:38:56",
            "upload_time_iso_8601": "2024-03-15T22:38:56.020638Z",
            "url": "https://files.pythonhosted.org/packages/f4/f0/21efef51304172736b823689aaf82f33dbc64f54e9b046b75f5212d5cee7/ftfy-6.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a8cfb53c42f47090525740b30007a1f53e61088109a5d7eae8e41c8398a7ba7a",
                "md5": "295d15c09dd55eab191a5521c323d2b0",
                "sha256": "5e42143c7025ef97944ca2619d6b61b0619fc6654f98771d39e862c1424c75c0"
            },
            "downloads": -1,
            "filename": "ftfy-6.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "295d15c09dd55eab191a5521c323d2b0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8,<4",
            "size": 63675,
            "upload_time": "2024-03-15T22:38:57",
            "upload_time_iso_8601": "2024-03-15T22:38:57.987715Z",
            "url": "https://files.pythonhosted.org/packages/a8/cf/b53c42f47090525740b30007a1f53e61088109a5d7eae8e41c8398a7ba7a/ftfy-6.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-15 22:38:57",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "ftfy"
}
        
Elapsed time: 0.20987s