eyecite


Nameeyecite JSON
Version 2.6.4 PyPI version JSON
download
home_pagehttps://github.com/freelawproject/eyecite
SummaryTool for extracting legal citations from text strings.
upload_time2024-06-03 18:48:56
maintainerFree Law Project
docs_urlNone
authorFree Law Project
requires_python<4.0,>=3.10
licenseBSD-2-Clause
keywords legal courts citations extraction cites
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            eyecite
==========

eyecite is an open source tool for extracting legal citations from text. It is used, among other things, to process millions of legal documents in the collections of `CourtListener <https://www.courtlistener.com/>`_ and Harvard's `Caselaw Access Project <https://case.law/>`_, and has been developed in collaboration with both projects.

eyecite recognizes a wide variety of citations commonly appearing in American legal decisions, including:

* full case: ``Bush v. Gore, 531 U.S. 98, 99-100 (2000)``
* short case: ``531 U.S., at 99``
* statutory: ``Mass. Gen. Laws ch. 1, § 2``
* law journal: ``1 Minn. L. Rev. 1``
* supra: ``Bush, supra, at 100``
* id.: ``Id., at 101``

All contributors, corrections, and additions are welcome!

If you use eyecite for your research, please consider citing our paper::

    @article{eyecite,
        title = {eyecite: A Tool for Parsing Legal Citations},
        author = {Cushman, Jack and Dahl, Matthew and Lissner, Michael},
        year = {2021},
        journal = {Journal of Open Source Software},
        volume = {6},
        number = {66},
        pages = {3617},
        url = {https://doi.org/10.21105/joss.03617},
    }

Functionality
=============

eyecite offers four core functions:

* `Extraction <https://freelawproject.github.io/eyecite/find.html>`_: Recognize and extract citations from text, using a database that has been trained on over 55 million existing citations (see all of the citation patterns eyecite looks for over in `reporters_db <https://github.com/freelawproject/reporters-db>`_).
* `Aggregation <https://freelawproject.github.io/eyecite/resolve.html>`_: Aggregate citations with common references (e.g., `supra` and `id.` citations) based on their logical antecedents.
* `Annotation <https://freelawproject.github.io/eyecite/annotate.html>`_: Annotate citation-laden text with custom markup surrounding each citation, using a fast diffing algorithm.
* `Cleaning <https://freelawproject.github.io/eyecite/clean.html>`_: Clean and pre-process text for easy use with eyecite.

Read on below for how to get started quickly or for a short tutorial in using eyecite.

Contributions & Support
=======================

Please see the issues list on GitHub for things we need, or start a conversation if you have questions or need support.

If you are fixing bugs or adding features, before you make your first contribution, we'll need a signed contributor license agreement. See the template in the root of the repo for how to get that taken care of.

API
===
The API documentation is located here:

https://freelawproject.github.io/eyecite/

It is autogenerated whenever we release a new version. Unfortunately, for now we do not support old versions of the API documentation, but it can be browsed in the gh-pages branch if needed.


Quickstart
==========

Install eyecite::

    pip install eyecite


Here's a short example of extracting citations and their metadata from text using eyecite's main :code:`get_citations()` function::

    from eyecite import get_citations

    text = """
        Mass. Gen. Laws ch. 1, § 2 (West 1999) (barring ...).
        Foo v. Bar, 1 U.S. 2, 3-4 (1999) (overruling ...).
        Id. at 3.
        Foo, supra, at 5.
    """

    get_citations(text)

    # returns:
    [
        FullLawCitation(
            'Mass. Gen. Laws ch. 1, § 2',
            groups={'reporter': 'Mass. Gen. Laws', 'chapter': '1', 'section': '2'},
            metadata=Metadata(parenthetical='barring ...', pin_cite=None, year='1999', publisher='West', ...)
        ),
        FullCaseCitation(
            '1 U.S. 2',
            groups={'volume': '1', 'reporter': 'U.S.', 'page': '2'},
            metadata=Metadata(parenthetical='overruling ...', pin_cite='3-4', year='1999', court='scotus', plaintiff='Foo', defendant='Bar,', ...)
        ),
        IdCitation(
            'Id.',
            metadata=Metadata(pin_cite='at 3')
        ),
        SupraCitation(
            'supra,',
            metadata=Metadata(antecedent_guess='Foo', pin_cite='at 5', ...)
        )
    ]

Tutorial
==========

For a more full-featured walkthrough of how to use all of eyecite's functionality,
please see the `tutorial <TUTORIAL.ipynb>`_.

Documentation
=============

eyecite's full API is documented `here <https://freelawproject.github.io/eyecite/>`_, but here are details regarding its four core functions, its tokenization logic, and its debugging tools.

Extracting Citations
--------------------

:code:`get_citations()`, the main executable function, takes three parameters.

1. :code:`plain_text` ==> str: The text to parse. Should be cleaned first.
2. :code:`remove_ambiguous` ==> bool, default :code:`False`: Whether to remove citations
   that might refer to more than one reporter and can't be narrowed down by date.
3. :code:`tokenizer` ==> Tokenizer, default :code:`eyecite.tokenizers.default_tokenizer`: An instance of a Tokenizer object (see "Tokenizers" below).


Cleaning Input Text
-------------------

For a given citation text such as "... 1 Baldwin's Rep. 1 ...", eyecite expects that the text
will be "clean" before being passed to :code:`get_citation`. This means:

* Spaces will be single space characters, not multiple spaces or other whitespace.
* Quotes and hyphens will be standard quote and hyphen characters.
* No junk such as HTML tags inside the citation.

You can use :code:`clean_text` to help with this:

::

    from eyecite import clean_text, get_citations

    source_text = '<p>foo   1  U.S.  1   </p>'
    plain_text = clean_text(text, ['html', 'inline_whitespace', my_func])
    found_citations = get_citations(plain_text)

See the `Annotating Citations <#annotating-citations>`_ section for how to insert links into the original text using
citations extracted from the cleaned text.

:code:`clean_text` currently accepts these values as cleaners:

1. :code:`inline_whitespace`: replace all runs of tab and space characters with a single space character
2. :code:`all_whitespace`: replace all runs of any whitespace character with a single space character
3. :code:`underscores`: remove two or more underscores, a common error in text extracted from PDFs
4. :code:`html`: remove non-visible HTML content using the lxml library
5. Custom function: any function taking a string and returning a string.


Annotating Citations
--------------------

For simple plain text, you can insert links to citations using the :code:`annotate_citations` function:

::

    from eyecite import get_citations, annotate_citations

    plain_text = 'bob lissner v. test 1 U.S. 12, 347-348 (4th Cir. 1982)'
    citations = get_citations(plain_text)
    linked_text = annotate_citations(plain_text, [[c.span(), "<a>", "</a>"] for c in citations])

    returns:
    'bob lissner v. test <a>1 U.S. 12</a>, 347-348 (4th Cir. 1982)'

Each citation returned by get_citations keeps track of where it was found in the source text.
As a result, :code:`annotate_citations` must be called with the *same* cleaned text used by :code:`get_citations`
to extract citations. If you do not, the offsets returned by the citation's :code:`span` method will
not align with the text, and your annotations will be in the wrong place.

If you want to clean text and then insert annotations into the original text, you can pass
the original text in as :code:`source_text`:

::

    from eyecite import get_citations, annotate_citations, clean_text

    source_text = '<p>bob lissner v. <i>test   1 U.S.</i> 12,   347-348 (4th Cir. 1982)</p>'
    plain_text = clean_text(source_text, ['html', 'inline_whitespace'])
    citations = get_citations(plain_text)
    linked_text = annotate_citations(plain_text, [[c.span(), "<a>", "</a>"] for c in citations], source_text=source_text)

    returns:
    '<p>bob lissner v. <i>test   <a>1 U.S.</i> 12</a>,   347-348 (4th Cir. 1982)</p>'

The above example extracts citations from :code:`plain_text` and applies them to
:code:`source_text`, using a diffing algorithm to insert annotations in the correct locations
in the original text.

There is also a :code:`full_span` attribute that can be used to get the indexes of the full citation, including the
pre- and post-citation attributes.

Wrapping HTML Tags
^^^^^^^^^^^^^^^^^^

Note that the above example includes mismatched HTML tags: "<a>1 U.S.</i> 12</a>".
To specify handling for unbalanced tags, use the :code:`unbalanced_tags` parameter:

* :code:`unbalanced_tags="skip"`: annotations that would result in unbalanced tags will not be inserted.
* :code:`unbalanced_tags="wrap"`: unbalanced tags will be wrapped, resulting in :code:`<a>1 U.S.</a></i><a> 12</a>`

Important: :code:`unbalanced_tags="wrap"` uses a simple regular expression and will only work for HTML where
angle brackets are properly escaped, such as the HTML emitted by :code:`lxml.html.tostring`. It is intended for
regularly formatted documents such as case text published by courts. It may have
unpredictable results for deliberately-constructed challenging inputs such as citations containing partial HTML
comments or :code:`<pre>` tags.

Customizing Annotation
^^^^^^^^^^^^^^^^^^^^^^

If inserting text before and after isn't sufficient, supply a callable under the :code:`annotator` parameter
that takes :code:`(before, span_text, after)` and returns the annotated text:

::

    def annotator(before, span_text, after):
        return before + span_text.lower() + after
    linked_text = annotate_citations(plain_text, [[c.span(), "<a>", "</a>"] for c in citations], annotator=annotator)

    returns:
    'bob lissner v. test <a>1 u.s. 12</a>, 347-348 (4th Cir. 1982)'

Resolving Citations
-------------------

Once you have extracted citations from a document, you may wish to resolve them to their common references.
To do so, just pass the results of :code:`get_citations()` into :code:`resolve_citations()`. This function will
do its best to resolve each "full," "short form," "supra," and "id" citation to a common :code:`Resource` object,
returning a dictionary that maps resources to lists of associated citations:

::

    from eyecite import get_citations, resolve_citations

    text = 'first citation: 1 U.S. 12. second citation: 2 F.3d 2. third citation: Id.'
    found_citations = get_citations(text)
    resolved_citations = resolve_citations(found_citations)

    returns (pseudo):
    {
        <Resource object>: [FullCaseCitation('1 U.S. 12')],
        <Resource object>: [FullCaseCitation('2 F.3d 2'), IdCitation('Id.')]
    }

Importantly, eyecite performs these resolutions using only its immanent knowledge about each citation's
textual representation. If you want to perform more sophisticated resolution (e.g., by augmenting each
citation with information from a third-party API), simply pass custom :code:`resolve_id_citation()`,
:code:`resolve_supra_citation()`, :code:`resolve_shortcase_citation()`, and :code:`resolve_full_citation()`
functions to :code:`resolve_citations()` as keyword arguments. You can also configure those functions to
return a more complex resource object (such as a Django model), so long as that object inherits the
:code:`eyecite.models.ResourceType` type (which simply requires hashability). For example, you might implement
a custom full citation resolution function as follows, using the default resolution logic as a fallback:

::

    def my_resolve(full_cite):
        # special handling for resolution of known cases in our database
        resource = MyOpinion.objects.get(full_cite)
        if resource:
            return resource
        # allow normal clustering of other citations
        return resolve_full_citation(full_cite)

    resolve_citations(citations, resolve_full_citation=my_resolve)

    returns (pseudo):
    {
        <MyOpinion object>: [<full_cite>, <short_cite>, <id_cite>],
        <Resource object>: [<full cite>, <short cite>],
    }

Tokenizers
----------

Internally, eyecite works by applying a list of regular expressions to the source text to convert it to a list
of tokens:

::

    In [1]: from eyecite.tokenizers import default_tokenizer

    In [2]: list(default_tokenizer.tokenize("Foo v. Bar, 123 U.S. 456 (2016). Id. at 457."))
    Out[2]:
    ['Foo',
     StopWordToken(data='v.', ...),
     'Bar,',
     CitationToken(data='123 U.S. 456', volume='123', reporter='U.S.', page='456', ...),
     '(2016).',
     IdToken(data='Id.', ...),
     'at',
     '457.']

Tokens are then scanned to determine values like the citation year or case name for citation resolution.

Alternate tokenizers can be substituted by providing a tokenizer instance to :code:`get_citations()`:

::

    from eyecite.tokenizers import HyperscanTokenizer
    hyperscan_tokenizer = HyperscanTokenizer(cache_dir='.hyperscan')
    cites = get_citations(text, tokenizer=hyperscan_tokenizer)

test_FindTest.py includes a simplified example of using a custom tokenizer that uses modified
regular expressions to extract citations with OCR errors.

eyecite ships with two tokenizers:

AhocorasickTokenizer (default)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The default tokenizer uses the pyahocorasick library to filter down eyecite's list of
extractor regexes. It then performs extraction using the builtin :code:`re` library.

HyperscanTokenizer
^^^^^^^^^^^^^^^^^^

The alternate HyperscanTokenizer compiles all extraction regexes into a hyperscan database
so they can be extracted in a single pass. This is far faster than the default tokenizer
(exactly how much faster depends on how many citation formats are included in the target text),
but requires the optional :code:`hyperscan` dependency that has limited platform support.
See the "Installation" section for hyperscan installation instructions and limitations.

Compiling the hyperscan database takes several seconds, so short-running scripts may want to
provide a cache directory where the database can be stored. The directory should be writeable
only by the user:

::

    hyperscan_tokenizer = HyperscanTokenizer(cache_dir='.hyperscan')


Debugging
---------

If you want to see what metadata eyecite is able to extract for each citation, you can use :code:`dump_citations`.
This is primarily useful for developing eyecite, but may also be useful for exploring what data is available to you::

    In [1]: from eyecite import dump_citations, get_citations

    In [2]: text="Mass. Gen. Laws ch. 1, § 2. Foo v. Bar, 1 U.S. 2, 3-4 (1999). Id. at 3. Foo, supra, at 5."

    In [3]: cites=get_citations(text)

    In [4]: print(dump_citations(get_citations(text), text))
    FullLawCitation: Mass. Gen. Laws ch. 1, § 2. Foo v. Bar, 1 U.S. 2, 3-4 (1
      * groups
        * reporter='Mass. Gen. Laws'
        * chapter='1'
        * section='2'
    FullCaseCitation: Laws ch. 1, § 2. Foo v. Bar, 1 U.S. 2, 3-4 (1999). Id. at 3. Foo, s
      * groups
        * volume='1'
        * reporter='U.S.'
        * page='2'
      * metadata
        * pin_cite='3-4'
        * year='1999'
        * court='scotus'
        * plaintiff='Foo'
        * defendant='Bar,'
      * year=1999
    IdCitation: v. Bar, 1 U.S. 2, 3-4 (1999). Id. at 3. Foo, supra, at 5.
      * metadata
        * pin_cite='at 3'
    SupraCitation: 2, 3-4 (1999). Id. at 3. Foo, supra, at 5.
      * metadata
        * antecedent_guess='Foo'
        * pin_cite='at 5'

In the real terminal, the :code:`span()` of each extracted citation will be highlighted.
You can use the :code:`context_chars=30` parameter to control how much text is shown before and after.


Installation
============
Installing eyecite is easy.

::

    poetry add eyecite


Or via pip::

    pip install eyecite


Or install the latest dev version from github::

    pip install https://github.com/freelawproject/eyecite/archive/main.zip#egg=eyecite

Hyperscan installation
----------------------

To use :code:`HyperscanTokenizer` you must additionally install the python `hyperscan <https://pypi.org/project/hyperscan/>`_
library and its dependencies. **python-hyperscan officially supports only x86 linux,** though other configurations may be
possible.

Hyperscan installation example on x86 Ubuntu 20.04:

::

    apt install libhyperscan-dev
    pip install hyperscan

Hyperscan installation example on x86 Debian Buster:

::

    echo 'deb http://deb.debian.org/debian buster-backports main' > /etc/apt/sources.list.d/backports.list
    apt install -t buster-backports libhyperscan-dev
    pip install hyperscan

Hyperscan installation example with homebrew on x86 MacOS:

::

    brew install hyperscan
    pip install hyperscan


Deployment
==========

1. Update CHANGES.md.

1. Update version info in :code:`pyproject.toml` by running :code:`poetry version [major, minor, patch]`.

For an automated deployment, tag the commit with vx.y.z, and push it to main.
An automated deploy and documentation update will be completed for you.

For a manual deployment, run:

::

    poetry publish --build

You will probably also need to push new documentation files to the gh-pages branch.

Testing
=======
eyecite comes with a robust test suite of different citation strings that it is equipped to handle. Run these tests as follows:

::

    python3 -m unittest discover -s tests -p 'test_*.py'

If you would like to create mock citation objects to assist you in writing your own local tests, import and use the following functions for convenience:

::

    from eyecite.test_factories import (
        case_citation,
        id_citation,
        supra_citation,
        unknown_citation,
    )


Development
===========
When a pull request is generated for changes from changes to eyecite, a github
workflow will automatically trigger.  The workflow, benchmark.yml will
test improvements in accuracy and speed against the current main branch.

The results are committed to an artifacts branch, and an ever updating comment
in the PR comments with the output.


License
=======
This repository is available under the permissive BSD license, making it easy and safe to incorporate in your own libraries.

Pull and feature requests welcome. Online editing in GitHub is possible (and easy!).

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/freelawproject/eyecite",
    "name": "eyecite",
    "maintainer": "Free Law Project",
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": "info@free.law",
    "keywords": "legal, courts, citations, extraction, cites",
    "author": "Free Law Project",
    "author_email": "info@free.law",
    "download_url": "https://files.pythonhosted.org/packages/bc/09/7c6dd9420c5e073aade45d0106193b77352142b5205bc548bbc885576c1c/eyecite-2.6.4.tar.gz",
    "platform": null,
    "description": "eyecite\n==========\n\neyecite is an open source tool for extracting legal citations from text. It is used, among other things, to process millions of legal documents in the collections of `CourtListener <https://www.courtlistener.com/>`_ and Harvard's `Caselaw Access Project <https://case.law/>`_, and has been developed in collaboration with both projects.\n\neyecite recognizes a wide variety of citations commonly appearing in American legal decisions, including:\n\n* full case: ``Bush v. Gore, 531 U.S. 98, 99-100 (2000)``\n* short case: ``531 U.S., at 99``\n* statutory: ``Mass. Gen. Laws ch. 1, \u00a7 2``\n* law journal: ``1 Minn. L. Rev. 1``\n* supra: ``Bush, supra, at 100``\n* id.: ``Id., at 101``\n\nAll contributors, corrections, and additions are welcome!\n\nIf you use eyecite for your research, please consider citing our paper::\n\n    @article{eyecite,\n        title = {eyecite: A Tool for Parsing Legal Citations},\n        author = {Cushman, Jack and Dahl, Matthew and Lissner, Michael},\n        year = {2021},\n        journal = {Journal of Open Source Software},\n        volume = {6},\n        number = {66},\n        pages = {3617},\n        url = {https://doi.org/10.21105/joss.03617},\n    }\n\nFunctionality\n=============\n\neyecite offers four core functions:\n\n* `Extraction <https://freelawproject.github.io/eyecite/find.html>`_: Recognize and extract citations from text, using a database that has been trained on over 55 million existing citations (see all of the citation patterns eyecite looks for over in `reporters_db <https://github.com/freelawproject/reporters-db>`_).\n* `Aggregation <https://freelawproject.github.io/eyecite/resolve.html>`_: Aggregate citations with common references (e.g., `supra` and `id.` citations) based on their logical antecedents.\n* `Annotation <https://freelawproject.github.io/eyecite/annotate.html>`_: Annotate citation-laden text with custom markup surrounding each citation, using a fast diffing algorithm.\n* `Cleaning <https://freelawproject.github.io/eyecite/clean.html>`_: Clean and pre-process text for easy use with eyecite.\n\nRead on below for how to get started quickly or for a short tutorial in using eyecite.\n\nContributions & Support\n=======================\n\nPlease see the issues list on GitHub for things we need, or start a conversation if you have questions or need support.\n\nIf you are fixing bugs or adding features, before you make your first contribution, we'll need a signed contributor license agreement. See the template in the root of the repo for how to get that taken care of.\n\nAPI\n===\nThe API documentation is located here:\n\nhttps://freelawproject.github.io/eyecite/\n\nIt is autogenerated whenever we release a new version. Unfortunately, for now we do not support old versions of the API documentation, but it can be browsed in the gh-pages branch if needed.\n\n\nQuickstart\n==========\n\nInstall eyecite::\n\n    pip install eyecite\n\n\nHere's a short example of extracting citations and their metadata from text using eyecite's main :code:`get_citations()` function::\n\n    from eyecite import get_citations\n\n    text = \"\"\"\n        Mass. Gen. Laws ch. 1, \u00a7 2 (West 1999) (barring ...).\n        Foo v. Bar, 1 U.S. 2, 3-4 (1999) (overruling ...).\n        Id. at 3.\n        Foo, supra, at 5.\n    \"\"\"\n\n    get_citations(text)\n\n    # returns:\n    [\n        FullLawCitation(\n            'Mass. Gen. Laws ch. 1, \u00a7 2',\n            groups={'reporter': 'Mass. Gen. Laws', 'chapter': '1', 'section': '2'},\n            metadata=Metadata(parenthetical='barring ...', pin_cite=None, year='1999', publisher='West', ...)\n        ),\n        FullCaseCitation(\n            '1 U.S. 2',\n            groups={'volume': '1', 'reporter': 'U.S.', 'page': '2'},\n            metadata=Metadata(parenthetical='overruling ...', pin_cite='3-4', year='1999', court='scotus', plaintiff='Foo', defendant='Bar,', ...)\n        ),\n        IdCitation(\n            'Id.',\n            metadata=Metadata(pin_cite='at 3')\n        ),\n        SupraCitation(\n            'supra,',\n            metadata=Metadata(antecedent_guess='Foo', pin_cite='at 5', ...)\n        )\n    ]\n\nTutorial\n==========\n\nFor a more full-featured walkthrough of how to use all of eyecite's functionality,\nplease see the `tutorial <TUTORIAL.ipynb>`_.\n\nDocumentation\n=============\n\neyecite's full API is documented `here <https://freelawproject.github.io/eyecite/>`_, but here are details regarding its four core functions, its tokenization logic, and its debugging tools.\n\nExtracting Citations\n--------------------\n\n:code:`get_citations()`, the main executable function, takes three parameters.\n\n1. :code:`plain_text` ==> str: The text to parse. Should be cleaned first.\n2. :code:`remove_ambiguous` ==> bool, default :code:`False`: Whether to remove citations\n   that might refer to more than one reporter and can't be narrowed down by date.\n3. :code:`tokenizer` ==> Tokenizer, default :code:`eyecite.tokenizers.default_tokenizer`: An instance of a Tokenizer object (see \"Tokenizers\" below).\n\n\nCleaning Input Text\n-------------------\n\nFor a given citation text such as \"... 1 Baldwin's Rep. 1 ...\", eyecite expects that the text\nwill be \"clean\" before being passed to :code:`get_citation`. This means:\n\n* Spaces will be single space characters, not multiple spaces or other whitespace.\n* Quotes and hyphens will be standard quote and hyphen characters.\n* No junk such as HTML tags inside the citation.\n\nYou can use :code:`clean_text` to help with this:\n\n::\n\n    from eyecite import clean_text, get_citations\n\n    source_text = '<p>foo   1  U.S.  1   </p>'\n    plain_text = clean_text(text, ['html', 'inline_whitespace', my_func])\n    found_citations = get_citations(plain_text)\n\nSee the `Annotating Citations <#annotating-citations>`_ section for how to insert links into the original text using\ncitations extracted from the cleaned text.\n\n:code:`clean_text` currently accepts these values as cleaners:\n\n1. :code:`inline_whitespace`: replace all runs of tab and space characters with a single space character\n2. :code:`all_whitespace`: replace all runs of any whitespace character with a single space character\n3. :code:`underscores`: remove two or more underscores, a common error in text extracted from PDFs\n4. :code:`html`: remove non-visible HTML content using the lxml library\n5. Custom function: any function taking a string and returning a string.\n\n\nAnnotating Citations\n--------------------\n\nFor simple plain text, you can insert links to citations using the :code:`annotate_citations` function:\n\n::\n\n    from eyecite import get_citations, annotate_citations\n\n    plain_text = 'bob lissner v. test 1 U.S. 12, 347-348 (4th Cir. 1982)'\n    citations = get_citations(plain_text)\n    linked_text = annotate_citations(plain_text, [[c.span(), \"<a>\", \"</a>\"] for c in citations])\n\n    returns:\n    'bob lissner v. test <a>1 U.S. 12</a>, 347-348 (4th Cir. 1982)'\n\nEach citation returned by get_citations keeps track of where it was found in the source text.\nAs a result, :code:`annotate_citations` must be called with the *same* cleaned text used by :code:`get_citations`\nto extract citations. If you do not, the offsets returned by the citation's :code:`span` method will\nnot align with the text, and your annotations will be in the wrong place.\n\nIf you want to clean text and then insert annotations into the original text, you can pass\nthe original text in as :code:`source_text`:\n\n::\n\n    from eyecite import get_citations, annotate_citations, clean_text\n\n    source_text = '<p>bob lissner v. <i>test   1 U.S.</i> 12,   347-348 (4th Cir. 1982)</p>'\n    plain_text = clean_text(source_text, ['html', 'inline_whitespace'])\n    citations = get_citations(plain_text)\n    linked_text = annotate_citations(plain_text, [[c.span(), \"<a>\", \"</a>\"] for c in citations], source_text=source_text)\n\n    returns:\n    '<p>bob lissner v. <i>test   <a>1 U.S.</i> 12</a>,   347-348 (4th Cir. 1982)</p>'\n\nThe above example extracts citations from :code:`plain_text` and applies them to\n:code:`source_text`, using a diffing algorithm to insert annotations in the correct locations\nin the original text.\n\nThere is also a :code:`full_span` attribute that can be used to get the indexes of the full citation, including the\npre- and post-citation attributes.\n\nWrapping HTML Tags\n^^^^^^^^^^^^^^^^^^\n\nNote that the above example includes mismatched HTML tags: \"<a>1 U.S.</i> 12</a>\".\nTo specify handling for unbalanced tags, use the :code:`unbalanced_tags` parameter:\n\n* :code:`unbalanced_tags=\"skip\"`: annotations that would result in unbalanced tags will not be inserted.\n* :code:`unbalanced_tags=\"wrap\"`: unbalanced tags will be wrapped, resulting in :code:`<a>1 U.S.</a></i><a> 12</a>`\n\nImportant: :code:`unbalanced_tags=\"wrap\"` uses a simple regular expression and will only work for HTML where\nangle brackets are properly escaped, such as the HTML emitted by :code:`lxml.html.tostring`. It is intended for\nregularly formatted documents such as case text published by courts. It may have\nunpredictable results for deliberately-constructed challenging inputs such as citations containing partial HTML\ncomments or :code:`<pre>` tags.\n\nCustomizing Annotation\n^^^^^^^^^^^^^^^^^^^^^^\n\nIf inserting text before and after isn't sufficient, supply a callable under the :code:`annotator` parameter\nthat takes :code:`(before, span_text, after)` and returns the annotated text:\n\n::\n\n    def annotator(before, span_text, after):\n        return before + span_text.lower() + after\n    linked_text = annotate_citations(plain_text, [[c.span(), \"<a>\", \"</a>\"] for c in citations], annotator=annotator)\n\n    returns:\n    'bob lissner v. test <a>1 u.s. 12</a>, 347-348 (4th Cir. 1982)'\n\nResolving Citations\n-------------------\n\nOnce you have extracted citations from a document, you may wish to resolve them to their common references.\nTo do so, just pass the results of :code:`get_citations()` into :code:`resolve_citations()`. This function will\ndo its best to resolve each \"full,\" \"short form,\" \"supra,\" and \"id\" citation to a common :code:`Resource` object,\nreturning a dictionary that maps resources to lists of associated citations:\n\n::\n\n    from eyecite import get_citations, resolve_citations\n\n    text = 'first citation: 1 U.S. 12. second citation: 2 F.3d 2. third citation: Id.'\n    found_citations = get_citations(text)\n    resolved_citations = resolve_citations(found_citations)\n\n    returns (pseudo):\n    {\n        <Resource object>: [FullCaseCitation('1 U.S. 12')],\n        <Resource object>: [FullCaseCitation('2 F.3d 2'), IdCitation('Id.')]\n    }\n\nImportantly, eyecite performs these resolutions using only its immanent knowledge about each citation's\ntextual representation. If you want to perform more sophisticated resolution (e.g., by augmenting each\ncitation with information from a third-party API), simply pass custom :code:`resolve_id_citation()`,\n:code:`resolve_supra_citation()`, :code:`resolve_shortcase_citation()`, and :code:`resolve_full_citation()`\nfunctions to :code:`resolve_citations()` as keyword arguments. You can also configure those functions to\nreturn a more complex resource object (such as a Django model), so long as that object inherits the\n:code:`eyecite.models.ResourceType` type (which simply requires hashability). For example, you might implement\na custom full citation resolution function as follows, using the default resolution logic as a fallback:\n\n::\n\n    def my_resolve(full_cite):\n        # special handling for resolution of known cases in our database\n        resource = MyOpinion.objects.get(full_cite)\n        if resource:\n            return resource\n        # allow normal clustering of other citations\n        return resolve_full_citation(full_cite)\n\n    resolve_citations(citations, resolve_full_citation=my_resolve)\n\n    returns (pseudo):\n    {\n        <MyOpinion object>: [<full_cite>, <short_cite>, <id_cite>],\n        <Resource object>: [<full cite>, <short cite>],\n    }\n\nTokenizers\n----------\n\nInternally, eyecite works by applying a list of regular expressions to the source text to convert it to a list\nof tokens:\n\n::\n\n    In [1]: from eyecite.tokenizers import default_tokenizer\n\n    In [2]: list(default_tokenizer.tokenize(\"Foo v. Bar, 123 U.S. 456 (2016). Id. at 457.\"))\n    Out[2]:\n    ['Foo',\n     StopWordToken(data='v.', ...),\n     'Bar,',\n     CitationToken(data='123 U.S. 456', volume='123', reporter='U.S.', page='456', ...),\n     '(2016).',\n     IdToken(data='Id.', ...),\n     'at',\n     '457.']\n\nTokens are then scanned to determine values like the citation year or case name for citation resolution.\n\nAlternate tokenizers can be substituted by providing a tokenizer instance to :code:`get_citations()`:\n\n::\n\n    from eyecite.tokenizers import HyperscanTokenizer\n    hyperscan_tokenizer = HyperscanTokenizer(cache_dir='.hyperscan')\n    cites = get_citations(text, tokenizer=hyperscan_tokenizer)\n\ntest_FindTest.py includes a simplified example of using a custom tokenizer that uses modified\nregular expressions to extract citations with OCR errors.\n\neyecite ships with two tokenizers:\n\nAhocorasickTokenizer (default)\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\nThe default tokenizer uses the pyahocorasick library to filter down eyecite's list of\nextractor regexes. It then performs extraction using the builtin :code:`re` library.\n\nHyperscanTokenizer\n^^^^^^^^^^^^^^^^^^\n\nThe alternate HyperscanTokenizer compiles all extraction regexes into a hyperscan database\nso they can be extracted in a single pass. This is far faster than the default tokenizer\n(exactly how much faster depends on how many citation formats are included in the target text),\nbut requires the optional :code:`hyperscan` dependency that has limited platform support.\nSee the \"Installation\" section for hyperscan installation instructions and limitations.\n\nCompiling the hyperscan database takes several seconds, so short-running scripts may want to\nprovide a cache directory where the database can be stored. The directory should be writeable\nonly by the user:\n\n::\n\n    hyperscan_tokenizer = HyperscanTokenizer(cache_dir='.hyperscan')\n\n\nDebugging\n---------\n\nIf you want to see what metadata eyecite is able to extract for each citation, you can use :code:`dump_citations`.\nThis is primarily useful for developing eyecite, but may also be useful for exploring what data is available to you::\n\n    In [1]: from eyecite import dump_citations, get_citations\n\n    In [2]: text=\"Mass. Gen. Laws ch. 1, \u00a7 2. Foo v. Bar, 1 U.S. 2, 3-4 (1999). Id. at 3. Foo, supra, at 5.\"\n\n    In [3]: cites=get_citations(text)\n\n    In [4]: print(dump_citations(get_citations(text), text))\n    FullLawCitation: Mass. Gen. Laws ch. 1, \u00a7 2. Foo v. Bar, 1 U.S. 2, 3-4 (1\n      * groups\n        * reporter='Mass. Gen. Laws'\n        * chapter='1'\n        * section='2'\n    FullCaseCitation: Laws ch. 1, \u00a7 2. Foo v. Bar, 1 U.S. 2, 3-4 (1999). Id. at 3. Foo, s\n      * groups\n        * volume='1'\n        * reporter='U.S.'\n        * page='2'\n      * metadata\n        * pin_cite='3-4'\n        * year='1999'\n        * court='scotus'\n        * plaintiff='Foo'\n        * defendant='Bar,'\n      * year=1999\n    IdCitation: v. Bar, 1 U.S. 2, 3-4 (1999). Id. at 3. Foo, supra, at 5.\n      * metadata\n        * pin_cite='at 3'\n    SupraCitation: 2, 3-4 (1999). Id. at 3. Foo, supra, at 5.\n      * metadata\n        * antecedent_guess='Foo'\n        * pin_cite='at 5'\n\nIn the real terminal, the :code:`span()` of each extracted citation will be highlighted.\nYou can use the :code:`context_chars=30` parameter to control how much text is shown before and after.\n\n\nInstallation\n============\nInstalling eyecite is easy.\n\n::\n\n    poetry add eyecite\n\n\nOr via pip::\n\n    pip install eyecite\n\n\nOr install the latest dev version from github::\n\n    pip install https://github.com/freelawproject/eyecite/archive/main.zip#egg=eyecite\n\nHyperscan installation\n----------------------\n\nTo use :code:`HyperscanTokenizer` you must additionally install the python `hyperscan <https://pypi.org/project/hyperscan/>`_\nlibrary and its dependencies. **python-hyperscan officially supports only x86 linux,** though other configurations may be\npossible.\n\nHyperscan installation example on x86 Ubuntu 20.04:\n\n::\n\n    apt install libhyperscan-dev\n    pip install hyperscan\n\nHyperscan installation example on x86 Debian Buster:\n\n::\n\n    echo 'deb http://deb.debian.org/debian buster-backports main' > /etc/apt/sources.list.d/backports.list\n    apt install -t buster-backports libhyperscan-dev\n    pip install hyperscan\n\nHyperscan installation example with homebrew on x86 MacOS:\n\n::\n\n    brew install hyperscan\n    pip install hyperscan\n\n\nDeployment\n==========\n\n1. Update CHANGES.md.\n\n1. Update version info in :code:`pyproject.toml` by running :code:`poetry version [major, minor, patch]`.\n\nFor an automated deployment, tag the commit with vx.y.z, and push it to main.\nAn automated deploy and documentation update will be completed for you.\n\nFor a manual deployment, run:\n\n::\n\n    poetry publish --build\n\nYou will probably also need to push new documentation files to the gh-pages branch.\n\nTesting\n=======\neyecite comes with a robust test suite of different citation strings that it is equipped to handle. Run these tests as follows:\n\n::\n\n    python3 -m unittest discover -s tests -p 'test_*.py'\n\nIf you would like to create mock citation objects to assist you in writing your own local tests, import and use the following functions for convenience:\n\n::\n\n    from eyecite.test_factories import (\n        case_citation,\n        id_citation,\n        supra_citation,\n        unknown_citation,\n    )\n\n\nDevelopment\n===========\nWhen a pull request is generated for changes from changes to eyecite, a github\nworkflow will automatically trigger.  The workflow, benchmark.yml will\ntest improvements in accuracy and speed against the current main branch.\n\nThe results are committed to an artifacts branch, and an ever updating comment\nin the PR comments with the output.\n\n\nLicense\n=======\nThis repository is available under the permissive BSD license, making it easy and safe to incorporate in your own libraries.\n\nPull and feature requests welcome. Online editing in GitHub is possible (and easy!).\n",
    "bugtrack_url": null,
    "license": "BSD-2-Clause",
    "summary": "Tool for extracting legal citations from text strings.",
    "version": "2.6.4",
    "project_urls": {
        "Homepage": "https://github.com/freelawproject/eyecite",
        "Organisation Homepage": "https://free.law/",
        "Repository": "https://github.com/freelawproject/eyecite"
    },
    "split_keywords": [
        "legal",
        " courts",
        " citations",
        " extraction",
        " cites"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fbfd3e36b43a50237b9c48a6dd86978b0d5b32851ef253ff3e07e357275277a7",
                "md5": "32e1ddc7beb6f48b294573d1c57a3df1",
                "sha256": "da6a100ca6c6fd05b9a6714fdcdaec8d2e5aa27fff550c9e6c41f75009bea81f"
            },
            "downloads": -1,
            "filename": "eyecite-2.6.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "32e1ddc7beb6f48b294573d1c57a3df1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 41088,
            "upload_time": "2024-06-03T18:48:54",
            "upload_time_iso_8601": "2024-06-03T18:48:54.554426Z",
            "url": "https://files.pythonhosted.org/packages/fb/fd/3e36b43a50237b9c48a6dd86978b0d5b32851ef253ff3e07e357275277a7/eyecite-2.6.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bc097c6dd9420c5e073aade45d0106193b77352142b5205bc548bbc885576c1c",
                "md5": "ff7c1760304e4d9e12019609a3be09fd",
                "sha256": "e3a7d8d7816ee58f2966b2c571df3f97dd19746c5ea5b951b30d4d82cabd8508"
            },
            "downloads": -1,
            "filename": "eyecite-2.6.4.tar.gz",
            "has_sig": false,
            "md5_digest": "ff7c1760304e4d9e12019609a3be09fd",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 42367,
            "upload_time": "2024-06-03T18:48:56",
            "upload_time_iso_8601": "2024-06-03T18:48:56.306661Z",
            "url": "https://files.pythonhosted.org/packages/bc/09/7c6dd9420c5e073aade45d0106193b77352142b5205bc548bbc885576c1c/eyecite-2.6.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-03 18:48:56",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "freelawproject",
    "github_project": "eyecite",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "eyecite"
}
        
Elapsed time: 0.52878s