juriscraper

Name	juriscraper JSON
Version	2.6.95 JSON
	download
home_page	None
Summary	An API to scrape American court websites for metadata.
upload_time	2025-11-02 18:41:06
maintainer	None
docs_url	None
author	None
requires_python	>=3.9
license	None
keywords	scraping legal pacer
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            +---------------+---------------------+-------------------+
| |Lint Badge|  | |Test Badge|        |  |Version Badge|  |
+---------------+---------------------+-------------------+


.. |Lint Badge| image:: https://github.com/freelawproject/juriscraper/workflows/Lint/badge.svg
.. |Test Badge| image:: https://github.com/freelawproject/juriscraper/workflows/Tests/badge.svg
.. |Version Badge| image:: https://badge.fury.io/py/juriscraper.svg


What is This?
=============

Juriscraper is a scraper library started several years ago that gathers judicial opinions, oral arguments, and PACER data in the American court system. It is currently able to scrape:

-  a variety of pages and reports within the PACER system
-  opinions from all major appellate Federal courts
-  opinions from all state courts of last resort except for Georgia (typically their "Supreme Court")
-  oral arguments from all appellate federal courts that offer them

Juriscraper is part of a two-part system. The second part is your code,
which calls Juriscraper. Your code is responsible for calling a scraper,
downloading and saving its results. A reference implementation of the
caller has been developed and is in use at
`CourtListener.com <https://www.courtlistener.com>`__. The code for that
caller can be `found
here <https://github.com/freelawproject/courtlistener/blob/main/cl/scrapers/management/commands/cl_scrape_opinions.py>`__.
There is also a basic sample caller `included in
Juriscraper <https://github.com/freelawproject/juriscraper/blob/main/sample_caller.py>`__
that can be used for testing or as a starting point when developing your
own.

Some of the design goals for this project are:

-  extensibility to support video, oral argument audio, etc.
-  extensibility to support geographies (US, Cuba, Mexico, California)
-  Mime type identification through magic numbers
-  Generalized architecture with minimal code repetition
-  XPath-based scraping powered by lxml's html parser
-  return all meta data available on court websites (caller can pick
   what it needs)
-  no need for a database
-  clear log levels (DEBUG, INFO, WARN, CRITICAL)
-  friendly as possible to court websites

Installation & Dependencies
===========================

First step: Install Python 3.9+, then:

Install the dependencies
------------------------

On Ubuntu based distributions/Debian Linux::

    sudo apt-get install libxml2-dev libxslt-dev libyaml-dev

On Arch based distributions::

    sudo pacman -S libxml2 libxslt libyaml

On macOS with Homebrew <https://brew.sh>::

    brew install libyaml


Then install the code
---------------------

::

    pip install juriscraper

You can set an environment variable for where you want to stash your logs (this
can be skipped, and `/var/log/juriscraper/debug.log` will be used as the
default if it exists on the filesystem)::

    export JURISCRAPER_LOG=/path/to/your/log.txt

Finally, do your WebDriver
--------------------------
Some websites are too difficult to crawl without some sort of automated
WebDriver. For these, Juriscraper either uses a locally-installed copy of
geckodriver or can be configured to connect to a remote webdriver. If you prefer
the local installation, you can download Selenium FireFox Geckodriver::

    # choose OS compatible package from:
    #   https://github.com/mozilla/geckodriver/releases/tag/v0.26.0
    # un-tar/zip your download
    sudo mv geckodriver /usr/local/bin

If you prefer to use a remote webdriver, like `Selenium's docker image <https://hub.docker.com/r/selenium/standalone-firefox>`__, you can
configure it with the following variables:

``WEBDRIVER_CONN``: Use this to set the connection string to your remote
webdriver. By default, this is ``local``, meaning it will look for a local
installation of geckodriver. Instead, you can set this to something like,
``'http://YOUR_DOCKER_IP:4444/wd/hub'``, which will switch it to using a remote
driver and connect it to that location.

``SELENIUM_VISIBLE``: Set this to any value to disable headless mode in your
selenium driver, if it supports it. Otherwise, it defaults to headless.

For example, if you want to watch a headless browser run, you can do so by
starting selenium with::

    docker run \
        -p 4444:4444 \
        -p 5900:5900 \
        -v /dev/shm:/dev/shm \
        selenium/standalone-firefox-debug

That'll launch it on your local machine with two open ports. 4444 is the
default on the image for accessing the webdriver. 5900 can be used to connect
via a VNC viewer, and can be used to watch progress if the ``SELENIUM_VISIBLE``
variable is set.

Once you have selenium running like that, you can do a test like::

    WEBDRIVER_CONN='http://localhost:4444/wd/hub' \
        SELENIUM_VISIBLE=yes \
        python sample_caller.py -c juriscraper.opinions.united_states.state.kan_p

Kansas's precedential scraper uses a webdriver. If you do this and watch
selenium, you should see it in action.

Contributing
============

We welcome contributions! If you'd like to get involved, please take a look at our
`CONTRIBUTING.md <CONTRIBUTING.md>`__
guide for instructions on setting up your environment, running tests, and more.

License
=======

Juriscraper is licensed under the permissive BSD license.

|forthebadge made-with-python|

.. |forthebadge made-with-python| image:: http://ForTheBadge.com/images/badges/made-with-python.svg
    :target: https://www.python.org/

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "juriscraper",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "scraping, legal, pacer",
    "author": null,
    "author_email": "Free Law Project <info@free.law>",
    "download_url": "https://files.pythonhosted.org/packages/ee/99/6f78f2c34bb8ff539787bad9e18b2e4295faa6600f5dce084d0330b69a53/juriscraper-2.6.95.tar.gz",
    "platform": null,
    "description": "+---------------+---------------------+-------------------+\n| |Lint Badge|  | |Test Badge|        |  |Version Badge|  |\n+---------------+---------------------+-------------------+\n\n\n.. |Lint Badge| image:: https://github.com/freelawproject/juriscraper/workflows/Lint/badge.svg\n.. |Test Badge| image:: https://github.com/freelawproject/juriscraper/workflows/Tests/badge.svg\n.. |Version Badge| image:: https://badge.fury.io/py/juriscraper.svg\n\n\nWhat is This?\n=============\n\nJuriscraper is a scraper library started several years ago that gathers judicial opinions, oral arguments, and PACER data in the American court system. It is currently able to scrape:\n\n-  a variety of pages and reports within the PACER system\n-  opinions from all major appellate Federal courts\n-  opinions from all state courts of last resort except for Georgia (typically their \"Supreme Court\")\n-  oral arguments from all appellate federal courts that offer them\n\nJuriscraper is part of a two-part system. The second part is your code,\nwhich calls Juriscraper. Your code is responsible for calling a scraper,\ndownloading and saving its results. A reference implementation of the\ncaller has been developed and is in use at\n`CourtListener.com <https://www.courtlistener.com>`__. The code for that\ncaller can be `found\nhere <https://github.com/freelawproject/courtlistener/blob/main/cl/scrapers/management/commands/cl_scrape_opinions.py>`__.\nThere is also a basic sample caller `included in\nJuriscraper <https://github.com/freelawproject/juriscraper/blob/main/sample_caller.py>`__\nthat can be used for testing or as a starting point when developing your\nown.\n\nSome of the design goals for this project are:\n\n-  extensibility to support video, oral argument audio, etc.\n-  extensibility to support geographies (US, Cuba, Mexico, California)\n-  Mime type identification through magic numbers\n-  Generalized architecture with minimal code repetition\n-  XPath-based scraping powered by lxml's html parser\n-  return all meta data available on court websites (caller can pick\n   what it needs)\n-  no need for a database\n-  clear log levels (DEBUG, INFO, WARN, CRITICAL)\n-  friendly as possible to court websites\n\nInstallation & Dependencies\n===========================\n\nFirst step: Install Python 3.9+, then:\n\nInstall the dependencies\n------------------------\n\nOn Ubuntu based distributions/Debian Linux::\n\n    sudo apt-get install libxml2-dev libxslt-dev libyaml-dev\n\nOn Arch based distributions::\n\n    sudo pacman -S libxml2 libxslt libyaml\n\nOn macOS with Homebrew <https://brew.sh>::\n\n    brew install libyaml\n\n\nThen install the code\n---------------------\n\n::\n\n    pip install juriscraper\n\nYou can set an environment variable for where you want to stash your logs (this\ncan be skipped, and `/var/log/juriscraper/debug.log` will be used as the\ndefault if it exists on the filesystem)::\n\n    export JURISCRAPER_LOG=/path/to/your/log.txt\n\nFinally, do your WebDriver\n--------------------------\nSome websites are too difficult to crawl without some sort of automated\nWebDriver. For these, Juriscraper either uses a locally-installed copy of\ngeckodriver or can be configured to connect to a remote webdriver. If you prefer\nthe local installation, you can download Selenium FireFox Geckodriver::\n\n    # choose OS compatible package from:\n    #   https://github.com/mozilla/geckodriver/releases/tag/v0.26.0\n    # un-tar/zip your download\n    sudo mv geckodriver /usr/local/bin\n\nIf you prefer to use a remote webdriver, like `Selenium's docker image <https://hub.docker.com/r/selenium/standalone-firefox>`__, you can\nconfigure it with the following variables:\n\n``WEBDRIVER_CONN``: Use this to set the connection string to your remote\nwebdriver. By default, this is ``local``, meaning it will look for a local\ninstallation of geckodriver. Instead, you can set this to something like,\n``'http://YOUR_DOCKER_IP:4444/wd/hub'``, which will switch it to using a remote\ndriver and connect it to that location.\n\n``SELENIUM_VISIBLE``: Set this to any value to disable headless mode in your\nselenium driver, if it supports it. Otherwise, it defaults to headless.\n\nFor example, if you want to watch a headless browser run, you can do so by\nstarting selenium with::\n\n    docker run \\\n        -p 4444:4444 \\\n        -p 5900:5900 \\\n        -v /dev/shm:/dev/shm \\\n        selenium/standalone-firefox-debug\n\nThat'll launch it on your local machine with two open ports. 4444 is the\ndefault on the image for accessing the webdriver. 5900 can be used to connect\nvia a VNC viewer, and can be used to watch progress if the ``SELENIUM_VISIBLE``\nvariable is set.\n\nOnce you have selenium running like that, you can do a test like::\n\n    WEBDRIVER_CONN='http://localhost:4444/wd/hub' \\\n        SELENIUM_VISIBLE=yes \\\n        python sample_caller.py -c juriscraper.opinions.united_states.state.kan_p\n\nKansas's precedential scraper uses a webdriver. If you do this and watch\nselenium, you should see it in action.\n\nContributing\n============\n\nWe welcome contributions! If you'd like to get involved, please take a look at our\n`CONTRIBUTING.md <CONTRIBUTING.md>`__\nguide for instructions on setting up your environment, running tests, and more.\n\nLicense\n=======\n\nJuriscraper is licensed under the permissive BSD license.\n\n|forthebadge made-with-python|\n\n.. |forthebadge made-with-python| image:: http://ForTheBadge.com/images/badges/made-with-python.svg\n    :target: https://www.python.org/\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "An API to scrape American court websites for metadata.",
    "version": "2.6.95",
    "project_urls": {
        "Repository": "https://github.com/freelawproject/juriscraper"
    },
    "split_keywords": [
        "scraping",
        " legal",
        " pacer"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8a633f532b255f22754db6cb9c3055c37ecb27960fe51f9615bea810c2caf3af",
                "md5": "143e280ad2ebd844a2b2a0bb4c8f40fb",
                "sha256": "8c9d99f8f124230d8f7c05c38fd2280cdd46f46206eb52bd62832aaea204c31b"
            },
            "downloads": -1,
            "filename": "juriscraper-2.6.95-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "143e280ad2ebd844a2b2a0bb4c8f40fb",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 553853,
            "upload_time": "2025-11-02T18:41:03",
            "upload_time_iso_8601": "2025-11-02T18:41:03.591512Z",
            "url": "https://files.pythonhosted.org/packages/8a/63/3f532b255f22754db6cb9c3055c37ecb27960fe51f9615bea810c2caf3af/juriscraper-2.6.95-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ee996f78f2c34bb8ff539787bad9e18b2e4295faa6600f5dce084d0330b69a53",
                "md5": "9e1ec5cea372ad459679f55b8baab3e6",
                "sha256": "3e300c979c3edd98b35974535cd34e6d461fcc25d01b2c5afb96c57dc925e9fe"
            },
            "downloads": -1,
            "filename": "juriscraper-2.6.95.tar.gz",
            "has_sig": false,
            "md5_digest": "9e1ec5cea372ad459679f55b8baab3e6",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 338169,
            "upload_time": "2025-11-02T18:41:06",
            "upload_time_iso_8601": "2025-11-02T18:41:06.100973Z",
            "url": "https://files.pythonhosted.org/packages/ee/99/6f78f2c34bb8ff539787bad9e18b2e4295faa6600f5dce084d0330b69a53/juriscraper-2.6.95.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-11-02 18:41:06",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "freelawproject",
    "github_project": "juriscraper",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "tox": true,
    "lcname": "juriscraper"
}

None