| Name | juriscraper JSON |
| Version |
2.6.95
JSON |
| download |
| home_page | None |
| Summary | An API to scrape American court websites for metadata. |
| upload_time | 2025-11-02 18:41:06 |
| maintainer | None |
| docs_url | None |
| author | None |
| requires_python | >=3.9 |
| license | None |
| keywords |
scraping
legal
pacer
|
| VCS |
 |
| bugtrack_url |
|
| requirements |
No requirements were recorded.
|
| Travis-CI |
No Travis.
|
| coveralls test coverage |
No coveralls.
|
+---------------+---------------------+-------------------+
| |Lint Badge| | |Test Badge| | |Version Badge| |
+---------------+---------------------+-------------------+
.. |Lint Badge| image:: https://github.com/freelawproject/juriscraper/workflows/Lint/badge.svg
.. |Test Badge| image:: https://github.com/freelawproject/juriscraper/workflows/Tests/badge.svg
.. |Version Badge| image:: https://badge.fury.io/py/juriscraper.svg
What is This?
=============
Juriscraper is a scraper library started several years ago that gathers judicial opinions, oral arguments, and PACER data in the American court system. It is currently able to scrape:
- a variety of pages and reports within the PACER system
- opinions from all major appellate Federal courts
- opinions from all state courts of last resort except for Georgia (typically their "Supreme Court")
- oral arguments from all appellate federal courts that offer them
Juriscraper is part of a two-part system. The second part is your code,
which calls Juriscraper. Your code is responsible for calling a scraper,
downloading and saving its results. A reference implementation of the
caller has been developed and is in use at
`CourtListener.com <https://www.courtlistener.com>`__. The code for that
caller can be `found
here <https://github.com/freelawproject/courtlistener/blob/main/cl/scrapers/management/commands/cl_scrape_opinions.py>`__.
There is also a basic sample caller `included in
Juriscraper <https://github.com/freelawproject/juriscraper/blob/main/sample_caller.py>`__
that can be used for testing or as a starting point when developing your
own.
Some of the design goals for this project are:
- extensibility to support video, oral argument audio, etc.
- extensibility to support geographies (US, Cuba, Mexico, California)
- Mime type identification through magic numbers
- Generalized architecture with minimal code repetition
- XPath-based scraping powered by lxml's html parser
- return all meta data available on court websites (caller can pick
what it needs)
- no need for a database
- clear log levels (DEBUG, INFO, WARN, CRITICAL)
- friendly as possible to court websites
Installation & Dependencies
===========================
First step: Install Python 3.9+, then:
Install the dependencies
------------------------
On Ubuntu based distributions/Debian Linux::
sudo apt-get install libxml2-dev libxslt-dev libyaml-dev
On Arch based distributions::
sudo pacman -S libxml2 libxslt libyaml
On macOS with Homebrew <https://brew.sh>::
brew install libyaml
Then install the code
---------------------
::
pip install juriscraper
You can set an environment variable for where you want to stash your logs (this
can be skipped, and `/var/log/juriscraper/debug.log` will be used as the
default if it exists on the filesystem)::
export JURISCRAPER_LOG=/path/to/your/log.txt
Finally, do your WebDriver
--------------------------
Some websites are too difficult to crawl without some sort of automated
WebDriver. For these, Juriscraper either uses a locally-installed copy of
geckodriver or can be configured to connect to a remote webdriver. If you prefer
the local installation, you can download Selenium FireFox Geckodriver::
# choose OS compatible package from:
# https://github.com/mozilla/geckodriver/releases/tag/v0.26.0
# un-tar/zip your download
sudo mv geckodriver /usr/local/bin
If you prefer to use a remote webdriver, like `Selenium's docker image <https://hub.docker.com/r/selenium/standalone-firefox>`__, you can
configure it with the following variables:
``WEBDRIVER_CONN``: Use this to set the connection string to your remote
webdriver. By default, this is ``local``, meaning it will look for a local
installation of geckodriver. Instead, you can set this to something like,
``'http://YOUR_DOCKER_IP:4444/wd/hub'``, which will switch it to using a remote
driver and connect it to that location.
``SELENIUM_VISIBLE``: Set this to any value to disable headless mode in your
selenium driver, if it supports it. Otherwise, it defaults to headless.
For example, if you want to watch a headless browser run, you can do so by
starting selenium with::
docker run \
-p 4444:4444 \
-p 5900:5900 \
-v /dev/shm:/dev/shm \
selenium/standalone-firefox-debug
That'll launch it on your local machine with two open ports. 4444 is the
default on the image for accessing the webdriver. 5900 can be used to connect
via a VNC viewer, and can be used to watch progress if the ``SELENIUM_VISIBLE``
variable is set.
Once you have selenium running like that, you can do a test like::
WEBDRIVER_CONN='http://localhost:4444/wd/hub' \
SELENIUM_VISIBLE=yes \
python sample_caller.py -c juriscraper.opinions.united_states.state.kan_p
Kansas's precedential scraper uses a webdriver. If you do this and watch
selenium, you should see it in action.
Contributing
============
We welcome contributions! If you'd like to get involved, please take a look at our
`CONTRIBUTING.md <CONTRIBUTING.md>`__
guide for instructions on setting up your environment, running tests, and more.
License
=======
Juriscraper is licensed under the permissive BSD license.
|forthebadge made-with-python|
.. |forthebadge made-with-python| image:: http://ForTheBadge.com/images/badges/made-with-python.svg
:target: https://www.python.org/
Raw data
{
"_id": null,
"home_page": null,
"name": "juriscraper",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "scraping, legal, pacer",
"author": null,
"author_email": "Free Law Project <info@free.law>",
"download_url": "https://files.pythonhosted.org/packages/ee/99/6f78f2c34bb8ff539787bad9e18b2e4295faa6600f5dce084d0330b69a53/juriscraper-2.6.95.tar.gz",
"platform": null,
"description": "+---------------+---------------------+-------------------+\n| |Lint Badge| | |Test Badge| | |Version Badge| |\n+---------------+---------------------+-------------------+\n\n\n.. |Lint Badge| image:: https://github.com/freelawproject/juriscraper/workflows/Lint/badge.svg\n.. |Test Badge| image:: https://github.com/freelawproject/juriscraper/workflows/Tests/badge.svg\n.. |Version Badge| image:: https://badge.fury.io/py/juriscraper.svg\n\n\nWhat is This?\n=============\n\nJuriscraper is a scraper library started several years ago that gathers judicial opinions, oral arguments, and PACER data in the American court system. It is currently able to scrape:\n\n- a variety of pages and reports within the PACER system\n- opinions from all major appellate Federal courts\n- opinions from all state courts of last resort except for Georgia (typically their \"Supreme Court\")\n- oral arguments from all appellate federal courts that offer them\n\nJuriscraper is part of a two-part system. The second part is your code,\nwhich calls Juriscraper. Your code is responsible for calling a scraper,\ndownloading and saving its results. A reference implementation of the\ncaller has been developed and is in use at\n`CourtListener.com <https://www.courtlistener.com>`__. The code for that\ncaller can be `found\nhere <https://github.com/freelawproject/courtlistener/blob/main/cl/scrapers/management/commands/cl_scrape_opinions.py>`__.\nThere is also a basic sample caller `included in\nJuriscraper <https://github.com/freelawproject/juriscraper/blob/main/sample_caller.py>`__\nthat can be used for testing or as a starting point when developing your\nown.\n\nSome of the design goals for this project are:\n\n- extensibility to support video, oral argument audio, etc.\n- extensibility to support geographies (US, Cuba, Mexico, California)\n- Mime type identification through magic numbers\n- Generalized architecture with minimal code repetition\n- XPath-based scraping powered by lxml's html parser\n- return all meta data available on court websites (caller can pick\n what it needs)\n- no need for a database\n- clear log levels (DEBUG, INFO, WARN, CRITICAL)\n- friendly as possible to court websites\n\nInstallation & Dependencies\n===========================\n\nFirst step: Install Python 3.9+, then:\n\nInstall the dependencies\n------------------------\n\nOn Ubuntu based distributions/Debian Linux::\n\n sudo apt-get install libxml2-dev libxslt-dev libyaml-dev\n\nOn Arch based distributions::\n\n sudo pacman -S libxml2 libxslt libyaml\n\nOn macOS with Homebrew <https://brew.sh>::\n\n brew install libyaml\n\n\nThen install the code\n---------------------\n\n::\n\n pip install juriscraper\n\nYou can set an environment variable for where you want to stash your logs (this\ncan be skipped, and `/var/log/juriscraper/debug.log` will be used as the\ndefault if it exists on the filesystem)::\n\n export JURISCRAPER_LOG=/path/to/your/log.txt\n\nFinally, do your WebDriver\n--------------------------\nSome websites are too difficult to crawl without some sort of automated\nWebDriver. For these, Juriscraper either uses a locally-installed copy of\ngeckodriver or can be configured to connect to a remote webdriver. If you prefer\nthe local installation, you can download Selenium FireFox Geckodriver::\n\n # choose OS compatible package from:\n # https://github.com/mozilla/geckodriver/releases/tag/v0.26.0\n # un-tar/zip your download\n sudo mv geckodriver /usr/local/bin\n\nIf you prefer to use a remote webdriver, like `Selenium's docker image <https://hub.docker.com/r/selenium/standalone-firefox>`__, you can\nconfigure it with the following variables:\n\n``WEBDRIVER_CONN``: Use this to set the connection string to your remote\nwebdriver. By default, this is ``local``, meaning it will look for a local\ninstallation of geckodriver. Instead, you can set this to something like,\n``'http://YOUR_DOCKER_IP:4444/wd/hub'``, which will switch it to using a remote\ndriver and connect it to that location.\n\n``SELENIUM_VISIBLE``: Set this to any value to disable headless mode in your\nselenium driver, if it supports it. Otherwise, it defaults to headless.\n\nFor example, if you want to watch a headless browser run, you can do so by\nstarting selenium with::\n\n docker run \\\n -p 4444:4444 \\\n -p 5900:5900 \\\n -v /dev/shm:/dev/shm \\\n selenium/standalone-firefox-debug\n\nThat'll launch it on your local machine with two open ports. 4444 is the\ndefault on the image for accessing the webdriver. 5900 can be used to connect\nvia a VNC viewer, and can be used to watch progress if the ``SELENIUM_VISIBLE``\nvariable is set.\n\nOnce you have selenium running like that, you can do a test like::\n\n WEBDRIVER_CONN='http://localhost:4444/wd/hub' \\\n SELENIUM_VISIBLE=yes \\\n python sample_caller.py -c juriscraper.opinions.united_states.state.kan_p\n\nKansas's precedential scraper uses a webdriver. If you do this and watch\nselenium, you should see it in action.\n\nContributing\n============\n\nWe welcome contributions! If you'd like to get involved, please take a look at our\n`CONTRIBUTING.md <CONTRIBUTING.md>`__\nguide for instructions on setting up your environment, running tests, and more.\n\nLicense\n=======\n\nJuriscraper is licensed under the permissive BSD license.\n\n|forthebadge made-with-python|\n\n.. |forthebadge made-with-python| image:: http://ForTheBadge.com/images/badges/made-with-python.svg\n :target: https://www.python.org/\n",
"bugtrack_url": null,
"license": null,
"summary": "An API to scrape American court websites for metadata.",
"version": "2.6.95",
"project_urls": {
"Repository": "https://github.com/freelawproject/juriscraper"
},
"split_keywords": [
"scraping",
" legal",
" pacer"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "8a633f532b255f22754db6cb9c3055c37ecb27960fe51f9615bea810c2caf3af",
"md5": "143e280ad2ebd844a2b2a0bb4c8f40fb",
"sha256": "8c9d99f8f124230d8f7c05c38fd2280cdd46f46206eb52bd62832aaea204c31b"
},
"downloads": -1,
"filename": "juriscraper-2.6.95-py3-none-any.whl",
"has_sig": false,
"md5_digest": "143e280ad2ebd844a2b2a0bb4c8f40fb",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 553853,
"upload_time": "2025-11-02T18:41:03",
"upload_time_iso_8601": "2025-11-02T18:41:03.591512Z",
"url": "https://files.pythonhosted.org/packages/8a/63/3f532b255f22754db6cb9c3055c37ecb27960fe51f9615bea810c2caf3af/juriscraper-2.6.95-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "ee996f78f2c34bb8ff539787bad9e18b2e4295faa6600f5dce084d0330b69a53",
"md5": "9e1ec5cea372ad459679f55b8baab3e6",
"sha256": "3e300c979c3edd98b35974535cd34e6d461fcc25d01b2c5afb96c57dc925e9fe"
},
"downloads": -1,
"filename": "juriscraper-2.6.95.tar.gz",
"has_sig": false,
"md5_digest": "9e1ec5cea372ad459679f55b8baab3e6",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 338169,
"upload_time": "2025-11-02T18:41:06",
"upload_time_iso_8601": "2025-11-02T18:41:06.100973Z",
"url": "https://files.pythonhosted.org/packages/ee/99/6f78f2c34bb8ff539787bad9e18b2e4295faa6600f5dce084d0330b69a53/juriscraper-2.6.95.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-11-02 18:41:06",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "freelawproject",
"github_project": "juriscraper",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"tox": true,
"lcname": "juriscraper"
}