recordsearch-data-scraper

Name	recordsearch-data-scraper JSON
Version	0.1.0 JSON
	download
home_page	https://github.com/wragge/recordsearch_data_scraper
Summary	Tool for extracting machine-readable data from the National Archives of Australia online database, Recordsearch.
upload_time	2023-01-20 01:08:55
maintainer
docs_url	None
author	Tim Sherratt
requires_python	>=3.8
license	MIT License
keywords	nbdev jupyter notebook python
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            RecordSearch Data Scraper
================

<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

The National Archives of Australia’s online database, RecordSearch,
contains lots of rich, historical data. Unfortunately there’s no API, so
we have to resort to screen scrapers to get it out in reusable form.
This is a library of scrapers to extract data about the main entities in
RecordSearch – Items, Series, and Agencies – from both individual
records, and search results.

The main classes are:

- `RSItem()` – an individual item
- `RSItemSearch()` – an advanced search for items
- `RSSeries()` – an individual series
- `RSSeriesSearch()` – an advanced search for series
- `RSAgency()` – an individual agency
- `RSAgencySearch()` – an advanced search for agencies

RecordSearch makes use of an odd assortment of sessions, redirects, and
hidden forms, which make scraping a challenge. Please let me know if
something isn’t working as expected, as problems can be difficult to pin
down!

This is a replacement for the original Recordsearch_tools library. The
main changes are:

- Requirements have been updated (dropping RoboBrowser which seems to be
  no longer maintained)
- The full range of search parameters are now supported for Items,
  Series, and Agencies
- There’s a built-in cache for improved efficiency and speed

See the
[documentation](https://wragge.github.io/recordsearch_data_scraper/) for
more details. And check out the [RecordSearch
section](https://glam-workbench.net/recordsearch/) of the GLAM Workbench
for examples of what’s possible.

## Install

`pip install recordsearch-data-scraper`

## How to use

Retrieve an item using its Item ID.

``` python
from recordsearch_data_scraper.scrapers import *

item = RSItem('3445411')
```

View the item data.

``` python
item.data
```

    {'title': 'WRAGGE Clement Lionel Egerton : SERN 647 : POB Cheadle England : POE Enoggera QLD : NOK  (Father) WRAGGE Clement Lindley',
     'identifier': '3445411',
     'series': 'B2455',
     'control_symbol': 'WRAGGE C L E',
     'digitised_status': True,
     'digitised_pages': 47,
     'access_status': 'Open',
     'access_decision_reasons': [],
     'location': 'Canberra',
     'retrieved': '2021-04-25T21:12:22.620414+10:00',
     'contents_date_str': '1914 - 1920',
     'contents_start_date': '1914',
     'contents_end_date': '1920',
     'access_decision_date_str': '12 Apr 2001',
     'access_decision_date': '2001-04-12'}

Search for items.

``` python
search = RSItemSearch(kw='wragge')
```

View the total number of items in the results set.

``` python
search.total_results
```

    209

Access the first page of results.

``` python
items = search.get_results()
```

View the first result.

``` python
items['results'][0]
```

    {'series': 'A2479',
     'control_symbol': '17/1306',
     'title': 'The Wragge Estate. Property for sale.',
     'identifier': '149309',
     'access_status': 'Open',
     'location': 'Canberra',
     'contents_date_str': '1917 - 1917',
     'contents_start_date': '1917',
     'contents_end_date': '1917',
     'digitised_status': True}

The Series and Agency classes follow exactly the same pattern. See the
[docs](https://wragge.github.io/recordsearch_data_scraper/) for more
examples.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/wragge/recordsearch_data_scraper",
    "name": "recordsearch-data-scraper",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "nbdev jupyter notebook python",
    "author": "Tim Sherratt",
    "author_email": "tim@timsherratt.org",
    "download_url": "https://files.pythonhosted.org/packages/05/1c/72e52aade4af6da90ab8c3501e1030e1da69a57440ca365c7de1ace47c45/recordsearch_data_scraper-0.1.0.tar.gz",
    "platform": null,
    "description": "RecordSearch Data Scraper\n================\n\n<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->\n\nThe National Archives of Australia\u2019s online database, RecordSearch,\ncontains lots of rich, historical data. Unfortunately there\u2019s no API, so\nwe have to resort to screen scrapers to get it out in reusable form.\nThis is a library of scrapers to extract data about the main entities in\nRecordSearch \u2013 Items, Series, and Agencies \u2013 from both individual\nrecords, and search results.\n\nThe main classes are:\n\n- `RSItem()` \u2013 an individual item\n- `RSItemSearch()` \u2013 an advanced search for items\n- `RSSeries()` \u2013 an individual series\n- `RSSeriesSearch()` \u2013 an advanced search for series\n- `RSAgency()` \u2013 an individual agency\n- `RSAgencySearch()` \u2013 an advanced search for agencies\n\nRecordSearch makes use of an odd assortment of sessions, redirects, and\nhidden forms, which make scraping a challenge. Please let me know if\nsomething isn\u2019t working as expected, as problems can be difficult to pin\ndown!\n\nThis is a replacement for the original Recordsearch_tools library. The\nmain changes are:\n\n- Requirements have been updated (dropping RoboBrowser which seems to be\n  no longer maintained)\n- The full range of search parameters are now supported for Items,\n  Series, and Agencies\n- There\u2019s a built-in cache for improved efficiency and speed\n\nSee the\n[documentation](https://wragge.github.io/recordsearch_data_scraper/) for\nmore details. And check out the [RecordSearch\nsection](https://glam-workbench.net/recordsearch/) of the GLAM Workbench\nfor examples of what\u2019s possible.\n\n## Install\n\n`pip install recordsearch-data-scraper`\n\n## How to use\n\nRetrieve an item using its Item ID.\n\n``` python\nfrom recordsearch_data_scraper.scrapers import *\n\nitem = RSItem('3445411')\n```\n\nView the item data.\n\n``` python\nitem.data\n```\n\n    {'title': 'WRAGGE Clement Lionel Egerton : SERN 647 : POB Cheadle England : POE Enoggera QLD : NOK  (Father) WRAGGE Clement Lindley',\n     'identifier': '3445411',\n     'series': 'B2455',\n     'control_symbol': 'WRAGGE C L E',\n     'digitised_status': True,\n     'digitised_pages': 47,\n     'access_status': 'Open',\n     'access_decision_reasons': [],\n     'location': 'Canberra',\n     'retrieved': '2021-04-25T21:12:22.620414+10:00',\n     'contents_date_str': '1914 - 1920',\n     'contents_start_date': '1914',\n     'contents_end_date': '1920',\n     'access_decision_date_str': '12 Apr 2001',\n     'access_decision_date': '2001-04-12'}\n\nSearch for items.\n\n``` python\nsearch = RSItemSearch(kw='wragge')\n```\n\nView the total number of items in the results set.\n\n``` python\nsearch.total_results\n```\n\n    209\n\nAccess the first page of results.\n\n``` python\nitems = search.get_results()\n```\n\nView the first result.\n\n``` python\nitems['results'][0]\n```\n\n    {'series': 'A2479',\n     'control_symbol': '17/1306',\n     'title': 'The Wragge Estate. Property for sale.',\n     'identifier': '149309',\n     'access_status': 'Open',\n     'location': 'Canberra',\n     'contents_date_str': '1917 - 1917',\n     'contents_start_date': '1917',\n     'contents_end_date': '1917',\n     'digitised_status': True}\n\nThe Series and Agency classes follow exactly the same pattern. See the\n[docs](https://wragge.github.io/recordsearch_data_scraper/) for more\nexamples.\n\n\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "Tool for extracting machine-readable data from the National Archives of Australia online database, Recordsearch.",
    "version": "0.1.0",
    "split_keywords": [
        "nbdev",
        "jupyter",
        "notebook",
        "python"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3dff0f831ad96c6e17c717a0ae249fb9e8018408777716088b081963fc47e0f1",
                "md5": "cf656f948c030a4d38f5b04c930d1a2d",
                "sha256": "022c6613d416c498cd97a4799bee65b5ea03875268daebc12dd440fc2c77af71"
            },
            "downloads": -1,
            "filename": "recordsearch_data_scraper-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "cf656f948c030a4d38f5b04c930d1a2d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 19111,
            "upload_time": "2023-01-20T01:08:53",
            "upload_time_iso_8601": "2023-01-20T01:08:53.580100Z",
            "url": "https://files.pythonhosted.org/packages/3d/ff/0f831ad96c6e17c717a0ae249fb9e8018408777716088b081963fc47e0f1/recordsearch_data_scraper-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "051c72e52aade4af6da90ab8c3501e1030e1da69a57440ca365c7de1ace47c45",
                "md5": "b1ed0108ad313a5ada462dd71db3d996",
                "sha256": "4bf9c96d72fc4fa8e48a868470e32616bfbee8c5e29d038fe3d7ec57f58c43e5"
            },
            "downloads": -1,
            "filename": "recordsearch_data_scraper-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "b1ed0108ad313a5ada462dd71db3d996",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 22108,
            "upload_time": "2023-01-20T01:08:55",
            "upload_time_iso_8601": "2023-01-20T01:08:55.255618Z",
            "url": "https://files.pythonhosted.org/packages/05/1c/72e52aade4af6da90ab8c3501e1030e1da69a57440ca365c7de1ace47c45/recordsearch_data_scraper-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-01-20 01:08:55",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "wragge",
    "github_project": "recordsearch_data_scraper",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "recordsearch-data-scraper"
}

Tim Sherratt