description-harvester


Namedescription-harvester JSON
Version 0.0.5 PyPI version JSON
download
home_pagehttps://github.com/UAlbanyArchives/description_harvester
SummaryA tool for working with archival description for public access.
upload_time2024-08-22 13:25:32
maintainerNone
docs_urlNone
authorGregory Wiedeman
requires_python>=3.7
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # description_harvester
A tool for working with archival description for public access. description_harvester reads archival description into a [minimalist data model for public-facing archival description](https://github.com/UAlbanyArchives/description_harvester/blob/main/description_harvester/models/description.py) and then converts it to the [Arclight data model](https://github.com/UAlbanyArchives/description_harvester/blob/main/description_harvester/models/arclight.py) and POSTs it into an Arclight Solr index using [PySolr](https://github.com/django-haystack/pysolr).

description_harvester is designed to be extensible and harvest archival description from a number of [sources](https://github.com/UAlbanyArchives/description_harvester/tree/main/description_harvester/inputs). Currently the only available source harvests data from the [ArchivesSpace](https://github.com/archivesspace/archivesspace) [API](https://archivesspace.github.io/archivesspace/api/#introduction) using [ArchivesSnake](https://github.com/archivesspace-labs/ArchivesSnake). It is possible in the future to add modules for EAD2002 and other sources. Its also possible to add additional [output modules](https://github.com/UAlbanyArchives/description_harvester/tree/main/description_harvester/outputs) to serialize description to EAD or other formats in addition to or in replace of sending description to an Arclight Solr instance. This potential opens up new possibilities of managing description using low-barrier formats and tools.

The [main branch](https://github.com/UAlbanyArchives/description_harvester) is designed to be a drop-in replacement for the Arclight Traject indexer, while the [dao-indexing branch](https://github.com/UAlbanyArchives/description_harvester/tree/dao-indexing) tries to fully index digital objects from digital repositories and other sources, including item-level metadata fields, embedded text, OCR text, and transcriptions. 

This is still a bit drafty, as its only tested on ASpace v2.8.0 and needs better error handling. Validation is also very minimal, but there is potential to add detailed validation with `jsonschema `.

### Installation

```python
pip install description_harvester
```

First, you need to configure ArchivesSnake by creating a `~/.archivessnake.yml`file with your API credentials as detailed by the [ArchivesSnake configuration docs](https://github.com/archivesspace-labs/ArchivesSnake#configuration).

Next, you also need a `~/.description_harvester.yml` file that lists your Solr URL and the core you want to index to. These can also be overridden with args.

```yml
solr_url: http://127.0.0.1:8983/solr
solr_core: blacklight-core
last_query: 0
```

### Indexing from ArchivesSpace API to Arclight

Once description_harvester is set up, you can index from the ASpace API to Arclight using the `to-arclight` command.

#### Index by id_0

You can provide one or more IDs to index using a resource's id_0` field

`harvest --id ua807`

`harvest --id mss123 apap106`

#### Index by URI

You can also use integers from ASpace URIs for resource, such as 263 for `https://my.aspace.edu/resources/263`

`harvest --uri 435`

`harvest --uri 1 755`

#### Indexing by modified time

Index collections modified in the past hour: `harvest --hour`

Index collections modified in the past day: `harvest --today`

Index collections modified since las run: `harvest --new`

#### Deleting collections

You can delete one or more collections using the `--delete` argument in addition to`--id`. This uses the Solr document ID, such as `apap106` for `https://my.arclight.edu/catalog/apap106`.

`harvest --id apap101 apap301 --delete`

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/UAlbanyArchives/description_harvester",
    "name": "description-harvester",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": null,
    "author": "Gregory Wiedeman",
    "author_email": "gwiedeman@albany.edu",
    "download_url": "https://files.pythonhosted.org/packages/10/26/1079f2d3e7a7233102b67496fce270a5fba90d135695a57dd2fc762ee56f/description_harvester-0.0.5.tar.gz",
    "platform": null,
    "description": "# description_harvester\r\nA tool for working with archival description for public access. description_harvester reads archival description into a [minimalist data model for public-facing archival description](https://github.com/UAlbanyArchives/description_harvester/blob/main/description_harvester/models/description.py) and then converts it to the [Arclight data model](https://github.com/UAlbanyArchives/description_harvester/blob/main/description_harvester/models/arclight.py) and POSTs it into an Arclight Solr index using [PySolr](https://github.com/django-haystack/pysolr).\r\n\r\ndescription_harvester is designed to be extensible and harvest archival description from a number of [sources](https://github.com/UAlbanyArchives/description_harvester/tree/main/description_harvester/inputs). Currently the only available source harvests data from the [ArchivesSpace](https://github.com/archivesspace/archivesspace) [API](https://archivesspace.github.io/archivesspace/api/#introduction) using [ArchivesSnake](https://github.com/archivesspace-labs/ArchivesSnake). It is possible in the future to add modules for EAD2002 and other sources. Its also possible to add additional [output modules](https://github.com/UAlbanyArchives/description_harvester/tree/main/description_harvester/outputs) to serialize description to EAD or other formats in addition to or in replace of sending description to an Arclight Solr instance. This potential opens up new possibilities of managing description using low-barrier formats and tools.\r\n\r\nThe [main branch](https://github.com/UAlbanyArchives/description_harvester) is designed to be a drop-in replacement for the Arclight Traject indexer, while the [dao-indexing branch](https://github.com/UAlbanyArchives/description_harvester/tree/dao-indexing) tries to fully index digital objects from digital repositories and other sources, including item-level metadata fields, embedded text, OCR text, and transcriptions. \r\n\r\nThis is still a bit drafty, as its only tested on ASpace v2.8.0 and needs better error handling. Validation is also very minimal, but there is potential to add detailed validation with `jsonschema `.\r\n\r\n### Installation\r\n\r\n```python\r\npip install description_harvester\r\n```\r\n\r\nFirst, you need to configure ArchivesSnake by creating a `~/.archivessnake.yml`file with your API credentials as detailed by the [ArchivesSnake configuration docs](https://github.com/archivesspace-labs/ArchivesSnake#configuration).\r\n\r\nNext, you also need a `~/.description_harvester.yml` file that lists your Solr URL and the core you want to index to. These can also be overridden with args.\r\n\r\n```yml\r\nsolr_url: http://127.0.0.1:8983/solr\r\nsolr_core: blacklight-core\r\nlast_query: 0\r\n```\r\n\r\n### Indexing from ArchivesSpace API to Arclight\r\n\r\nOnce description_harvester is set up, you can index from the ASpace API to Arclight using the `to-arclight` command.\r\n\r\n#### Index by id_0\r\n\r\nYou can provide one or more IDs to index using a resource's id_0` field\r\n\r\n`harvest --id ua807`\r\n\r\n`harvest --id mss123 apap106`\r\n\r\n#### Index by URI\r\n\r\nYou can also use integers from ASpace URIs for resource, such as 263 for `https://my.aspace.edu/resources/263`\r\n\r\n`harvest --uri 435`\r\n\r\n`harvest --uri 1 755`\r\n\r\n#### Indexing by modified time\r\n\r\nIndex collections modified in the past hour: `harvest --hour`\r\n\r\nIndex collections modified in the past day: `harvest --today`\r\n\r\nIndex collections modified since las run: `harvest --new`\r\n\r\n#### Deleting collections\r\n\r\nYou can delete one or more collections using the `--delete` argument in addition to`--id`. This uses the Solr document ID, such as `apap106` for `https://my.arclight.edu/catalog/apap106`.\r\n\r\n`harvest --id apap101 apap301 --delete`\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A tool for working with archival description for public access.",
    "version": "0.0.5",
    "project_urls": {
        "Homepage": "https://github.com/UAlbanyArchives/description_harvester"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e9f1ceabc73447fe47cf1721586266788f06db729cd869f378ec00fb6e52e838",
                "md5": "09cae9f31ec3d2ea4463d6830df134cd",
                "sha256": "fafc388b2f5c6f2d32e6dae102c31422c8ebcde817cd30decee4d5a4a15cc24c"
            },
            "downloads": -1,
            "filename": "description_harvester-0.0.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "09cae9f31ec3d2ea4463d6830df134cd",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 17154,
            "upload_time": "2024-08-22T13:25:30",
            "upload_time_iso_8601": "2024-08-22T13:25:30.660693Z",
            "url": "https://files.pythonhosted.org/packages/e9/f1/ceabc73447fe47cf1721586266788f06db729cd869f378ec00fb6e52e838/description_harvester-0.0.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "10261079f2d3e7a7233102b67496fce270a5fba90d135695a57dd2fc762ee56f",
                "md5": "921cedc5437f613698df85998a5e38db",
                "sha256": "ff31e2244112c3f219781cf8cc37e4162c886548c652aa166e77bf7ab761446e"
            },
            "downloads": -1,
            "filename": "description_harvester-0.0.5.tar.gz",
            "has_sig": false,
            "md5_digest": "921cedc5437f613698df85998a5e38db",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 15836,
            "upload_time": "2024-08-22T13:25:32",
            "upload_time_iso_8601": "2024-08-22T13:25:32.263746Z",
            "url": "https://files.pythonhosted.org/packages/10/26/1079f2d3e7a7233102b67496fce270a5fba90d135695a57dd2fc762ee56f/description_harvester-0.0.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-22 13:25:32",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "UAlbanyArchives",
    "github_project": "description_harvester",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "description-harvester"
}
        
Elapsed time: 0.49117s