bdrc-transfer

Name	bdrc-transfer JSON
Version	0.1.18 JSON
	download
home_page
Summary	Transfer library
upload_time	2023-03-29 18:23:58
maintainer
docs_url	None
author	jimk
requires_python	>=3.8
license
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # BDRC Transfer Library

`bdrc-transfer` is a Python library and console script package that
provides SFTP and other services to implement the BDRC workflow to send
BDRC works to a remote site for OCR production, and receive, unpack, and distribute
the resulting works.

## Copyrighted Works

While fair use doctrine allows us to transmit our copies of images of copyrighted works,
there may be an issue with Google Books making them available to their community.

Google Books Library Partnership Staff Ben Bunnell described Google Books' copyrght validation process
in an email to BDRC dated 13 Jan 2023:

> Hi Jim,
> We to use the metadata, but the main way is that everything that goes through the Google Books process includes a
> copyright verification check as part of the analysis stage. The first few pages of the book are presented to operators
> who verify publication dates and location of publication. This info goes through an automated flowchart that
> determines
> viewability in any given location.
>
> For cases where you think the copyright determination is incorrect, you or a general user can open the book on Google
> Books, then go to the gear icon (or three-dot menu icon depending on whether you're looking at the new Google Books
> interface) /Help/Report problems to request a second review.
>
> Best wishes,
> Ben

## Debian Installation

1. On Debian systems, mysql library is needed:
   ` sudo apt install default-libmysqlclient-dev`

1. Install audit-tool version `Version 1.0Beta_2022_05_12` or later (`audit-tool --version` will show you the installed
   version). Use the latest version from [Audit Tool Releases Page](https://github.com/buda-base/asset-manager/releases)

2. `pip[3] install [--upgrade] [--no-cache-dir] bdrc-transfer`

* some systems only have `pip3` not `pip`
* `--upgrade and --no-cache-dir ` make sure that the latest release is installed. `no-cache-dir` is usually only
  required when testing local disk wheels. `--upgrade` is for using the pyPi repository

Then, once only, run:
`gb-bdrc-init`
This copies a google books config from the install directory into the user's `.config/gb` folder, making
a backup copy if there is a copy before. The user is responsible for merging their site specific changes

## Getting Started

### Manual Workflow

This is a provisional workflow until all the steps can be automated. Development of automation
for "When the material is ready for conversion" and "Browse to the GB Converted Page" is underway.
The **Automated Workflow** section of his document will be updated as each release gets this support.

The Google Books manual workflow is:

1. Identify works to send to Google Books
2. Create and upload the metadata for that list  (`upload-metadata`)
3. Create a list of paths to the works on that list, and upload that (`upload-content`) **Note that a specially
   configured audit-tool validates the content before upload.**
4. Wait for some time for Google to process the content. This can be a day to a week.
5. When the material is ready for conversion,
    1. [GB TBRC InProcess Page](https://books.google.com/libraries/TBRC/_in_process) - select and save the 'text only'
       version
    2. Select the checkbox for every work (remember there may be multiple pages)
    3. click "request conversion" for them
6. Wait for some time, and then use GRIN to get the list of files that GB has converted, and which are ready to
   download,
7. Browse to [GB TBRC Converted Page](https://books.google.com/libraries/TBRC/_converted). For each line you find:
    1. In the browser, select the ....pgp.gz file  (they're links) in turn and download it.
    2. On the command line:
        1. run `unpack` on the downloaded archive
        2. run `distribute_ocr` on the resulting work structure

### Automated Workflow

#### Preparation and configuration

1. Install bdrc-transfer >= 0.0.4. v 0.0.4 implements the automated "conversion request step" (below)
2. Choose a user to host a `crontab` entry. The user's environment must contain the environment variables listed in **
   Runtime** below. The recommended way is to use the user's interactive **bash**  environment, as shown here. Be sure
   that the file referenced in BASH_ENV passes control to some script which initializes all the variables. (Typically,
   .bashrc, but probably some variant of it)

```shell
# m h  dom mon dow   command
# * *   *   *   *     BASH_ENV=~/.profile request-conversion-service
```

3. Schedule the crontab entry shown above

#### Workflow

The Google Books automated workflow is:

1. Identify works to send to Google Books
2. Create and upload the metadata for that list  (`upload-metadata`)
3. Create a list of paths to the works on that list, and upload that (`upload-content`) **Note that a specially
   configured audit-tool validates the content before upload.**
4. The crontab entry `request-conversion-service` (see above) will poll the Google Books server and look for volumes
   available for conversion, and will request them.
5. The crontab entry `process-converted` (in `bdrc-transfer 0.0.5`) will:
    1. Poll the Google Books server for volumes which are ready to download.
    2. Download, unpack, and distribute the OCR'd volume and support.

### Backlog processing

There are some utilities that can help in setting up the process
For example, we have manually downloaded and unpacked items before.
To trigger a re-distribution, we can signal again that they've been downloaded. The command line tool
`mark-downloaded [-i [ paths | - ] path, .....` marks in the internal tracking system that those items have been
downloaded. The items
must have the file name format `{parent_path}/WorkRid-ImageGroupRid.tar.gz`

## Runtime

### Environment configuration

`bdrc-transfer` requires these environment variables, unless overridden on the command line.
(Overriding is not recommended in production)

* `GRIN_CONFIG` - Path to the configuration file, which contains authorization and other essential data.
  The name and contents of this file should be closely held in the development team. **Environment variables which
  v<= 0.0.4 read are now in this file.**

#### Logging

One requirement of this package is that there be a single, authoritative log of activities. In development,
there will be testing attempts. It should be easy to add the logs of these testing attempts to the single log.
Each `gb_ocr` operation defines a tuple of _activity_ and _destination_.

The _activity_ values are:

- upload
- request_conversion
- unpack
- distribute

and the _destination_ values are:

- metadata
- content

The resulting set of log files this package creates are:

- upload_metadata-activity.log
- upload_metadata-runtime.log
- upload_content-activity.log
- upload_content-runtime.log
- request_conversion-activity.log
- request_conversion-runtime.log
- transfer-activity.log
- transfer-runtime.log
- unpack-activity.log
- unpack-runtime.log

##### Runtime log

This is a free-form console log for diagnostic and informational purposes.

##### Activity log

This is the canonical log file for the activity. Each activity module in the `gb_ocr` Its structure is optimized for
programmatic import, not human readability.

** v 0.1.12 update ** The canonical log has been moved into a database. The database is accessed through the
`AORunActivityLog.db_activity_logger` class.

<s>
##### Log file naming
Log files are intended to be continuous, and are not concurrency safe. *Activity logs* are intended to be singular
across the whole BDRC network, so there *must* be only one activity instance writing at a time.
(As of 7 Jun 2022, this is not enforced)
</s>

### Available commands

                   unpack
                   relocate-downloads
                   gb-convert
                   move-downloads
                   upload-metadata
                   distribute-ocr
                   upload-content
                    request-conversion (request-conversion-service)
                    request-download (request-download-service)

#### Common Options

All commands in this section share these common options:

```shell
optional arguments:
  -h, --help            show this help message and exit
  -l LOG_HOME, --log_home LOG_HOME
                        Where logs are stored - see manual
  -n, --dry_run         Connect only. Do not upload
  -d {info,warning,error,debug,critical}, --debug_level {info,warning,error,debug,critical}
                        choice values are from python logging module
  -z, --log_after_fact  (ex post facto) log a successful activity after it was performed out of band
  -i [FILE], --input_file [FILE]
                        files to read. use - for stdin
```

---

#### upload-metadata

```shell
usage: upload-metadata [-h] [-l LOG_HOME] [-n] [-d {info,warning,error,debug,critical}] [-z] [-i [FILE]] [work_rid]

Creates and sends metadata to gb

positional arguments:
  work_rid              Work ID
```

---

#### upload-content

```shell
❯ upload-content --help
usage: upload-content [-h] [-l LOG_HOME] [-n] [-d {info,warning,error,debug,critical}] [-z] [-i [FILE]] [-g] [work_path]

uploads the images in a work to GB. Can upload all or some image groups (see --image_group option)

... common arguments

  -g, --image_group     True if paths are to image group
```

---

#### unpack

```shell
usage: unpack [-h] [-l LOG_HOME] [-n] [-d {info,warning,error,debug,critical}] [-z] [-i [FILE]] [src]

Unpacks an artifact

positional arguments:
  src                   xxx.tar.gz.gpg file to unpack
```

Unpacks a downloaded GB processed artifact (Note that the download is not FTP,
so there is no API to download. In 0.0.1, this is a manual operation)

**See the section Distribution format for the output documentation**

---

#### gb-convert

This is a stub function, which simulates requesting a conversion from the Google books
web UI. It simply logs the fact that the user has checked a whole list of items to convert.
Usually the user will have to download the list from gb, extract the image group rids, and feed them
into this program.

```shell
usage: gb-convert [-h] [-l LOG_HOME] [-n] [-d {info,warning,error,debug,critical}] [-z] [-i [FILE]] [image_group]

Requests conversion of an uploaded content image group

positional arguments:
  image_group           workRid-ImageGroupRid - no file suffixes
```

---

#### ftp-transfer

This is a low level utility function, which should not generally be used in the workflow.

```shell
usage: ftp-transfer [-h] [-l LOG_HOME] [-n] [-d {info,warning,error,debug,critical}] [-z] [-i [FILE]] [-m | -c] [-p | -g]
                    src [dest]

Uploads a file to a specific partner server, defined by a section in the config file

positional arguments:
  src                   source file for transfer
  dest                  [Optional] destination file - defaults to basename of source

optional arguments:
                        files to read. use - for stdin
  -m, --metadata        Act on metadata target
  -c, --content         Act on the content target
  -p, --put             send to
  -g, --get             get from (NOT IMPLEMENTED)
```

### Launching

Define the environment variable  `GB_CONFIG` to point to the configuration file for the project. The configuration file
is the access point to GB's sftp host, and is tightly controlled.

### Activity Tracking and Logging

Activity tracing is the responsibility of the `log_ocr` package.
The `log_ocr` has a public module `AORunLog.py` which contains the `AORunActivityLog` class. This class offers three
interfaces to its clients. These are separated into two groups: `logging` implementations, and database implementations

#### Logging

These are Python `logging` instances, and offer the complete `logging` interface

- `activity_logger`
- `runtime_logger`

#### Database implementation

The database implementation is a replacement for the activity logger, which is a simple canonical journal of GB OCR
processing.

- `activity_db_logger` This is an instance of class `log_ocr.GbOcrTrack.GbOcrTracker`. This exposes the following
  methods:
    * add_content_request - Records a content process step:
        * upload
        * request_conversion
        * download image groups which GB has processed
        * distribute
- get_ready_to_convert: Gets a list of image groups which GB has received, but we have not requested conversion
- get_converted: Gets a list of image groups which GB has converted, but we have not downloaded.

The property `log_ocr.AORunLog.activity_db_logger` is the replacement for the "activity" tracking log discussed below.
It does not use the python `logging` API, but its own specific methods, which are found in `log_ocr.

### Logging

#### Log store

The default directory for logging can be given in these directives:

1. the current working directory is the default, in the absence of these next entries.
2. Environment variable `RUN_ACTIVITY_LOG_HOME`.
3. the `-l/--log_home` argument to `ftp-transfer`. Overrides the environment variable if given

#### Log files

`ftp_transfer` logs two kids of activity:

- runtime logs, `transfer-runtime.log` describing details of an operation. The content of this log is affected by
  the `-d` flag.
- activity logs. `transfer-activity.log`. They provide limited, but auditable information on:
    - the activity subject (metadata or content)
    - the activity outcome (success or fail)
      It is the caller's responsibility to aggregate activity logs into a coherent view of activity.

#### Log format

##### Runtime Format

short date time:message:content descriptor

Example:

```
06-03 15:29:INFO:upload success /Users/jimk/dev/tmp/aog1/META/marc-W2PD17457.xml:metadata
```

#### Activity Format

Date mm-DD-YYYY HH-MM-SS:operation:status:message:content descriptor

Example:

```
06-06-2022 20-28-06:get:error:/Users/jimk/dev/tmp/aog1/META/marc-W2PD17457.xml:metadata:
```

## Distribution Format

This section defines the format of the OCR distribution on BDRC's OCR servers. It is the final result of the
discussions in Github buda-base archive-ops-694 (no URL given, private repository)

The distribution format for a typical work, and one image group in that work, is
shown here:

```
❯ tree --filesfirst  Works
Works/
└── a9/
    └── W1PD12345/
        └── google_books/
            └── batch_2022/
                ├── info.json
                ├── info/
                │   ├── W1PD12345-I1PD12345/
                │   │      ├── gb-bdrc-map.json
                │   │      └── TBRC_W1PD12345-I1PD12345.xml
                │   └── W1PD12345-I1PD12..../               
                └── output/
                    ├── W1PD12345-I1PD12345/
                    │    ├── html.zip
                    │    ├── images.zip
                    │    └── txt.zip
                    └── W1PD12345-I1PD12..../


```

### Folder structure

### Work level folders

```
Works/{hash}/{wid}/{ocrmethod}/{batchid}/
```

Where:

where:

- `{hash}` is the well-known hash (2 first hexa digits of the md5 of the W id)
- `{wid}` is also well-known (ex: `W22084`)
- `{ocrmethod}` should be `vision/` for Google OCR
- `{batchid}` should be a unique batch id, it doesn't need to be in alphabetical order, it just needs to be unique per
  wid+ocrmethod (in the Google Books delivery, this is the literal 'batch_2022')

`{batchid}` contains one file and two folders:

- `info.json`
- `info`
- `output`

In the following discussion, `{wid}-{iid}` refers to the WorkRID-ImageGroupRID tuple as a string
(`W1PD12345-I1PD12345`, in this example)

#### info.json

`{wid}/info.json` contains:

```
{
  "timestamp" : "X",
  "html": "html.zip",
  "txt": "txt.zip",
  "images": "images.zip"
}
```

It is uploaded with every image group, so timestamp will always be the latest upload, even if all the image groups are
not present in OCR yet.
However, because our image group processing is independent, there's no flag to say when all the image groups in a run
are done (there's not even a notion of a run - buda-base/ao-google-books#23 requests that implementation.

The keys `html` `txt` and `images` are finding aids - they reference the filenames
under `output/{wid}-{iid}` (Note this forces every image group under the Work to be in this structure)

#### info/

This is a dictionary of metadata. It contains, for each `{wid}-{iid}` that has been processed,

- `gb-bdrc-map.json`: mapping between BUDA image list and OCR derived image list. The BDRC Google Books process
  creates this artifact.
- `TBRC_{wid}-{iid}.xml`: The Google Books creation process delivers this file, which BDRC Google Books process
  relocates from the original position here. This file contains PREMIS metadata for the image group.

#### output/

Output contains only folders for each `{wid}-{iid}/` in the work
Each of these contains only three files, each of which is an archive of
Google Books generated content.

- `html.zip` - HOCR files (OCR content in HTML format)
- `images.zip` - Generated images from which Google Books derived the OCR
- `txt.zip` - Unicode text that Google Books generated

# Changelog

| Version | Changes                                                                                                                              |
|---------|--------------------------------------------------------------------------------------------------------------------------------------|
| 0.1.18  | [e0e4adc](https://github.com/buda-base/ao-google-books/commit/e0e4adc6d56379c21a599953547ced03df459cbe) Use better image detector    |
| 0.1.17  | [6f9a5b88](https://github.com/buda-base/ao-google-books/commit/6f9a5b88af70e14daa5e07c6e00fdda6b9584124) console logging of header   |
|         | [b110073](https://github.com/buda-base/ao-google-books/commit/b110073a66c747f6b758c81e28e82748e1cb1ef6) Staging for get custom query |
|         | [603d0ca](https://github.com/buda-base/ao-google-books/commit/603d0ca1d4a7538b2a17aee3093b39678696e149) Move ORM to bdrc-db-lib      |
| 0.1.16  | [d842f98](https://github.com/buda-base/ao-google-books/commit/d842f98149222a7673b69c87cc728e3d0bb9f542)                              |
|         | Segment request conversion requests                                                                                                  |
| 0.1.15  | Database object refactoring                                                                                                          |
| 0.1.8   | [5a6b000](https://github.com/buda-base/ao-google-books/commit/5a6b000c354522550c38c7514bb0c4a448c86617) Upload                       |
|         | standalone image groups                                                                                                              |

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "bdrc-transfer",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "",
    "author": "jimk",
    "author_email": "jimk@tbrc.org",
    "download_url": "",
    "platform": null,
    "description": "# BDRC Transfer Library\n\n`bdrc-transfer` is a Python library and console script package that\nprovides SFTP and other services to implement the BDRC workflow to send\nBDRC works to a remote site for OCR production, and receive, unpack, and distribute\nthe resulting works.\n\n## Copyrighted Works\n\nWhile fair use doctrine allows us to transmit our copies of images of copyrighted works,\nthere may be an issue with Google Books making them available to their community.\n\nGoogle Books Library Partnership Staff Ben Bunnell described Google Books' copyrght validation process\nin an email to BDRC dated 13 Jan 2023:\n\n> Hi Jim,\n> We to use the metadata, but the main way is that everything that goes through the Google Books process includes a\n> copyright verification check as part of the analysis stage. The first few pages of the book are presented to operators\n> who verify publication dates and location of publication. This info goes through an automated flowchart that\n> determines\n> viewability in any given location.\n>\n> For cases where you think the copyright determination is incorrect, you or a general user can open the book on Google\n> Books, then go to the gear icon (or three-dot menu icon depending on whether you're looking at the new Google Books\n> interface) /Help/Report problems to request a second review.\n>\n> Best wishes,\n> Ben\n\n## Debian Installation\n\n1. On Debian systems, mysql library is needed:\n   ` sudo apt install default-libmysqlclient-dev`\n\n1. Install audit-tool version `Version 1.0Beta_2022_05_12` or later (`audit-tool --version` will show you the installed\n   version). Use the latest version from [Audit Tool Releases Page](https://github.com/buda-base/asset-manager/releases)\n\n2. `pip[3] install [--upgrade] [--no-cache-dir] bdrc-transfer`\n\n* some systems only have `pip3` not `pip`\n* `--upgrade and --no-cache-dir ` make sure that the latest release is installed. `no-cache-dir` is usually only\n  required when testing local disk wheels. `--upgrade` is for using the pyPi repository\n\nThen, once only, run:\n`gb-bdrc-init`\nThis copies a google books config from the install directory into the user's `.config/gb` folder, making\na backup copy if there is a copy before. The user is responsible for merging their site specific changes\n\n## Getting Started\n\n### Manual Workflow\n\nThis is a provisional workflow until all the steps can be automated. Development of automation\nfor \"When the material is ready for conversion\" and \"Browse to the GB Converted Page\" is underway.\nThe **Automated Workflow** section of his document will be updated as each release gets this support.\n\nThe Google Books manual workflow is:\n\n1. Identify works to send to Google Books\n2. Create and upload the metadata for that list  (`upload-metadata`)\n3. Create a list of paths to the works on that list, and upload that (`upload-content`) **Note that a specially\n   configured audit-tool validates the content before upload.**\n4. Wait for some time for Google to process the content. This can be a day to a week.\n5. When the material is ready for conversion,\n    1. [GB TBRC InProcess Page](https://books.google.com/libraries/TBRC/_in_process) - select and save the 'text only'\n       version\n    2. Select the checkbox for every work (remember there may be multiple pages)\n    3. click \"request conversion\" for them\n6. Wait for some time, and then use GRIN to get the list of files that GB has converted, and which are ready to\n   download,\n7. Browse to [GB TBRC Converted Page](https://books.google.com/libraries/TBRC/_converted). For each line you find:\n    1. In the browser, select the ....pgp.gz file  (they're links) in turn and download it.\n    2. On the command line:\n        1. run `unpack` on the downloaded archive\n        2. run `distribute_ocr` on the resulting work structure\n\n### Automated Workflow\n\n#### Preparation and configuration\n\n1. Install bdrc-transfer >= 0.0.4. v 0.0.4 implements the automated \"conversion request step\" (below)\n2. Choose a user to host a `crontab` entry. The user's environment must contain the environment variables listed in **\n   Runtime** below. The recommended way is to use the user's interactive **bash**  environment, as shown here. Be sure\n   that the file referenced in BASH_ENV passes control to some script which initializes all the variables. (Typically,\n   .bashrc, but probably some variant of it)\n\n```shell\n# m h  dom mon dow   command\n# * *   *   *   *     BASH_ENV=~/.profile request-conversion-service\n```\n\n3. Schedule the crontab entry shown above\n\n#### Workflow\n\nThe Google Books automated workflow is:\n\n1. Identify works to send to Google Books\n2. Create and upload the metadata for that list  (`upload-metadata`)\n3. Create a list of paths to the works on that list, and upload that (`upload-content`) **Note that a specially\n   configured audit-tool validates the content before upload.**\n4. The crontab entry `request-conversion-service` (see above) will poll the Google Books server and look for volumes\n   available for conversion, and will request them.\n5. The crontab entry `process-converted` (in `bdrc-transfer 0.0.5`) will:\n    1. Poll the Google Books server for volumes which are ready to download.\n    2. Download, unpack, and distribute the OCR'd volume and support.\n\n### Backlog processing\n\nThere are some utilities that can help in setting up the process\nFor example, we have manually downloaded and unpacked items before.\nTo trigger a re-distribution, we can signal again that they've been downloaded. The command line tool\n`mark-downloaded [-i [ paths | - ] path, .....` marks in the internal tracking system that those items have been\ndownloaded. The items\nmust have the file name format `{parent_path}/WorkRid-ImageGroupRid.tar.gz`\n\n## Runtime\n\n### Environment configuration\n\n`bdrc-transfer` requires these environment variables, unless overridden on the command line.\n(Overriding is not recommended in production)\n\n* `GRIN_CONFIG` - Path to the configuration file, which contains authorization and other essential data.\n  The name and contents of this file should be closely held in the development team. **Environment variables which\n  v<= 0.0.4 read are now in this file.**\n\n#### Logging\n\nOne requirement of this package is that there be a single, authoritative log of activities. In development,\nthere will be testing attempts. It should be easy to add the logs of these testing attempts to the single log.\nEach `gb_ocr` operation defines a tuple of _activity_ and _destination_.\n\nThe _activity_ values are:\n\n- upload\n- request_conversion\n- unpack\n- distribute\n\nand the _destination_ values are:\n\n- metadata\n- content\n\nThe resulting set of log files this package creates are:\n\n- upload_metadata-activity.log\n- upload_metadata-runtime.log\n- upload_content-activity.log\n- upload_content-runtime.log\n- request_conversion-activity.log\n- request_conversion-runtime.log\n- transfer-activity.log\n- transfer-runtime.log\n- unpack-activity.log\n- unpack-runtime.log\n\n##### Runtime log\n\nThis is a free-form console log for diagnostic and informational purposes.\n\n##### Activity log\n\nThis is the canonical log file for the activity. Each activity module in the `gb_ocr` Its structure is optimized for\nprogrammatic import, not human readability.\n\n** v 0.1.12 update ** The canonical log has been moved into a database. The database is accessed through the\n`AORunActivityLog.db_activity_logger` class.\n\n<s>\n##### Log file naming\nLog files are intended to be continuous, and are not concurrency safe. *Activity logs* are intended to be singular\nacross the whole BDRC network, so there *must* be only one activity instance writing at a time.\n(As of 7 Jun 2022, this is not enforced)\n</s>\n\n### Available commands\n\n                   unpack\n                   relocate-downloads\n                   gb-convert\n                   move-downloads\n                   upload-metadata\n                   distribute-ocr\n                   upload-content\n                    request-conversion (request-conversion-service)\n                    request-download (request-download-service)\n\n#### Common Options\n\nAll commands in this section share these common options:\n\n```shell\noptional arguments:\n  -h, --help            show this help message and exit\n  -l LOG_HOME, --log_home LOG_HOME\n                        Where logs are stored - see manual\n  -n, --dry_run         Connect only. Do not upload\n  -d {info,warning,error,debug,critical}, --debug_level {info,warning,error,debug,critical}\n                        choice values are from python logging module\n  -z, --log_after_fact  (ex post facto) log a successful activity after it was performed out of band\n  -i [FILE], --input_file [FILE]\n                        files to read. use - for stdin\n```\n\n---\n\n#### upload-metadata\n\n```shell\nusage: upload-metadata [-h] [-l LOG_HOME] [-n] [-d {info,warning,error,debug,critical}] [-z] [-i [FILE]] [work_rid]\n\nCreates and sends metadata to gb\n\npositional arguments:\n  work_rid              Work ID\n```\n\n---\n\n#### upload-content\n\n```shell\n\u276f upload-content --help\nusage: upload-content [-h] [-l LOG_HOME] [-n] [-d {info,warning,error,debug,critical}] [-z] [-i [FILE]] [-g] [work_path]\n\nuploads the images in a work to GB. Can upload all or some image groups (see --image_group option)\n\n... common arguments\n\n  -g, --image_group     True if paths are to image group\n```\n\n---\n\n#### unpack\n\n```shell\nusage: unpack [-h] [-l LOG_HOME] [-n] [-d {info,warning,error,debug,critical}] [-z] [-i [FILE]] [src]\n\nUnpacks an artifact\n\npositional arguments:\n  src                   xxx.tar.gz.gpg file to unpack\n```\n\nUnpacks a downloaded GB processed artifact (Note that the download is not FTP,\nso there is no API to download. In 0.0.1, this is a manual operation)\n\n**See the section Distribution format for the output documentation**\n\n---\n\n#### gb-convert\n\nThis is a stub function, which simulates requesting a conversion from the Google books\nweb UI. It simply logs the fact that the user has checked a whole list of items to convert.\nUsually the user will have to download the list from gb, extract the image group rids, and feed them\ninto this program.\n\n```shell\nusage: gb-convert [-h] [-l LOG_HOME] [-n] [-d {info,warning,error,debug,critical}] [-z] [-i [FILE]] [image_group]\n\nRequests conversion of an uploaded content image group\n\npositional arguments:\n  image_group           workRid-ImageGroupRid - no file suffixes\n```\n\n---\n\n#### ftp-transfer\n\nThis is a low level utility function, which should not generally be used in the workflow.\n\n```shell\nusage: ftp-transfer [-h] [-l LOG_HOME] [-n] [-d {info,warning,error,debug,critical}] [-z] [-i [FILE]] [-m | -c] [-p | -g]\n                    src [dest]\n\nUploads a file to a specific partner server, defined by a section in the config file\n\npositional arguments:\n  src                   source file for transfer\n  dest                  [Optional] destination file - defaults to basename of source\n\noptional arguments:\n                        files to read. use - for stdin\n  -m, --metadata        Act on metadata target\n  -c, --content         Act on the content target\n  -p, --put             send to\n  -g, --get             get from (NOT IMPLEMENTED)\n```\n\n### Launching\n\nDefine the environment variable  `GB_CONFIG` to point to the configuration file for the project. The configuration file\nis the access point to GB's sftp host, and is tightly controlled.\n\n### Activity Tracking and Logging\n\nActivity tracing is the responsibility of the `log_ocr` package.\nThe `log_ocr` has a public module `AORunLog.py` which contains the `AORunActivityLog` class. This class offers three\ninterfaces to its clients. These are separated into two groups: `logging` implementations, and database implementations\n\n#### Logging\n\nThese are Python `logging` instances, and offer the complete `logging` interface\n\n- `activity_logger`\n- `runtime_logger`\n\n#### Database implementation\n\nThe database implementation is a replacement for the activity logger, which is a simple canonical journal of GB OCR\nprocessing.\n\n- `activity_db_logger` This is an instance of class `log_ocr.GbOcrTrack.GbOcrTracker`. This exposes the following\n  methods:\n    * add_content_request - Records a content process step:\n        * upload\n        * request_conversion\n        * download image groups which GB has processed\n        * distribute\n- get_ready_to_convert: Gets a list of image groups which GB has received, but we have not requested conversion\n- get_converted: Gets a list of image groups which GB has converted, but we have not downloaded.\n\nThe property `log_ocr.AORunLog.activity_db_logger` is the replacement for the \"activity\" tracking log discussed below.\nIt does not use the python `logging` API, but its own specific methods, which are found in `log_ocr.\n\n### Logging\n\n#### Log store\n\nThe default directory for logging can be given in these directives:\n\n1. the current working directory is the default, in the absence of these next entries.\n2. Environment variable `RUN_ACTIVITY_LOG_HOME`.\n3. the `-l/--log_home` argument to `ftp-transfer`. Overrides the environment variable if given\n\n#### Log files\n\n`ftp_transfer` logs two kids of activity:\n\n- runtime logs, `transfer-runtime.log` describing details of an operation. The content of this log is affected by\n  the `-d` flag.\n- activity logs. `transfer-activity.log`. They provide limited, but auditable information on:\n    - the activity subject (metadata or content)\n    - the activity outcome (success or fail)\n      It is the caller's responsibility to aggregate activity logs into a coherent view of activity.\n\n#### Log format\n\n##### Runtime Format\n\nshort date time:message:content descriptor\n\nExample:\n\n```\n06-03 15:29:INFO:upload success /Users/jimk/dev/tmp/aog1/META/marc-W2PD17457.xml:metadata\n```\n\n#### Activity Format\n\nDate mm-DD-YYYY HH-MM-SS:operation:status:message:content descriptor\n\nExample:\n\n```\n06-06-2022 20-28-06:get:error:/Users/jimk/dev/tmp/aog1/META/marc-W2PD17457.xml:metadata:\n```\n\n## Distribution Format\n\nThis section defines the format of the OCR distribution on BDRC's OCR servers. It is the final result of the\ndiscussions in Github buda-base archive-ops-694 (no URL given, private repository)\n\nThe distribution format for a typical work, and one image group in that work, is\nshown here:\n\n```\n\u276f tree --filesfirst  Works\nWorks/\n\u2514\u2500\u2500 a9/\n    \u2514\u2500\u2500 W1PD12345/\n        \u2514\u2500\u2500 google_books/\n            \u2514\u2500\u2500 batch_2022/\n                \u251c\u2500\u2500 info.json\n                \u251c\u2500\u2500 info/\n                \u2502   \u251c\u2500\u2500 W1PD12345-I1PD12345/\n                \u2502   \u2502      \u251c\u2500\u2500 gb-bdrc-map.json\n                \u2502   \u2502      \u2514\u2500\u2500 TBRC_W1PD12345-I1PD12345.xml\n                \u2502   \u2514\u2500\u2500 W1PD12345-I1PD12..../               \n                \u2514\u2500\u2500 output/\n                    \u251c\u2500\u2500 W1PD12345-I1PD12345/\n                    \u2502    \u251c\u2500\u2500 html.zip\n                    \u2502    \u251c\u2500\u2500 images.zip\n                    \u2502    \u2514\u2500\u2500 txt.zip\n                    \u2514\u2500\u2500 W1PD12345-I1PD12..../\n\n\n```\n\n### Folder structure\n\n### Work level folders\n\n```\nWorks/{hash}/{wid}/{ocrmethod}/{batchid}/\n```\n\nWhere:\n\nwhere:\n\n- `{hash}` is the well-known hash (2 first hexa digits of the md5 of the W id)\n- `{wid}` is also well-known (ex: `W22084`)\n- `{ocrmethod}` should be `vision/` for Google OCR\n- `{batchid}` should be a unique batch id, it doesn't need to be in alphabetical order, it just needs to be unique per\n  wid+ocrmethod (in the Google Books delivery, this is the literal 'batch_2022')\n\n`{batchid}` contains one file and two folders:\n\n- `info.json`\n- `info`\n- `output`\n\nIn the following discussion, `{wid}-{iid}` refers to the WorkRID-ImageGroupRID tuple as a string\n(`W1PD12345-I1PD12345`, in this example)\n\n#### info.json\n\n`{wid}/info.json` contains:\n\n```\n{\n  \"timestamp\" : \"X\",\n  \"html\": \"html.zip\",\n  \"txt\": \"txt.zip\",\n  \"images\": \"images.zip\"\n}\n```\n\nIt is uploaded with every image group, so timestamp will always be the latest upload, even if all the image groups are\nnot present in OCR yet.\nHowever, because our image group processing is independent, there's no flag to say when all the image groups in a run\nare done (there's not even a notion of a run - buda-base/ao-google-books#23 requests that implementation.\n\nThe keys `html` `txt` and `images` are finding aids - they reference the filenames\nunder `output/{wid}-{iid}` (Note this forces every image group under the Work to be in this structure)\n\n#### info/\n\nThis is a dictionary of metadata. It contains, for each `{wid}-{iid}` that has been processed,\n\n- `gb-bdrc-map.json`: mapping between BUDA image list and OCR derived image list. The BDRC Google Books process\n  creates this artifact.\n- `TBRC_{wid}-{iid}.xml`: The Google Books creation process delivers this file, which BDRC Google Books process\n  relocates from the original position here. This file contains PREMIS metadata for the image group.\n\n#### output/\n\nOutput contains only folders for each `{wid}-{iid}/` in the work\nEach of these contains only three files, each of which is an archive of\nGoogle Books generated content.\n\n- `html.zip` - HOCR files (OCR content in HTML format)\n- `images.zip` - Generated images from which Google Books derived the OCR\n- `txt.zip` - Unicode text that Google Books generated\n\n# Changelog\n\n| Version | Changes                                                                                                                              |\n|---------|--------------------------------------------------------------------------------------------------------------------------------------|\n| 0.1.18  | [e0e4adc](https://github.com/buda-base/ao-google-books/commit/e0e4adc6d56379c21a599953547ced03df459cbe) Use better image detector    |\n| 0.1.17  | [6f9a5b88](https://github.com/buda-base/ao-google-books/commit/6f9a5b88af70e14daa5e07c6e00fdda6b9584124) console logging of header   |\n|         | [b110073](https://github.com/buda-base/ao-google-books/commit/b110073a66c747f6b758c81e28e82748e1cb1ef6) Staging for get custom query |\n|         | [603d0ca](https://github.com/buda-base/ao-google-books/commit/603d0ca1d4a7538b2a17aee3093b39678696e149) Move ORM to bdrc-db-lib      |\n| 0.1.16  | [d842f98](https://github.com/buda-base/ao-google-books/commit/d842f98149222a7673b69c87cc728e3d0bb9f542)                              |\n|         | Segment request conversion requests                                                                                                  |\n| 0.1.15  | Database object refactoring                                                                                                          |\n| 0.1.8   | [5a6b000](https://github.com/buda-base/ao-google-books/commit/5a6b000c354522550c38c7514bb0c4a448c86617) Upload                       |\n|         | standalone image groups                                                                                                              |\n\n\n\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Transfer library",
    "version": "0.1.18",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "76bbfdbcd4bd05c21eedd4e21ea957cd8a8ed5cab8f0210ae54cea4df837fcac",
                "md5": "bd144ef3d98d0748df66db0cb6e60af9",
                "sha256": "1fbbb5ff091ffb8ab13c630efe7d25dc849bd9ef148b4b254a12d47dc51f7b4e"
            },
            "downloads": -1,
            "filename": "bdrc_transfer-0.1.18-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "bd144ef3d98d0748df66db0cb6e60af9",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 71797,
            "upload_time": "2023-03-29T18:23:58",
            "upload_time_iso_8601": "2023-03-29T18:23:58.260641Z",
            "url": "https://files.pythonhosted.org/packages/76/bb/fdbcd4bd05c21eedd4e21ea957cd8a8ed5cab8f0210ae54cea4df837fcac/bdrc_transfer-0.1.18-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-03-29 18:23:58",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "lcname": "bdrc-transfer"
}

jimk