bdrc-irat

Name	bdrc-irat JSON
Version	0.9.0 JSON
	download
home_page	https://github.com/buda-base/count-images/parallelscan
Summary	Image Repository analysis
upload_time	2023-08-25 14:17:14
maintainer
docs_url	None
author	jimk
requires_python	>=3.7
license
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Parallel scan
This folder contains utilities that survey the entire archive file system and/or s3, depending on the task,
collect, and aggregate data. It is limited to the hard coded roots of the archive (`/mnt/Archive[0..n]`) the BUDA (`archive.tbrc.org/Works`)


It was written in python to avoid nasty file globbing issues having to do with reserved shell characters in file names, 
and to exploit parallelism.

Thanks to Élie Roux for providing the released work and image group lists that served as the input.

The motivation was to have in one place a definition of graphics images, as ordinary file listing techniques left it up 
to the researcher to filter out the counts. In an ideal world, we would test and count by using a graphics library to
open and taste each file (the `scan-images` action `types` actually does this, to extract the image type from the
image file metadata), but this is hugely expensive in practice.

The rationale behind only including image files is that other files don't really matter, and don't generally hurt BUDA
performance. Although updates since 2019 are clean (audit-tool), there's no real use of cleaning out old files that aren't bothering
anybody. As well, correspondences between S3 images/ folders and our file systems aren't perpetually guaranteed. 

Image files are included in input calculations with a regular expression in `common.py` This is a one line change that 
can update the scans as needed

```python
# common.py
# re string
GRAPHICS_FILE_EXTS: str = r'.+\.(jpg|jpeg|tif|tiff|png|bmp|wmf|pdf)$'

# Example action_list.py
...
        img_re = re.compile(GRAPHICS_FILE_EXTS, re.IGNORECASE)
...
        s3_images_list.extend([x['Key'] for x in object_list if img_re.match(x['Key'])])
```
Regexp was chosen over `fnmatch` due to efficiency and being able to select case sensitivity (or not)

## Installation
Until I build a pyPI installation, you can run manually.
`git clone|pull` this repository, and run directly from the `parallelscan`folder.

Python 3 is required, and some elements may require python 3.9 or later
See  [requirements.txt](requirements.txt) for the python libraries you have to install. AO recommends using a venv.
We also strongly recommend installing `wheel` before the others (`pip 23` has gotten all rigid on us)

## Usage
The only executable in this folder is `scan-images`

```zsh
❯ ./scan-images --h
usage: See parallelscan/README.doc for details

Runs image scanning tools against a set of works

optional arguments:
  -h, --help            show this help message and exit
  -a {list,types,sizes}, --action {list,types,sizes}
                        Available actions
  -w WORK_RIDS [WORK_RIDS ...], --work_rids WORK_RIDS [WORK_RIDS ...]
                        one or more work_rids
  -i INPUT_LIST_FILE, --input_list_file INPUT_LIST_FILE
                        file containing list of work_rids or paths (see -c)
  -o OUTPUT_FILE, --output_file OUTPUT_FILE
                        [Optional] output file (default stdout)

```

Lying `argparse` says the `-a/--action` argument is optional, it is not.

## File arguments
* `-w/--work_rids` Just what it appears to be.
* `-i/--input-file` list of entities to search. this is Work RIDs only.
* `-a/--action` the three possible actions, or modes. These are documented below.

## Actions
### list
This action counts image files by image group and emits four columns in a csv (see `published_work_file_counts.csv)
Where the image group could not be found, blank columns are emitted. Where the image group was found, but contained no images,
the count is shown as 0.

## Outputs

The following sections describe the output of each of the actions:
- types
- list
- sizes
### types
This output was the original instance of the queue pattern. (see [buda-base/archive-ops#549](https://github.com/buda-base/archive-ops/issues/549))
The output is a list of individual files whose file extension does not match the image type 
as PIL sees it.

## list
Returns counts, by image groups of file system (archive) and s3 (web)

```csv
work,ig,n_fs,n_s3
W00CHZ0103341,I1CZ35,210,210
W00EGS1016181,I1PD10388,183,183
W933,I5700,225,0
W933,5700,0,225
WEAP039-1-4-130,,,
WEAP039-1-4-140,,,
WEAP039-1-4-150,,,
WEAP039-1-4-160,,,
```
Takes about 3 hours for the whole oeuvre. (71611 image groups)

```shell
[EDT 08/23/23 18.41.41]:root:{p-00}-DEBUG- collected count W9140 sec: 29.049865
[EDT 08/23/23 18.41.41]:root:{p-06}-DEBUG- collected count W8LS31064 sec: 401.735611
[EDT 08/23/23 18.41.41]:root:{MainThread}-INFO-Done waiting
[EDT 08/23/23 18.41.41]:root:{MainThread}-INFO-< End ET: 12805.011623

~/prod/ao927  ----  took 3h 33m 26s   at 18:41:42 
```
### sizes
The `sizes` action aggregates all the 
- published images graphics files (children of the `images/` directory - assumes all image groups are published)
- other graphic files under the work

and emits a csv file with their counts.

```csv
work,non-image-size,non-image-count,image-size,image-count
W00EGS1016242,36827083,6,31098214,234
W00EGS1016047,61075320,149,10262628,98
W00EGS1016181,15813459,191,3565566,183
W00CHZ0103343,105386611,270,6418626,264
W00EGS1016202,11987387,7,6267968,107
W00EGS1016199,109057027,151,2574614,144
W00EGS1016259,18303512,200,2172374,194
W00EGS1016255,26680295,6,20931234,462
```

## Data analysis
See `data/counts.ipynb` for turning these into meaningful data. (`.ipynb` is a jupyter notebook. `pip install jupyter && jupyter notebook`) brings
up a web browser - you click counts.ipynb to open the script. More work can be gained by understanding the pandas DataFrame api,
but that's for later.


## Adding tests
These tests exploit parallelism heavily by using producer and consumer queues. The idea is that calculating is
very expensive, but reporting is not. So the runs were written in a poor man's Hadoop by calling the producer action 
'map' and the consumer 'reduce' 
For most actions, the 'reduce' step only writes output. You don't want to do that in the producing thread
because the output file gets quite large, and each open has to seek to the end.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/buda-base/count-images/parallelscan",
    "name": "bdrc-irat",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "",
    "author": "jimk",
    "author_email": "jimk@bdrc.io",
    "download_url": "",
    "platform": null,
    "description": "# Parallel scan\nThis folder contains utilities that survey the entire archive file system and/or s3, depending on the task,\ncollect, and aggregate data. It is limited to the hard coded roots of the archive (`/mnt/Archive[0..n]`) the BUDA (`archive.tbrc.org/Works`)\n\n\nIt was written in python to avoid nasty file globbing issues having to do with reserved shell characters in file names, \nand to exploit parallelism.\n\nThanks to \u00c9lie Roux for providing the released work and image group lists that served as the input.\n\nThe motivation was to have in one place a definition of graphics images, as ordinary file listing techniques left it up \nto the researcher to filter out the counts. In an ideal world, we would test and count by using a graphics library to\nopen and taste each file (the `scan-images` action `types` actually does this, to extract the image type from the\nimage file metadata), but this is hugely expensive in practice.\n\nThe rationale behind only including image files is that other files don't really matter, and don't generally hurt BUDA\nperformance. Although updates since 2019 are clean (audit-tool), there's no real use of cleaning out old files that aren't bothering\nanybody. As well, correspondences between S3 images/ folders and our file systems aren't perpetually guaranteed. \n\nImage files are included in input calculations with a regular expression in `common.py` This is a one line change that \ncan update the scans as needed\n\n```python\n# common.py\n# re string\nGRAPHICS_FILE_EXTS: str = r'.+\\.(jpg|jpeg|tif|tiff|png|bmp|wmf|pdf)$'\n\n# Example action_list.py\n...\n        img_re = re.compile(GRAPHICS_FILE_EXTS, re.IGNORECASE)\n...\n        s3_images_list.extend([x['Key'] for x in object_list if img_re.match(x['Key'])])\n```\nRegexp was chosen over `fnmatch` due to efficiency and being able to select case sensitivity (or not)\n\n## Installation\nUntil I build a pyPI installation, you can run manually.\n`git clone|pull` this repository, and run directly from the `parallelscan`folder.\n\nPython 3 is required, and some elements may require python 3.9 or later\nSee  [requirements.txt](requirements.txt) for the python libraries you have to install. AO recommends using a venv.\nWe also strongly recommend installing `wheel` before the others (`pip 23` has gotten all rigid on us)\n\n## Usage\nThe only executable in this folder is `scan-images`\n\n```zsh\n\u276f ./scan-images --h\nusage: See parallelscan/README.doc for details\n\nRuns image scanning tools against a set of works\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -a {list,types,sizes}, --action {list,types,sizes}\n                        Available actions\n  -w WORK_RIDS [WORK_RIDS ...], --work_rids WORK_RIDS [WORK_RIDS ...]\n                        one or more work_rids\n  -i INPUT_LIST_FILE, --input_list_file INPUT_LIST_FILE\n                        file containing list of work_rids or paths (see -c)\n  -o OUTPUT_FILE, --output_file OUTPUT_FILE\n                        [Optional] output file (default stdout)\n\n```\n\nLying `argparse` says the `-a/--action` argument is optional, it is not.\n\n## File arguments\n* `-w/--work_rids` Just what it appears to be.\n* `-i/--input-file` list of entities to search. this is Work RIDs only.\n* `-a/--action` the three possible actions, or modes. These are documented below.\n\n## Actions\n### list\nThis action counts image files by image group and emits four columns in a csv (see `published_work_file_counts.csv)\nWhere the image group could not be found, blank columns are emitted. Where the image group was found, but contained no images,\nthe count is shown as 0.\n\n## Outputs\n\nThe following sections describe the output of each of the actions:\n- types\n- list\n- sizes\n### types\nThis output was the original instance of the queue pattern. (see [buda-base/archive-ops#549](https://github.com/buda-base/archive-ops/issues/549))\nThe output is a list of individual files whose file extension does not match the image type \nas PIL sees it.\n\n## list\nReturns counts, by image groups of file system (archive) and s3 (web)\n\n```csv\nwork,ig,n_fs,n_s3\nW00CHZ0103341,I1CZ35,210,210\nW00EGS1016181,I1PD10388,183,183\nW933,I5700,225,0\nW933,5700,0,225\nWEAP039-1-4-130,,,\nWEAP039-1-4-140,,,\nWEAP039-1-4-150,,,\nWEAP039-1-4-160,,,\n```\nTakes about 3 hours for the whole oeuvre. (71611 image groups)\n\n```shell\n[EDT 08/23/23 18.41.41]:root:{p-00}-DEBUG- collected count W9140 sec: 29.049865\n[EDT 08/23/23 18.41.41]:root:{p-06}-DEBUG- collected count W8LS31064 sec: 401.735611\n[EDT 08/23/23 18.41.41]:root:{MainThread}-INFO-Done waiting\n[EDT 08/23/23 18.41.41]:root:{MainThread}-INFO-< End ET: 12805.011623\n\n~/prod/ao927  ----  took 3h 33m 26s   at 18:41:42 \n```\n### sizes\nThe `sizes` action aggregates all the \n- published images graphics files (children of the `images/` directory - assumes all image groups are published)\n- other graphic files under the work\n\nand emits a csv file with their counts.\n\n```csv\nwork,non-image-size,non-image-count,image-size,image-count\nW00EGS1016242,36827083,6,31098214,234\nW00EGS1016047,61075320,149,10262628,98\nW00EGS1016181,15813459,191,3565566,183\nW00CHZ0103343,105386611,270,6418626,264\nW00EGS1016202,11987387,7,6267968,107\nW00EGS1016199,109057027,151,2574614,144\nW00EGS1016259,18303512,200,2172374,194\nW00EGS1016255,26680295,6,20931234,462\n```\n\n## Data analysis\nSee `data/counts.ipynb` for turning these into meaningful data. (`.ipynb` is a jupyter notebook. `pip install jupyter && jupyter notebook`) brings\nup a web browser - you click counts.ipynb to open the script. More work can be gained by understanding the pandas DataFrame api,\nbut that's for later.\n\n\n## Adding tests\nThese tests exploit parallelism heavily by using producer and consumer queues. The idea is that calculating is\nvery expensive, but reporting is not. So the runs were written in a poor man's Hadoop by calling the producer action \n'map' and the consumer 'reduce' \nFor most actions, the 'reduce' step only writes output. You don't want to do that in the producing thread\nbecause the output file gets quite large, and each open has to seek to the end.\n\n\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Image Repository analysis",
    "version": "0.9.0",
    "project_urls": {
        "Homepage": "https://github.com/buda-base/count-images/parallelscan"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6e3048f44e06baa77ea92db52d736d8398c7471b5d2fb2c95626e94875b3f6a8",
                "md5": "4823e2d976dcf6338a07fb809bba7b02",
                "sha256": "0e81008097e325489b3c5064ec8f485451afbab0cd8d0656813824d1b39ce509"
            },
            "downloads": -1,
            "filename": "bdrc_irat-0.9.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4823e2d976dcf6338a07fb809bba7b02",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 19743,
            "upload_time": "2023-08-25T14:17:14",
            "upload_time_iso_8601": "2023-08-25T14:17:14.281983Z",
            "url": "https://files.pythonhosted.org/packages/6e/30/48f44e06baa77ea92db52d736d8398c7471b5d2fb2c95626e94875b3f6a8/bdrc_irat-0.9.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-25 14:17:14",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "buda-base",
    "github_project": "count-images",
    "github_not_found": true,
    "lcname": "bdrc-irat"
}

jimk