irods-capability-automated-ingest


Nameirods-capability-automated-ingest JSON
Version 0.4.1 PyPI version JSON
download
home_pagehttps://github.com/irods/irods_capability_automated_ingest
SummaryImplement filesystem scanners and landing zones
upload_time2023-03-26 17:05:16
maintainer
docs_urlNone
authoriRODS Consortium
requires_python>=3.7,
licenseBSD
keywords irods automated ingest landingzone filesystem
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # iRODS Automated Ingest Framework

The automated ingest framework gives iRODS an enterprise solution that solves two major use cases: getting existing data under management and ingesting incoming data hitting a landing zone.

Based on the Python iRODS Client and Celery, this framework can scale up to match the demands of data coming off instruments, satellites, or parallel filesystems.

The example diagrams below show a filesystem scanner and a landing zone.

![Automated Ingest: Filesystem Scanner Diagram](capability_automated_ingest_filesystem_scanner.jpg)

![Automated Ingest: Landing Zone Diagram](capability_automated_ingest_landing_zone.jpg)

## Supported/Tested Python versions
- 3.7
- 3.8
- 3.9
- 3.10

## Usage options

### Redis options
| option | effect | default |
| ----   |  ----- |  ----- |
| redis_host | Domain or IP address of Redis host | localhost |
| redis_port | Port number for Redis | 6379 |
| redis_db | Redis DB number to use for ingest | 0 |

### S3 options
To scan S3 bucket, minimally requires `--s3_keypair` and source path of the form `/bucket_name/path/to/root/folder`.

| option | effect | default |
| ----   |  ----- |  ----- |
| s3_keypair | path to S3 keypair file | None |
| s3_endpoint_domain | S3 endpoint domain | s3.amazonaws.com |
| s3_region_name | S3 region name | us-east-1 |
| s3_proxy_url | URL to proxy for S3 access | None |
| s3_insecure_connection | Do not use SSL when connecting to S3 endpoint | False |

### Logging/Profiling options
| option | effect | default |
| ----   |  ----- |  ----- |
| log_filename | Path to output file for logs | None |
| log_level | Minimum level of message to log | None |
| log_interval | Time interval with which to rollover ingest log file | None |
| log_when | Type/units of log_interval (see TimedRotatingFileHandler) | None |

`--profile` allows you to use vis to visualize a profile of Celery workers over time of ingest job.

| option | effect | default |
| ----   |  ----- |  ----- |
| profile_filename | Specify name of profile filename (JSON output) | None |
| profile_level | Minimum level of message to log for profiling | None |
| profile_interval | Time interval with which to rollover ingest profile file | None |
| profile_when | Type/units of profile_interval (see TimedRotatingFileHandler) | None |

### Ingest start options
These options are used at the "start" of an ingest job.

| option | effect | default |
| ----   |  ----- |  ----- |
| job_name | Reference name for ingest job | a generated uuid |
| interval | Restart interval (in seconds). If absent, will only sync once. | None |
| file_queue | Name for the file queue. | file |
| path_queue | Name for the path queue. | path |
| restart_queue | Name for the restart queue. | restart |
| event_handler | Path to event handler file | None (see "event_handler methods" below) |
| synchronous | Block until sync job is completed | False |
| progress | Show progress bar and task counts (must have --synchronous flag) | False |
| ignore_cache | Ignore last sync time in cache - like starting a new sync | False |

### Optimization options
| option | effect | default |
| ----   |  ----- |  ----- |
| exclude_file_type | types of files to exclude: regular, directory, character, block, socket, pipe, link | None |
| exclude_file_name | a list of space-separated python regular expressions defining the file names to exclude such as "(\S+)exclude" "(\S+)\.hidden" | None |
| exclude_directory_name | a list of space-separated python regular expressions defining the directory names to exclude such as "(\S+)exclude" "(\S+)\.hidden" | None |
| files_per_task | Number of paths to process in a given task on the queue | 50 |
| initial_ingest | Use this flag on initial ingest to avoid check for data object paths already in iRODS | False |
| irods_idle_disconnect_seconds | Seconds to hold open iRODS connection while idle | 60 |

## available `--event_handler` methods

| method |  effect  | default |
| ----   |   ----- |  ----- |
| pre_data_obj_create   |   user-defined python  |  none |
| post_data_obj_create   | user-defined python  |  none |
| pre_data_obj_modify     |   user-defined python   |  none |
| post_data_obj_modify     | user-defined python  |  none |
| pre_coll_create    |   user-defined python |  none |
| post_coll_create    |  user-defined python   |  none |
| pre_coll_modify    |   user-defined python |  none |
| post_coll_modify    |  user-defined python   |  none |
| character_map | user-defined python | none |
| as_user |   takes action as this iRODS user |  authenticated user |
| target_path | set mount path on the irods server which can be different from client mount path | client mount path |
| to_resource | defines  target resource request of operation |  as provided by client environment |
| operation | defines the mode of operation |  `Operation.REGISTER_SYNC` |
| max_retries | defines max number of retries on failure | 0 |
| timeout | defines seconds until job times out | 3600 |
| delay | defines seconds between retries | 0 |

Event handlers can use `logger` to write logs. See `structlog` for available logging methods and signatures.

### Character Mapping option
If an application should require that iRODS logical paths produced by the ingest process exclude subsets of the
range of possible Unicode characters, we can add a character\_map method that returns a dict object. For example:

```
    class event_handler(Core):
        @staticmethod
        def character_map():
            return {
                re.compile('[^a-zA-Z0-9]'):'_'
            }
        # ...
```
The returned dictionary, in this case, indicates that the ingest process should replace all non-alphanumeric (as
well as non-ASCII) characters with an underscore wherever they may occur in an otherwise normally generated logical path.
The substition process also applies to the intermediate (ie collection name) elements in a logical path, and a suffix is
appended to affected path elements to avoid potential collisions with other remapped object names.

Each key of the returned dictionary indicates a character or set of characters needing substitution.
Possible key types include:

   1. character
   ```
       # substitute backslashes with underscores
       '\\': '_'
   ```
   2. tuple of characters
   ```
       # any character of the tuple is replaced by a Unicode small script x
       ('\\','#','-'): '\u2093'
   ```
   3. regular expression
   ```
       # any character outside of range 0-256 becomes an underscore
       re.compile('[\u0100-\U0010ffff]'): '_'
   ```
   4. callable accepting a character argument and returning a boolean
   ```
       # ASCII codes above 'z' become ':'
       (lambda c: ord(c) in range(ord('z')+1,128)): ':'
   ```

In the event that the order-of-substitution is significant, the method may instead return a list of key-value tuples.

### UnicodeEncodeError
Any file whose path in the filesystem whose ingest results in a UnicodeEncodeError exception being raised (e.g. by the
inclusion of an unencodable UTF8 sequence) will be automatically renamed using a base-64 sequence to represent the original,
unmodified vault path.

Additionally, data objects that have had their names remapped, whether pro forma or via a UnicodeEncodeError, will be
annotated with an AVU of the form

   Attribute:	"irods::automated_ingest::" + ANNOTATION_REASON
   Value:	A PREFIX plus the base64-converted "bad filepath"
   Units:	"python3.base64.b64encode(full_path_of_source_file)"

Where :
   - ANNOTATION_REASON is either "UnicodeEncodeError" or "character\_map" depending on why the remapping occurred.
   - PREFIX is either "irods_UnicodeEncodeError\_" or blank(""), again depending on the re-mapping cause.

Note that the UnicodeEncodeError type of remapping is unconditional, whereas the character remapping is contingent on
an event handler's character_map method being defined.  Also, if a UnicodeEncodeError-style ingest is performed on a
given object, this precludes character mapping being done for the object.

### operation mode ###

| operation |  new files  | updated files  |
| ----   |   ----- | ----- |
| `Operation.REGISTER_SYNC` (default)   |  registers in catalog | updates size in catalog |
| `Operation.REGISTER_AS_REPLICA_SYNC`  |   registers first or additional replica | updates size in catalog |
| `Operation.PUT`  |   copies file to target vault, and registers in catalog | no action |
| `Operation.PUT_SYNC`  |   copies file to target vault, and registers in catalog | copies entire file again, and updates catalog |
| `Operation.PUT_APPEND`  |   copies file to target vault, and registers in catalog | copies only appended part of file, and updates catalog |
| `Operation.NO_OP` | no action | no action

`--event_handler` usage examples can be found [in the examples directory](irods_capability_automated_ingest/examples).

## Deployment

### Basic: manual redis, Celery, pip

Running the sync job and Celery workers requires a valid iRODS environment file for an authenticated iRODS user on each node.

#### Starting Redis Server
Install redis-server:
```
sudo yum install redis-server
```
```
sudo apt-get install redis-server
```
Or, build it yourself: https://redis.io/topics/quickstart

Start redis:
```
redis-server
```
Or, dameonized:
```
sudo service redis-server start
```
```
sudo systemctl start redis
```

The [Redis documentation](https://redis.io/topics/admin) also recommends an additional step:
> Make sure to set the Linux kernel overcommit memory setting to 1. Add vm.overcommit_memory = 1 to /etc/sysctl.conf and then reboot or run the command sysctl vm.overcommit_memory=1 for this to take effect immediately.

This allows the Linux kernel to overcommit virtual memory even if this exceeds the physical memory on the host machine. See [kernel.org documentation](https://www.kernel.org/doc/Documentation/vm/overcommit-accounting) for more information.

**Note:** If running in a distributed environment, make sure Redis server accepts connections by editing the `bind` line in /etc/redis/redis.conf or /etc/redis.conf.

#### Setting up virtual environment
You may need to upgrade pip:
```
pip install --upgrade pip
```

Install virtualenv:
```
pip install virtualenv
```

Create a virtualenv with python3:
```
virtualenv -p python3 rodssync
```

Activate virtual environment:
```
source rodssync/bin/activate
```

#### Install this package
```
pip install irods_capability_automated_ingest
```

Set up environment for Celery:
```
export CELERY_BROKER_URL=redis://<redis host>:<redis port>/<redis db> # e.g. redis://127.0.0.1:6379/0
export PYTHONPATH=`pwd`
```

Start celery worker(s):
```
celery -A irods_capability_automated_ingest.sync_task worker -l error -Q restart,path,file -c <num workers> 
```
**Note:** Make sure queue names match those of the ingest job (default queue names shown here).

#### Run tests
**Note:** The test suite requires Python version >=3.7.
**Note:** The tests should be run without running Celery workers or with an unused redis database.
```
python -m irods_capability_automated_ingest.test.test_irods_sync
```
See [docker/test/README.md](docker/test/README.md) for how to run in a dockerized environment.

#### Start sync job
```
python -m irods_capability_automated_ingest.irods_sync start <source dir> <destination collection>
```

#### List jobs
```
python -m irods_capability_automated_ingest.irods_sync list
```

#### Stop jobs
```
python -m irods_capability_automated_ingest.irods_sync stop <job name>
```

#### Watch jobs (same as using `--progress`)
```
python -m irods_capability_automated_ingest.irods_sync watch <job name>
```

### Intermediate: dockerize, manually config (needs to be updated for Celery)
See [docker/README.md](docker/README.md)

### Advanced: kubernetes (needs to be updated for Celery)

This does not assume that your iRODS installation is in kubernetes.

#### `kubeadm`

setup `Glusterfs` and `Heketi`

create storage class

create a persistent volume claim `data`

#### install minikube and helm

set memory to at least 8g and cpu to at least 4

```
minikube start --memory 8192 --cpus 4
```

#### enable ingress on minikube

```
minikube addons enable ingress
```

#### mount host dirs

This is where you data and event handler. In this setup, we assume that your event handler is under `/tmp/host/event_handler` and you data is under `/tmp/host/data`. We will mount `/tmp/host/data` into `/host/data` in minikube which will mount `/host/data` into `/data` in containers,

`/tmp/host/data` -> minikube `/host/data` -> container `/data`.

and similarly,

`/tmp/host/event_handler` -> minikube `/host/event_handler` -> container `/event_handler`. Your setup may differ.

```
mkdir /tmp/host/event_handler
mkdir /tmp/host/data
```

`/tmp/host/event_handler/event_handler.py`
```python
from irods_capability_automated_ingest.core import Core
from irods_capability_automated_ingest.utils import Operation

class event_handler(Core):

    @staticmethod
    def target_path(session, meta, **options):
        return path

```

```
minikube mount /tmp/host:/host --gid 998 --uid 998 --9p-version=9p2000.L
```

#### enable incubator

```
helm repo add incubator https://kubernetes-charts-incubator.storage.googleapis.com/
```

#### build local docker images (optional)
If you want to use local docker images, you can build the docker images into minikube as follows:

`fish`
```
eval (minikube docker-env)
```

`bash`
```
eval $(minikube docker-env)
```

```
cd <repo>/docker/irods-postgresql
docker build . -t irods-provider-postgresql:4.2.2
```

```
cd <repo>/docker/irods-cockroachdb
docker build . -t irods-provider-cockroachdb:4.3.0
```

```
cd <repo>/docker
docker build . -t irods_capability_automated_ingest:0.1.0
```

```
cd <repo>/docker/rq
docker build . -t irods_rq:0.1.0
```

```
cd <repo>/docker/rq-scheduler
docker build . -t irods_rq-scheduler:0.1.0
```

#### install irods

##### `postgresql`
```
cd <repo>/kubernetes/irods-provider-postgresql
helm dependency update
```

```
cd <repo>/kubernetes
helm install ./irods-provider-postgresql --name irods
```

##### `cockroachdb`
```
cd <repo>/kubernetes/irods-provider-cockroachdb
helm dependency update
```

```
cd <repo>/kubernetes
helm install ./irods-provider-cockroachdb --name irods
```

when reinstalling, run

```
kubectl delete --all pv
kubectl delete --all pvc 
```

#### update irods configurations

Set configurations in `<repo>/kubernetes/chart/values.yaml` or `--set` command line argument.

#### install chart

```
cd <repo>/kubernetes/chart
```

We call our release `icai`.
```
cd <repo>/kubernetes
helm install ./chart --name icai
```

#### scale rq workers
```
kubectl scale deployment.apps/icai-rq-deployment --replicas=<n>
```

#### access by REST API (recommended)

##### submit job
`submit.yaml`
```yaml
root: /data
target: /tempZone/home/rods/data
interval: <interval>
append_json: <yaml>
timeout: <timeout>
all: <all>
event_handler: <event_handler>
event_handler_data: |
    from irods_capability_automated_ingest.core import Core
    from irods_capability_automated_ingest.utils import Operation

    class event_handler(Core):

        @staticmethod
        def target_path(session, meta, **options):
            return path

```

`fish`
```
curl -XPUT "http://"(minikube ip)"/job/<job name> -H "Content-Type: application/x-yaml" --data-binary=`submit.yaml`
```

`bash`
```
curl -XPUT "http://$(minikube ip)/job/<job name>" -H "Content-Type: application/x-yaml" --data-binary "@submit.yaml"
```

`fish`
```
curl -XPUT "http://"(minikube ip)"/job" -H "Content-Type: application/x-yaml" --data-binary "@submit.yaml"
```

`bash`
```
curl -XPUT "http://$(minikube ip)/job" -H "Content-Type: application/x-yaml" --data-binary "@submit.yaml"
```

##### list job
`fish`
```
curl -XGET "http://"(minikube ip)"/job"
```

`bash`
```
curl -XGET "http://$(minikube ip)/job"
```

##### delete job
`fish`
```
curl -XDELETE "http://"(minikube ip)"/job/<job name>"
```

`bash`
```
curl -XDELETE "http://$(minikube ip)/job/<job name>"
```

#### access by command line (not recommended)

##### submit job
```
kubectl run --rm -i icai --image=irods_capability_automated_ingest:0.1.0 --restart=Never -- start /data /tempZone/home/rods/data -i <interval> --event_handler=event_handler --job_name=<job name> --redis_host icai-redis-master
```

##### list job
```
kubectl run --rm -i icai --image=irods_capability_automated_ingest:0.1.0 --restart=Never -- list --redis_host icai-redis-master
```

##### delete job
```
kubectl run --rm -i icai --image=irods_capability_automated_ingest:0.1.0 --restart=Never -- stop <job name> --redis_host icai-redis-master
```

#### install logging tool

Install chart with set `log_level` to `INFO`.
```
helm del --purge icai
```

```
cd <repo>/kubernetes
helm install ./chart --set log_level=INFO --name icai
```

set parameters for elasticsearch

```
minikube ssh 'echo "sysctl -w vm.max_map_count=262144" | sudo tee -a /var/lib/boot2docker/bootlocal.sh'
minikube stop
minikube start
```

```
cd <repo>/kubernetes
helm install ./elk --name icai-elk
```


##### Grafana

look for service port
```
kubectl get svc icai-elk-grafana
```

forward port
```
kubectl port-forward svc/icai-elk-grafana 8000:80
```

If `--set grafana.adminPassword=""` system generates a random password, lookup admin password
```
kubectl get secret --namespace default icai-elk-grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
```

open browser url `localhost:8000`

login with username `admin` and password `admin`
click on `icai dashboard`


##### Kibana

Uncomment kibana sections in the yaml files under the `<repo>/kubernetes/elk` directory

look for service port
```
kubectl get svc icai-elk-kibana
```

forward port
```
kubectl port-forward svc/icai-elk-kibana 8000:443
```

open browser url `localhost:8000`




            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/irods/irods_capability_automated_ingest",
    "name": "irods-capability-automated-ingest",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7,",
    "maintainer_email": "",
    "keywords": "irods automated ingest landingzone filesystem",
    "author": "iRODS Consortium",
    "author_email": "support@irods.org",
    "download_url": "https://files.pythonhosted.org/packages/ff/fe/ae201c15a646cc0e0495a0d582e05662284c09bc887f5f127691afcc1314/irods-capability-automated-ingest-0.4.1.tar.gz",
    "platform": null,
    "description": "# iRODS Automated Ingest Framework\n\nThe automated ingest framework gives iRODS an enterprise solution that solves two major use cases: getting existing data under management and ingesting incoming data hitting a landing zone.\n\nBased on the Python iRODS Client and Celery, this framework can scale up to match the demands of data coming off instruments, satellites, or parallel filesystems.\n\nThe example diagrams below show a filesystem scanner and a landing zone.\n\n![Automated Ingest: Filesystem Scanner Diagram](capability_automated_ingest_filesystem_scanner.jpg)\n\n![Automated Ingest: Landing Zone Diagram](capability_automated_ingest_landing_zone.jpg)\n\n## Supported/Tested Python versions\n- 3.7\n- 3.8\n- 3.9\n- 3.10\n\n## Usage options\n\n### Redis options\n| option | effect | default |\n| ----   |  ----- |  ----- |\n| redis_host | Domain or IP address of Redis host | localhost |\n| redis_port | Port number for Redis | 6379 |\n| redis_db | Redis DB number to use for ingest | 0 |\n\n### S3 options\nTo scan S3 bucket, minimally requires `--s3_keypair` and source path of the form `/bucket_name/path/to/root/folder`.\n\n| option | effect | default |\n| ----   |  ----- |  ----- |\n| s3_keypair | path to S3 keypair file | None |\n| s3_endpoint_domain | S3 endpoint domain | s3.amazonaws.com |\n| s3_region_name | S3 region name | us-east-1 |\n| s3_proxy_url | URL to proxy for S3 access | None |\n| s3_insecure_connection | Do not use SSL when connecting to S3 endpoint | False |\n\n### Logging/Profiling options\n| option | effect | default |\n| ----   |  ----- |  ----- |\n| log_filename | Path to output file for logs | None |\n| log_level | Minimum level of message to log | None |\n| log_interval | Time interval with which to rollover ingest log file | None |\n| log_when | Type/units of log_interval (see TimedRotatingFileHandler) | None |\n\n`--profile` allows you to use vis to visualize a profile of Celery workers over time of ingest job.\n\n| option | effect | default |\n| ----   |  ----- |  ----- |\n| profile_filename | Specify name of profile filename (JSON output) | None |\n| profile_level | Minimum level of message to log for profiling | None |\n| profile_interval | Time interval with which to rollover ingest profile file | None |\n| profile_when | Type/units of profile_interval (see TimedRotatingFileHandler) | None |\n\n### Ingest start options\nThese options are used at the \"start\" of an ingest job.\n\n| option | effect | default |\n| ----   |  ----- |  ----- |\n| job_name | Reference name for ingest job | a generated uuid |\n| interval | Restart interval (in seconds). If absent, will only sync once. | None |\n| file_queue | Name for the file queue. | file |\n| path_queue | Name for the path queue. | path |\n| restart_queue | Name for the restart queue. | restart |\n| event_handler | Path to event handler file | None (see \"event_handler methods\" below) |\n| synchronous | Block until sync job is completed | False |\n| progress | Show progress bar and task counts (must have --synchronous flag) | False |\n| ignore_cache | Ignore last sync time in cache - like starting a new sync | False |\n\n### Optimization options\n| option | effect | default |\n| ----   |  ----- |  ----- |\n| exclude_file_type | types of files to exclude: regular, directory, character, block, socket, pipe, link | None |\n| exclude_file_name | a list of space-separated python regular expressions defining the file names to exclude such as \"(\\S+)exclude\" \"(\\S+)\\.hidden\" | None |\n| exclude_directory_name | a list of space-separated python regular expressions defining the directory names to exclude such as \"(\\S+)exclude\" \"(\\S+)\\.hidden\" | None |\n| files_per_task | Number of paths to process in a given task on the queue | 50 |\n| initial_ingest | Use this flag on initial ingest to avoid check for data object paths already in iRODS | False |\n| irods_idle_disconnect_seconds | Seconds to hold open iRODS connection while idle | 60 |\n\n## available `--event_handler` methods\n\n| method |  effect  | default |\n| ----   |   ----- |  ----- |\n| pre_data_obj_create   |   user-defined python  |  none |\n| post_data_obj_create   | user-defined python  |  none |\n| pre_data_obj_modify     |   user-defined python   |  none |\n| post_data_obj_modify     | user-defined python  |  none |\n| pre_coll_create    |   user-defined python |  none |\n| post_coll_create    |  user-defined python   |  none |\n| pre_coll_modify    |   user-defined python |  none |\n| post_coll_modify    |  user-defined python   |  none |\n| character_map | user-defined python | none |\n| as_user |   takes action as this iRODS user |  authenticated user |\n| target_path | set mount path on the irods server which can be different from client mount path | client mount path |\n| to_resource | defines  target resource request of operation |  as provided by client environment |\n| operation | defines the mode of operation |  `Operation.REGISTER_SYNC` |\n| max_retries | defines max number of retries on failure | 0 |\n| timeout | defines seconds until job times out | 3600 |\n| delay | defines seconds between retries | 0 |\n\nEvent handlers can use `logger` to write logs. See `structlog` for available logging methods and signatures.\n\n### Character Mapping option\nIf an application should require that iRODS logical paths produced by the ingest process exclude subsets of the\nrange of possible Unicode characters, we can add a character\\_map method that returns a dict object. For example:\n\n```\n    class event_handler(Core):\n        @staticmethod\n        def character_map():\n            return {\n                re.compile('[^a-zA-Z0-9]'):'_'\n            }\n        # ...\n```\nThe returned dictionary, in this case, indicates that the ingest process should replace all non-alphanumeric (as\nwell as non-ASCII) characters with an underscore wherever they may occur in an otherwise normally generated logical path.\nThe substition process also applies to the intermediate (ie collection name) elements in a logical path, and a suffix is\nappended to affected path elements to avoid potential collisions with other remapped object names.\n\nEach key of the returned dictionary indicates a character or set of characters needing substitution.\nPossible key types include:\n\n   1. character\n   ```\n       # substitute backslashes with underscores\n       '\\\\': '_'\n   ```\n   2. tuple of characters\n   ```\n       # any character of the tuple is replaced by a Unicode small script x\n       ('\\\\','#','-'): '\\u2093'\n   ```\n   3. regular expression\n   ```\n       # any character outside of range 0-256 becomes an underscore\n       re.compile('[\\u0100-\\U0010ffff]'): '_'\n   ```\n   4. callable accepting a character argument and returning a boolean\n   ```\n       # ASCII codes above 'z' become ':'\n       (lambda c: ord(c) in range(ord('z')+1,128)): ':'\n   ```\n\nIn the event that the order-of-substitution is significant, the method may instead return a list of key-value tuples.\n\n### UnicodeEncodeError\nAny file whose path in the filesystem whose ingest results in a UnicodeEncodeError exception being raised (e.g. by the\ninclusion of an unencodable UTF8 sequence) will be automatically renamed using a base-64 sequence to represent the original,\nunmodified vault path.\n\nAdditionally, data objects that have had their names remapped, whether pro forma or via a UnicodeEncodeError, will be\nannotated with an AVU of the form\n\n   Attribute:\t\"irods::automated_ingest::\" + ANNOTATION_REASON\n   Value:\tA PREFIX plus the base64-converted \"bad filepath\"\n   Units:\t\"python3.base64.b64encode(full_path_of_source_file)\"\n\nWhere :\n   - ANNOTATION_REASON is either \"UnicodeEncodeError\" or \"character\\_map\" depending on why the remapping occurred.\n   - PREFIX is either \"irods_UnicodeEncodeError\\_\" or blank(\"\"), again depending on the re-mapping cause.\n\nNote that the UnicodeEncodeError type of remapping is unconditional, whereas the character remapping is contingent on\nan event handler's character_map method being defined.  Also, if a UnicodeEncodeError-style ingest is performed on a\ngiven object, this precludes character mapping being done for the object.\n\n### operation mode ###\n\n| operation |  new files  | updated files  |\n| ----   |   ----- | ----- |\n| `Operation.REGISTER_SYNC` (default)   |  registers in catalog | updates size in catalog |\n| `Operation.REGISTER_AS_REPLICA_SYNC`  |   registers first or additional replica | updates size in catalog |\n| `Operation.PUT`  |   copies file to target vault, and registers in catalog | no action |\n| `Operation.PUT_SYNC`  |   copies file to target vault, and registers in catalog | copies entire file again, and updates catalog |\n| `Operation.PUT_APPEND`  |   copies file to target vault, and registers in catalog | copies only appended part of file, and updates catalog |\n| `Operation.NO_OP` | no action | no action\n\n`--event_handler` usage examples can be found [in the examples directory](irods_capability_automated_ingest/examples).\n\n## Deployment\n\n### Basic: manual redis, Celery, pip\n\nRunning the sync job and Celery workers requires a valid iRODS environment file for an authenticated iRODS user on each node.\n\n#### Starting Redis Server\nInstall redis-server:\n```\nsudo yum install redis-server\n```\n```\nsudo apt-get install redis-server\n```\nOr, build it yourself: https://redis.io/topics/quickstart\n\nStart redis:\n```\nredis-server\n```\nOr, dameonized:\n```\nsudo service redis-server start\n```\n```\nsudo systemctl start redis\n```\n\nThe [Redis documentation](https://redis.io/topics/admin) also recommends an additional step:\n> Make sure to set the Linux kernel overcommit memory setting to 1. Add vm.overcommit_memory = 1 to /etc/sysctl.conf and then reboot or run the command sysctl vm.overcommit_memory=1 for this to take effect immediately.\n\nThis allows the Linux kernel to overcommit virtual memory even if this exceeds the physical memory on the host machine. See [kernel.org documentation](https://www.kernel.org/doc/Documentation/vm/overcommit-accounting) for more information.\n\n**Note:** If running in a distributed environment, make sure Redis server accepts connections by editing the `bind` line in /etc/redis/redis.conf or /etc/redis.conf.\n\n#### Setting up virtual environment\nYou may need to upgrade pip:\n```\npip install --upgrade pip\n```\n\nInstall virtualenv:\n```\npip install virtualenv\n```\n\nCreate a virtualenv with python3:\n```\nvirtualenv -p python3 rodssync\n```\n\nActivate virtual environment:\n```\nsource rodssync/bin/activate\n```\n\n#### Install this package\n```\npip install irods_capability_automated_ingest\n```\n\nSet up environment for Celery:\n```\nexport CELERY_BROKER_URL=redis://<redis host>:<redis port>/<redis db> # e.g. redis://127.0.0.1:6379/0\nexport PYTHONPATH=`pwd`\n```\n\nStart celery worker(s):\n```\ncelery -A irods_capability_automated_ingest.sync_task worker -l error -Q restart,path,file -c <num workers> \n```\n**Note:** Make sure queue names match those of the ingest job (default queue names shown here).\n\n#### Run tests\n**Note:** The test suite requires Python version >=3.7.\n**Note:** The tests should be run without running Celery workers or with an unused redis database.\n```\npython -m irods_capability_automated_ingest.test.test_irods_sync\n```\nSee [docker/test/README.md](docker/test/README.md) for how to run in a dockerized environment.\n\n#### Start sync job\n```\npython -m irods_capability_automated_ingest.irods_sync start <source dir> <destination collection>\n```\n\n#### List jobs\n```\npython -m irods_capability_automated_ingest.irods_sync list\n```\n\n#### Stop jobs\n```\npython -m irods_capability_automated_ingest.irods_sync stop <job name>\n```\n\n#### Watch jobs (same as using `--progress`)\n```\npython -m irods_capability_automated_ingest.irods_sync watch <job name>\n```\n\n### Intermediate: dockerize, manually config (needs to be updated for Celery)\nSee [docker/README.md](docker/README.md)\n\n### Advanced: kubernetes (needs to be updated for Celery)\n\nThis does not assume that your iRODS installation is in kubernetes.\n\n#### `kubeadm`\n\nsetup `Glusterfs` and `Heketi`\n\ncreate storage class\n\ncreate a persistent volume claim `data`\n\n#### install minikube and helm\n\nset memory to at least 8g and cpu to at least 4\n\n```\nminikube start --memory 8192 --cpus 4\n```\n\n#### enable ingress on minikube\n\n```\nminikube addons enable ingress\n```\n\n#### mount host dirs\n\nThis is where you data and event handler. In this setup, we assume that your event handler is under `/tmp/host/event_handler` and you data is under `/tmp/host/data`. We will mount `/tmp/host/data` into `/host/data` in minikube which will mount `/host/data` into `/data` in containers,\n\n`/tmp/host/data` -> minikube `/host/data` -> container `/data`.\n\nand similarly,\n\n`/tmp/host/event_handler` -> minikube `/host/event_handler` -> container `/event_handler`. Your setup may differ.\n\n```\nmkdir /tmp/host/event_handler\nmkdir /tmp/host/data\n```\n\n`/tmp/host/event_handler/event_handler.py`\n```python\nfrom irods_capability_automated_ingest.core import Core\nfrom irods_capability_automated_ingest.utils import Operation\n\nclass event_handler(Core):\n\n    @staticmethod\n    def target_path(session, meta, **options):\n        return path\n\n```\n\n```\nminikube mount /tmp/host:/host --gid 998 --uid 998 --9p-version=9p2000.L\n```\n\n#### enable incubator\n\n```\nhelm repo add incubator https://kubernetes-charts-incubator.storage.googleapis.com/\n```\n\n#### build local docker images (optional)\nIf you want to use local docker images, you can build the docker images into minikube as follows:\n\n`fish`\n```\neval (minikube docker-env)\n```\n\n`bash`\n```\neval $(minikube docker-env)\n```\n\n```\ncd <repo>/docker/irods-postgresql\ndocker build . -t irods-provider-postgresql:4.2.2\n```\n\n```\ncd <repo>/docker/irods-cockroachdb\ndocker build . -t irods-provider-cockroachdb:4.3.0\n```\n\n```\ncd <repo>/docker\ndocker build . -t irods_capability_automated_ingest:0.1.0\n```\n\n```\ncd <repo>/docker/rq\ndocker build . -t irods_rq:0.1.0\n```\n\n```\ncd <repo>/docker/rq-scheduler\ndocker build . -t irods_rq-scheduler:0.1.0\n```\n\n#### install irods\n\n##### `postgresql`\n```\ncd <repo>/kubernetes/irods-provider-postgresql\nhelm dependency update\n```\n\n```\ncd <repo>/kubernetes\nhelm install ./irods-provider-postgresql --name irods\n```\n\n##### `cockroachdb`\n```\ncd <repo>/kubernetes/irods-provider-cockroachdb\nhelm dependency update\n```\n\n```\ncd <repo>/kubernetes\nhelm install ./irods-provider-cockroachdb --name irods\n```\n\nwhen reinstalling, run\n\n```\nkubectl delete --all pv\nkubectl delete --all pvc \n```\n\n#### update irods configurations\n\nSet configurations in `<repo>/kubernetes/chart/values.yaml` or `--set` command line argument.\n\n#### install chart\n\n```\ncd <repo>/kubernetes/chart\n```\n\nWe call our release `icai`.\n```\ncd <repo>/kubernetes\nhelm install ./chart --name icai\n```\n\n#### scale rq workers\n```\nkubectl scale deployment.apps/icai-rq-deployment --replicas=<n>\n```\n\n#### access by REST API (recommended)\n\n##### submit job\n`submit.yaml`\n```yaml\nroot: /data\ntarget: /tempZone/home/rods/data\ninterval: <interval>\nappend_json: <yaml>\ntimeout: <timeout>\nall: <all>\nevent_handler: <event_handler>\nevent_handler_data: |\n    from irods_capability_automated_ingest.core import Core\n    from irods_capability_automated_ingest.utils import Operation\n\n    class event_handler(Core):\n\n        @staticmethod\n        def target_path(session, meta, **options):\n            return path\n\n```\n\n`fish`\n```\ncurl -XPUT \"http://\"(minikube ip)\"/job/<job name> -H \"Content-Type: application/x-yaml\" --data-binary=`submit.yaml`\n```\n\n`bash`\n```\ncurl -XPUT \"http://$(minikube ip)/job/<job name>\" -H \"Content-Type: application/x-yaml\" --data-binary \"@submit.yaml\"\n```\n\n`fish`\n```\ncurl -XPUT \"http://\"(minikube ip)\"/job\" -H \"Content-Type: application/x-yaml\" --data-binary \"@submit.yaml\"\n```\n\n`bash`\n```\ncurl -XPUT \"http://$(minikube ip)/job\" -H \"Content-Type: application/x-yaml\" --data-binary \"@submit.yaml\"\n```\n\n##### list job\n`fish`\n```\ncurl -XGET \"http://\"(minikube ip)\"/job\"\n```\n\n`bash`\n```\ncurl -XGET \"http://$(minikube ip)/job\"\n```\n\n##### delete job\n`fish`\n```\ncurl -XDELETE \"http://\"(minikube ip)\"/job/<job name>\"\n```\n\n`bash`\n```\ncurl -XDELETE \"http://$(minikube ip)/job/<job name>\"\n```\n\n#### access by command line (not recommended)\n\n##### submit job\n```\nkubectl run --rm -i icai --image=irods_capability_automated_ingest:0.1.0 --restart=Never -- start /data /tempZone/home/rods/data -i <interval> --event_handler=event_handler --job_name=<job name> --redis_host icai-redis-master\n```\n\n##### list job\n```\nkubectl run --rm -i icai --image=irods_capability_automated_ingest:0.1.0 --restart=Never -- list --redis_host icai-redis-master\n```\n\n##### delete job\n```\nkubectl run --rm -i icai --image=irods_capability_automated_ingest:0.1.0 --restart=Never -- stop <job name> --redis_host icai-redis-master\n```\n\n#### install logging tool\n\nInstall chart with set `log_level` to `INFO`.\n```\nhelm del --purge icai\n```\n\n```\ncd <repo>/kubernetes\nhelm install ./chart --set log_level=INFO --name icai\n```\n\nset parameters for elasticsearch\n\n```\nminikube ssh 'echo \"sysctl -w vm.max_map_count=262144\" | sudo tee -a /var/lib/boot2docker/bootlocal.sh'\nminikube stop\nminikube start\n```\n\n```\ncd <repo>/kubernetes\nhelm install ./elk --name icai-elk\n```\n\n\n##### Grafana\n\nlook for service port\n```\nkubectl get svc icai-elk-grafana\n```\n\nforward port\n```\nkubectl port-forward svc/icai-elk-grafana 8000:80\n```\n\nIf `--set grafana.adminPassword=\"\"` system generates a random password, lookup admin password\n```\nkubectl get secret --namespace default icai-elk-grafana -o jsonpath=\"{.data.admin-password}\" | base64 --decode ; echo\n```\n\nopen browser url `localhost:8000`\n\nlogin with username `admin` and password `admin`\nclick on `icai dashboard`\n\n\n##### Kibana\n\nUncomment kibana sections in the yaml files under the `<repo>/kubernetes/elk` directory\n\nlook for service port\n```\nkubectl get svc icai-elk-kibana\n```\n\nforward port\n```\nkubectl port-forward svc/icai-elk-kibana 8000:443\n```\n\nopen browser url `localhost:8000`\n\n\n\n",
    "bugtrack_url": null,
    "license": "BSD",
    "summary": "Implement filesystem scanners and landing zones",
    "version": "0.4.1",
    "split_keywords": [
        "irods",
        "automated",
        "ingest",
        "landingzone",
        "filesystem"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c5b1ea6c13ce458bedd2897b9f50f0d2a51c36aa7a44f8ddbf565c43ce5f857f",
                "md5": "daf7464e031232a6a5f72dca94ffff33",
                "sha256": "45b99fa2075738bab9aab4968edc2958454f53fdbccc901e7f81c245ad33fd1f"
            },
            "downloads": -1,
            "filename": "irods_capability_automated_ingest-0.4.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "daf7464e031232a6a5f72dca94ffff33",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7,",
            "size": 62297,
            "upload_time": "2023-03-26T17:05:14",
            "upload_time_iso_8601": "2023-03-26T17:05:14.420086Z",
            "url": "https://files.pythonhosted.org/packages/c5/b1/ea6c13ce458bedd2897b9f50f0d2a51c36aa7a44f8ddbf565c43ce5f857f/irods_capability_automated_ingest-0.4.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fffeae201c15a646cc0e0495a0d582e05662284c09bc887f5f127691afcc1314",
                "md5": "575198a8f3a184059fcbdc3666a84376",
                "sha256": "e82b589ef656b3d5d538694bd501d3a8ff73ffad33014fcb37764b4ca021d6e6"
            },
            "downloads": -1,
            "filename": "irods-capability-automated-ingest-0.4.1.tar.gz",
            "has_sig": false,
            "md5_digest": "575198a8f3a184059fcbdc3666a84376",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7,",
            "size": 52468,
            "upload_time": "2023-03-26T17:05:16",
            "upload_time_iso_8601": "2023-03-26T17:05:16.125363Z",
            "url": "https://files.pythonhosted.org/packages/ff/fe/ae201c15a646cc0e0495a0d582e05662284c09bc887f5f127691afcc1314/irods-capability-automated-ingest-0.4.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-03-26 17:05:16",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "irods",
    "github_project": "irods_capability_automated_ingest",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "irods-capability-automated-ingest"
}
        
Elapsed time: 0.05148s