invenio-record-importer-kcworks


Nameinvenio-record-importer-kcworks JSON
Version 0.2.20a7 PyPI version JSON
download
home_pageNone
SummaryBulk import records into a InvenioRDM instance configured for Knowledge Commons Works
upload_time2024-09-23 20:57:16
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseMIT License Copyright (c) 2024 MESH Research Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords invenio invenio-rdm knowledge-commons-works bulk-import record-importer
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # INVENIO RECORD IMPORTER for Knowledge Commons Works

Version 0.2.20-alpha7

*ALPHA QUALITY SOFTWARE - USE AT OWN RISK*

This is a command line utility to perform a bulk import of records into Knowledge Commons Works, an InvenioRDM instance. The utility could be adapted to work with other InvenioRDM installations, but in its current form it assumes the customized metadata schema, database structures, and other configuration employed in Knowledge Commons Works.

This utility is designed to play two functions.
(1) The first is to serve as a bulk importer for any batch of records coming from an external service or source.
(2) The second is a specific legacy function--to convert legacy Humanities Commons CORE deposits to the metadata schema used in KC Works and prepare a json file for use by the importer.

This module adds a cli command to InvenioRDM's internal via the `console_commands` entry point. The command must be run within a Knowledge Commons Works instance of InvenioRDM. From the command line,
within the instance directory, run

```shell
pipenv run invenio importer
```

## Commands

The importer has three primary commands: `serialize`, `read`, and `load`. These are described in detail below. The `serialize` command is used to serialize metadata records for import into InvenioRDM. The `read` command is used to read the metadata for records that are being imported. The `load` command is used to load the serialized metadata records into InvenioRDM.

Two additional commands are provided for creating synthetic usage statistics to preserve stats coming from a legacy repository. These commands are `stats` and `aggregations`. They are described in the [Usage Statistics Preservation](#usage-statistics-preservation) section below.

## Installation

This module should already be installed in a standard Knowledge Commons Works instance of InvenioRDM.
To install from PyPI, run

```shell
pip install invenio-record-importer-kcworks
```

To install for development of this module, clone the repository and run

```shell
pipenv install -e .
```
from within the repository.

## Setup

Prior to running the importer, the required configuration variables listed below must be set either in the `invenio.cfg` file or as environment variables. A jsonlines file containing the serialized metadata records named `records-for-import.json` must also be placed in the folder identified by the RECORD_IMPORTER_DATA_DIR environment variable. All files for the records to be imported should be placed in the folder identified by the RECORD_IMPORTER_FILES_LOCATION environment variable.

## Dependencies

In addition to the normal InvenioRDM packages and KCWorks, this module relies in particular
on the following packages, which are also installed in the standard KCWorks environment:

- invenio-remote-user-data-kcworks
- invenio-group-collections-kcworks

Fetching of user data and creation of group collections relies on these packages being properly configured.

## Configuration

The importer relies on several environment variables. These can be set in the `invenio.cfg` file of the InvenioRDM instance, or in a `.env` file in the base directory of the InvenioRDM instance. If they are set in the `.env` file they must be prefixed with `INVENIO_`.

| Variable name                   | Required | Description                                                                                                                                                        |
| ------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| RECORD_IMPORTER_ADMIN_EMAIL                     | Y | The email address of the admin user in the InvenioRDM instance. This defaults to the global ADMIN_EMAIL Flask environment variable.                                                                                                     |
| RECORD_IMPORTER_DATA_DIR       | Y | The folder where the file with serialized metadata records can be found, named `records-for-import.json`                                                           |
| RECORD_IMPORTER_FILES_LOCATION | N | The folder where the files for upload withe the new deposits may be found. It defaults to a subfolder of the RECORD_IMPORTER_DATA_DIR directory.                                                                                         |
| RECORD_IMPORTER_LOGS_LOCATION   | N | The full path to the local directory where the record import log files will be written. It defaults to the `logs` folder of the `invenio-record-importer-kcworks` modules.                                                                                          |
| RECORD_IMPORTER_OVERRIDES_FOLDER | N | The full path to the local directory where the overrides files can be found. It defaults to the `overrides` subfolder of the folder at RECORD_IMPORTER_DATA_DIR.                                                                                       |
| RECORD_IMPORTER_CREATED_LOG_PATH | N | The full path to the local file where the created records log will be written. It defaults to the `record_importer_created_records.jsonl` file in the RECORD_IMPORTER_LOGS_LOCATION folder.                                                                                       |
| RECORD_IMPORTER_FAILED_LOG_PATH | N | The full path to the local file where the failed records log will be written. It defaults to the `record_importer_failed_records.jsonl` file in the RECORD_IMPORTER_LOGS_LOCATION folder.                                                                                       |
| RECORD_IMPORTER_SERIALIZED_PATH | N | The full path to the local file where the serialized records will be written. It defaults to the `record_importer_serialized_records.jsonl` file in the RECORD_IMPORTER_DATA_DIR folder.                                                                                       |
| RECORD_IMPORTER_SERIALIZED_FAILED_PATH | N | The full path to the local file where the serialized failed records will be written. It defaults to the `record_importer_failed_serialized.jsonl` file in the RECORD_IMPORTER_LOGS_LOCATION folder.                                                                                       |

The required folders must of course be created before the importer is run. The importer will not create these folders if they do not exist. The various log files and serialized records files will be created by the importer if they do not already exist.

Remember that in containerized environments, the directories holding data and logs must be persistent. In a Docker environment, these directories should be mounted as volumes in the container.

## Serializer usage

The `serialize` command is run within the knowledge_commons_repository ui container like this:

```shell
invenio importer serialize
```

This command serializes metadata records for import into InvenioRDM. It reads records from a source file, processes them, and writes the serialized records to a JSON Lines file. The command also performs metadata repair and logs any problematic records.

### Command-line flags

| Flag                           | Short flag | Description                                                                                                                     |
| ------------------------------ | ---------- | ------------------------------------------------------------------------------------------------------------------------------- |
| --start-index INTEGER          | -s         | The index of the first record to serialize (1-based). Defaults to 1.                                                             |
| --end-index INTEGER            | -e         | The index of the last record to serialize (inclusive). If not provided, will serialize to the end of the input file.            |
| --verbose / --no-verbose       | -v / -q    | Enable or disable verbose output. Defaults to False.                                                                             |

### Metadata repair

The serializer will attempt to repair any metadata fields that are missing or have incorrect values. If a record has a missing or incorrect metadata field, the serializer will attempt to fill in the missing field with a value from a related field.

### Logging

Details about the program's progress are sent to Invenio's logging system as it runs. After each serializer run, a list of records with problematic metadata is written to the file at RECORD_IMPORTER_SERIALIZED_FAILED_PATH. Each line of this file is a json object listing metadata fields that the program has flagged as problematic for each record. The file is overwritten each time the serializer is run.

Note that in most cases a record with problematic metadata will still be serialized and included in the output file. The problems flagged by the serializer are usually limited to a single metadata field. Many are not true errors but rather pieces of metadata that require human review.

## Reader usage

The record reader is a convenience utility for retrieving the metadata for records that are being imported into InvenioRDM. It is not necessary to run the reader before running the loader, but it can be useful for debugging purposes. By default records are identified by their index in the jsonl file of records for import, but they can also be identified by their source id. The reader will output the metadata for the specified records to the console. By default it will print the serialized matadata as it will be fed into the loader, but it can also print the raw metadata as it exists prior to serialization.

The `read` command is run within the knowledge_commons_repository instance directory like this:

```shell
pipenv run invenio importer read {RECORDS} {FLAGS}
```

### Command-line arguments

A list of the provided positional arguments specifying which records to read. Defaults to [].

### Command-line flags

| Flag                           | Short flag | Description                                                                                                                     |
| ------------------------------ | ---------- | ------------------------------------------------------------------------------------------------------------------------------- |
| --raw_input (bool, optional)           | -r | If True, print the raw metadata for the specified records as it exists prior to serialization. Defaults to False.                                                   |
| --use-sourceids (bool, optional) | -s | If True, the positional arguments are interpreted as ids in the source system instead of positional indices. By default these ids are interpreted as DOI identifiers. If a different id scheme is desired, this may be set using the `--scheme` flag. Defaults to False. |
| --scheme (str, optional) | -m | The identifier scheme to use for the records when the --use-sourceids flag is True. Defaults to "doi". |
| --field-path (str, optional) | -f | The dot-separated path to a specific metadata field to be printed. If not specified, the entire record will be printed. |

## Loader usage

The record loader is run within the knowledge_commons_repository instance directory like this:

```shell
pipenv run invenio importer load {RECORDS} {FLAGS}
```

This command loads serialized metadata records into InvenioRDM. It reads records from a JSON Lines file, processes them, and creates or updates records in InvenioRDM accordingly.

### Command-line arguments

A list of the provided positional arguments specifying which records to load. Defaults to [].

If no positional arguments are provided, all records will be loaded.

If positional arguments are provided, they should be either integers specifying the line numbers of the records to load, or source ids specifying the ids of the records to load in the source system. These will be interpreted as line numbers in the jsonl file of records for import (beginning at 1) unless the --use-sourceids flag is set.

If a range is specified in the RECORDS by linking two integers with a hyphen, the program will load all records between the two indices, inclusive. If the range ends in a hyphen with no second integer, the program will load all records from the start index to the end of the input file.

### Command-line flags

| Flag                           | Short flag | Description                                                                                                                     |
| ------------------------------ | ---------- | ------------------------------------------------------------------------------------------------------------------------------- |
| --no-updates                   | -n         | If set, do not update existing records where a record with the same DOI already exists. Defaults to False.                       |
| --retry-failed                 | -r         | If set, try to load in all previously failed records that have not already been repaired successfully. Defaults to False.        |
| --use-sourceids                | -s         | If set, the positional arguments are interpreted as ids in the source system instead of positional indices. Defaults to False.   |
| --scheme TEXT                  | -m         | The identifier scheme to use for the records when the --use-sourceids flag is True. Defaults to "hclegacy-pid".                  |
| --aggregate                    | -a         | If set, run Invenio's usage statistics aggregation after importing the records. Defaults to False.                               |
| --start_date TEXT              |            | The start date for the usage statistics aggregation. Must be in the format "YYYY-MM-DD". Defaults to None.                       |
| --end_date TEXT                |            | The end date for the usage statistics aggregation. Must be in the format "YYYY-MM-DD". Defaults to None.                         |
| --clean_filenames              | -c         | If set, clean the filenames of the files to be uploaded. Defaults to False.                                                      |
| --verbose / --no-verbose       | -v / -q    | Enable or disable verbose output. Defaults to False.                                                                             |
| --stop_on_error / --no-stop_on_error | -e / -E | If set, stop the loading process if an error is encountered. Defaults to False.                                                |

### Examples:

To load records 1, 2, 3, and 5, run:

```shell
pipenv run invenio importer load 1 2 3 5
```

A range can be specified in the RECORDS by linking two integers with a hyphen. For example, to load only the first 100 records, run:

```shell
pipenv run invenio importer load 1-100
```

If the range ends in a hyphen with no second integer, the program will load all records from the start index to the end of the input file. For example, to load all records from 100 to the end of the file, run:

```shell
pipenv run invenio importer load 100-
```

Records may be loaded by id in the source system instead of by index. For example, to load records with ids hc:4723, hc:8271, and hc:2246, run:

```shell
pipenv run invenio importer load --use-sourceids hc:4723 hc:8271 hc:2246
```

### Source file locations

The `load` command must be run from the base knowledge_commons_repository directory. It will look for the exported records in the directory specified by the RECORD_IMPORTER_DATA_DIR environment variable. It will look for the files to be uploaded in the directory specified by the RECORD_IMPORTER_FILES_LOCATION environment variable.

### Overriding metadata during loading

The loader can be configured to override metadata fields during loading. This is done by creating a jsonl file containing the metadata fields to be overridden, and placing it in the directory specified by the RECORD_IMPORTER_OVERRIDES_FOLDER environment variable. The file should be named `record-importer-overrides_mySourceService.json`, where `mySourceService` is the name of the source from which the import data is coming. If this source is a service configured with KCWorks as a SAML authentication provider, the name of the source should be the same as the name of the service in the SAML configuration.

The file should contain one json object per line, with no newline characters within each object. Each object must have the key "source_id" which contains the identifier of the record in the source system. The object may also contain the keys "overrides", "skip", and "notes". The "overrides" value must be a json object containing the metadata fields to be overridden. The "skip" value must be a boolean indicating whether the record should be skipped during loading. The "notes" value is a string that is ignored by invenio-record-importer-kcworks but allows explanatory notes to be kept together with the override data.

The format of the overrides file should be as follows:

```json
{"source_id": "hc:12345", "overrides": {"metadata|title": "My overridden title", "custom_fields|hclegacy:file_location": "my/overridden/filename.pdf"}, "notes": "optional notes"}
{"source_id": "hc:678910", "skip": true, "notes": "optional notes"}
```

Note that the metadata field names should be path strings with steps separated by a pipe character. So if you want to update the "title" subfield of the top-level "metadata", the path string should be "metadata|title". If one level in the metadata hierarchy is a list/array, a number may be included in the string indicating which index in the list should be updated. So to update the "date" subfield of the list of "dates" in the "metadata" field, we would use the path string "metadata|dates|1|date".

Whatever value is provided for each path string will *entirely replace* the value at that point in the metadata hierarchy. So if I update "metadata|dates" I will need to provide the *entire list* of dates, even if there are items that should remain the same. If I only want to update one item I must specify it with a more specific path string like "metadata|dates|0".

Note that in some cases field names are namespaced by the schema they belong to. For example, the "file_location" field is in the "hclegacy" schema, so the path string for this field would be "custom_fields|hclegacy:file_location".

To uncover the metadata structure of the Invenio record being overridden, use the `read` command to print the metadata of the record to the terminal.

### Skipping records during loading

To skip a record from the serialized metadata during loading, add a line to the overrides file with the "skip" key set to true. The record will be skipped during loading, but will still be recorded in the created records log. If a record is skipped, it will not be included in the failed records log and will be removed from that log if has previously failed.

### Authentication

Since the program interacts directly with InvenioRDM (and not via the REST API) it does not require separate authentication.

### Collections and record owners

Where necessary this program will create top-level domain communities, assign the records to the correct domain communities, create new Invenio users corresponding to the users who uploaded the original deposits, and transfer ownership of the Invenio record to the correct users. If the source of the records is associated with a SAML authentication IDP, these new users will be set to authenticate using their account with that IDP.

If the record was part of any group collections in the source system, the program will assign the record to the equivalent KCWorks group collections, creating new collections if necessary.

### Recovering existing records

If a record with the same DOI already exists in Invenio, the program will try to update the existing record with any new metadata and/or files, creating a new draft of published records if necessary. Unpublished existing drafts will be submitted to the appropriate community and published. Alternately, if the --no-updates flag is set, the program will skip any records that match DOIs for records that already exist in Invenio.

### Logging

Details about the program's progress are sent to Invenio's logging system as it runs. In addition, a running list of all records that have been created (a load attempt has been made) is recorded in the file `record_importer_created_records.json` in the RECORD_IMPORTER_LOGS_LOCATION directory. A record of all records that have failed to load is kept in the file `record_importer_failed_records.json` in the same directory. If failed records are later successfully repaired, they will be removed from the failed records file.

## Usage Statistics Preservation

Two additional commands are provided for creating synthetic usage statistics to preserve stats coming from a legacy repository. These commands are `stats` and `aggregations`.

### Stats Command

The `stats` command is used to generate synthetic usage statistics for imported records. It's run within the knowledge_commons_repository instance directory like this:

```shell
invenio importer stats {FLAGS}
```

This command creates synthetic usage events for records imported from a legacy system. It reads the created records log and generates view and download events based on the legacy statistics.

This operation is idempotent, so it can be run multiple times without causing errors or creating duplicate events. Each time it will simply collect the events currently in the system for the given date range.

#### Command-line flags

| Flag                           | Short flag | Description                                                                                                                     |
| ------------------------------ | ---------- | ------------------------------------------------------------------------------------------------------------------------------- |
| --start-date TEXT              | -s         | The start date for generating events. Must be in the format "YYYY-MM-DD". Defaults to "2013-01-01".                              |
| --end-date TEXT                | -e         | The end date for generating events. Must be in the format "YYYY-MM-DD". Defaults to the current date.                            |
| --verbose        | -v         | Enable or disable verbose output. Defaults to False.                                                                             |
| --record-ids     | -r         | A comma-separated list of record ids to create usage statistics for. If not specified, all records from the specified source will be used. |
| --record-source  | -S         | The source of the records to create usage statistics for. Defaults to 'knowledgeCommons'. |
| --from-db        | -d         | If True, the usage statistics will be created from the database. Defaults to False. |
| --downloads-field | -D         | The field in each record to use for the number of downloads. Defaults to 'hclegacy:total_downloads'. |
| --views-field    | -V         | The field in each record to use for the number of record views. Defaults to 'hclegacy:total_views'. |
| --date-field     | -t         | The field in each record to use for the record creation date. Defaults to 'metadata.publication_date'. |

##### Explanation of flags:

- `--start-date`: This flag allows you to specify the earliest date for which synthetic events should be generated.

- `--end-date`: This flag sets the latest date for which synthetic events should be generated.

- `--verbose`: When verbose mode is enabled, the command will provide detailed output about its progress, including information about each record being processed.

- `--record-ids`: This flag allows you to specify a list of record ids to create usage statistics for. If not specified, all records from the specified source will be used.

- `--record-source`: This flag sets the source of the records to create usage statistics for. If not specified, this will default to 'knowledgeCommons'. If records are being read from a file, the source file should be a JSONL file with this string as its name (without the .jsonl extension).

- `--from-db`: This flag determines whether the usage statistics will be created from the database or from a file. If True, the usage statistics will be created from the database. If False, the usage statistics will be created from the events in the file specified by the RECORD_IMPORTER_USAGE_STATS_PATH config variable.

- `--downloads-field`: This flag sets the field in each record to use for the number of downloads. If the --from-db flag is True, the field should be found in the database record metadata. If the --from-db flag is False, the field should be found in the record data object read from the import file. Defaults to 'hclegacy:total_downloads'.

- `--views-field`: This flag sets the field in each record to use for the number of record views. If the --from-db flag is True, the field should be found in the database record metadata. If the --from-db flag is False, the field should be found in the record data object read from the import file. Defaults to 'hclegacy:total_views'.

- `--date-field`: This flag sets the field in each record to use for the record creation date. If the --from-db flag is True, the field should be found in the database record metadata. If the --from-db flag is False, the field should be found in the record data object read from the import file. Defaults to 'metadata.publication_date'.


### Aggregations Command

The `aggregations` command is used to aggregate the synthetic usage statistics generated by the `stats` command. It's run within the knowledge_commons_repository instance directory like this:

```shell
invenio importer aggregations {FLAGS}
```

This command aggregates usage statistics for records in the InvenioRDM instance. It processes the usage events and creates aggregated statistics.

This operation is idempotent, so it can be run multiple times without causing errors or creating duplicate aggregations. Each time it will simply collect the events currently in the system for the given date range.

This is a very time-consuming operation, but the operations are run as
background tasks so as not to interfere with other operations. It can,
however, overwhelm the search index with too many requests. So the
operation is divided into 1 year chunks.

The `start-date` and `end-date` parameters allow for specification of a
range of dates for which to create aggregations. A start date is required.
If it is not provided in the cli call, the start date will be taken from
the RECORD_IMPORTER_START_DATE config variable. If that is not set, an
will be raised.  An end date is optional. If not end date is
specified, the current date is used.

!NOTE: Currently this operation is intensive on memory and search index
resources and can fail as a result. If you are aggregating a large
number of stats events this can be mitigated by running the operation
for smaller time ranges sequentially.


#### Command-line flags

| Flag                           | Short flag | Description                                                                                                                     |
| ------------------------------ | ---------- | ------------------------------------------------------------------------------------------------------------------------------- |
| --start-date TEXT              | -s         | The start date for aggregating events. Must be in the format "YYYY-MM-DD". If not provided, will aggregate all events.           |
| --end-date TEXT                | -e         | The end date for aggregating events. Must be in the format "YYYY-MM-DD". If not provided, will aggregate up to the current date. |
| --verbose / --no-verbose       | -v / -q    | Enable or disable verbose output. Defaults to False.                                                                             |

##### Explanation of flags:

- `--start-date`: This flag determines the earliest date from which to start aggregating events. Any events before this date will not be included in the aggregation. If not specified, the command will include all available events from the beginning of the dataset.

- `--end-date`: This flag sets the latest date up to which events should be aggregated. Events after this date will not be included. If not provided, the command will aggregate events up to the current date, including the most recent data available.

- `--verbose / --no-verbose`: When enabled, this flag causes the command to provide detailed information about the aggregation process. This can include progress updates, information about each record being processed, and any issues encountered. It's particularly useful for monitoring long-running aggregations or troubleshooting.

### Usage Statistics Workflow

To preserve usage statistics from a legacy system:

1. First, run the `stats` command to generate synthetic usage events based on the legacy statistics:

   ```shell
   invenio importer stats --start-date "2013-01-01" --end-date "2023-12-31"
   ```

2. Then, run the `aggregations` command to aggregate these synthetic events:

   ```shell
   invenio importer aggregations --start-date "2013-01-01" --end-date "2023-12-31"
   ```

This process will create and aggregate synthetic usage statistics that reflect the usage patterns from the legacy system. By adjusting the date ranges with the `--start-date` and `--end-date` flags, you can control exactly which period of legacy data is represented in your new system.

## Utility Commands

### Count Objects Command

The `count` command is used to count the number of objects in the JSON file prepared for import. It's run within the knowledge_commons_repository instance directory like this:

```shell
invenio importer count
```

This command reads the serialized records file and counts the number of objects (records) it contains. The file location is specified by the `RECORD_IMPORTER_SERIALIZED_PATH` config variable.

#### Usage

This command doesn't take any arguments or flags. Simply run it to get a count of the objects in the import file.

#### Output

The command will print the total number of objects found in the specified file. For example:

```
Total objects in /path/to/your/file.json: 1000
```

If the file is not found at the specified location, an error message will be displayed.

### Delete Records Command

The `delete` command is used to delete one or more records from InvenioRDM by their record ID. It's run within the knowledge_commons_repository instance directory like this:

```shell
invenio importer delete {RECORD_IDS}
```

This command allows you to remove specific records from the InvenioRDM database.

#### Command-line arguments

A list of record IDs to be deleted. You can specify multiple IDs separated by spaces.

#### Usage Examples

To delete a single record:

```shell
invenio importer delete record_id_1
```

To delete multiple records:

```shell
invenio importer delete record_id_1 record_id_2 record_id_3
```

#### Output

The command will print the results of the deletion operation, showing which records were successfully deleted and any errors that occurred during the process.

### Caution

The delete operation is irreversible. Make sure you have backups of your data before deleting records, especially in a production environment.

## Testing

This project includes a test suite to ensure the functionality of the importer. To run the tests, use the `run-tests.sh` script provided in the root directory of the project.

### Running Tests

To run the tests, execute the following command from the project root:

```bash
./run-tests.sh
```

This script performs the following actions:

1. Checks if the required Python version (3.12) is installed.
2. Sets up a virtual environment using pipenv if it doesn't exist.
3. Installs the project dependencies.
4. Runs the pytest suite with coverage reporting.

### Test Coverage

The `run-tests.sh` script includes coverage reporting. After running the tests, you'll see a coverage report in the terminal output. This report shows which parts of the code were executed during the tests and helps identify areas that may need additional testing.

### Continuous Integration

The test suite is also integrated into our KCWorks CI/CD pipeline. Any push to the repository or pull request will trigger the test suite to run, ensuring that new changes don't break existing functionality.

### Writing New Tests

When adding new features or fixing bugs, it's important to add corresponding tests. Place new test files in the `tests/` directory, following the existing naming conventions (e.g., `test_<module_name>.py`).

Remember to run the test suite locally before pushing changes to ensure all tests pass.

## Developing this Module

### Testing

This project includes a test suite to ensure the functionality of the importer. To run the tests, use the `run-tests.sh` script provided in the root directory of the project.

#### Running Tests

To run the tests, execute the following command from the project root:

```bash
./run-tests.sh
```

This script performs the following actions:

1. Checks if the required Python version (3.12) is installed.
2. Sets up a virtual environment using pipenv if it doesn't exist.
3. Installs the project dependencies.
4. Runs the pytest suite with coverage reporting.

#### Test Coverage

The `run-tests.sh` script includes coverage reporting. After running the tests, you'll see a coverage report in the terminal output. This report shows which parts of the code were executed during the tests and helps identify areas that may need additional testing.

#### Continuous Integration

The test suite is also integrated into our KCWorks CI/CD pipeline. Any push to the repository or pull request will trigger the test suite to run, ensuring that new changes don't break existing functionality.

### Writing New Tests

When adding new features or fixing bugs, it's important to add corresponding tests. Place new test files in the `tests/` directory, following the existing naming conventions (e.g., `test_<module_name>.py`).

Remember to run the test suite locally before pushing changes to ensure all tests pass.

### Versioning

This project uses [bumpversion](https://github.com/peritus/bumpversion) to manage versioning. The version is stored in the `pyproject.toml` file and is updated automatically when a new release is made.

For example, to update the version numbers for a new 'patch' release, from 0.2.20 to 0.2.21, run the following command:

```bash
pipenv run bumpver update --patch
```

To update the tag number, from 'alpha6' to 'alpha7', run the following command:

```bash
pipenv run bumpver update --tag
```

This will update the version throughout this project's files, create a new commit, and tag the commit with the new version.

## Copyright

Copyright 2023-24 MESH Research. Released under the MIT license.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "invenio-record-importer-kcworks",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "invenio, invenio-rdm, knowledge-commons-works, bulk-import, record-importer",
    "author": null,
    "author_email": "Mesh Research <scottia4@msu.edu>",
    "download_url": "https://files.pythonhosted.org/packages/5f/12/f92b229d0c42a07a23451ce45caf56a6acda93a70e4f5638a13d91f5be56/invenio_record_importer_kcworks-0.2.20a7.tar.gz",
    "platform": null,
    "description": "# INVENIO RECORD IMPORTER for Knowledge Commons Works\n\nVersion 0.2.20-alpha7\n\n*ALPHA QUALITY SOFTWARE - USE AT OWN RISK*\n\nThis is a command line utility to perform a bulk import of records into Knowledge Commons Works, an InvenioRDM instance. The utility could be adapted to work with other InvenioRDM installations, but in its current form it assumes the customized metadata schema, database structures, and other configuration employed in Knowledge Commons Works.\n\nThis utility is designed to play two functions.\n(1) The first is to serve as a bulk importer for any batch of records coming from an external service or source.\n(2) The second is a specific legacy function--to convert legacy Humanities Commons CORE deposits to the metadata schema used in KC Works and prepare a json file for use by the importer.\n\nThis module adds a cli command to InvenioRDM's internal via the `console_commands` entry point. The command must be run within a Knowledge Commons Works instance of InvenioRDM. From the command line,\nwithin the instance directory, run\n\n```shell\npipenv run invenio importer\n```\n\n## Commands\n\nThe importer has three primary commands: `serialize`, `read`, and `load`. These are described in detail below. The `serialize` command is used to serialize metadata records for import into InvenioRDM. The `read` command is used to read the metadata for records that are being imported. The `load` command is used to load the serialized metadata records into InvenioRDM.\n\nTwo additional commands are provided for creating synthetic usage statistics to preserve stats coming from a legacy repository. These commands are `stats` and `aggregations`. They are described in the [Usage Statistics Preservation](#usage-statistics-preservation) section below.\n\n## Installation\n\nThis module should already be installed in a standard Knowledge Commons Works instance of InvenioRDM.\nTo install from PyPI, run\n\n```shell\npip install invenio-record-importer-kcworks\n```\n\nTo install for development of this module, clone the repository and run\n\n```shell\npipenv install -e .\n```\nfrom within the repository.\n\n## Setup\n\nPrior to running the importer, the required configuration variables listed below must be set either in the `invenio.cfg` file or as environment variables. A jsonlines file containing the serialized metadata records named `records-for-import.json` must also be placed in the folder identified by the RECORD_IMPORTER_DATA_DIR environment variable. All files for the records to be imported should be placed in the folder identified by the RECORD_IMPORTER_FILES_LOCATION environment variable.\n\n## Dependencies\n\nIn addition to the normal InvenioRDM packages and KCWorks, this module relies in particular\non the following packages, which are also installed in the standard KCWorks environment:\n\n- invenio-remote-user-data-kcworks\n- invenio-group-collections-kcworks\n\nFetching of user data and creation of group collections relies on these packages being properly configured.\n\n## Configuration\n\nThe importer relies on several environment variables. These can be set in the `invenio.cfg` file of the InvenioRDM instance, or in a `.env` file in the base directory of the InvenioRDM instance. If they are set in the `.env` file they must be prefixed with `INVENIO_`.\n\n| Variable name                   | Required | Description                                                                                                                                                        |\n| ------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |\n| RECORD_IMPORTER_ADMIN_EMAIL                     | Y | The email address of the admin user in the InvenioRDM instance. This defaults to the global ADMIN_EMAIL Flask environment variable.                                                                                                     |\n| RECORD_IMPORTER_DATA_DIR       | Y | The folder where the file with serialized metadata records can be found, named `records-for-import.json`                                                           |\n| RECORD_IMPORTER_FILES_LOCATION | N | The folder where the files for upload withe the new deposits may be found. It defaults to a subfolder of the RECORD_IMPORTER_DATA_DIR directory.                                                                                         |\n| RECORD_IMPORTER_LOGS_LOCATION   | N | The full path to the local directory where the record import log files will be written. It defaults to the `logs` folder of the `invenio-record-importer-kcworks` modules.                                                                                          |\n| RECORD_IMPORTER_OVERRIDES_FOLDER | N | The full path to the local directory where the overrides files can be found. It defaults to the `overrides` subfolder of the folder at RECORD_IMPORTER_DATA_DIR.                                                                                       |\n| RECORD_IMPORTER_CREATED_LOG_PATH | N | The full path to the local file where the created records log will be written. It defaults to the `record_importer_created_records.jsonl` file in the RECORD_IMPORTER_LOGS_LOCATION folder.                                                                                       |\n| RECORD_IMPORTER_FAILED_LOG_PATH | N | The full path to the local file where the failed records log will be written. It defaults to the `record_importer_failed_records.jsonl` file in the RECORD_IMPORTER_LOGS_LOCATION folder.                                                                                       |\n| RECORD_IMPORTER_SERIALIZED_PATH | N | The full path to the local file where the serialized records will be written. It defaults to the `record_importer_serialized_records.jsonl` file in the RECORD_IMPORTER_DATA_DIR folder.                                                                                       |\n| RECORD_IMPORTER_SERIALIZED_FAILED_PATH | N | The full path to the local file where the serialized failed records will be written. It defaults to the `record_importer_failed_serialized.jsonl` file in the RECORD_IMPORTER_LOGS_LOCATION folder.                                                                                       |\n\nThe required folders must of course be created before the importer is run. The importer will not create these folders if they do not exist. The various log files and serialized records files will be created by the importer if they do not already exist.\n\nRemember that in containerized environments, the directories holding data and logs must be persistent. In a Docker environment, these directories should be mounted as volumes in the container.\n\n## Serializer usage\n\nThe `serialize` command is run within the knowledge_commons_repository ui container like this:\n\n```shell\ninvenio importer serialize\n```\n\nThis command serializes metadata records for import into InvenioRDM. It reads records from a source file, processes them, and writes the serialized records to a JSON Lines file. The command also performs metadata repair and logs any problematic records.\n\n### Command-line flags\n\n| Flag                           | Short flag | Description                                                                                                                     |\n| ------------------------------ | ---------- | ------------------------------------------------------------------------------------------------------------------------------- |\n| --start-index INTEGER          | -s         | The index of the first record to serialize (1-based). Defaults to 1.                                                             |\n| --end-index INTEGER            | -e         | The index of the last record to serialize (inclusive). If not provided, will serialize to the end of the input file.            |\n| --verbose / --no-verbose       | -v / -q    | Enable or disable verbose output. Defaults to False.                                                                             |\n\n### Metadata repair\n\nThe serializer will attempt to repair any metadata fields that are missing or have incorrect values. If a record has a missing or incorrect metadata field, the serializer will attempt to fill in the missing field with a value from a related field.\n\n### Logging\n\nDetails about the program's progress are sent to Invenio's logging system as it runs. After each serializer run, a list of records with problematic metadata is written to the file at RECORD_IMPORTER_SERIALIZED_FAILED_PATH. Each line of this file is a json object listing metadata fields that the program has flagged as problematic for each record. The file is overwritten each time the serializer is run.\n\nNote that in most cases a record with problematic metadata will still be serialized and included in the output file. The problems flagged by the serializer are usually limited to a single metadata field. Many are not true errors but rather pieces of metadata that require human review.\n\n## Reader usage\n\nThe record reader is a convenience utility for retrieving the metadata for records that are being imported into InvenioRDM. It is not necessary to run the reader before running the loader, but it can be useful for debugging purposes. By default records are identified by their index in the jsonl file of records for import, but they can also be identified by their source id. The reader will output the metadata for the specified records to the console. By default it will print the serialized matadata as it will be fed into the loader, but it can also print the raw metadata as it exists prior to serialization.\n\nThe `read` command is run within the knowledge_commons_repository instance directory like this:\n\n```shell\npipenv run invenio importer read {RECORDS} {FLAGS}\n```\n\n### Command-line arguments\n\nA list of the provided positional arguments specifying which records to read. Defaults to [].\n\n### Command-line flags\n\n| Flag                           | Short flag | Description                                                                                                                     |\n| ------------------------------ | ---------- | ------------------------------------------------------------------------------------------------------------------------------- |\n| --raw_input (bool, optional)           | -r | If True, print the raw metadata for the specified records as it exists prior to serialization. Defaults to False.                                                   |\n| --use-sourceids (bool, optional) | -s | If True, the positional arguments are interpreted as ids in the source system instead of positional indices. By default these ids are interpreted as DOI identifiers. If a different id scheme is desired, this may be set using the `--scheme` flag. Defaults to False. |\n| --scheme (str, optional) | -m | The identifier scheme to use for the records when the --use-sourceids flag is True. Defaults to \"doi\". |\n| --field-path (str, optional) | -f | The dot-separated path to a specific metadata field to be printed. If not specified, the entire record will be printed. |\n\n## Loader usage\n\nThe record loader is run within the knowledge_commons_repository instance directory like this:\n\n```shell\npipenv run invenio importer load {RECORDS} {FLAGS}\n```\n\nThis command loads serialized metadata records into InvenioRDM. It reads records from a JSON Lines file, processes them, and creates or updates records in InvenioRDM accordingly.\n\n### Command-line arguments\n\nA list of the provided positional arguments specifying which records to load. Defaults to [].\n\nIf no positional arguments are provided, all records will be loaded.\n\nIf positional arguments are provided, they should be either integers specifying the line numbers of the records to load, or source ids specifying the ids of the records to load in the source system. These will be interpreted as line numbers in the jsonl file of records for import (beginning at 1) unless the --use-sourceids flag is set.\n\nIf a range is specified in the RECORDS by linking two integers with a hyphen, the program will load all records between the two indices, inclusive. If the range ends in a hyphen with no second integer, the program will load all records from the start index to the end of the input file.\n\n### Command-line flags\n\n| Flag                           | Short flag | Description                                                                                                                     |\n| ------------------------------ | ---------- | ------------------------------------------------------------------------------------------------------------------------------- |\n| --no-updates                   | -n         | If set, do not update existing records where a record with the same DOI already exists. Defaults to False.                       |\n| --retry-failed                 | -r         | If set, try to load in all previously failed records that have not already been repaired successfully. Defaults to False.        |\n| --use-sourceids                | -s         | If set, the positional arguments are interpreted as ids in the source system instead of positional indices. Defaults to False.   |\n| --scheme TEXT                  | -m         | The identifier scheme to use for the records when the --use-sourceids flag is True. Defaults to \"hclegacy-pid\".                  |\n| --aggregate                    | -a         | If set, run Invenio's usage statistics aggregation after importing the records. Defaults to False.                               |\n| --start_date TEXT              |            | The start date for the usage statistics aggregation. Must be in the format \"YYYY-MM-DD\". Defaults to None.                       |\n| --end_date TEXT                |            | The end date for the usage statistics aggregation. Must be in the format \"YYYY-MM-DD\". Defaults to None.                         |\n| --clean_filenames              | -c         | If set, clean the filenames of the files to be uploaded. Defaults to False.                                                      |\n| --verbose / --no-verbose       | -v / -q    | Enable or disable verbose output. Defaults to False.                                                                             |\n| --stop_on_error / --no-stop_on_error | -e / -E | If set, stop the loading process if an error is encountered. Defaults to False.                                                |\n\n### Examples:\n\nTo load records 1, 2, 3, and 5, run:\n\n```shell\npipenv run invenio importer load 1 2 3 5\n```\n\nA range can be specified in the RECORDS by linking two integers with a hyphen. For example, to load only the first 100 records, run:\n\n```shell\npipenv run invenio importer load 1-100\n```\n\nIf the range ends in a hyphen with no second integer, the program will load all records from the start index to the end of the input file. For example, to load all records from 100 to the end of the file, run:\n\n```shell\npipenv run invenio importer load 100-\n```\n\nRecords may be loaded by id in the source system instead of by index. For example, to load records with ids hc:4723, hc:8271, and hc:2246, run:\n\n```shell\npipenv run invenio importer load --use-sourceids hc:4723 hc:8271 hc:2246\n```\n\n### Source file locations\n\nThe `load` command must be run from the base knowledge_commons_repository directory. It will look for the exported records in the directory specified by the RECORD_IMPORTER_DATA_DIR environment variable. It will look for the files to be uploaded in the directory specified by the RECORD_IMPORTER_FILES_LOCATION environment variable.\n\n### Overriding metadata during loading\n\nThe loader can be configured to override metadata fields during loading. This is done by creating a jsonl file containing the metadata fields to be overridden, and placing it in the directory specified by the RECORD_IMPORTER_OVERRIDES_FOLDER environment variable. The file should be named `record-importer-overrides_mySourceService.json`, where `mySourceService` is the name of the source from which the import data is coming. If this source is a service configured with KCWorks as a SAML authentication provider, the name of the source should be the same as the name of the service in the SAML configuration.\n\nThe file should contain one json object per line, with no newline characters within each object. Each object must have the key \"source_id\" which contains the identifier of the record in the source system. The object may also contain the keys \"overrides\", \"skip\", and \"notes\". The \"overrides\" value must be a json object containing the metadata fields to be overridden. The \"skip\" value must be a boolean indicating whether the record should be skipped during loading. The \"notes\" value is a string that is ignored by invenio-record-importer-kcworks but allows explanatory notes to be kept together with the override data.\n\nThe format of the overrides file should be as follows:\n\n```json\n{\"source_id\": \"hc:12345\", \"overrides\": {\"metadata|title\": \"My overridden title\", \"custom_fields|hclegacy:file_location\": \"my/overridden/filename.pdf\"}, \"notes\": \"optional notes\"}\n{\"source_id\": \"hc:678910\", \"skip\": true, \"notes\": \"optional notes\"}\n```\n\nNote that the metadata field names should be path strings with steps separated by a pipe character. So if you want to update the \"title\" subfield of the top-level \"metadata\", the path string should be \"metadata|title\". If one level in the metadata hierarchy is a list/array, a number may be included in the string indicating which index in the list should be updated. So to update the \"date\" subfield of the list of \"dates\" in the \"metadata\" field, we would use the path string \"metadata|dates|1|date\".\n\nWhatever value is provided for each path string will *entirely replace* the value at that point in the metadata hierarchy. So if I update \"metadata|dates\" I will need to provide the *entire list* of dates, even if there are items that should remain the same. If I only want to update one item I must specify it with a more specific path string like \"metadata|dates|0\".\n\nNote that in some cases field names are namespaced by the schema they belong to. For example, the \"file_location\" field is in the \"hclegacy\" schema, so the path string for this field would be \"custom_fields|hclegacy:file_location\".\n\nTo uncover the metadata structure of the Invenio record being overridden, use the `read` command to print the metadata of the record to the terminal.\n\n### Skipping records during loading\n\nTo skip a record from the serialized metadata during loading, add a line to the overrides file with the \"skip\" key set to true. The record will be skipped during loading, but will still be recorded in the created records log. If a record is skipped, it will not be included in the failed records log and will be removed from that log if has previously failed.\n\n### Authentication\n\nSince the program interacts directly with InvenioRDM (and not via the REST API) it does not require separate authentication.\n\n### Collections and record owners\n\nWhere necessary this program will create top-level domain communities, assign the records to the correct domain communities, create new Invenio users corresponding to the users who uploaded the original deposits, and transfer ownership of the Invenio record to the correct users. If the source of the records is associated with a SAML authentication IDP, these new users will be set to authenticate using their account with that IDP.\n\nIf the record was part of any group collections in the source system, the program will assign the record to the equivalent KCWorks group collections, creating new collections if necessary.\n\n### Recovering existing records\n\nIf a record with the same DOI already exists in Invenio, the program will try to update the existing record with any new metadata and/or files, creating a new draft of published records if necessary. Unpublished existing drafts will be submitted to the appropriate community and published. Alternately, if the --no-updates flag is set, the program will skip any records that match DOIs for records that already exist in Invenio.\n\n### Logging\n\nDetails about the program's progress are sent to Invenio's logging system as it runs. In addition, a running list of all records that have been created (a load attempt has been made) is recorded in the file `record_importer_created_records.json` in the RECORD_IMPORTER_LOGS_LOCATION directory. A record of all records that have failed to load is kept in the file `record_importer_failed_records.json` in the same directory. If failed records are later successfully repaired, they will be removed from the failed records file.\n\n## Usage Statistics Preservation\n\nTwo additional commands are provided for creating synthetic usage statistics to preserve stats coming from a legacy repository. These commands are `stats` and `aggregations`.\n\n### Stats Command\n\nThe `stats` command is used to generate synthetic usage statistics for imported records. It's run within the knowledge_commons_repository instance directory like this:\n\n```shell\ninvenio importer stats {FLAGS}\n```\n\nThis command creates synthetic usage events for records imported from a legacy system. It reads the created records log and generates view and download events based on the legacy statistics.\n\nThis operation is idempotent, so it can be run multiple times without causing errors or creating duplicate events. Each time it will simply collect the events currently in the system for the given date range.\n\n#### Command-line flags\n\n| Flag                           | Short flag | Description                                                                                                                     |\n| ------------------------------ | ---------- | ------------------------------------------------------------------------------------------------------------------------------- |\n| --start-date TEXT              | -s         | The start date for generating events. Must be in the format \"YYYY-MM-DD\". Defaults to \"2013-01-01\".                              |\n| --end-date TEXT                | -e         | The end date for generating events. Must be in the format \"YYYY-MM-DD\". Defaults to the current date.                            |\n| --verbose        | -v         | Enable or disable verbose output. Defaults to False.                                                                             |\n| --record-ids     | -r         | A comma-separated list of record ids to create usage statistics for. If not specified, all records from the specified source will be used. |\n| --record-source  | -S         | The source of the records to create usage statistics for. Defaults to 'knowledgeCommons'. |\n| --from-db        | -d         | If True, the usage statistics will be created from the database. Defaults to False. |\n| --downloads-field | -D         | The field in each record to use for the number of downloads. Defaults to 'hclegacy:total_downloads'. |\n| --views-field    | -V         | The field in each record to use for the number of record views. Defaults to 'hclegacy:total_views'. |\n| --date-field     | -t         | The field in each record to use for the record creation date. Defaults to 'metadata.publication_date'. |\n\n##### Explanation of flags:\n\n- `--start-date`: This flag allows you to specify the earliest date for which synthetic events should be generated.\n\n- `--end-date`: This flag sets the latest date for which synthetic events should be generated.\n\n- `--verbose`: When verbose mode is enabled, the command will provide detailed output about its progress, including information about each record being processed.\n\n- `--record-ids`: This flag allows you to specify a list of record ids to create usage statistics for. If not specified, all records from the specified source will be used.\n\n- `--record-source`: This flag sets the source of the records to create usage statistics for. If not specified, this will default to 'knowledgeCommons'. If records are being read from a file, the source file should be a JSONL file with this string as its name (without the .jsonl extension).\n\n- `--from-db`: This flag determines whether the usage statistics will be created from the database or from a file. If True, the usage statistics will be created from the database. If False, the usage statistics will be created from the events in the file specified by the RECORD_IMPORTER_USAGE_STATS_PATH config variable.\n\n- `--downloads-field`: This flag sets the field in each record to use for the number of downloads. If the --from-db flag is True, the field should be found in the database record metadata. If the --from-db flag is False, the field should be found in the record data object read from the import file. Defaults to 'hclegacy:total_downloads'.\n\n- `--views-field`: This flag sets the field in each record to use for the number of record views. If the --from-db flag is True, the field should be found in the database record metadata. If the --from-db flag is False, the field should be found in the record data object read from the import file. Defaults to 'hclegacy:total_views'.\n\n- `--date-field`: This flag sets the field in each record to use for the record creation date. If the --from-db flag is True, the field should be found in the database record metadata. If the --from-db flag is False, the field should be found in the record data object read from the import file. Defaults to 'metadata.publication_date'.\n\n\n### Aggregations Command\n\nThe `aggregations` command is used to aggregate the synthetic usage statistics generated by the `stats` command. It's run within the knowledge_commons_repository instance directory like this:\n\n```shell\ninvenio importer aggregations {FLAGS}\n```\n\nThis command aggregates usage statistics for records in the InvenioRDM instance. It processes the usage events and creates aggregated statistics.\n\nThis operation is idempotent, so it can be run multiple times without causing errors or creating duplicate aggregations. Each time it will simply collect the events currently in the system for the given date range.\n\nThis is a very time-consuming operation, but the operations are run as\nbackground tasks so as not to interfere with other operations. It can,\nhowever, overwhelm the search index with too many requests. So the\noperation is divided into 1 year chunks.\n\nThe `start-date` and `end-date` parameters allow for specification of a\nrange of dates for which to create aggregations. A start date is required.\nIf it is not provided in the cli call, the start date will be taken from\nthe RECORD_IMPORTER_START_DATE config variable. If that is not set, an\nwill be raised.  An end date is optional. If not end date is\nspecified, the current date is used.\n\n!NOTE: Currently this operation is intensive on memory and search index\nresources and can fail as a result. If you are aggregating a large\nnumber of stats events this can be mitigated by running the operation\nfor smaller time ranges sequentially.\n\n\n#### Command-line flags\n\n| Flag                           | Short flag | Description                                                                                                                     |\n| ------------------------------ | ---------- | ------------------------------------------------------------------------------------------------------------------------------- |\n| --start-date TEXT              | -s         | The start date for aggregating events. Must be in the format \"YYYY-MM-DD\". If not provided, will aggregate all events.           |\n| --end-date TEXT                | -e         | The end date for aggregating events. Must be in the format \"YYYY-MM-DD\". If not provided, will aggregate up to the current date. |\n| --verbose / --no-verbose       | -v / -q    | Enable or disable verbose output. Defaults to False.                                                                             |\n\n##### Explanation of flags:\n\n- `--start-date`: This flag determines the earliest date from which to start aggregating events. Any events before this date will not be included in the aggregation. If not specified, the command will include all available events from the beginning of the dataset.\n\n- `--end-date`: This flag sets the latest date up to which events should be aggregated. Events after this date will not be included. If not provided, the command will aggregate events up to the current date, including the most recent data available.\n\n- `--verbose / --no-verbose`: When enabled, this flag causes the command to provide detailed information about the aggregation process. This can include progress updates, information about each record being processed, and any issues encountered. It's particularly useful for monitoring long-running aggregations or troubleshooting.\n\n### Usage Statistics Workflow\n\nTo preserve usage statistics from a legacy system:\n\n1. First, run the `stats` command to generate synthetic usage events based on the legacy statistics:\n\n   ```shell\n   invenio importer stats --start-date \"2013-01-01\" --end-date \"2023-12-31\"\n   ```\n\n2. Then, run the `aggregations` command to aggregate these synthetic events:\n\n   ```shell\n   invenio importer aggregations --start-date \"2013-01-01\" --end-date \"2023-12-31\"\n   ```\n\nThis process will create and aggregate synthetic usage statistics that reflect the usage patterns from the legacy system. By adjusting the date ranges with the `--start-date` and `--end-date` flags, you can control exactly which period of legacy data is represented in your new system.\n\n## Utility Commands\n\n### Count Objects Command\n\nThe `count` command is used to count the number of objects in the JSON file prepared for import. It's run within the knowledge_commons_repository instance directory like this:\n\n```shell\ninvenio importer count\n```\n\nThis command reads the serialized records file and counts the number of objects (records) it contains. The file location is specified by the `RECORD_IMPORTER_SERIALIZED_PATH` config variable.\n\n#### Usage\n\nThis command doesn't take any arguments or flags. Simply run it to get a count of the objects in the import file.\n\n#### Output\n\nThe command will print the total number of objects found in the specified file. For example:\n\n```\nTotal objects in /path/to/your/file.json: 1000\n```\n\nIf the file is not found at the specified location, an error message will be displayed.\n\n### Delete Records Command\n\nThe `delete` command is used to delete one or more records from InvenioRDM by their record ID. It's run within the knowledge_commons_repository instance directory like this:\n\n```shell\ninvenio importer delete {RECORD_IDS}\n```\n\nThis command allows you to remove specific records from the InvenioRDM database.\n\n#### Command-line arguments\n\nA list of record IDs to be deleted. You can specify multiple IDs separated by spaces.\n\n#### Usage Examples\n\nTo delete a single record:\n\n```shell\ninvenio importer delete record_id_1\n```\n\nTo delete multiple records:\n\n```shell\ninvenio importer delete record_id_1 record_id_2 record_id_3\n```\n\n#### Output\n\nThe command will print the results of the deletion operation, showing which records were successfully deleted and any errors that occurred during the process.\n\n### Caution\n\nThe delete operation is irreversible. Make sure you have backups of your data before deleting records, especially in a production environment.\n\n## Testing\n\nThis project includes a test suite to ensure the functionality of the importer. To run the tests, use the `run-tests.sh` script provided in the root directory of the project.\n\n### Running Tests\n\nTo run the tests, execute the following command from the project root:\n\n```bash\n./run-tests.sh\n```\n\nThis script performs the following actions:\n\n1. Checks if the required Python version (3.12) is installed.\n2. Sets up a virtual environment using pipenv if it doesn't exist.\n3. Installs the project dependencies.\n4. Runs the pytest suite with coverage reporting.\n\n### Test Coverage\n\nThe `run-tests.sh` script includes coverage reporting. After running the tests, you'll see a coverage report in the terminal output. This report shows which parts of the code were executed during the tests and helps identify areas that may need additional testing.\n\n### Continuous Integration\n\nThe test suite is also integrated into our KCWorks CI/CD pipeline. Any push to the repository or pull request will trigger the test suite to run, ensuring that new changes don't break existing functionality.\n\n### Writing New Tests\n\nWhen adding new features or fixing bugs, it's important to add corresponding tests. Place new test files in the `tests/` directory, following the existing naming conventions (e.g., `test_<module_name>.py`).\n\nRemember to run the test suite locally before pushing changes to ensure all tests pass.\n\n## Developing this Module\n\n### Testing\n\nThis project includes a test suite to ensure the functionality of the importer. To run the tests, use the `run-tests.sh` script provided in the root directory of the project.\n\n#### Running Tests\n\nTo run the tests, execute the following command from the project root:\n\n```bash\n./run-tests.sh\n```\n\nThis script performs the following actions:\n\n1. Checks if the required Python version (3.12) is installed.\n2. Sets up a virtual environment using pipenv if it doesn't exist.\n3. Installs the project dependencies.\n4. Runs the pytest suite with coverage reporting.\n\n#### Test Coverage\n\nThe `run-tests.sh` script includes coverage reporting. After running the tests, you'll see a coverage report in the terminal output. This report shows which parts of the code were executed during the tests and helps identify areas that may need additional testing.\n\n#### Continuous Integration\n\nThe test suite is also integrated into our KCWorks CI/CD pipeline. Any push to the repository or pull request will trigger the test suite to run, ensuring that new changes don't break existing functionality.\n\n### Writing New Tests\n\nWhen adding new features or fixing bugs, it's important to add corresponding tests. Place new test files in the `tests/` directory, following the existing naming conventions (e.g., `test_<module_name>.py`).\n\nRemember to run the test suite locally before pushing changes to ensure all tests pass.\n\n### Versioning\n\nThis project uses [bumpversion](https://github.com/peritus/bumpversion) to manage versioning. The version is stored in the `pyproject.toml` file and is updated automatically when a new release is made.\n\nFor example, to update the version numbers for a new 'patch' release, from 0.2.20 to 0.2.21, run the following command:\n\n```bash\npipenv run bumpver update --patch\n```\n\nTo update the tag number, from 'alpha6' to 'alpha7', run the following command:\n\n```bash\npipenv run bumpver update --tag\n```\n\nThis will update the version throughout this project's files, create a new commit, and tag the commit with the new version.\n\n## Copyright\n\nCopyright 2023-24 MESH Research. Released under the MIT license.\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2024 MESH Research  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "Bulk import records into a InvenioRDM instance configured for Knowledge Commons Works",
    "version": "0.2.20a7",
    "project_urls": {
        "Homepage": "https://github.com/Knowledge-Commons-Works/invenio-record-importer-kcworks"
    },
    "split_keywords": [
        "invenio",
        " invenio-rdm",
        " knowledge-commons-works",
        " bulk-import",
        " record-importer"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1d07fe0983ce3ee6dc6ce0e0bdff9ebb628a223799ddd6292b55e643f69222fa",
                "md5": "2cd9841e46635e0bc76bc22b50738cc8",
                "sha256": "34664f78e35cf3034ff6bac122ff598c8782362f8ef29347dd3f28e07215d694"
            },
            "downloads": -1,
            "filename": "invenio_record_importer_kcworks-0.2.20a7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2cd9841e46635e0bc76bc22b50738cc8",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 2979683,
            "upload_time": "2024-09-23T20:57:14",
            "upload_time_iso_8601": "2024-09-23T20:57:14.282531Z",
            "url": "https://files.pythonhosted.org/packages/1d/07/fe0983ce3ee6dc6ce0e0bdff9ebb628a223799ddd6292b55e643f69222fa/invenio_record_importer_kcworks-0.2.20a7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5f12f92b229d0c42a07a23451ce45caf56a6acda93a70e4f5638a13d91f5be56",
                "md5": "fd8d63584e531e22d08330decb2bc305",
                "sha256": "16c50b209b42c6aea94e3c9e184ef45ac0b0bd133e70bd261efef9911200748a"
            },
            "downloads": -1,
            "filename": "invenio_record_importer_kcworks-0.2.20a7.tar.gz",
            "has_sig": false,
            "md5_digest": "fd8d63584e531e22d08330decb2bc305",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 1952896,
            "upload_time": "2024-09-23T20:57:16",
            "upload_time_iso_8601": "2024-09-23T20:57:16.987079Z",
            "url": "https://files.pythonhosted.org/packages/5f/12/f92b229d0c42a07a23451ce45caf56a6acda93a70e4f5638a13d91f5be56/invenio_record_importer_kcworks-0.2.20a7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-23 20:57:16",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Knowledge-Commons-Works",
    "github_project": "invenio-record-importer-kcworks",
    "github_not_found": true,
    "lcname": "invenio-record-importer-kcworks"
}
        
Elapsed time: 1.04717s