invenio-rdm-migrator

Name	invenio-rdm-migrator JSON
Version	5.0.0 JSON
	download
home_page	https://github.com/inveniosoftware/invenio-rdm-migrator
Summary	InvenioRDM module for data migration.
upload_time	2024-07-12 09:16:24
maintainer	None
docs_url	None
author	CERN
requires_python	>=3.7
license	MIT
keywords	invenio rdm migration
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            ..
    Copyright (C) 2022-2023 CERN.


    Invenio-RDM-Migrator is free software; you can redistribute it and/or
    modify it under the terms of the MIT License; see LICENSE file for more
    details.

=====================
 Invenio-RDM-Migrator
=====================

.. image:: https://github.com/inveniosoftware/invenio-rdm-migrator/workflows/CI/badge.svg
        :target: https://github.com/inveniosoftware/invenio-rdm-migrator/actions?query=workflow%3ACI+branch%3Amaster

.. image:: https://img.shields.io/github/tag/inveniosoftware/invenio-rdm-migrator.svg
        :target: https://github.com/inveniosoftware/invenio-rdm-migrator/releases

.. image:: https://img.shields.io/pypi/dm/invenio-rdm-migrator.svg
        :target: https://pypi.python.org/pypi/invenio-rdm-migrator

.. image:: https://img.shields.io/github/license/inveniosoftware/invenio-rdm-migrator.svg
        :target: https://github.com/inveniosoftware/invenio-rdm-migrator/blob/master/LICENSE

DataCite-based data model for Invenio.


Development
===========

Install
-------

Make sure that you have `libpq-dev` installed in your system. See
`psycopg installation instructions <https://www.psycopg.org/install/>`_
for more information.

Decide if you need to use intermediate database to copy from (more details below)
If yes, you need to install the optional [alchemy] extra.
Attention! it installs sqlalchemy in version 2, which might be incompatible with some
InvenioRDM instances.

Choose a version of search and database, then run:

.. code-block:: console

    pip install -e .

.. code-block:: console

    pip install -e ."[tests,alchemy]"


Tests
-----

.. code-block:: console

    ./run-tests.sh

How to run it
=============

To run the migration you need:

- A running InvenioRDM instance.
- If your data contains references to other records (e.g. vocabularies),
  then it is also required to run the setup step.

.. code-block:: console

    invenio-cli services setup --force --no-demo-data

- Install Invenio-RDM-Migrator, any other dependencies must be handled
  in the Pipfile of your instance.

.. code-block:: console

    $ pip install invenio-rdm-migrator

- Create/edit the configuration file on your instance, for example
  `streams.yaml`:

.. code-block:: yaml

    data_dir: /path/to/data
    tmp_dir: /path/to/tmp
    state_dir: /path/to/state
    log_dir: /path/to/logs
    db_uri: postgresql+psycopg2://inveniordm:inveniordm@localhost:5432/inveniordm
    old_secret_key: CHANGE_ME
    new_secret_key: CHANGE_ME
    records:
        extract:
            filename: /path/to/records.json


- You will need to create a small python script
  putting together the different blocks of the ETL. You can find an example
  at `my-site/site/my_site/migrator/__main__.py`.

.. code-block:: python

    from invenio_rdm_migrator.streams import StreamDefinition
    from invenio_rdm_migrator.streams.records.load import RDMRecordCopyLoad

    if __name__ == "__main__":
        RecordStreamDefinition = StreamDefinition(
            name="records",
            extract_cls=JSONLExtract,
            transform_cls=ZenodoToRDMRecordTransform,
            load_cls=RDMRecordCopyLoad,
        )

        runner = Runner(
            stream_definitions=[
                RecordStreamDefinition,
            ],
            config_filepath="path/to/your/streams.yaml",
        )

        runner.run()

- Finally, you can execute the above code. Since it is in the `__main__` file
  of the python package, you can run it as a module:

.. code-block:: console

    $ python -m my_site.migrator

- Once the migration has completed, in your instance you can reindex the data.
  Following the records example above, it would look like:

.. code-block:: console

    $ invenio-cli pyshell

    In [1]: from invenio_access.permissions import system_identity
    In [2]: from invenio_rdm_records.proxies import current_rdm_records_service
    In [3]: current_rdm_records_service.rebuild_index(identity=system_identity)

ETL {Extract/Transform/Load} architecture
=========================================

There are four packages in this module `extract`, `transform`, `load`, and
`streams`. The first three correspond to the three steps of an ETL process.
The `streams` package contains the logic to run the process and different
stream-specific implementations of ETL classes (e.g. `records`).

Extract
-------

The extract is the first part of the data processing stream. It's
functionality is quite simple: return an iterator (e.g. of records), where each
yielded value is a dictionary. Note that the data in this step is _transformed_
in format (e.g. JSON, XML), not in content. For example, the implementation of
`XMLExtract` would look as follows:

.. code-block:: python

    class XMLExtract(Extract):
    ...

        def run(self):
            with open("file.xml") as file:
                for entry in file:
                    yield xml.loads(entry)

Transform
---------

The transformer is in charge of modifying the content to suit, in this case,
the InvenioRDM data model (e.g. for records) so it can be imported in the DB.
It will loop through the entries (i.e. the iterator returned by the extract
class), transform and yield (e.g. the record). Diving more in the example of
a record:

To transform something to an RDM record, you need to implement
`streams/records/transform.py:RDMRecordTransform`. For each record it will
yield what is considered a semantically "full" record: the record itself,
its parent, its draft in case it exists and the files related them.

.. code-block:: python

    {
        "record": self._record(entry),
        "draft": self._draft(entry),
        "parent": self._parent(entry),
    }

This means that you will need to implement the functions for each key. Note
that, only `_record` and `_parent` should return content, the others can return
`None`.

Some of these functions can themselves use a `transform/base:Entry`
transformer. An _entry_ transformer is an extra layer of abstraction, to
provide an interface with the methods needed to generate valid data for part of
the `Transform` class. In the record example, you can implement
`transform.base:RDMRecordEntry`, which can be used in the
`RDMRecordTransform._record` function mentioned in the code snippet above. Note
that implementing this interface will produce valid _data_ for a record.
However, there is no abc for _metadata_. It is an open question how much we
should define these interfaces and avoid duplicating the already existing
Marshmallow schemas of InvenioRDM.

At this point you might be wondering "Why not Marshmallow then?". The answer is
"separation of responsibilities, performance and simplicity". The later lays
with the fact that most of the data transformation is custom, so we would end
up with a schema full of `Method` fields, which does not differ much from what
we have but would have an impact on performance (Marshmallow is slow...).
Regarding the responsibilities part, validating (mostly referential, like
vocabularies) can only be done on (or after) _load_ where RDM instance knowledge/appctx
is available.

Note that no validation, not even structural, is done in this step.

Load
----

The final step to have the records available in the RDM instance is to load
them. There are two types of loading _bulk_ or _transactions_.

Bulk
....

Bulk loading will insert data in the database table by table using `COPY`. Since
the order of the tables is not guaranteed it is necessary to drop foreign keys before
loading. They can be restored afterwards. In addition, dropping indices would increase
performance since they will only be calculated once, when they are restored after loading.

Bulk loading is done using the `load.postgresql.bulk:PostgreSQLCopyLoad` class, which will
carry out 2 steps:

1. Prepare the data, writing one DB row per line in a csv file:

.. code-block:: console

    $ /path/to/data/tables1668697280.943311
        |
        | - pidstore_pid.csv
        | - rdm_parents_metadata.csv
        | - rdm_records_metadata.csv
        | - rdm_versions_state.csv

2. Perform the actual loading, using `COPY`. Inserting all rows at once is more
   efficient than performing one `INSERT` per row.

Internally what is happening is that the `prepare` function makes use of
`TableGenerator` implementations and then yields the list of csv files.
So the `load` only iterates through the filenames, not the actual entries.

A `TableGenerator` will, for each value in the iterator, yield one
or more rows (lines to be written to the a csv file). For example for a record
it will yield: recid, DOI and OAI (PersistentIdentifiers), record and parent
metadata, etc. which will be written to the respective CSV file.


Transactions
............

Another option is to migrate transactions. For example, once you have done the initial
part of it in bulk, you can migrate the changes that were persisted while the bulk
migration happened. That can be achieved by migrating transactions. A transaction is a
group of operations, which can be understod as SQL statement and thus have two values:
the operation type (created, update, delete) and its data represented as a database model.

Transaction loading is done using the `load.postgresql.transactions:PostgreSQLExecuteLoad`
class, which will carry out 2 similar steps to the one above:

1. Prepare the data, storing in memory a series of `Operation`\s.
2. Perform the actual loading by adding or removing from the session, or updating the
   corresponding object. Each operation is flushed to the database to avoid foreing key
   violations. However, each transaction is atomic, meaning that an error in one of the
   operations will cause the full transaction to fail as a group.

Internally, the load will use an instance of
`load.postgresql.transactions.generators.group:TxGenerator` to prepare the
operations. This class contains a mapping between table names and
`load.postgresql.transactions.generators.row:RowGenerators`, which will return a list of
operations with the data as database model in the `obj` attribute.

Note that the `TxGenerator` is tightly coupled to the
`transform.transactions.Tx` since it expects the dictionaries to have a
specific structure:

.. code-block::

    {
        "tx_id": the actual transaction id, useful for debug and error handling
        "action": this information refers to the semantic meaning of the group
                       for example: record metadata update or file upload
        "operations": [
            {
                "op": c (create), u (update), d (delete)
                "table": the name of the table in the source system (e.g. Zenodo)
                "data": the transformed data, this can use any `Transform` implementation
            }
        ]
    }

State
=====

During a migration run, there is a need to share information across different streams
or different generators on the same stream. For example, the records stream needs to
access the UUID to slug map that was populated on the communities stream; or the
drafts generator needs to know which parent records have been created on the records
generator to keep the version state consistent.

All this information is persisted to a SQLite database. This state database is kept
in memory during each stream processing, and it is persisted to disk if the stream
finishes without errors. The state will be saved with the name of the stream
(e.g. `records.db`) to avoid overwriting a previous state. Therefore, a migration can be
restarted from any stream.

There are two ways to add more information to the state:

- Full entities, for example record or users, require their own DB table. Those must be
  defined at `state.py:State._initialize_db`. In addition, to abstract the access to that
  table, a state entity is required. It needs to be initialized in the `Runner.py:Runner`
  constructor and added the the `state_entities` dictionary.
- Independent value, for example the maximum value of generated primary keys. Those can be
  stored in the `global_state`. This state has two columns: key and value; adding
  information to it would look like `{key: name_of_the_value, value: actual_value}`.

Notes
=====

**Using python generators**

Using generators instead of lists, allows us to iterate through the data
only once and perform the E-T-L steps on them. Instead of loop for E, loop
for T, loop for L. In addition, this allows us to have the csv files open
during the writing and closing them at the end (open/close is an expensive
op when done 3M times).

..
    Copyright (C) 2022-2023 CERN.


    Invenio-RDM-Migrator is free software; you can redistribute it and/or
    modify it under the terms of the MIT License; see LICENSE file for more
    details.

Changes
=======

Version 5.0.0 (released 2024-07-12)

- change how the submodule are packaged (breaking change)
- installation: install sqlalchemy version 2 optionally
  (solves dependency conflict with current InvenioRDM installations)


Version 4.4.1


- Fix default value for nullable model fields.

Version 4.4.0

- Add GitHub stream.
- Add ``verified_at`` and ``blocked_at`` for usesr models.
- Handle parent DOIs for records.
- Add media files to records and drafts.
- Add ``deletion_status`` to record models.
- Switch to ``orjson`` for JSON dumping/loading.
- Add multi-processing for transform.
- Refactor state to also use Python dict for caching.

Version 4.3.0

- Add community basic CRUD actions.
- Add DB session fixtures.

Version 4.2.0

- Rename `FileUploadAction` to `DraftFileUploadAction`.

Version 4.1.0

- Add file upload action.
- Add draft edit action.

Version 4.0.0

- Namespace actions by load and transform.

Version 3.1.0

- Add `DatetimeMixin` to transform timestamps into iso formatted date strings.
- Add `JSONLoadMixin` to load dictionaries from strings.

Version 3.0.0

- `Operation` instances have split the model and the data into two attributes.
- Add user actions.
- `PostgreSQLTx` `resolve_references` function has now a default behaviour (`pass`).
- Add nullable configuration to draft and user related models.
- Minor bug fixes.

Version 2.0.0

- Make state globally available.
- Refactor transactions into actions. Create transaction and load data classes.
- Removed empty kafka extract module.
- Improved error handling and created specialized classes.
- Move `dict_set` to utils.
- Remove Python 3.8 from test matrix.

Version 1.0.0

- Initial public release.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/inveniosoftware/invenio-rdm-migrator",
    "name": "invenio-rdm-migrator",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "invenio rdm migration",
    "author": "CERN",
    "author_email": "info@inveniosoftware.org",
    "download_url": "https://files.pythonhosted.org/packages/73/68/7aa845ef8dcbbf6f190dd91043b3eca44d666cc656e016920fd967c52377/invenio-rdm-migrator-5.0.0.tar.gz",
    "platform": "any",
    "description": "..\n    Copyright (C) 2022-2023 CERN.\n\n\n    Invenio-RDM-Migrator is free software; you can redistribute it and/or\n    modify it under the terms of the MIT License; see LICENSE file for more\n    details.\n\n=====================\n Invenio-RDM-Migrator\n=====================\n\n.. image:: https://github.com/inveniosoftware/invenio-rdm-migrator/workflows/CI/badge.svg\n        :target: https://github.com/inveniosoftware/invenio-rdm-migrator/actions?query=workflow%3ACI+branch%3Amaster\n\n.. image:: https://img.shields.io/github/tag/inveniosoftware/invenio-rdm-migrator.svg\n        :target: https://github.com/inveniosoftware/invenio-rdm-migrator/releases\n\n.. image:: https://img.shields.io/pypi/dm/invenio-rdm-migrator.svg\n        :target: https://pypi.python.org/pypi/invenio-rdm-migrator\n\n.. image:: https://img.shields.io/github/license/inveniosoftware/invenio-rdm-migrator.svg\n        :target: https://github.com/inveniosoftware/invenio-rdm-migrator/blob/master/LICENSE\n\nDataCite-based data model for Invenio.\n\n\nDevelopment\n===========\n\nInstall\n-------\n\nMake sure that you have `libpq-dev` installed in your system. See\n`psycopg installation instructions <https://www.psycopg.org/install/>`_\nfor more information.\n\nDecide if you need to use intermediate database to copy from (more details below)\nIf yes, you need to install the optional [alchemy] extra.\nAttention! it installs sqlalchemy in version 2, which might be incompatible with some\nInvenioRDM instances.\n\nChoose a version of search and database, then run:\n\n.. code-block:: console\n\n    pip install -e .\n\n.. code-block:: console\n\n    pip install -e .\"[tests,alchemy]\"\n\n\nTests\n-----\n\n.. code-block:: console\n\n    ./run-tests.sh\n\nHow to run it\n=============\n\nTo run the migration you need:\n\n- A running InvenioRDM instance.\n- If your data contains references to other records (e.g. vocabularies),\n  then it is also required to run the setup step.\n\n.. code-block:: console\n\n    invenio-cli services setup --force --no-demo-data\n\n- Install Invenio-RDM-Migrator, any other dependencies must be handled\n  in the Pipfile of your instance.\n\n.. code-block:: console\n\n    $ pip install invenio-rdm-migrator\n\n- Create/edit the configuration file on your instance, for example\n  `streams.yaml`:\n\n.. code-block:: yaml\n\n    data_dir: /path/to/data\n    tmp_dir: /path/to/tmp\n    state_dir: /path/to/state\n    log_dir: /path/to/logs\n    db_uri: postgresql+psycopg2://inveniordm:inveniordm@localhost:5432/inveniordm\n    old_secret_key: CHANGE_ME\n    new_secret_key: CHANGE_ME\n    records:\n        extract:\n            filename: /path/to/records.json\n\n\n- You will need to create a small python script\n  putting together the different blocks of the ETL. You can find an example\n  at `my-site/site/my_site/migrator/__main__.py`.\n\n.. code-block:: python\n\n    from invenio_rdm_migrator.streams import StreamDefinition\n    from invenio_rdm_migrator.streams.records.load import RDMRecordCopyLoad\n\n    if __name__ == \"__main__\":\n        RecordStreamDefinition = StreamDefinition(\n            name=\"records\",\n            extract_cls=JSONLExtract,\n            transform_cls=ZenodoToRDMRecordTransform,\n            load_cls=RDMRecordCopyLoad,\n        )\n\n        runner = Runner(\n            stream_definitions=[\n                RecordStreamDefinition,\n            ],\n            config_filepath=\"path/to/your/streams.yaml\",\n        )\n\n        runner.run()\n\n- Finally, you can execute the above code. Since it is in the `__main__` file\n  of the python package, you can run it as a module:\n\n.. code-block:: console\n\n    $ python -m my_site.migrator\n\n- Once the migration has completed, in your instance you can reindex the data.\n  Following the records example above, it would look like:\n\n.. code-block:: console\n\n    $ invenio-cli pyshell\n\n    In [1]: from invenio_access.permissions import system_identity\n    In [2]: from invenio_rdm_records.proxies import current_rdm_records_service\n    In [3]: current_rdm_records_service.rebuild_index(identity=system_identity)\n\nETL {Extract/Transform/Load} architecture\n=========================================\n\nThere are four packages in this module `extract`, `transform`, `load`, and\n`streams`. The first three correspond to the three steps of an ETL process.\nThe `streams` package contains the logic to run the process and different\nstream-specific implementations of ETL classes (e.g. `records`).\n\nExtract\n-------\n\nThe extract is the first part of the data processing stream. It's\nfunctionality is quite simple: return an iterator (e.g. of records), where each\nyielded value is a dictionary. Note that the data in this step is _transformed_\nin format (e.g. JSON, XML), not in content. For example, the implementation of\n`XMLExtract` would look as follows:\n\n.. code-block:: python\n\n    class XMLExtract(Extract):\n    ...\n\n        def run(self):\n            with open(\"file.xml\") as file:\n                for entry in file:\n                    yield xml.loads(entry)\n\nTransform\n---------\n\nThe transformer is in charge of modifying the content to suit, in this case,\nthe InvenioRDM data model (e.g. for records) so it can be imported in the DB.\nIt will loop through the entries (i.e. the iterator returned by the extract\nclass), transform and yield (e.g. the record). Diving more in the example of\na record:\n\nTo transform something to an RDM record, you need to implement\n`streams/records/transform.py:RDMRecordTransform`. For each record it will\nyield what is considered a semantically \"full\" record: the record itself,\nits parent, its draft in case it exists and the files related them.\n\n.. code-block:: python\n\n    {\n        \"record\": self._record(entry),\n        \"draft\": self._draft(entry),\n        \"parent\": self._parent(entry),\n    }\n\nThis means that you will need to implement the functions for each key. Note\nthat, only `_record` and `_parent` should return content, the others can return\n`None`.\n\nSome of these functions can themselves use a `transform/base:Entry`\ntransformer. An _entry_ transformer is an extra layer of abstraction, to\nprovide an interface with the methods needed to generate valid data for part of\nthe `Transform` class. In the record example, you can implement\n`transform.base:RDMRecordEntry`, which can be used in the\n`RDMRecordTransform._record` function mentioned in the code snippet above. Note\nthat implementing this interface will produce valid _data_ for a record.\nHowever, there is no abc for _metadata_. It is an open question how much we\nshould define these interfaces and avoid duplicating the already existing\nMarshmallow schemas of InvenioRDM.\n\nAt this point you might be wondering \"Why not Marshmallow then?\". The answer is\n\"separation of responsibilities, performance and simplicity\". The later lays\nwith the fact that most of the data transformation is custom, so we would end\nup with a schema full of `Method` fields, which does not differ much from what\nwe have but would have an impact on performance (Marshmallow is slow...).\nRegarding the responsibilities part, validating (mostly referential, like\nvocabularies) can only be done on (or after) _load_ where RDM instance knowledge/appctx\nis available.\n\nNote that no validation, not even structural, is done in this step.\n\nLoad\n----\n\nThe final step to have the records available in the RDM instance is to load\nthem. There are two types of loading _bulk_ or _transactions_.\n\nBulk\n....\n\nBulk loading will insert data in the database table by table using `COPY`. Since\nthe order of the tables is not guaranteed it is necessary to drop foreign keys before\nloading. They can be restored afterwards. In addition, dropping indices would increase\nperformance since they will only be calculated once, when they are restored after loading.\n\nBulk loading is done using the `load.postgresql.bulk:PostgreSQLCopyLoad` class, which will\ncarry out 2 steps:\n\n1. Prepare the data, writing one DB row per line in a csv file:\n\n.. code-block:: console\n\n    $ /path/to/data/tables1668697280.943311\n        |\n        | - pidstore_pid.csv\n        | - rdm_parents_metadata.csv\n        | - rdm_records_metadata.csv\n        | - rdm_versions_state.csv\n\n2. Perform the actual loading, using `COPY`. Inserting all rows at once is more\n   efficient than performing one `INSERT` per row.\n\nInternally what is happening is that the `prepare` function makes use of\n`TableGenerator` implementations and then yields the list of csv files.\nSo the `load` only iterates through the filenames, not the actual entries.\n\nA `TableGenerator` will, for each value in the iterator, yield one\nor more rows (lines to be written to the a csv file). For example for a record\nit will yield: recid, DOI and OAI (PersistentIdentifiers), record and parent\nmetadata, etc. which will be written to the respective CSV file.\n\n\nTransactions\n............\n\nAnother option is to migrate transactions. For example, once you have done the initial\npart of it in bulk, you can migrate the changes that were persisted while the bulk\nmigration happened. That can be achieved by migrating transactions. A transaction is a\ngroup of operations, which can be understod as SQL statement and thus have two values:\nthe operation type (created, update, delete) and its data represented as a database model.\n\nTransaction loading is done using the `load.postgresql.transactions:PostgreSQLExecuteLoad`\nclass, which will carry out 2 similar steps to the one above:\n\n1. Prepare the data, storing in memory a series of `Operation`\\s.\n2. Perform the actual loading by adding or removing from the session, or updating the\n   corresponding object. Each operation is flushed to the database to avoid foreing key\n   violations. However, each transaction is atomic, meaning that an error in one of the\n   operations will cause the full transaction to fail as a group.\n\nInternally, the load will use an instance of\n`load.postgresql.transactions.generators.group:TxGenerator` to prepare the\noperations. This class contains a mapping between table names and\n`load.postgresql.transactions.generators.row:RowGenerators`, which will return a list of\noperations with the data as database model in the `obj` attribute.\n\nNote that the `TxGenerator` is tightly coupled to the\n`transform.transactions.Tx` since it expects the dictionaries to have a\nspecific structure:\n\n.. code-block::\n\n    {\n        \"tx_id\": the actual transaction id, useful for debug and error handling\n        \"action\": this information refers to the semantic meaning of the group\n                       for example: record metadata update or file upload\n        \"operations\": [\n            {\n                \"op\": c (create), u (update), d (delete)\n                \"table\": the name of the table in the source system (e.g. Zenodo)\n                \"data\": the transformed data, this can use any `Transform` implementation\n            }\n        ]\n    }\n\nState\n=====\n\nDuring a migration run, there is a need to share information across different streams\nor different generators on the same stream. For example, the records stream needs to\naccess the UUID to slug map that was populated on the communities stream; or the\ndrafts generator needs to know which parent records have been created on the records\ngenerator to keep the version state consistent.\n\nAll this information is persisted to a SQLite database. This state database is kept\nin memory during each stream processing, and it is persisted to disk if the stream\nfinishes without errors. The state will be saved with the name of the stream\n(e.g. `records.db`) to avoid overwriting a previous state. Therefore, a migration can be\nrestarted from any stream.\n\nThere are two ways to add more information to the state:\n\n- Full entities, for example record or users, require their own DB table. Those must be\n  defined at `state.py:State._initialize_db`. In addition, to abstract the access to that\n  table, a state entity is required. It needs to be initialized in the `Runner.py:Runner`\n  constructor and added the the `state_entities` dictionary.\n- Independent value, for example the maximum value of generated primary keys. Those can be\n  stored in the `global_state`. This state has two columns: key and value; adding\n  information to it would look like `{key: name_of_the_value, value: actual_value}`.\n\nNotes\n=====\n\n**Using python generators**\n\nUsing generators instead of lists, allows us to iterate through the data\nonly once and perform the E-T-L steps on them. Instead of loop for E, loop\nfor T, loop for L. In addition, this allows us to have the csv files open\nduring the writing and closing them at the end (open/close is an expensive\nop when done 3M times).\n\n..\n    Copyright (C) 2022-2023 CERN.\n\n\n    Invenio-RDM-Migrator is free software; you can redistribute it and/or\n    modify it under the terms of the MIT License; see LICENSE file for more\n    details.\n\nChanges\n=======\n\nVersion 5.0.0 (released 2024-07-12)\n\n- change how the submodule are packaged (breaking change)\n- installation: install sqlalchemy version 2 optionally\n  (solves dependency conflict with current InvenioRDM installations)\n\n\nVersion 4.4.1\n\n\n- Fix default value for nullable model fields.\n\nVersion 4.4.0\n\n- Add GitHub stream.\n- Add ``verified_at`` and ``blocked_at`` for usesr models.\n- Handle parent DOIs for records.\n- Add media files to records and drafts.\n- Add ``deletion_status`` to record models.\n- Switch to ``orjson`` for JSON dumping/loading.\n- Add multi-processing for transform.\n- Refactor state to also use Python dict for caching.\n\nVersion 4.3.0\n\n- Add community basic CRUD actions.\n- Add DB session fixtures.\n\nVersion 4.2.0\n\n- Rename `FileUploadAction` to `DraftFileUploadAction`.\n\nVersion 4.1.0\n\n- Add file upload action.\n- Add draft edit action.\n\nVersion 4.0.0\n\n- Namespace actions by load and transform.\n\nVersion 3.1.0\n\n- Add `DatetimeMixin` to transform timestamps into iso formatted date strings.\n- Add `JSONLoadMixin` to load dictionaries from strings.\n\nVersion 3.0.0\n\n- `Operation` instances have split the model and the data into two attributes.\n- Add user actions.\n- `PostgreSQLTx` `resolve_references` function has now a default behaviour (`pass`).\n- Add nullable configuration to draft and user related models.\n- Minor bug fixes.\n\nVersion 2.0.0\n\n- Make state globally available.\n- Refactor transactions into actions. Create transaction and load data classes.\n- Removed empty kafka extract module.\n- Improved error handling and created specialized classes.\n- Move `dict_set` to utils.\n- Remove Python 3.8 from test matrix.\n\nVersion 1.0.0\n\n- Initial public release.\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "InvenioRDM module for data migration.",
    "version": "5.0.0",
    "project_urls": {
        "Homepage": "https://github.com/inveniosoftware/invenio-rdm-migrator"
    },
    "split_keywords": [
        "invenio",
        "rdm",
        "migration"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b8ff80f3415160650fc91dfb50e8a9767e118a6ae6d4d3dcd1ec457a7e7c634e",
                "md5": "f8b85fbbe84ae134535c3bd539807267",
                "sha256": "77a8363dfdb067fb6cac2d91f0267f6bcb86126e916f0ea234ad9a976388161a"
            },
            "downloads": -1,
            "filename": "invenio_rdm_migrator-5.0.0-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f8b85fbbe84ae134535c3bd539807267",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": ">=3.7",
            "size": 109033,
            "upload_time": "2024-07-12T09:15:57",
            "upload_time_iso_8601": "2024-07-12T09:15:57.895895Z",
            "url": "https://files.pythonhosted.org/packages/b8/ff/80f3415160650fc91dfb50e8a9767e118a6ae6d4d3dcd1ec457a7e7c634e/invenio_rdm_migrator-5.0.0-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "73687aa845ef8dcbbf6f190dd91043b3eca44d666cc656e016920fd967c52377",
                "md5": "ba9c31596c9e242cf5aae39cef0a2d0b",
                "sha256": "07e7ad93160bbb085f0a5f3d01b8732d377c151afb6f7c8bba049eb880286602"
            },
            "downloads": -1,
            "filename": "invenio-rdm-migrator-5.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "ba9c31596c9e242cf5aae39cef0a2d0b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 96092,
            "upload_time": "2024-07-12T09:16:24",
            "upload_time_iso_8601": "2024-07-12T09:16:24.245556Z",
            "url": "https://files.pythonhosted.org/packages/73/68/7aa845ef8dcbbf6f190dd91043b3eca44d666cc656e016920fd967c52377/invenio-rdm-migrator-5.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-12 09:16:24",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "inveniosoftware",
    "github_project": "invenio-rdm-migrator",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "invenio-rdm-migrator"
}

CERN