rcsb.db

Name	rcsb.db JSON
Version	1.721 JSON
	download
home_page	https://github.com/rcsb/py-rcsb_db
Summary	RCSB Python Database Access and Loading Utility Classes
upload_time	2024-05-14 12:26:53
maintainer	None
docs_url	None
author	John Westbrook
requires_python	None
license	Apache 2.0
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # RCSB DB

## A collection of Python Database Utility Classes

[![Build Status](https://dev.azure.com/rcsb/RCSB%20PDB%20Python%20Projects/_apis/build/status/rcsb.py-rcsb_db?branchName=master)](https://dev.azure.com/rcsb/RCSB%20PDB%20Python%20Projects/_build/latest?definitionId=12&branchName=master)

## Introduction

This module contains a collection of utility classes for processing and loading PDB repository and
derived data content using relational and document database servers.  One target data store for
these tools is a document database used to exchange content within the RCSB PDB data pipeline.

### Installation

Download the library source software from the project repository:

```bash

git clone --recurse-submodules https://github.com/rcsb/py-rcsb_db.git

```

Optionally, run test suite (Python versions 2.7, 3.6, and 3.7) using
[setuptools](https://setuptools.readthedocs.io/en/latest/) or
[tox](http://tox.readthedocs.io/en/latest/example/platform.html):

```bash
python setup.py test

or simply run

tox
```

Installation is via the program [pip](https://pypi.python.org/pypi/pip).  To run tests
from the source tree, the package must be installed in editable mode (i.e. -e):

```bash
pip install -e .
```

#### Installing in Ubuntu Linux (tested in 18.04)

You will need a few packages, before `pip install .` can work:

```bash

sudo apt install default-libmysqlclient-dev flex bison

```

### Installing on macOS

To use and develop this package on macOS requires a number of packages that are not
distributed as part of the base macOS operating system.
The following steps provide one approach to creating the development environment for this
package.  First, install the Apple [XCode](https://developer.apple.com/xcode/) package and associate command-line tools.
This will provide essential compilers and supporting tools.  The [HomeBrew](https://brew.sh/) package
manager provides further access to a variety of common open source services and tools.
Follow the instructions provided by at the [HomeBrew](https://brew.sh/) site to
install this system.   Once HomeBrew is installed, you can further install the
[MariaDB](https://mariadb.com/kb/en/library/installing-mariadb-on-macos-using-homebrew/) and
[MongoDB](https://docs.mongodb.com/manual/tutorial/install-mongodb-on-os-x/) packages which
are required to support the ExDB  tools.  HomeBrew also provides a variety of options for
managing a [Python virtual environments](https://gist.github.com/Geoyi/f55ed54d24cc9ff1c14bd95fac21c042).

### Configuration File

RCSB/PDB repository path details are stored as configuration options.
An example configuration file included in this package is viewable under `rcsb/db/config`: [exdb-config-example.yml](https://github.com/rcsb/py-rcsb_db/blob/master/rcsb/db/config/exdb-config-example.yml). This example references dictionary resources and mock repository data
provided in the package in `rcsb/mock-data/*`. The `site_info_configuration` section
in this file provides database server connection details and common path details.
This is followed by sections specifying the dictionaries, helper functions, and
configuration used to define the schema for the each supported content type
(e.g., pdbx_core, chem_comp_core, bird_chem_comp_core,.. ).

### Command Line Interfaces

#### Schema File Generation
A convenience CLI `schema_update_cli` is provided for generating operational schema from
PDBx/mmCIF dictionary metadata.  Schema are encoded for the ExDB  API (rcsb), and
for the document schema encoded in JSON and BSON formats.  The latter schema can be used to
validate the loadable document objects produced for the collections served by MongoDB.

```bash
 => schema_update_cli  --help
usage: schema_update_cli [-h] [--update_chem_comp_ref]
                         [--update_chem_comp_core_ref]
                         [--update_bird_chem_comp_ref]
                         [--update_bird_chem_comp_core_ref]
                         [--update_bird_ref] [--update_bird_family_ref]
                         [--update_pdbx] [--update_pdbx_core]
                         [--update_repository_holdings]
                         [--update_entity_sequence_clusters]
                         [--update_data_exchange] [--update_ihm_dev]
                         [--update_drugbank_core] [--update_config_all]
                         [--update_config_deployed] [--update_config_test]
                         [--config_path CONFIG_PATH]
                         [--config_name CONFIG_NAME]
                         [--cache_path SCHEMA_CACHE_PATH]
                         [--schema_types SCHEMA_TYPES]
                         [--schema_levels SCHEMA_LEVELS] [--debug] [--mock]

optional arguments:
  -h, --help            show this help message and exit
  --update_chem_comp_ref
                        Update schema for Chemical Component reference
                        definitions
  --update_chem_comp_core_ref
                        Update core schema for Chemical Component reference
                        definitions
  --update_bird_chem_comp_ref
                        Update schema for Bird Chemical Component reference
                        definitions
  --update_bird_chem_comp_core_ref
                        Update core schema for Bird Chemical Component
                        reference definitions
  --update_bird_ref     Update schema for Bird reference definitions
  --update_bird_family_ref
                        Update schema for Bird Family reference definitions
  --update_pdbx         Update schema for PDBx entry data
  --update_pdbx_core    Update schema for PDBx core entry/entity data
  --update_repository_holdings
                        Update schema for repository holdings
  --update_entity_sequence_clusters
                        Update schema for entity sequence clusters
  --update_data_exchange
                        Update schema for data exchange status
  --update_ihm_dev      Update schema for I/HM dev entry data
  --update_drugbank_core
                        Update DrugBank schema
  --update_config_all   Update using configuration settings (e.g.
                        DATABASE_NAMES_ALL)
  --update_config_deployed
                        Update using configuration settings (e.g.
                        DATABASE_NAMES_DEPLOYED)
  --update_config_test  Update using configuration settings (e.g.
                        DATABASE_NAMES_TEST)
  --config_path CONFIG_PATH
                        Path to configuration options file
  --config_name CONFIG_NAME
                        Configuration section name
  --cache_path CACHE_PATH
                        Schema cache directory path
  --schema_types SCHEMA_TYPES
                        Schema encoding (rcsb|json|bson) (comma separated)
  --schema_levels SCHEMA_LEVELS
                        Schema validation level (full|min) (comma separated)
  --debug               Turn on verbose logging
  --mock                Use MOCK repository configuration for dependencies and
                        testing
________________________________________________________________________________

```

##### Example Usage

For example, the following command will generate the JSON and BSON schema for the collections in the
pdbx_core schema.

```bash
schema_update_cli  --mock --schema_types json,bson \
                   --schema_level full  \
                   --update_pdbx_core   \
                   --cache_path . \
                   --config_path ./rcsb/db/config/exdb-config-example.yml  \
                   --config_name site_info_configuration
```

#### ExDB Loading

A convenience CLI `exdb_repo_load_cli` is provided to support loading PDB repositories
containing entry and chemical reference data content types in the form of document collections
compatible with MongoDB.

```bash
exdb_repo_load_cli --help

usage: exdb_repo_load_cli [-h] [--op OP_TYPE] [--load_type LOAD_TYPE]
                          [--database DATABASE_NAME]
                          [--config_path CONFIG_PATH]
                          [--config_name CONFIG_NAME] [--db_type DB_TYPE]
                          [--num_proc NUM_PROC] [--chunk_size CHUNK_SIZE]
                          [--document_style DOCUMENT_STYLE]
                          [--disable_read_back_check] [--schema_level SCHEMA_LEVEL]
                          [--load_id_list_path LOAD_ID_LIST_PATH]
                          [--load_file_list_path LOAD_FILE_LIST_PATH]
                          [--fail_file_list_path FAIL_FILE_LIST_PATH]
                          [--save_file_list_path SAVE_FILE_LIST_PATH]
                          [--file_limit FILE_LIMIT]
                          [--prune_document_size PRUNE_DOCUMENT_SIZE]
                          [--debug] [--mock] [--cache_path CACHE_PATH]
                          [--rebuild_cache] [--rebuild_schema]
                          [--vrpt_repo_path VRPT_REPO_PATH]

optional arguments:
  -h, --help            show this help message and exit
  --op {pdbx_loader,build_resource_cache,pdbx_db_wiper,pdbx_id_list_splitter,pdbx_loader_check,etl_entity_sequence_clusters,etl_repository_holdings}
                        Loading operation to perform
  --load_type {replace,full}
                        Type of load ('replace' for incremental and
                        multi-worker load, 'full' for complete and
                        fresh single-worker load)
  --database {pdbx_core,pdbx_comp_model_core,bird_chem_comp_core,chem_comp,chem_comp_core,bird_chem_comp,bird,bird_family,ihm_dev}
                        Database to load (most common choices are:
                        'pdbx_core', 'pdbx_comp_model_core', or
                        'bird_chem_comp_core')
  --config_path CONFIG_PATH
                        Path to configuration options file
  --config_name CONFIG_NAME
                        Configuration section name
  --document_style DOCUMENT_STYLE
                        Document organization (rowwise_by_name_with_c
                        ardinality|rowwise_by_name|columnwise_by_name
                        |rowwise_by_id|rowwise_no_name)
  --cache_path CACHE_PATH
                        Cache path for resource files
  --num_proc NUM_PROC   Number of processes to execute (default=2)
  --chunk_size CHUNK_SIZE
                        Number of files loaded per process
  --max_step_length MAX_STEP_LENGTH
                        Maximum subList size (default=500)
  --schema_level SCHEMA_LEVEL
                        Schema validation level (full|min)
  --collection_list COLLECTION_LIST
                        Specific collections to load
  --load_id_list_path LOAD_ID_LIST_PATH
                        Input file containing the list of IDs to load
                        in the current iteration by a single worker
  --holdings_file_path HOLDINGS_FILE_PATH
                        File containing the complete list of all IDs
                        (or holdings files) that will be loaded
  --load_file_list_path LOAD_FILE_LIST_PATH
                        Input file containing load file path list
                        (override automatic repository scan)
  --fail_file_list_path FAIL_FILE_LIST_PATH
                        Output file containing file paths that fail
                        to load
  --save_file_list_path SAVE_FILE_LIST_PATH
                        Save repo file paths from automatic file
                        system scan in this path
  --load_file_list_dir LOAD_FILE_LIST_DIR
                        Directory path for storing load file lists
  --num_sublists NUM_SUBLISTS
                        Number of sublists to create/load for the
                        associated database
  --force_reload        Force re-load of provided ID list (i.e.,
                        don't just load delta; useful for manual/test
                        runs).
  --provider_types_exclude
                        Resource provider types to exclude
  --db_type DB_TYPE     Database server type (default=mongo)
  --file_limit FILE_LIMIT
                        Load file limit for testing
  --prune_document_size PRUNE_DOCUMENT_SIZE
                        Prune large documents to this size limit (MB)
  --regex_purge         Perform additional regex-based purge of all
                        pre-existing documents for loadType != 'full'
  --data_selectors  [ ...]
                        Data selectors, space-separated.
  --disable_read_back_check
                        Disable read back check on all documents
  --disable_merge_validation_reports
                        Disable merging of validation report data
                        with the primary content type
  --debug               Turn on verbose logging
  --mock                Use MOCK repository configuration for testing
  --rebuild_cache       Rebuild cached resource files
  --rebuild_schema      Rebuild schema on-the-fly if not cached
  --vrpt_repo_path VRPT_REPO_PATH
                        Path to validation report repository
________________________________________________________________________________
```

##### Example Usage
The following commands demonstrate how each type of operation (`--op`) is used for loading of PDB repository data to ExDB. For all commands, the following environmental variables must first be set:

```bash
export CONFIG_SUPPORT_TOKEN_ENV=personal_token_used_for_decrypting_config_variables
export OE_LICENSE=/path/to/oe_license.txt
export NLTK_DATA=/path/to/nltk_data
```

`--op build_resource_cache` - Build the external resource cache that will be used for and integrated with the loading of PDB structure data.
```bash
exdb_repo_load_cli --op "build_resource_cache" \
--config_path "/opt/etl-scratch/config/exdb-loader-config.yml" \
--config_name "site_info_remote_configuration" \
--num_proc 6  \
--cache_path "/opt/etl-scratch/data/CACHE" \

```

`--op pdbx_db_wiper` - Wipe the pre-existing database (and all of its collections).
```bash
exdb_repo_load_cli --op "pdbx_db_wiper" \
--database "pdbx_core" \
--config_path "/opt/etl-scratch/config/exdb-loader-config.yml" \
--config_name "site_info_remote_configuration" \
--cache_path "/opt/etl-scratch/data/CACHE" \

```

`--op pdbx_id_list_splitter` - Split the full list of input IDs into smaller, equally-sized sublists.
```bash
exdb_repo_load_cli --op "pdbx_id_list_splitter" \
--database "pdbx_core" \
--config_path "/opt/etl-scratch/config/exdb-loader-config.yml" \
--config_name "site_info_remote_configuration" \
--cache_path "/opt/etl-scratch/data/CACHE" \
--load_file_list_dir "/opt/etl-scratch/work-dir/load_file_lists" \
--holdings_file_path "https://files.wwpdb.org/pub/pdb/holdings/released_structures_last_modified_dates.json.gz" \
--num_sublists 10 \

```

`--op pdbx_loader` - Load a list of entry IDs to ExDB.
```bash
exdb_repo_load_cli --op "pdbx_loader" \
--database "pdbx_core" \
--load_type replace  \
--config_path /opt/etl-scratch/config/exdb-loader-config.yml \
--config_name site_info_remote_configuration \
--num_proc 8  \
--chunk_size 5  \
--max_step_length 500 \
--load_id_list_path "/opt/etl-scratch/work-dir/load_file_lists/pdbx_core_ids-1.txt" \
--cache_path "/opt/etl-scratch/data/CACHE" \

```

`--op pdbx_loader_check` - Check the resulting ExDB database to confirm that all expected documents were loaded.
```bash
exdb_repo_load_cli --op "pdbx_loader_check" \
--database "pdbx_core" \
--config_path "/opt/etl-scratch/config/exdb-loader-config.yml" \
--config_name "site_info_remote_configuration" \
--cache_path "/opt/etl-scratch/data/CACHE" \
--load_file_list_dir "/opt/etl-scratch/work-dir/load_file_lists" \
--holdings_file_path "https://files.wwpdb.org/pub/pdb/holdings/released_structures_last_modified_dates.json.gz" \
--num_sublists 10 \

```

#### Repository Scanning

Part of the schema definition process supported by this module involves refining
the dictionary metadata with more specific data typing and coverage details.
A scanning tools is provided to collect and organize these details for the
other ETL tools in this package.  The following convenience CLI, `repo_scan_cli`,
is provided to scan supported PDB repository content and update data type and coverage details.

```bash
repo_scan_cli --help

usage: repo_scan_cli [-h] [--scanType SCANTYPE]
                     [--scan_chem_comp_ref | --scan_chem_comp_core_ref | --scan_bird_chem_comp_ref | --scan_bird_chem_comp_core_ref | --scan_bird_ref | --scan_bird_family_ref | --scan_entry_data | --scan_ihm_dev]
                     [--config_path CONFIG_PATH] [--config_name CONFIG_NAME]
                     [--input_file_list_path INPUT_FILE_LIST_PATH]
                     [--output_file_list_path OUTPUT_FILE_LIST_PATH]
                     [--fail_file_list_path FAIL_FILE_LIST_PATH]
                     [--scan_data_file_path SCAN_DATA_FILE_PATH]
                     [--coverage_file_path COVERAGE_FILE_PATH]
                     [--type_map_file_path TYPE_MAP_FILE_PATH]
                     [--num_proc NUM_PROC] [--chunk_size CHUNK_SIZE]
                     [--file_limit FILE_LIMIT] [--debug] [--mock]
                     [--working_path WORKING_PATH]

optional arguments:
  -h, --help            show this help message and exit
  --scanType SCANTYPE   Repository scan type (full|incr)
  --scan_chem_comp_ref  Scan Chemical Component reference definitions (public
                        subset)
  --scan_chem_comp_core_ref
                        Scan Chemical Component Core reference definitions
                        (public subset)
  --scan_bird_chem_comp_ref
                        Scan Bird Chemical Component reference definitions
                        (public subset)
  --scan_bird_chem_comp_core_ref
                        Scan Bird Chemical Component Core reference
                        definitions (public subset)
  --scan_bird_ref       Scan Bird reference definitions (public subset)
  --scan_bird_family_ref
                        Scan Bird Family reference definitions (public subset)
  --scan_entry_data     Scan PDB entry data (current released subset)
  --scan_ihm_dev        Scan PDBDEV I/HM entry data (current released subset)
  --config_path CONFIG_PATH
                        Path to configuration options file
  --config_name CONFIG_NAME
                        Configuration section name
  --input_file_list_path INPUT_FILE_LIST_PATH
                        Input file containing file paths to scan
  --output_file_list_path OUTPUT_FILE_LIST_PATH
                        Output file containing file paths scanned
  --fail_file_list_path FAIL_FILE_LIST_PATH
                        Output file containing file paths that fail scan
  --scan_data_file_path SCAN_DATA_FILE_PATH
                        Output working file storing scan data (Pickle)
  --coverage_file_path COVERAGE_FILE_PATH
                        Coverage map (JSON) output path
  --type_map_file_path TYPE_MAP_FILE_PATH
                        Type map (JSON) output path
  --num_proc NUM_PROC   Number of processes to execute (default=2)
  --chunk_size CHUNK_SIZE
                        Number of files loaded per process
  --file_limit FILE_LIMIT
                        Load file limit for testing
  --debug               Turn on verbose logging
  --mock                Use MOCK repository configuration for testing
  --working_path WORKING_PATH
                        Working path for temporary files
________________________________________________________________________________

```

#### ETL Processing
The following CLI provides a preliminary access to ETL functions for processing
derived content types such as sequence comparative data.

```bash
etl_exec_cli --help
usage: etl_exec_cli [-h] [--full] [--etl_entity_sequence_clusters]
                    [--etl_repository_holdings] [--data_set_id DATA_SET_ID]
                    [--sequence_cluster_data_path SEQUENCE_CLUSTER_DATA_PATH]
                    [--sandbox_data_path SANDBOX_DATA_PATH]
                    [--config_path CONFIG_PATH] [--config_name CONFIG_NAME]
                    [--db_type DB_TYPE] [--read_back_check]
                    [--num_proc NUM_PROC] [--chunk_size CHUNK_SIZE]
                    [--document_limit DOCUMENT_LIMIT]
                    [--prune_document_size PRUNE_DOCUMENT_SIZE] [--debug]
                    [--mock] [--cache_path CACHE_PATH] [--rebuild_cache]

optional arguments:
  -h, --help            show this help message and exit
  --full                Fresh full load in a new tables/collections (Default)
  --etl_entity_sequence_clusters
                        ETL entity sequence clusters
  --etl_repository_holdings
                        ETL repository holdings
  --data_set_id DATA_SET_ID
                        Data set identifier (default= 2018_14 for current
                        week)
  --sequence_cluster_data_path SEQUENCE_CLUSTER_DATA_PATH
                        Sequence cluster data path (default set by
                        configuration
  --sandbox_data_path SANDBOX_DATA_PATH
                        Date exchange sandboxPath data path (default set by
                        configuration
  --config_path CONFIG_PATH
                        Path to configuration options file
  --config_name CONFIG_NAME
                        Configuration section name
  --db_type DB_TYPE     Database server type (default=mongo)
  --read_back_check     Perform read back check on all documents
  --num_proc NUM_PROC   Number of processes to execute (default=2)
  --chunk_size CHUNK_SIZE
                        Number of files loaded per process
  --document_limit DOCUMENT_LIMIT
                        Load document limit for testing
  --prune_document_size PRUNE_DOCUMENT_SIZE
                        Prune large documents to this size limit (MB)
  --debug               Turn on verbose logging
  --mock                Use MOCK repository configuration for testing
  --cache_path CACHE_PATH
                        Path containing cache directories
  --rebuild_cache       Rebuild cached resource files
________________________________________________________________________________

```

### Additional Examples

(*Note: The examples below are outdated and may not function as described. They are only kept here for historical reference.*)

If you are working in the source repository, then you can run the CLI commands in the following manner.
The following examples load data in the mock repositories in source distribution assuming you have a local
default installation of MongoDb (no user/pw assigned).

To run the command-line interface `exdb_repo_load_cli` outside of the source distribution, you will need to
create a configuration file with the appropriate path details and authentication credentials.

For instance, to perform a fresh/full load of all of the chemical component definitions in the mock repository:

```bash

cd rcsb/db/cli
python RepoLoadExec.py --full  --load_chem_comp_ref  \
                      --config_path ../config/exdb-config-example.yml \
                      --config_name site_info_configuration \
                      --fail_file_list_path failed-cc-path-list.txt \
                      --read_back_check
```

The following illustrates, a full load of the mock structure data repository followed by a reload with replacement of
this same data.

```bash

cd rcsb/db/cli
python RepoLoadExec.py  --mock --full  --load_entry_data \
                     --config_path ../config/exdb-config-example.yml \
                     --config_name site_info_configuration \
                     --save_file_list_path  LATEST_PDBX_LOAD_LIST.txt \
                     --fail_file_list_path failed-entry-path-list.txt

python RepoLoadExec.py --mock --replace  --load_entry_data \
                      --config_path ../config/exdb-config-example.yml \
                      --config_name site_info_configuration \
                      --load_file_list_path  LATEST_PDBX_LOAD_LIST.txt \
                      --fail_file_list_path failed-entry-path-list.txt
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/rcsb/py-rcsb_db",
    "name": "rcsb.db",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "John Westbrook",
    "author_email": "john.westbrook@rcsb.org",
    "download_url": "https://files.pythonhosted.org/packages/91/9a/f16885dc597d993e259b2f23adb03b78805121b38e9ba31a5974bbd4436e/rcsb_db-1.721.tar.gz",
    "platform": null,
    "description": "# RCSB DB\n\n## A collection of Python Database Utility Classes\n\n[![Build Status](https://dev.azure.com/rcsb/RCSB%20PDB%20Python%20Projects/_apis/build/status/rcsb.py-rcsb_db?branchName=master)](https://dev.azure.com/rcsb/RCSB%20PDB%20Python%20Projects/_build/latest?definitionId=12&branchName=master)\n\n## Introduction\n\nThis module contains a collection of utility classes for processing and loading PDB repository and\nderived data content using relational and document database servers.  One target data store for\nthese tools is a document database used to exchange content within the RCSB PDB data pipeline.\n\n### Installation\n\nDownload the library source software from the project repository:\n\n```bash\n\ngit clone --recurse-submodules https://github.com/rcsb/py-rcsb_db.git\n\n```\n\nOptionally, run test suite (Python versions 2.7, 3.6, and 3.7) using\n[setuptools](https://setuptools.readthedocs.io/en/latest/) or\n[tox](http://tox.readthedocs.io/en/latest/example/platform.html):\n\n```bash\npython setup.py test\n\nor simply run\n\ntox\n```\n\nInstallation is via the program [pip](https://pypi.python.org/pypi/pip).  To run tests\nfrom the source tree, the package must be installed in editable mode (i.e. -e):\n\n```bash\npip install -e .\n```\n\n#### Installing in Ubuntu Linux (tested in 18.04)\n\nYou will need a few packages, before `pip install .` can work:\n\n```bash\n\nsudo apt install default-libmysqlclient-dev flex bison\n\n```\n\n### Installing on macOS\n\nTo use and develop this package on macOS requires a number of packages that are not\ndistributed as part of the base macOS operating system.\nThe following steps provide one approach to creating the development environment for this\npackage.  First, install the Apple [XCode](https://developer.apple.com/xcode/) package and associate command-line tools.\nThis will provide essential compilers and supporting tools.  The [HomeBrew](https://brew.sh/) package\nmanager provides further access to a variety of common open source services and tools.\nFollow the instructions provided by at the [HomeBrew](https://brew.sh/) site to\ninstall this system.   Once HomeBrew is installed, you can further install the\n[MariaDB](https://mariadb.com/kb/en/library/installing-mariadb-on-macos-using-homebrew/) and\n[MongoDB](https://docs.mongodb.com/manual/tutorial/install-mongodb-on-os-x/) packages which\nare required to support the ExDB  tools.  HomeBrew also provides a variety of options for\nmanaging a [Python virtual environments](https://gist.github.com/Geoyi/f55ed54d24cc9ff1c14bd95fac21c042).\n\n### Configuration File\n\nRCSB/PDB repository path details are stored as configuration options.\nAn example configuration file included in this package is viewable under `rcsb/db/config`: [exdb-config-example.yml](https://github.com/rcsb/py-rcsb_db/blob/master/rcsb/db/config/exdb-config-example.yml). This example references dictionary resources and mock repository data\nprovided in the package in `rcsb/mock-data/*`. The `site_info_configuration` section\nin this file provides database server connection details and common path details.\nThis is followed by sections specifying the dictionaries, helper functions, and\nconfiguration used to define the schema for the each supported content type\n(e.g., pdbx_core, chem_comp_core, bird_chem_comp_core,.. ).\n\n### Command Line Interfaces\n\n#### Schema File Generation\nA convenience CLI `schema_update_cli` is provided for generating operational schema from\nPDBx/mmCIF dictionary metadata.  Schema are encoded for the ExDB  API (rcsb), and\nfor the document schema encoded in JSON and BSON formats.  The latter schema can be used to\nvalidate the loadable document objects produced for the collections served by MongoDB.\n\n```bash\n => schema_update_cli  --help\nusage: schema_update_cli [-h] [--update_chem_comp_ref]\n                         [--update_chem_comp_core_ref]\n                         [--update_bird_chem_comp_ref]\n                         [--update_bird_chem_comp_core_ref]\n                         [--update_bird_ref] [--update_bird_family_ref]\n                         [--update_pdbx] [--update_pdbx_core]\n                         [--update_repository_holdings]\n                         [--update_entity_sequence_clusters]\n                         [--update_data_exchange] [--update_ihm_dev]\n                         [--update_drugbank_core] [--update_config_all]\n                         [--update_config_deployed] [--update_config_test]\n                         [--config_path CONFIG_PATH]\n                         [--config_name CONFIG_NAME]\n                         [--cache_path SCHEMA_CACHE_PATH]\n                         [--schema_types SCHEMA_TYPES]\n                         [--schema_levels SCHEMA_LEVELS] [--debug] [--mock]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --update_chem_comp_ref\n                        Update schema for Chemical Component reference\n                        definitions\n  --update_chem_comp_core_ref\n                        Update core schema for Chemical Component reference\n                        definitions\n  --update_bird_chem_comp_ref\n                        Update schema for Bird Chemical Component reference\n                        definitions\n  --update_bird_chem_comp_core_ref\n                        Update core schema for Bird Chemical Component\n                        reference definitions\n  --update_bird_ref     Update schema for Bird reference definitions\n  --update_bird_family_ref\n                        Update schema for Bird Family reference definitions\n  --update_pdbx         Update schema for PDBx entry data\n  --update_pdbx_core    Update schema for PDBx core entry/entity data\n  --update_repository_holdings\n                        Update schema for repository holdings\n  --update_entity_sequence_clusters\n                        Update schema for entity sequence clusters\n  --update_data_exchange\n                        Update schema for data exchange status\n  --update_ihm_dev      Update schema for I/HM dev entry data\n  --update_drugbank_core\n                        Update DrugBank schema\n  --update_config_all   Update using configuration settings (e.g.\n                        DATABASE_NAMES_ALL)\n  --update_config_deployed\n                        Update using configuration settings (e.g.\n                        DATABASE_NAMES_DEPLOYED)\n  --update_config_test  Update using configuration settings (e.g.\n                        DATABASE_NAMES_TEST)\n  --config_path CONFIG_PATH\n                        Path to configuration options file\n  --config_name CONFIG_NAME\n                        Configuration section name\n  --cache_path CACHE_PATH\n                        Schema cache directory path\n  --schema_types SCHEMA_TYPES\n                        Schema encoding (rcsb|json|bson) (comma separated)\n  --schema_levels SCHEMA_LEVELS\n                        Schema validation level (full|min) (comma separated)\n  --debug               Turn on verbose logging\n  --mock                Use MOCK repository configuration for dependencies and\n                        testing\n________________________________________________________________________________\n\n```\n\n##### Example Usage\n\nFor example, the following command will generate the JSON and BSON schema for the collections in the\npdbx_core schema.\n\n```bash\nschema_update_cli  --mock --schema_types json,bson \\\n                   --schema_level full  \\\n                   --update_pdbx_core   \\\n                   --cache_path . \\\n                   --config_path ./rcsb/db/config/exdb-config-example.yml  \\\n                   --config_name site_info_configuration\n```\n\n#### ExDB Loading\n\nA convenience CLI `exdb_repo_load_cli` is provided to support loading PDB repositories\ncontaining entry and chemical reference data content types in the form of document collections\ncompatible with MongoDB.\n\n```bash\nexdb_repo_load_cli --help\n\nusage: exdb_repo_load_cli [-h] [--op OP_TYPE] [--load_type LOAD_TYPE]\n                          [--database DATABASE_NAME]\n                          [--config_path CONFIG_PATH]\n                          [--config_name CONFIG_NAME] [--db_type DB_TYPE]\n                          [--num_proc NUM_PROC] [--chunk_size CHUNK_SIZE]\n                          [--document_style DOCUMENT_STYLE]\n                          [--disable_read_back_check] [--schema_level SCHEMA_LEVEL]\n                          [--load_id_list_path LOAD_ID_LIST_PATH]\n                          [--load_file_list_path LOAD_FILE_LIST_PATH]\n                          [--fail_file_list_path FAIL_FILE_LIST_PATH]\n                          [--save_file_list_path SAVE_FILE_LIST_PATH]\n                          [--file_limit FILE_LIMIT]\n                          [--prune_document_size PRUNE_DOCUMENT_SIZE]\n                          [--debug] [--mock] [--cache_path CACHE_PATH]\n                          [--rebuild_cache] [--rebuild_schema]\n                          [--vrpt_repo_path VRPT_REPO_PATH]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --op {pdbx_loader,build_resource_cache,pdbx_db_wiper,pdbx_id_list_splitter,pdbx_loader_check,etl_entity_sequence_clusters,etl_repository_holdings}\n                        Loading operation to perform\n  --load_type {replace,full}\n                        Type of load ('replace' for incremental and\n                        multi-worker load, 'full' for complete and\n                        fresh single-worker load)\n  --database {pdbx_core,pdbx_comp_model_core,bird_chem_comp_core,chem_comp,chem_comp_core,bird_chem_comp,bird,bird_family,ihm_dev}\n                        Database to load (most common choices are:\n                        'pdbx_core', 'pdbx_comp_model_core', or\n                        'bird_chem_comp_core')\n  --config_path CONFIG_PATH\n                        Path to configuration options file\n  --config_name CONFIG_NAME\n                        Configuration section name\n  --document_style DOCUMENT_STYLE\n                        Document organization (rowwise_by_name_with_c\n                        ardinality|rowwise_by_name|columnwise_by_name\n                        |rowwise_by_id|rowwise_no_name)\n  --cache_path CACHE_PATH\n                        Cache path for resource files\n  --num_proc NUM_PROC   Number of processes to execute (default=2)\n  --chunk_size CHUNK_SIZE\n                        Number of files loaded per process\n  --max_step_length MAX_STEP_LENGTH\n                        Maximum subList size (default=500)\n  --schema_level SCHEMA_LEVEL\n                        Schema validation level (full|min)\n  --collection_list COLLECTION_LIST\n                        Specific collections to load\n  --load_id_list_path LOAD_ID_LIST_PATH\n                        Input file containing the list of IDs to load\n                        in the current iteration by a single worker\n  --holdings_file_path HOLDINGS_FILE_PATH\n                        File containing the complete list of all IDs\n                        (or holdings files) that will be loaded\n  --load_file_list_path LOAD_FILE_LIST_PATH\n                        Input file containing load file path list\n                        (override automatic repository scan)\n  --fail_file_list_path FAIL_FILE_LIST_PATH\n                        Output file containing file paths that fail\n                        to load\n  --save_file_list_path SAVE_FILE_LIST_PATH\n                        Save repo file paths from automatic file\n                        system scan in this path\n  --load_file_list_dir LOAD_FILE_LIST_DIR\n                        Directory path for storing load file lists\n  --num_sublists NUM_SUBLISTS\n                        Number of sublists to create/load for the\n                        associated database\n  --force_reload        Force re-load of provided ID list (i.e.,\n                        don't just load delta; useful for manual/test\n                        runs).\n  --provider_types_exclude\n                        Resource provider types to exclude\n  --db_type DB_TYPE     Database server type (default=mongo)\n  --file_limit FILE_LIMIT\n                        Load file limit for testing\n  --prune_document_size PRUNE_DOCUMENT_SIZE\n                        Prune large documents to this size limit (MB)\n  --regex_purge         Perform additional regex-based purge of all\n                        pre-existing documents for loadType != 'full'\n  --data_selectors  [ ...]\n                        Data selectors, space-separated.\n  --disable_read_back_check\n                        Disable read back check on all documents\n  --disable_merge_validation_reports\n                        Disable merging of validation report data\n                        with the primary content type\n  --debug               Turn on verbose logging\n  --mock                Use MOCK repository configuration for testing\n  --rebuild_cache       Rebuild cached resource files\n  --rebuild_schema      Rebuild schema on-the-fly if not cached\n  --vrpt_repo_path VRPT_REPO_PATH\n                        Path to validation report repository\n________________________________________________________________________________\n```\n\n##### Example Usage\nThe following commands demonstrate how each type of operation (`--op`) is used for loading of PDB repository data to ExDB. For all commands, the following environmental variables must first be set:\n\n```bash\nexport CONFIG_SUPPORT_TOKEN_ENV=personal_token_used_for_decrypting_config_variables\nexport OE_LICENSE=/path/to/oe_license.txt\nexport NLTK_DATA=/path/to/nltk_data\n```\n\n`--op build_resource_cache` - Build the external resource cache that will be used for and integrated with the loading of PDB structure data.\n```bash\nexdb_repo_load_cli --op \"build_resource_cache\" \\\n--config_path \"/opt/etl-scratch/config/exdb-loader-config.yml\" \\\n--config_name \"site_info_remote_configuration\" \\\n--num_proc 6  \\\n--cache_path \"/opt/etl-scratch/data/CACHE\" \\\n\n```\n\n`--op pdbx_db_wiper` - Wipe the pre-existing database (and all of its collections).\n```bash\nexdb_repo_load_cli --op \"pdbx_db_wiper\" \\\n--database \"pdbx_core\" \\\n--config_path \"/opt/etl-scratch/config/exdb-loader-config.yml\" \\\n--config_name \"site_info_remote_configuration\" \\\n--cache_path \"/opt/etl-scratch/data/CACHE\" \\\n\n```\n\n`--op pdbx_id_list_splitter` - Split the full list of input IDs into smaller, equally-sized sublists.\n```bash\nexdb_repo_load_cli --op \"pdbx_id_list_splitter\" \\\n--database \"pdbx_core\" \\\n--config_path \"/opt/etl-scratch/config/exdb-loader-config.yml\" \\\n--config_name \"site_info_remote_configuration\" \\\n--cache_path \"/opt/etl-scratch/data/CACHE\" \\\n--load_file_list_dir \"/opt/etl-scratch/work-dir/load_file_lists\" \\\n--holdings_file_path \"https://files.wwpdb.org/pub/pdb/holdings/released_structures_last_modified_dates.json.gz\" \\\n--num_sublists 10 \\\n\n```\n\n`--op pdbx_loader` - Load a list of entry IDs to ExDB.\n```bash\nexdb_repo_load_cli --op \"pdbx_loader\" \\\n--database \"pdbx_core\" \\\n--load_type replace  \\\n--config_path /opt/etl-scratch/config/exdb-loader-config.yml \\\n--config_name site_info_remote_configuration \\\n--num_proc 8  \\\n--chunk_size 5  \\\n--max_step_length 500 \\\n--load_id_list_path \"/opt/etl-scratch/work-dir/load_file_lists/pdbx_core_ids-1.txt\" \\\n--cache_path \"/opt/etl-scratch/data/CACHE\" \\\n\n```\n\n`--op pdbx_loader_check` - Check the resulting ExDB database to confirm that all expected documents were loaded.\n```bash\nexdb_repo_load_cli --op \"pdbx_loader_check\" \\\n--database \"pdbx_core\" \\\n--config_path \"/opt/etl-scratch/config/exdb-loader-config.yml\" \\\n--config_name \"site_info_remote_configuration\" \\\n--cache_path \"/opt/etl-scratch/data/CACHE\" \\\n--load_file_list_dir \"/opt/etl-scratch/work-dir/load_file_lists\" \\\n--holdings_file_path \"https://files.wwpdb.org/pub/pdb/holdings/released_structures_last_modified_dates.json.gz\" \\\n--num_sublists 10 \\\n\n```\n\n#### Repository Scanning\n\nPart of the schema definition process supported by this module involves refining\nthe dictionary metadata with more specific data typing and coverage details.\nA scanning tools is provided to collect and organize these details for the\nother ETL tools in this package.  The following convenience CLI, `repo_scan_cli`,\nis provided to scan supported PDB repository content and update data type and coverage details.\n\n```bash\nrepo_scan_cli --help\n\nusage: repo_scan_cli [-h] [--scanType SCANTYPE]\n                     [--scan_chem_comp_ref | --scan_chem_comp_core_ref | --scan_bird_chem_comp_ref | --scan_bird_chem_comp_core_ref | --scan_bird_ref | --scan_bird_family_ref | --scan_entry_data | --scan_ihm_dev]\n                     [--config_path CONFIG_PATH] [--config_name CONFIG_NAME]\n                     [--input_file_list_path INPUT_FILE_LIST_PATH]\n                     [--output_file_list_path OUTPUT_FILE_LIST_PATH]\n                     [--fail_file_list_path FAIL_FILE_LIST_PATH]\n                     [--scan_data_file_path SCAN_DATA_FILE_PATH]\n                     [--coverage_file_path COVERAGE_FILE_PATH]\n                     [--type_map_file_path TYPE_MAP_FILE_PATH]\n                     [--num_proc NUM_PROC] [--chunk_size CHUNK_SIZE]\n                     [--file_limit FILE_LIMIT] [--debug] [--mock]\n                     [--working_path WORKING_PATH]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --scanType SCANTYPE   Repository scan type (full|incr)\n  --scan_chem_comp_ref  Scan Chemical Component reference definitions (public\n                        subset)\n  --scan_chem_comp_core_ref\n                        Scan Chemical Component Core reference definitions\n                        (public subset)\n  --scan_bird_chem_comp_ref\n                        Scan Bird Chemical Component reference definitions\n                        (public subset)\n  --scan_bird_chem_comp_core_ref\n                        Scan Bird Chemical Component Core reference\n                        definitions (public subset)\n  --scan_bird_ref       Scan Bird reference definitions (public subset)\n  --scan_bird_family_ref\n                        Scan Bird Family reference definitions (public subset)\n  --scan_entry_data     Scan PDB entry data (current released subset)\n  --scan_ihm_dev        Scan PDBDEV I/HM entry data (current released subset)\n  --config_path CONFIG_PATH\n                        Path to configuration options file\n  --config_name CONFIG_NAME\n                        Configuration section name\n  --input_file_list_path INPUT_FILE_LIST_PATH\n                        Input file containing file paths to scan\n  --output_file_list_path OUTPUT_FILE_LIST_PATH\n                        Output file containing file paths scanned\n  --fail_file_list_path FAIL_FILE_LIST_PATH\n                        Output file containing file paths that fail scan\n  --scan_data_file_path SCAN_DATA_FILE_PATH\n                        Output working file storing scan data (Pickle)\n  --coverage_file_path COVERAGE_FILE_PATH\n                        Coverage map (JSON) output path\n  --type_map_file_path TYPE_MAP_FILE_PATH\n                        Type map (JSON) output path\n  --num_proc NUM_PROC   Number of processes to execute (default=2)\n  --chunk_size CHUNK_SIZE\n                        Number of files loaded per process\n  --file_limit FILE_LIMIT\n                        Load file limit for testing\n  --debug               Turn on verbose logging\n  --mock                Use MOCK repository configuration for testing\n  --working_path WORKING_PATH\n                        Working path for temporary files\n________________________________________________________________________________\n\n```\n\n#### ETL Processing\nThe following CLI provides a preliminary access to ETL functions for processing\nderived content types such as sequence comparative data.\n\n```bash\netl_exec_cli --help\nusage: etl_exec_cli [-h] [--full] [--etl_entity_sequence_clusters]\n                    [--etl_repository_holdings] [--data_set_id DATA_SET_ID]\n                    [--sequence_cluster_data_path SEQUENCE_CLUSTER_DATA_PATH]\n                    [--sandbox_data_path SANDBOX_DATA_PATH]\n                    [--config_path CONFIG_PATH] [--config_name CONFIG_NAME]\n                    [--db_type DB_TYPE] [--read_back_check]\n                    [--num_proc NUM_PROC] [--chunk_size CHUNK_SIZE]\n                    [--document_limit DOCUMENT_LIMIT]\n                    [--prune_document_size PRUNE_DOCUMENT_SIZE] [--debug]\n                    [--mock] [--cache_path CACHE_PATH] [--rebuild_cache]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --full                Fresh full load in a new tables/collections (Default)\n  --etl_entity_sequence_clusters\n                        ETL entity sequence clusters\n  --etl_repository_holdings\n                        ETL repository holdings\n  --data_set_id DATA_SET_ID\n                        Data set identifier (default= 2018_14 for current\n                        week)\n  --sequence_cluster_data_path SEQUENCE_CLUSTER_DATA_PATH\n                        Sequence cluster data path (default set by\n                        configuration\n  --sandbox_data_path SANDBOX_DATA_PATH\n                        Date exchange sandboxPath data path (default set by\n                        configuration\n  --config_path CONFIG_PATH\n                        Path to configuration options file\n  --config_name CONFIG_NAME\n                        Configuration section name\n  --db_type DB_TYPE     Database server type (default=mongo)\n  --read_back_check     Perform read back check on all documents\n  --num_proc NUM_PROC   Number of processes to execute (default=2)\n  --chunk_size CHUNK_SIZE\n                        Number of files loaded per process\n  --document_limit DOCUMENT_LIMIT\n                        Load document limit for testing\n  --prune_document_size PRUNE_DOCUMENT_SIZE\n                        Prune large documents to this size limit (MB)\n  --debug               Turn on verbose logging\n  --mock                Use MOCK repository configuration for testing\n  --cache_path CACHE_PATH\n                        Path containing cache directories\n  --rebuild_cache       Rebuild cached resource files\n________________________________________________________________________________\n\n```\n\n### Additional Examples\n\n(*Note: The examples below are outdated and may not function as described. They are only kept here for historical reference.*)\n\nIf you are working in the source repository, then you can run the CLI commands in the following manner.\nThe following examples load data in the mock repositories in source distribution assuming you have a local\ndefault installation of MongoDb (no user/pw assigned).\n\nTo run the command-line interface `exdb_repo_load_cli` outside of the source distribution, you will need to\ncreate a configuration file with the appropriate path details and authentication credentials.\n\nFor instance, to perform a fresh/full load of all of the chemical component definitions in the mock repository:\n\n```bash\n\ncd rcsb/db/cli\npython RepoLoadExec.py --full  --load_chem_comp_ref  \\\n                      --config_path ../config/exdb-config-example.yml \\\n                      --config_name site_info_configuration \\\n                      --fail_file_list_path failed-cc-path-list.txt \\\n                      --read_back_check\n```\n\nThe following illustrates, a full load of the mock structure data repository followed by a reload with replacement of\nthis same data.\n\n```bash\n\ncd rcsb/db/cli\npython RepoLoadExec.py  --mock --full  --load_entry_data \\\n                     --config_path ../config/exdb-config-example.yml \\\n                     --config_name site_info_configuration \\\n                     --save_file_list_path  LATEST_PDBX_LOAD_LIST.txt \\\n                     --fail_file_list_path failed-entry-path-list.txt\n\npython RepoLoadExec.py --mock --replace  --load_entry_data \\\n                      --config_path ../config/exdb-config-example.yml \\\n                      --config_name site_info_configuration \\\n                      --load_file_list_path  LATEST_PDBX_LOAD_LIST.txt \\\n                      --fail_file_list_path failed-entry-path-list.txt\n```\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "RCSB Python Database Access and Loading Utility Classes",
    "version": "1.721",
    "project_urls": {
        "Homepage": "https://github.com/rcsb/py-rcsb_db"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "919af16885dc597d993e259b2f23adb03b78805121b38e9ba31a5974bbd4436e",
                "md5": "514336c61bb648df60d81d02e746c52f",
                "sha256": "c8950e41522fd19ad38f39bcd138befbb5069bbcc62be5b243d26e81a2db6cf9"
            },
            "downloads": -1,
            "filename": "rcsb_db-1.721.tar.gz",
            "has_sig": false,
            "md5_digest": "514336c61bb648df60d81d02e746c52f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 170450,
            "upload_time": "2024-05-14T12:26:53",
            "upload_time_iso_8601": "2024-05-14T12:26:53.171759Z",
            "url": "https://files.pythonhosted.org/packages/91/9a/f16885dc597d993e259b2f23adb03b78805121b38e9ba31a5974bbd4436e/rcsb_db-1.721.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-14 12:26:53",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "rcsb",
    "github_project": "py-rcsb_db",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "tox": true,
    "lcname": "rcsb.db"
}

John Westbrook