refscan


Namerefscan JSON
Version 0.2.0 PyPI version JSON
download
home_pageNone
SummaryCommand-line program that scans the NMDC MongoDB database for referential integrity violations
upload_time2025-01-13 07:48:39
maintainerNone
docs_urlNone
authorNone
requires_python<4.0,>=3.9
licenseNone
keywords mongodb mongo relationships reference database data referential integrity scan
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # refscan

`refscan` is a command-line tool people can use to **scan** the [NMDC](https://microbiomedata.org/) MongoDB database
for referential integrity violations.

```mermaid
%% This is the source code of a Mermaid diagram, which GitHub will render as a diagram.
%% Note: PyPI does not render Mermaid diagrams, and instead displays their source code.
%%       Reference: https://github.com/pypi/warehouse/issues/13083
graph LR
    schema[LinkML<br>schema]
    database[(MongoDB<br>database)]
    script[["refscan"]]
    violations["List of<br>violations"]
    references["List of<br>references"]:::dashed_border
    schema --> script
    database --> script
    script -.-> references
    script --> violations
    
    classDef dashed_border stroke-dasharray: 5 5
```

In addition to using refscan to scan the NMDC MongoDB database for referential integrity violations,
people can use `refscan` to generate **graphs** (diagrams) depicting which collections' documents (or which classes'
instances) can contain references to which _other_ collections' documents (or classes' instances) while still being
schema compliant.

<!-- Note: We removed the hard-coded Table of Contents because—nowadays—GitHub automatically derives/presents one. -->

## How it works

Here is a summary of how each of `refscan`'s main functions works under the hood.

### Scan

`refscan` does this in two stages:
1. It uses the LinkML schema to determine where references _can_ exist in a MongoDB database that conforms to the schema.
   > **Example:** The schema might say that, if a document in the `biosample_set` collection has a field named
   > `associated_studies`, that field must contain a list of `id`s of documents in the `study_set` collection.
2. It scans the MongoDB database to check the integrity of all the references that _do_ exist.
   > **Example:** For each document in the `biosample_set` collection that _has_ a field named `associated_studies`,
   > for each value in that field, confirm there _is_ a document having that `id` in the `study_set` collection.

### Graph

`refscan` does this in three stages:
1. It uses the LinkML schema to determine where references _can_ exist in a MongoDB database that conforms to the schema.
2. It formats that list of references into a data structure compatible with [`Cytoscape.js`](https://js.cytoscape.org/).
3. It outputs an HTML document that uses `Cytoscape.js` to visualize that data structure as a graph.

## Assumptions

`refscan` was designed under the assumption that **every document** in **every collection described by the schema** has
a **field named `type`**, whose value is the [class_uri](https://linkml.io/linkml/code/metamodel.html#linkml_runtime.linkml_model.meta.ClassDefinition.class_uri) of the schema class the document represents an instance
of. `refscan` uses that `class_uri` value (in that `type` field) to determine the _name_ of that schema class,
whose definition `refscan` then uses to determine _which fields_ of that document can contain references.

## Usage

### Install

Assuming you have `pipx` installed, you can install the tool by running the following command:

```shell
pipx install refscan
```

> [`pipx`](https://pipx.pypa.io/stable/) is a tool people can use to
> [download and install](https://pipx.pypa.io/stable/#where-does-pipx-install-apps-from)
> Python scripts that are hosted on PyPI.
> You can [install `pipx`](https://pipx.pypa.io/stable/installation/) by running `$ python -m pip install pipx`.

### Run

Once installed, you can display the tool's `--help` snippet by running:

```shell
refscan --help
```

At the time of this writing, the tool's `--help` snippet is:

```console
 Usage: refscan [OPTIONS] COMMAND [ARGS]...

╭─ Options ──────────────────────────────────────────────────────────────────────────────╮
│ --help          Show this message and exit.                                            │
╰────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ─────────────────────────────────────────────────────────────────────────────╮
│ version   Show version number and exit.                                                │
│ scan      Scan the NMDC MongoDB database for referential integrity violations.         │
│ graph     Generate an interactive graph of the references described by a schema.       │
╰────────────────────────────────────────────────────────────────────────────────────────╯
```

<!-- Note: The above snippet was captured from a terminal window whose width was 90 characters. -->

Each command has its own `--help` snippet.

#### The `scan` command

At the time of this writing, the `--help` snippet for the `scan` command is:

```console
 Usage: refscan scan [OPTIONS]

 Scan the NMDC MongoDB database for referential integrity violations.

╭─ Options ──────────────────────────────────────────────────────────────────────────────╮
│ *  --schema                               FILE  Filesystem path at which the YAML file │
│                                                 representing the schema is located.    │
│                                                 [default: None]                        │
│                                                 [required]                             │
│    --database-name                        TEXT  Name of the database.                  │
│                                                 [default: nmdc]                        │
│    --mongo-uri                            TEXT  Connection string for accessing the    │
│                                                 MongoDB server. If you have Docker     │
│                                                 installed, you can spin up a temporary │
│                                                 MongoDB server at the default URI by   │
│                                                 running: $ docker run --rm --detach -p │
│                                                 27017:27017 mongo                      │
│                                                 [env var: MONGO_URI]                   │
│                                                 [default: mongodb://localhost:27017]   │
│    --verbose                                    Show verbose output.                   │
│    --skip-source-collection,--skip        TEXT  Name of collection you do not want to  │
│                                                 search for referring documents. Option │
│                                                 can be used multiple times.            │
│                                                 [default: None]                        │
│    --reference-report                     FILE  Filesystem path at which you want the  │
│                                                 program to generate its reference      │
│                                                 report.                                │
│                                                 [default: references.tsv]              │
│    --violation-report                     FILE  Filesystem path at which you want the  │
│                                                 program to generate its violation      │
│                                                 report.                                │
│                                                 [default: violations.tsv]              │
│    --no-scan                                    Generate a reference report, but do    │
│                                                 not scan the database for violations.  │
│    --locate-misplaced-documents                 For each referenced document not found │
│                                                 in any of the collections the schema   │
│                                                 allows, also search for it in all      │
│                                                 other collections.                     │
│    --help                                       Show this message and exit.            │
╰────────────────────────────────────────────────────────────────────────────────────────╯
```

<!-- Note: The above snippet was captured from a terminal window whose width was 90 characters. -->

##### The MongoDB connection string (`--mongo-uri`)

As documented in the `--help` snippet above, you can provide the MongoDB connection string to the tool via either
(a) the `--mongo-uri` option; or (b) an environment variable named `MONGO_URI`. The latter can come in handy
when the MongoDB connection string contains information you don't want to appear in your shell history,
such as a password.

Here's how you could create that environment variable:

```shell  
export MONGO_URI='mongodb://username:password@localhost:27017'
```

##### The schema (`--schema`)

As documented in the `--help` snippet above, you can provide the path to a YAML-formatted LinkML schema file to the tool
via the `--schema` option.

<details>

<summary>
Show/hide tips for getting a schema file
</summary>

---

If you have `curl` installed, you can download a YAML file from GitHub by running the following command (after replacing
the `{...}` placeholders and customizing the path):

```shell
# Download the raw content of https://github.com/{user_or_org}/{repo}/blob/{branch}/path/to/schema.yaml
curl -o schema.yaml https://raw.githubusercontent.com/{user_or_org}/{repo}/{branch}/path/to/schema.yaml
```

For example:

```shell
# Download the raw content of https://github.com/microbiomedata/nmdc-schema/blob/main/nmdc_schema/nmdc_materialized_patterns.yaml
curl -o schema.yaml https://raw.githubusercontent.com/microbiomedata/nmdc-schema/main/nmdc_schema/nmdc_materialized_patterns.yaml

# Download the raw content of https://github.com/microbiomedata/nmdc-schema/blob/v11.2.1/nmdc_schema/nmdc_materialized_patterns.yaml
curl -o schema.yaml https://raw.githubusercontent.com/microbiomedata/nmdc-schema/v11.2.1/nmdc_schema/nmdc_materialized_patterns.yaml
```

---
</details>

##### Output

While `refscan` is running, it will display console output indicating what it's currently doing.

![Screenshot of refscan console output](./docs/refscan-screenshot.png)

Once the scan is complete, the reference report (TSV file) and violation report (TSV file) will be available
in the current directory (or in custom directories, if any were specified via CLI options).

#### The `graph` command

At the time of this writing, the `--help` snippet for the `graph` command is:

```console
 Usage: refscan graph [OPTIONS]

 Generate an interactive graph of the references described by a schema.

╭─ Options ──────────────────────────────────────────────────────────────────────────────╮
│ *  --schema         FILE                Filesystem path at which the YAML file         │
│                                         representing the schema is located.            │
│                                         [default: None]                                │
│                                         [required]                                     │
│    --graph          FILE                Filesystem path at which you want refscan to   │
│                                         generate the graph.                            │
│                                         [default: graph.html]                          │
│    --subject        [collection|class]  Whether you want each node of the graph to     │
│                                         represent a collection or a class.             │
│                                         [default: collection]                          │
│    --verbose                            Show verbose output.                           │
│    --help                               Show this message and exit.                    │
╰────────────────────────────────────────────────────────────────────────────────────────╯
```

<!-- Note: The above snippet was captured from a terminal window whose width was 90 characters. -->

### Update

You can update the tool to [the latest version available on PyPI](https://pypi.org/project/refscan/) by running:

```shell
pipx upgrade refscan
```

### Uninstall

You can uninstall the tool from your computer by running:

```shell
pipx uninstall refscan
```

## Development

We use [Poetry](https://python-poetry.org/) to both (a) manage dependencies and (b) build distributable packages that can be published to PyPI.

- `pyproject.toml`: Configuration file for Poetry and other tools (was initialized via `$ poetry init`)
- `poetry.lock`: List of dependencies, both direct and [indirect/transitive](https://en.wikipedia.org/wiki/Transitive_dependency)

### Clone repository

```shell
git clone https://github.com/microbiomedata/refscan.git
cd refscan
```

### Create virtual environment

Create a Poetry virtual environment and attach to its shell:

```shell
poetry shell
```

> You can see information about the Poetry virtual environment by running: `$ poetry env info`

> You can detach from the Poetry virtual environment's shell by running: `$ exit`

From now on, I'll refer to the Poetry virtual environment's shell as the "Poetry shell."

### Install dependencies

At the Poetry shell, install the project's dependencies:

```shell
poetry install
```

### Make changes

Edit the tool's source code and documentation however you want.

While editing the tool's source code, you can run the tool as you normally would in order to test things out.

```shell
poetry run refscan --help
```

### Run tests

We use [pytest](https://docs.pytest.org/en/8.2.x/) as the testing framework for `refscan`.

Tests are defined in the `tests` directory.

You can run the tests by running the following command from the root directory of the repository:

```shell
poetry run pytest
```

### Format code

We use [`black`](https://black.readthedocs.io/en/stable/) as the code formatter for `refscan`. 

We do not use it with its default options. Instead, we include an option that allows lines to be 120 characters
instead of the default 88 characters. That option is defined in the `[tool.black]` section of `pyproject.toml`.

You can format all the Python code in the repository by running this command
from the root directory of the repository:

```shell
poetry run black .
```

#### Check format

You can _check_ the format of the Python code by including the `--check` option, like this:

```shell
poetry run black --check .
```

## Building and publishing

### Build for production

Whenever someone publishes a [GitHub Release](https://github.com/microbiomedata/refscan/releases) in this repository,
a [GitHub Actions workflow](.github/workflows/build-and-publish-package-to-pypi.yml)
will automatically build a package and publish it to [PyPI](https://pypi.org/project/refscan/).
That package will have a version identifier that matches the name of the Git tag associated with the Release.

### Test the build process locally

In case you want to test the build process locally, you can do so by running:

```shell
poetry build
```

> That will create both a
> [source distribution](https://setuptools.pypa.io/en/latest/deprecated/distutils/sourcedist.html#creating-a-source-distribution)
> file (whose name ends with `.tar.gz`) and a
> [wheel](https://packaging.python.org/en/latest/specifications/binary-distribution-format/#binary-distribution-format)
> file (whose name ends with `.whl`) in the `dist` directory.


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "refscan",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.9",
    "maintainer_email": null,
    "keywords": "mongodb, mongo, relationships, reference, database, data, referential integrity, scan",
    "author": null,
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/25/f2/a5d17bc4f234da68a00dc18715a9dca226e5b44973807dc3e3508f7b9cd1/refscan-0.2.0.tar.gz",
    "platform": null,
    "description": "# refscan\n\n`refscan` is a command-line tool people can use to **scan** the [NMDC](https://microbiomedata.org/) MongoDB database\nfor referential integrity violations.\n\n```mermaid\n%% This is the source code of a Mermaid diagram, which GitHub will render as a diagram.\n%% Note: PyPI does not render Mermaid diagrams, and instead displays their source code.\n%%       Reference: https://github.com/pypi/warehouse/issues/13083\ngraph LR\n    schema[LinkML<br>schema]\n    database[(MongoDB<br>database)]\n    script[[\"refscan\"]]\n    violations[\"List of<br>violations\"]\n    references[\"List of<br>references\"]:::dashed_border\n    schema --> script\n    database --> script\n    script -.-> references\n    script --> violations\n    \n    classDef dashed_border stroke-dasharray: 5 5\n```\n\nIn addition to using refscan to scan the NMDC MongoDB database for referential integrity violations,\npeople can use `refscan` to generate **graphs** (diagrams) depicting which collections' documents (or which classes'\ninstances) can contain references to which _other_ collections' documents (or classes' instances) while still being\nschema compliant.\n\n<!-- Note: We removed the hard-coded Table of Contents because\u2014nowadays\u2014GitHub automatically derives/presents one. -->\n\n## How it works\n\nHere is a summary of how each of `refscan`'s main functions works under the hood.\n\n### Scan\n\n`refscan` does this in two stages:\n1. It uses the LinkML schema to determine where references _can_ exist in a MongoDB database that conforms to the schema.\n   > **Example:** The schema might say that, if a document in the `biosample_set` collection has a field named\n   > `associated_studies`, that field must contain a list of `id`s of documents in the `study_set` collection.\n2. It scans the MongoDB database to check the integrity of all the references that _do_ exist.\n   > **Example:** For each document in the `biosample_set` collection that _has_ a field named `associated_studies`,\n   > for each value in that field, confirm there _is_ a document having that `id` in the `study_set` collection.\n\n### Graph\n\n`refscan` does this in three stages:\n1. It uses the LinkML schema to determine where references _can_ exist in a MongoDB database that conforms to the schema.\n2. It formats that list of references into a data structure compatible with [`Cytoscape.js`](https://js.cytoscape.org/).\n3. It outputs an HTML document that uses `Cytoscape.js` to visualize that data structure as a graph.\n\n## Assumptions\n\n`refscan` was designed under the assumption that **every document** in **every collection described by the schema** has\na **field named `type`**, whose value is the [class_uri](https://linkml.io/linkml/code/metamodel.html#linkml_runtime.linkml_model.meta.ClassDefinition.class_uri) of the schema class the document represents an instance\nof. `refscan` uses that `class_uri` value (in that `type` field) to determine the _name_ of that schema class,\nwhose definition `refscan` then uses to determine _which fields_ of that document can contain references.\n\n## Usage\n\n### Install\n\nAssuming you have `pipx` installed, you can install the tool by running the following command:\n\n```shell\npipx install refscan\n```\n\n> [`pipx`](https://pipx.pypa.io/stable/) is a tool people can use to\n> [download and install](https://pipx.pypa.io/stable/#where-does-pipx-install-apps-from)\n> Python scripts that are hosted on PyPI.\n> You can [install `pipx`](https://pipx.pypa.io/stable/installation/) by running `$ python -m pip install pipx`.\n\n### Run\n\nOnce installed, you can display the tool's `--help` snippet by running:\n\n```shell\nrefscan --help\n```\n\nAt the time of this writing, the tool's `--help` snippet is:\n\n```console\n Usage: refscan [OPTIONS] COMMAND [ARGS]...\n\n\u256d\u2500 Options \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 --help          Show this message and exit.                                            \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n\u256d\u2500 Commands \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 version   Show version number and exit.                                                \u2502\n\u2502 scan      Scan the NMDC MongoDB database for referential integrity violations.         \u2502\n\u2502 graph     Generate an interactive graph of the references described by a schema.       \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n```\n\n<!-- Note: The above snippet was captured from a terminal window whose width was 90 characters. -->\n\nEach command has its own `--help` snippet.\n\n#### The `scan` command\n\nAt the time of this writing, the `--help` snippet for the `scan` command is:\n\n```console\n Usage: refscan scan [OPTIONS]\n\n Scan the NMDC MongoDB database for referential integrity violations.\n\n\u256d\u2500 Options \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 *  --schema                               FILE  Filesystem path at which the YAML file \u2502\n\u2502                                                 representing the schema is located.    \u2502\n\u2502                                                 [default: None]                        \u2502\n\u2502                                                 [required]                             \u2502\n\u2502    --database-name                        TEXT  Name of the database.                  \u2502\n\u2502                                                 [default: nmdc]                        \u2502\n\u2502    --mongo-uri                            TEXT  Connection string for accessing the    \u2502\n\u2502                                                 MongoDB server. If you have Docker     \u2502\n\u2502                                                 installed, you can spin up a temporary \u2502\n\u2502                                                 MongoDB server at the default URI by   \u2502\n\u2502                                                 running: $ docker run --rm --detach -p \u2502\n\u2502                                                 27017:27017 mongo                      \u2502\n\u2502                                                 [env var: MONGO_URI]                   \u2502\n\u2502                                                 [default: mongodb://localhost:27017]   \u2502\n\u2502    --verbose                                    Show verbose output.                   \u2502\n\u2502    --skip-source-collection,--skip        TEXT  Name of collection you do not want to  \u2502\n\u2502                                                 search for referring documents. Option \u2502\n\u2502                                                 can be used multiple times.            \u2502\n\u2502                                                 [default: None]                        \u2502\n\u2502    --reference-report                     FILE  Filesystem path at which you want the  \u2502\n\u2502                                                 program to generate its reference      \u2502\n\u2502                                                 report.                                \u2502\n\u2502                                                 [default: references.tsv]              \u2502\n\u2502    --violation-report                     FILE  Filesystem path at which you want the  \u2502\n\u2502                                                 program to generate its violation      \u2502\n\u2502                                                 report.                                \u2502\n\u2502                                                 [default: violations.tsv]              \u2502\n\u2502    --no-scan                                    Generate a reference report, but do    \u2502\n\u2502                                                 not scan the database for violations.  \u2502\n\u2502    --locate-misplaced-documents                 For each referenced document not found \u2502\n\u2502                                                 in any of the collections the schema   \u2502\n\u2502                                                 allows, also search for it in all      \u2502\n\u2502                                                 other collections.                     \u2502\n\u2502    --help                                       Show this message and exit.            \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n```\n\n<!-- Note: The above snippet was captured from a terminal window whose width was 90 characters. -->\n\n##### The MongoDB connection string (`--mongo-uri`)\n\nAs documented in the `--help` snippet above, you can provide the MongoDB connection string to the tool via either\n(a) the `--mongo-uri` option; or (b) an environment variable named `MONGO_URI`. The latter can come in handy\nwhen the MongoDB connection string contains information you don't want to appear in your shell history,\nsuch as a password.\n\nHere's how you could create that environment variable:\n\n```shell  \nexport MONGO_URI='mongodb://username:password@localhost:27017'\n```\n\n##### The schema (`--schema`)\n\nAs documented in the `--help` snippet above, you can provide the path to a YAML-formatted LinkML schema file to the tool\nvia the `--schema` option.\n\n<details>\n\n<summary>\nShow/hide tips for getting a schema file\n</summary>\n\n---\n\nIf you have `curl` installed, you can download a YAML file from GitHub by running the following command (after replacing\nthe `{...}` placeholders and customizing the path):\n\n```shell\n# Download the raw content of https://github.com/{user_or_org}/{repo}/blob/{branch}/path/to/schema.yaml\ncurl -o schema.yaml https://raw.githubusercontent.com/{user_or_org}/{repo}/{branch}/path/to/schema.yaml\n```\n\nFor example:\n\n```shell\n# Download the raw content of https://github.com/microbiomedata/nmdc-schema/blob/main/nmdc_schema/nmdc_materialized_patterns.yaml\ncurl -o schema.yaml https://raw.githubusercontent.com/microbiomedata/nmdc-schema/main/nmdc_schema/nmdc_materialized_patterns.yaml\n\n# Download the raw content of https://github.com/microbiomedata/nmdc-schema/blob/v11.2.1/nmdc_schema/nmdc_materialized_patterns.yaml\ncurl -o schema.yaml https://raw.githubusercontent.com/microbiomedata/nmdc-schema/v11.2.1/nmdc_schema/nmdc_materialized_patterns.yaml\n```\n\n---\n</details>\n\n##### Output\n\nWhile `refscan` is running, it will display console output indicating what it's currently doing.\n\n![Screenshot of refscan console output](./docs/refscan-screenshot.png)\n\nOnce the scan is complete, the reference report (TSV file) and violation report (TSV file) will be available\nin the current directory (or in custom directories, if any were specified via CLI options).\n\n#### The `graph` command\n\nAt the time of this writing, the `--help` snippet for the `graph` command is:\n\n```console\n Usage: refscan graph [OPTIONS]\n\n Generate an interactive graph of the references described by a schema.\n\n\u256d\u2500 Options \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 *  --schema         FILE                Filesystem path at which the YAML file         \u2502\n\u2502                                         representing the schema is located.            \u2502\n\u2502                                         [default: None]                                \u2502\n\u2502                                         [required]                                     \u2502\n\u2502    --graph          FILE                Filesystem path at which you want refscan to   \u2502\n\u2502                                         generate the graph.                            \u2502\n\u2502                                         [default: graph.html]                          \u2502\n\u2502    --subject        [collection|class]  Whether you want each node of the graph to     \u2502\n\u2502                                         represent a collection or a class.             \u2502\n\u2502                                         [default: collection]                          \u2502\n\u2502    --verbose                            Show verbose output.                           \u2502\n\u2502    --help                               Show this message and exit.                    \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n```\n\n<!-- Note: The above snippet was captured from a terminal window whose width was 90 characters. -->\n\n### Update\n\nYou can update the tool to [the latest version available on PyPI](https://pypi.org/project/refscan/) by running:\n\n```shell\npipx upgrade refscan\n```\n\n### Uninstall\n\nYou can uninstall the tool from your computer by running:\n\n```shell\npipx uninstall refscan\n```\n\n## Development\n\nWe use [Poetry](https://python-poetry.org/) to both (a) manage dependencies and (b) build distributable packages that can be published to PyPI.\n\n- `pyproject.toml`: Configuration file for Poetry and other tools (was initialized via `$ poetry init`)\n- `poetry.lock`: List of dependencies, both direct and [indirect/transitive](https://en.wikipedia.org/wiki/Transitive_dependency)\n\n### Clone repository\n\n```shell\ngit clone https://github.com/microbiomedata/refscan.git\ncd refscan\n```\n\n### Create virtual environment\n\nCreate a Poetry virtual environment and attach to its shell:\n\n```shell\npoetry shell\n```\n\n> You can see information about the Poetry virtual environment by running: `$ poetry env info`\n\n> You can detach from the Poetry virtual environment's shell by running: `$ exit`\n\nFrom now on, I'll refer to the Poetry virtual environment's shell as the \"Poetry shell.\"\n\n### Install dependencies\n\nAt the Poetry shell, install the project's dependencies:\n\n```shell\npoetry install\n```\n\n### Make changes\n\nEdit the tool's source code and documentation however you want.\n\nWhile editing the tool's source code, you can run the tool as you normally would in order to test things out.\n\n```shell\npoetry run refscan --help\n```\n\n### Run tests\n\nWe use [pytest](https://docs.pytest.org/en/8.2.x/) as the testing framework for `refscan`.\n\nTests are defined in the `tests` directory.\n\nYou can run the tests by running the following command from the root directory of the repository:\n\n```shell\npoetry run pytest\n```\n\n### Format code\n\nWe use [`black`](https://black.readthedocs.io/en/stable/) as the code formatter for `refscan`. \n\nWe do not use it with its default options. Instead, we include an option that allows lines to be 120 characters\ninstead of the default 88 characters. That option is defined in the `[tool.black]` section of `pyproject.toml`.\n\nYou can format all the Python code in the repository by running this command\nfrom the root directory of the repository:\n\n```shell\npoetry run black .\n```\n\n#### Check format\n\nYou can _check_ the format of the Python code by including the `--check` option, like this:\n\n```shell\npoetry run black --check .\n```\n\n## Building and publishing\n\n### Build for production\n\nWhenever someone publishes a [GitHub Release](https://github.com/microbiomedata/refscan/releases) in this repository,\na [GitHub Actions workflow](.github/workflows/build-and-publish-package-to-pypi.yml)\nwill automatically build a package and publish it to [PyPI](https://pypi.org/project/refscan/).\nThat package will have a version identifier that matches the name of the Git tag associated with the Release.\n\n### Test the build process locally\n\nIn case you want to test the build process locally, you can do so by running:\n\n```shell\npoetry build\n```\n\n> That will create both a\n> [source distribution](https://setuptools.pypa.io/en/latest/deprecated/distutils/sourcedist.html#creating-a-source-distribution)\n> file (whose name ends with `.tar.gz`) and a\n> [wheel](https://packaging.python.org/en/latest/specifications/binary-distribution-format/#binary-distribution-format)\n> file (whose name ends with `.whl`) in the `dist` directory.\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Command-line program that scans the NMDC MongoDB database for referential integrity violations",
    "version": "0.2.0",
    "project_urls": {
        "Documentation": "https://github.com/microbiomedata/refscan",
        "Homepage": "https://github.com/microbiomedata/refscan",
        "Repository": "https://github.com/microbiomedata/refscan"
    },
    "split_keywords": [
        "mongodb",
        " mongo",
        " relationships",
        " reference",
        " database",
        " data",
        " referential integrity",
        " scan"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "46ce0b7d6b72cef3d70898fae3ccda50e332fb94531c1c3800771b00964dac6c",
                "md5": "09fcf1af7f929fe5aa935bbd33bdd339",
                "sha256": "32eb44e3ce6772d2ad8f291fd426892989a268f48898da85240d85a1d69e4c1e"
            },
            "downloads": -1,
            "filename": "refscan-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "09fcf1af7f929fe5aa935bbd33bdd339",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.9",
            "size": 27907,
            "upload_time": "2025-01-13T07:48:35",
            "upload_time_iso_8601": "2025-01-13T07:48:35.886188Z",
            "url": "https://files.pythonhosted.org/packages/46/ce/0b7d6b72cef3d70898fae3ccda50e332fb94531c1c3800771b00964dac6c/refscan-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "25f2a5d17bc4f234da68a00dc18715a9dca226e5b44973807dc3e3508f7b9cd1",
                "md5": "0137ff6fc24f9756610c66638b028db7",
                "sha256": "4a115c4e39c5c9d3439275ff03cb7b87367f61d47b4974ba3ba9a777a94ecdfc"
            },
            "downloads": -1,
            "filename": "refscan-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "0137ff6fc24f9756610c66638b028db7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.9",
            "size": 25044,
            "upload_time": "2025-01-13T07:48:39",
            "upload_time_iso_8601": "2025-01-13T07:48:39.838438Z",
            "url": "https://files.pythonhosted.org/packages/25/f2/a5d17bc4f234da68a00dc18715a9dca226e5b44973807dc3e3508f7b9cd1/refscan-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-13 07:48:39",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "microbiomedata",
    "github_project": "refscan",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "refscan"
}
        
Elapsed time: 1.66026s