es-pii-tool


Namees-pii-tool JSON
Version 0.9.0 PyPI version JSON
download
home_pageNone
SummaryRedacting field data from your Elasticsearch indices and Searchable Snapshots
upload_time2024-10-02 00:03:30
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseApache-2.0
keywords elasticsearch index pii redact
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Elasticsearch PII Tool

Did you find PII (Personally Identifiable Information) in your Elasticsearch
indices that doesn't belong there? This is the tool for you!

`es-pii-tool` can help you redact information from even Searchable Snapshot
mounted indices. It works with deeply nested fields, too!


## Client Configuration

The tool connects using the [`es_client` Python module](https://es-client.readthedocs.io/).

You can use command-line options, or a YAML configuration file to configure the client connection.
If using a configuration file is desired, the configuration file structure requires
`elasticsearch` at the root level as follows:

```yaml
---
elasticsearch:
  client:
    hosts: https://10.11.12.13:9200
    cloud_id:
    request_timeout: 60
    verify_certs:
    ca_certs:
    client_cert:
    client_key:
  other_settings:
    username:
    password:
    api_key:
      id:
      api_key:
      token:

logging:
  loglevel: INFO
  logfile: /path/to/file.log
  logformat: default
  blacklist: []
```


## `REDACTIONS_FILE` Configuration

```
  ---
  redactions:
    - job_name_20240930_redact_hot:
        pattern: hot-*
        query: {'match': {'message': 'message1'}}
        fields: ['message']
        message: REDACTED
        expected_docs: 1
        restore_settings: {'index.routing.allocation.include._tier_preference': 'data_warm,data_hot,data_content'}
    - job_name_20240930_redact_cold:
        pattern: restored-cold-*
        query: {'match': {'nested.key': 'nested19'}}
        fields: ['nested.key']
        message: REDACTED
        expected_docs: 1
        restore_settings: {'index.routing.allocation.include._tier_preference': 'data_warm,data_hot,data_content'}
        forcemerge:
          max_num_segments: 1
    - job_name_20240930_redact_frozen:
        pattern: partial-frozen-*
        query: {'range': {'number': {'gte': 8, 'lte': 11}}}
        fields: ['deep.l1.l2.l3']
        message: REDACTED
        expected_docs: 4
        forcemerge:
          only_expunge_deletes: True
```

### Job Name

The job name _must_ be unique. Progress throughout a redaction job is tracked in 
a tracking index, the default name being `redactions-tracker`. If job progress is
interrupted for any reason, `es-pii-tool` will attempt to resume where it left off.

### `pattern`

The `pattern` setting defines which indices will be searched for documents to be
redacted. This can be [anything that Elasticsearch
supports](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multiple-indices.html),
including wildcard globbing and comma separated values.

### `query`

The Elasticsearch query used to isolate the documents to redact. This should be
in the [Elasticsearch Query
DSL](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html)
format.

### `fields`

The fields to be redacted. If the fields are not at the root-level of the document,
deeply nested objects can be referred to using dotted notation.

This is an array value, so multiple lines can be specified.

**NOTE:** All documents to be redacted must have all of the fields defined in the
`fields` array or an error condition will result.

### `message`

This value will replace whatever was in each of the `fields` specified. The default
value is `REDACTED`. It can be any string value. Partial or sub-string replacements
are not supported at this time.

### `expected_docs`

This must match the exact number of hits that will result from executing `query`.

**NOTE:** This value cannot exceed 10,000. This is the default limit that
Elasticsearch imposes for the maximum number of results from a single query. In
the event that you have a single query that produces more than 10,000 hits, and
all need to be redacted, it is suggested that you find some other limiting
condition or filter to reduce the total hits below 10,000.

### `restore_settings`

When redacting data from indices in the `cold` or `frozen` tier, it is required
to fully restore them from the snapshot repository first. If you wish to apply
any specific settings to these indices to be restored, this is where to set them.

The primary use case for `restore_settings` is to force the indices to restore
to the `data_warm` tier, or some other targeted data node, but it can be used
to apply any other desired setting.

### `forcemerge`

This is only used for redacting searchable snapshot indices in the `cold` or
`frozen` tiers. Best practices is to force merge your indices to 1 segment per
shard before taking a snapshot and mounting in the `cold` or `frozen` tier. This
tool defaults to performing this force merge after redacting your data. But you
can set the force merge to `only_expunge_deletes`.

In Elasticsearch, a document, once indexed, is immutable. In order to redact a
document, Elasticsearch effectively deletes the original document and reindexes
the modified document in its place.

#### `max_num_segments`

The default value for this setting is `1`, meaning 1 segment per shard. This is
a rather "expensive" operation in terms of disk I/O, and if you are only removing
a small number of documents from an index or shard, it may not make much sense
to perform a full force merge. To avoid this, use the `only_expunge_deletes`
setting.

#### `only_expunge_deletes`

This is a boolean setting, either `true` or `false` (or `True` or `False`).

If `true`, the force merge will only delete the bits marked for deletion,
which are the original documents that were deleted as part of the redaction
process.

## Running `es_pii_tool`

### Command Line Execution

The script is run by executing `pii-tool` at the command-line. There are multiple
levels of configuration for `pii-tool`. The top level is for client connection to
Elasticsearch configuration.

Subsequent commands include `show-all-options` and `file-based` (meaning
redaction configuration derived from a YAML configuration file).

#### Client configuration 


```
pii-tool --help
Usage: pii-tool [OPTIONS] COMMAND [ARGS]...

  Elastic PII Tool

Options:
  --config PATH                   Path to configuration file.
  --hosts TEXT                    Elasticsearch URL to connect to.
  --cloud_id TEXT                 Elastic Cloud instance id
  --api_token TEXT                The base64 encoded API Key token
  --id TEXT                       API Key "id" value
  --api_key TEXT                  API Key "api_key" value
  --username TEXT                 Elasticsearch username
  --password TEXT                 Elasticsearch password
  --request_timeout FLOAT         Request timeout in seconds
  --verify_certs / --no-verify_certs
                                  Verify SSL/TLS certificate(s)
  --ca_certs TEXT                 Path to CA certificate file or directory
  --client_cert TEXT              Path to client certificate file
  --client_key TEXT               Path to client key file
  --loglevel [DEBUG|INFO|WARNING|ERROR|CRITICAL]
                                  Log level
  --logfile TEXT                  Log file
  --logformat [default|json|ecs]  Log output format
  -v, --version                   Show the version and exit.
  -h, --help                      Show this message and exit.

Commands:
  file-based        Redact from YAML config file
  show-all-options  Show all client configuration options
```

#### `show-all-options`

As the list of configuration options for just the client connection are quite
long, the default `--help` output is somewhat truncated. You can show all of the
configuration options, as well as reveal their environment variable names by
running: `pii-tool show-all-options`:

```
Usage: pii-tool show-all-options [OPTIONS]

  ALL OPTIONS SHOWN

  The full list of options available for configuring a connection at the command-line.

Options:
  --config PATH                   Path to configuration file.  [env var: ESCLIENT_CONFIG]
  --hosts TEXT                    Elasticsearch URL to connect to.  [env var: ESCLIENT_HOSTS]
  --cloud_id TEXT                 Elastic Cloud instance id  [env var: ESCLIENT_CLOUD_ID]
  --api_token TEXT                The base64 encoded API Key token  [env var: ESCLIENT_API_TOKEN]
  --id TEXT                       API Key "id" value  [env var: ESCLIENT_ID]
  --api_key TEXT                  API Key "api_key" value  [env var: ESCLIENT_API_KEY]
  --username TEXT                 Elasticsearch username  [env var: ESCLIENT_USERNAME]
  --password TEXT                 Elasticsearch password  [env var: ESCLIENT_PASSWORD]
  --bearer_auth TEXT              Bearer authentication token  [env var: ESCLIENT_BEARER_AUTH]
  --opaque_id TEXT                X-Opaque-Id HTTP header value  [env var: ESCLIENT_OPAQUE_ID]
  --request_timeout FLOAT         Request timeout in seconds  [env var: ESCLIENT_REQUEST_TIMEOUT]
  --http_compress / --no-http_compress
                                  Enable HTTP compression  [env var: ESCLIENT_HTTP_COMPRESS]
  --verify_certs / --no-verify_certs
                                  Verify SSL/TLS certificate(s)  [env var: ESCLIENT_VERIFY_CERTS]
  --ca_certs TEXT                 Path to CA certificate file or directory  [env var: ESCLIENT_CA_CERTS]
  --client_cert TEXT              Path to client certificate file  [env var: ESCLIENT_CLIENT_CERT]
  --client_key TEXT               Path to client key file  [env var: ESCLIENT_CLIENT_KEY]
  --ssl_assert_hostname TEXT      Hostname or IP address to verify on the node's certificate.  [env var:
                                  ESCLIENT_SSL_ASSERT_HOSTNAME]
  --ssl_assert_fingerprint TEXT   SHA-256 fingerprint of the node's certificate. If this value is given then root-of-trust
                                  verification isn't done and only the node's certificate fingerprint is verified.  [env var:
                                  ESCLIENT_SSL_ASSERT_FINGERPRINT]
  --ssl_version TEXT              Minimum acceptable TLS/SSL version  [env var: ESCLIENT_SSL_VERSION]
  --master-only / --no-master-only
                                  Only run if the single host provided is the elected master  [env var: ESCLIENT_MASTER_ONLY]
  --skip_version_test / --no-skip_version_test
                                  Elasticsearch version compatibility check  [env var: ESCLIENT_SKIP_VERSION_TEST]
  --loglevel [DEBUG|INFO|WARNING|ERROR|CRITICAL]
                                  Log level  [env var: ESCLIENT_LOGLEVEL]
  --logfile TEXT                  Log file  [env var: ESCLIENT_LOGFILE]
  --logformat [default|json|ecs]  Log output format  [env var: ESCLIENT_LOGFORMAT]
  --blacklist TEXT                Named entities will not be logged  [env var: ESCLIENT_BLACKLIST]
  -h, --help                      Show this message and exit.
```

You will note that each of these has an `env var: ESCLIENT_` in this view. 

##### Environment Variables

If set as an environment variable, setting the associated command-line flag becomes
unnecessary.

For example, if I had a client configuration YAML file at `/path/to/config.yaml`,
and I set the `ESCLIENT_CONFIG` environment variable, e.g.

```
export ESCLIENT_CONFIG=/path/to/config.yaml
```

Then I could execute `pii-tool` without needing to use the
`--config /path/to/config.yaml` flag at the command line.

This is extremely useful with Docker-based execution!

Learn more in the [`es_client` documentation](https://es-client.readthedocs.io/en/latest/envvars.html).

#### `file-based`

The sub-command `file-based` is used *after* setting the client connection
parameters:

```
$ pii-tool file-based --help
Usage: run_script.py file-based [OPTIONS] REDACTIONS_FILE

  Redact from YAML config file

Options:
  --dry-run              Do not perform any changes.  [env var: PII_TOOL_DRY_RUN]
  --tracking-index TEXT  Name for the tracking index.  [env var: PII_TOOL_TRACKING_INDEX; default: redactions-tracker]
  -h, --help             Show this message and exit.
```

You will note that there are environment variables here, too!

### Docker Execution

The Docker image requires a volume map to `/.config` on the container (for now).

This is where any configuration files should go, including client YAML files and
the redactions file for use with the `file-based` command.

The command, presuming you have a file called `REDACTIONS_FILE.yaml` in `$(pwd)`:

```shell
docker run \
  --rm \  # Remove the container after execution
  -it \   # Not strictly necessary, but helpful as it is an interactive terminal
  -v $(pwd)/:/.config \  # Map the present-working directory to /.config
  untergeek/es-pii-tool:${TAG} \
    --hosts https://127.0.0.1:9200 \ 
    --username USERNAME \
    --password HIDDEN \
    file-based /.config/REDACTIONS_FILE.yaml
```

Any and all command-line flags are usable. But there's a better way...

#### Environment variables

As mentioned previously, environment variables can make things really easy for
Docker and Kubernetes:

```shell
docker run \
  --rm \  # Remove the container after execution
  -it \   # Not strictly necessary, but helpful as it is an interactive terminal
  -v $(pwd)/:/.config \  # Map the present-working directory to /.config
  -e ESCLIENT_HOSTS=${ESCLIENT_HOSTS} \ 
  -e ESCLIENT_USERNAME=${ESCLIENT_USERNAME} \ 
  -e ESCLIENT_PASSWORD=${ESCLIENT_PASSWORD} \ 
  untergeek/es-pii-tool:0.9.0 file-based /.config/REDACTIONS_FILE.yaml
```

## Testing

### The `docker_test` directory

Hopefully you won't ever have to modify anything here. In order to run the native
tests, you really *should* have Docker running locally, or at least accessible.

If not, you should plan on exporting the following environment variables:

* `TEST_ES_SERVER` Should be set to the full URL of the Elasticsearch cluster
* `TEST_USER` The Elasticsearch test username
* `TEST_PASS` The Elasticsearch password for `TEST_USER`
* `CA_CRT` Potentially optional, if your cluster is signed by a known signer
* `TEST_ES_REPO` The name of the repository on `TEST_ES_SERVER`

Each of these should be put in `.env` in the root directory of the project with
`export` in front, e.g.

```
export TEST_ES_SERVER=https://127.0.0.1:9200
...
```

Why in `.env`? Because `pytest` looks there for it.

Your test cluster would also need to have at least a Trial X-Pack license, as
searchable snapshots requires this.

Now, if this seems a bit daunting, it is. The `docker_test` scripts automate this
for you.

#### Manual Setup

```
docker_test/create.sh VERSION [SCENARIO]
```

`VERSION` needs to be a recognized image tag at
`docker.elastic.co/elasticsearch/elasticsearch`.

`SCENARIO` is technically optional, but is not optional for this project.

In order to test redacting from searchable snapshots from the `frozen` tier, you
need to have at least one `data_frozen` node. The scenario that does that is
`frozen_node`, e.g.:

```
$ docker_test/create.sh 8.15.1 frozen_node

Using scenario: frozen_node
Using Elasticsearch version 8.15.1

Creating 2 node(s)...

Starting node 1...
Container ID: ecf1e6d45bbc8d0617f75f47101372fd723a1424b27f7c250d1c7553f096960c
Getting Elasticsearch credentials from container es-pii-tool-test-0...
- 18s elapsed (typically 15s - 25s)...
Trial license started...
Snapshot repository initialized...

Node 1 started.
Node 2 started.

All nodes ready to test!

Environment variables are in .env
```

This script will create the first node, and collect the credentials created during
creation. It will then start an X-Pack trial license and initialize a snapshot
repository. Then in the same subnet, it will spin up another node, with
`node_role: ["data_frozen"]`. In effect, it creates exactly what testing the
`pii_tool` needs.

**NOTE:** It should take somewhere between 20 and 30 seconds to have both nodes
up and running. The script keeps a lovely spinner running while it's waiting and
trying to capture the credentials so you can see that it's working. It should be done
rather quickly after that.

#### Manual Teardown

```
$ docker_test/destroy.sh
Stopping all containers...
Removing all containers...
Deleting remaining files and directories
Cleanup complete.
```

This cleans up all containers, env var files, and the "repo" directory used as
a shared fs snapshot repository. No extra flags needed.

### Testing

With `docker_test` so integral to testing `es-pii-tool`, an effort was made to make
the setup and teardown of the Docker containers automatic.

```
$ pytest --docker_create true --docker_destroy true
```

If `--docker_create true` and/or `--docker_destroy true` are omitted, the tests
will assume you already have a functional test environment and the environment
variables are configured in `.env`.

The output looks like this:

```
$ pytest --docker_create true --docker_destroy true
Running: "/Users/buh/WORK/es-pii-tool/docker_test/create.sh 8.15.1 frozen_node"

Using scenario: frozen_node
Using Elasticsearch version 8.15.1

Creating 2 node(s)...

Starting node 1...
Container ID: f4e006bc94cdc8fa0977d9ab5bae509250f6d5fd2e0b8de6a3bbc994c2ee684a
Getting Elasticsearch credentials from container es-pii-tool-test-0...
- 18s elapsed (typically 15s - 25s)...
Trial license started...
Snapshot repository initialized...

Node 1 started.
Node 2 started.

All nodes ready to test!

Environment variables are in .env[-shell]
===================================================== test session starts =====================================================
platform darwin -- Python 3.12.2, pytest-8.1.1, pluggy-1.5.0
rootdir: /Users/buh/WORK/es-pii-tool
configfile: pytest.ini
plugins: cov-5.0.0, anyio-4.3.0, returns-0.22.0
collected 70 items

tests/integration/test_cold.py .........................                                                                [ 35%]
tests/integration/test_frozen.py .........................                                                              [ 71%]
tests/integration/test_hot.py ....................                                                                      [100%]

 -- docker_test environment destroyed.


=============================================== 70 passed in 148.31s (0:02:28) ================================================

```

#### Errors during testing

While uncommon, occasionally a test will hang. While this could happen for a number
of reasons, the common one is that Docker could not access a resource properly and
a process timed out. 

If this happens, chances are good that it will look like other tests failed. This
is because the primary test scenario runs 4 passes across a series of 3 indices.
Depending on the circumstances, 2 of those indices might be frozen, cold, part of
a data_stream, etc.

Don't panic if a test times out. You *will* need to manually destroy the Docker
environment using `docker_test/destroy.sh` as the testing environment defaults to
*not* destroying the test environment so you can inspect it, if necessary.

After doing so, re-run the test. If it repeatedly fails, then something else may
be going on. This is also why being able to inspect the *not* automatically
destroyed Elasticsearch environment in Docker is a good thing.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "es-pii-tool",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "elasticsearch, index, pii, redact",
    "author": null,
    "author_email": "Elastic <info@elastic.co>",
    "download_url": "https://files.pythonhosted.org/packages/e0/54/5014f2eaf933c0433f8f3326da747ea1bc7ede6c42fb47cc432f8b67f7bb/es_pii_tool-0.9.0.tar.gz",
    "platform": null,
    "description": "# Elasticsearch PII Tool\n\nDid you find PII (Personally Identifiable Information) in your Elasticsearch\nindices that doesn't belong there? This is the tool for you!\n\n`es-pii-tool` can help you redact information from even Searchable Snapshot\nmounted indices. It works with deeply nested fields, too!\n\n\n## Client Configuration\n\nThe tool connects using the [`es_client` Python module](https://es-client.readthedocs.io/).\n\nYou can use command-line options, or a YAML configuration file to configure the client connection.\nIf using a configuration file is desired, the configuration file structure requires\n`elasticsearch` at the root level as follows:\n\n```yaml\n---\nelasticsearch:\n  client:\n    hosts: https://10.11.12.13:9200\n    cloud_id:\n    request_timeout: 60\n    verify_certs:\n    ca_certs:\n    client_cert:\n    client_key:\n  other_settings:\n    username:\n    password:\n    api_key:\n      id:\n      api_key:\n      token:\n\nlogging:\n  loglevel: INFO\n  logfile: /path/to/file.log\n  logformat: default\n  blacklist: []\n```\n\n\n## `REDACTIONS_FILE` Configuration\n\n```\n  ---\n  redactions:\n    - job_name_20240930_redact_hot:\n        pattern: hot-*\n        query: {'match': {'message': 'message1'}}\n        fields: ['message']\n        message: REDACTED\n        expected_docs: 1\n        restore_settings: {'index.routing.allocation.include._tier_preference': 'data_warm,data_hot,data_content'}\n    - job_name_20240930_redact_cold:\n        pattern: restored-cold-*\n        query: {'match': {'nested.key': 'nested19'}}\n        fields: ['nested.key']\n        message: REDACTED\n        expected_docs: 1\n        restore_settings: {'index.routing.allocation.include._tier_preference': 'data_warm,data_hot,data_content'}\n        forcemerge:\n          max_num_segments: 1\n    - job_name_20240930_redact_frozen:\n        pattern: partial-frozen-*\n        query: {'range': {'number': {'gte': 8, 'lte': 11}}}\n        fields: ['deep.l1.l2.l3']\n        message: REDACTED\n        expected_docs: 4\n        forcemerge:\n          only_expunge_deletes: True\n```\n\n### Job Name\n\nThe job name _must_ be unique. Progress throughout a redaction job is tracked in \na tracking index, the default name being `redactions-tracker`. If job progress is\ninterrupted for any reason, `es-pii-tool` will attempt to resume where it left off.\n\n### `pattern`\n\nThe `pattern` setting defines which indices will be searched for documents to be\nredacted. This can be [anything that Elasticsearch\nsupports](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multiple-indices.html),\nincluding wildcard globbing and comma separated values.\n\n### `query`\n\nThe Elasticsearch query used to isolate the documents to redact. This should be\nin the [Elasticsearch Query\nDSL](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html)\nformat.\n\n### `fields`\n\nThe fields to be redacted. If the fields are not at the root-level of the document,\ndeeply nested objects can be referred to using dotted notation.\n\nThis is an array value, so multiple lines can be specified.\n\n**NOTE:** All documents to be redacted must have all of the fields defined in the\n`fields` array or an error condition will result.\n\n### `message`\n\nThis value will replace whatever was in each of the `fields` specified. The default\nvalue is `REDACTED`. It can be any string value. Partial or sub-string replacements\nare not supported at this time.\n\n### `expected_docs`\n\nThis must match the exact number of hits that will result from executing `query`.\n\n**NOTE:** This value cannot exceed 10,000. This is the default limit that\nElasticsearch imposes for the maximum number of results from a single query. In\nthe event that you have a single query that produces more than 10,000 hits, and\nall need to be redacted, it is suggested that you find some other limiting\ncondition or filter to reduce the total hits below 10,000.\n\n### `restore_settings`\n\nWhen redacting data from indices in the `cold` or `frozen` tier, it is required\nto fully restore them from the snapshot repository first. If you wish to apply\nany specific settings to these indices to be restored, this is where to set them.\n\nThe primary use case for `restore_settings` is to force the indices to restore\nto the `data_warm` tier, or some other targeted data node, but it can be used\nto apply any other desired setting.\n\n### `forcemerge`\n\nThis is only used for redacting searchable snapshot indices in the `cold` or\n`frozen` tiers. Best practices is to force merge your indices to 1 segment per\nshard before taking a snapshot and mounting in the `cold` or `frozen` tier. This\ntool defaults to performing this force merge after redacting your data. But you\ncan set the force merge to `only_expunge_deletes`.\n\nIn Elasticsearch, a document, once indexed, is immutable. In order to redact a\ndocument, Elasticsearch effectively deletes the original document and reindexes\nthe modified document in its place.\n\n#### `max_num_segments`\n\nThe default value for this setting is `1`, meaning 1 segment per shard. This is\na rather \"expensive\" operation in terms of disk I/O, and if you are only removing\na small number of documents from an index or shard, it may not make much sense\nto perform a full force merge. To avoid this, use the `only_expunge_deletes`\nsetting.\n\n#### `only_expunge_deletes`\n\nThis is a boolean setting, either `true` or `false` (or `True` or `False`).\n\nIf `true`, the force merge will only delete the bits marked for deletion,\nwhich are the original documents that were deleted as part of the redaction\nprocess.\n\n## Running `es_pii_tool`\n\n### Command Line Execution\n\nThe script is run by executing `pii-tool` at the command-line. There are multiple\nlevels of configuration for `pii-tool`. The top level is for client connection to\nElasticsearch configuration.\n\nSubsequent commands include `show-all-options` and `file-based` (meaning\nredaction configuration derived from a YAML configuration file).\n\n#### Client configuration \n\n\n```\npii-tool --help\nUsage: pii-tool [OPTIONS] COMMAND [ARGS]...\n\n  Elastic PII Tool\n\nOptions:\n  --config PATH                   Path to configuration file.\n  --hosts TEXT                    Elasticsearch URL to connect to.\n  --cloud_id TEXT                 Elastic Cloud instance id\n  --api_token TEXT                The base64 encoded API Key token\n  --id TEXT                       API Key \"id\" value\n  --api_key TEXT                  API Key \"api_key\" value\n  --username TEXT                 Elasticsearch username\n  --password TEXT                 Elasticsearch password\n  --request_timeout FLOAT         Request timeout in seconds\n  --verify_certs / --no-verify_certs\n                                  Verify SSL/TLS certificate(s)\n  --ca_certs TEXT                 Path to CA certificate file or directory\n  --client_cert TEXT              Path to client certificate file\n  --client_key TEXT               Path to client key file\n  --loglevel [DEBUG|INFO|WARNING|ERROR|CRITICAL]\n                                  Log level\n  --logfile TEXT                  Log file\n  --logformat [default|json|ecs]  Log output format\n  -v, --version                   Show the version and exit.\n  -h, --help                      Show this message and exit.\n\nCommands:\n  file-based        Redact from YAML config file\n  show-all-options  Show all client configuration options\n```\n\n#### `show-all-options`\n\nAs the list of configuration options for just the client connection are quite\nlong, the default `--help` output is somewhat truncated. You can show all of the\nconfiguration options, as well as reveal their environment variable names by\nrunning: `pii-tool show-all-options`:\n\n```\nUsage: pii-tool show-all-options [OPTIONS]\n\n  ALL OPTIONS SHOWN\n\n  The full list of options available for configuring a connection at the command-line.\n\nOptions:\n  --config PATH                   Path to configuration file.  [env var: ESCLIENT_CONFIG]\n  --hosts TEXT                    Elasticsearch URL to connect to.  [env var: ESCLIENT_HOSTS]\n  --cloud_id TEXT                 Elastic Cloud instance id  [env var: ESCLIENT_CLOUD_ID]\n  --api_token TEXT                The base64 encoded API Key token  [env var: ESCLIENT_API_TOKEN]\n  --id TEXT                       API Key \"id\" value  [env var: ESCLIENT_ID]\n  --api_key TEXT                  API Key \"api_key\" value  [env var: ESCLIENT_API_KEY]\n  --username TEXT                 Elasticsearch username  [env var: ESCLIENT_USERNAME]\n  --password TEXT                 Elasticsearch password  [env var: ESCLIENT_PASSWORD]\n  --bearer_auth TEXT              Bearer authentication token  [env var: ESCLIENT_BEARER_AUTH]\n  --opaque_id TEXT                X-Opaque-Id HTTP header value  [env var: ESCLIENT_OPAQUE_ID]\n  --request_timeout FLOAT         Request timeout in seconds  [env var: ESCLIENT_REQUEST_TIMEOUT]\n  --http_compress / --no-http_compress\n                                  Enable HTTP compression  [env var: ESCLIENT_HTTP_COMPRESS]\n  --verify_certs / --no-verify_certs\n                                  Verify SSL/TLS certificate(s)  [env var: ESCLIENT_VERIFY_CERTS]\n  --ca_certs TEXT                 Path to CA certificate file or directory  [env var: ESCLIENT_CA_CERTS]\n  --client_cert TEXT              Path to client certificate file  [env var: ESCLIENT_CLIENT_CERT]\n  --client_key TEXT               Path to client key file  [env var: ESCLIENT_CLIENT_KEY]\n  --ssl_assert_hostname TEXT      Hostname or IP address to verify on the node's certificate.  [env var:\n                                  ESCLIENT_SSL_ASSERT_HOSTNAME]\n  --ssl_assert_fingerprint TEXT   SHA-256 fingerprint of the node's certificate. If this value is given then root-of-trust\n                                  verification isn't done and only the node's certificate fingerprint is verified.  [env var:\n                                  ESCLIENT_SSL_ASSERT_FINGERPRINT]\n  --ssl_version TEXT              Minimum acceptable TLS/SSL version  [env var: ESCLIENT_SSL_VERSION]\n  --master-only / --no-master-only\n                                  Only run if the single host provided is the elected master  [env var: ESCLIENT_MASTER_ONLY]\n  --skip_version_test / --no-skip_version_test\n                                  Elasticsearch version compatibility check  [env var: ESCLIENT_SKIP_VERSION_TEST]\n  --loglevel [DEBUG|INFO|WARNING|ERROR|CRITICAL]\n                                  Log level  [env var: ESCLIENT_LOGLEVEL]\n  --logfile TEXT                  Log file  [env var: ESCLIENT_LOGFILE]\n  --logformat [default|json|ecs]  Log output format  [env var: ESCLIENT_LOGFORMAT]\n  --blacklist TEXT                Named entities will not be logged  [env var: ESCLIENT_BLACKLIST]\n  -h, --help                      Show this message and exit.\n```\n\nYou will note that each of these has an `env var: ESCLIENT_` in this view. \n\n##### Environment Variables\n\nIf set as an environment variable, setting the associated command-line flag becomes\nunnecessary.\n\nFor example, if I had a client configuration YAML file at `/path/to/config.yaml`,\nand I set the `ESCLIENT_CONFIG` environment variable, e.g.\n\n```\nexport ESCLIENT_CONFIG=/path/to/config.yaml\n```\n\nThen I could execute `pii-tool` without needing to use the\n`--config /path/to/config.yaml` flag at the command line.\n\nThis is extremely useful with Docker-based execution!\n\nLearn more in the [`es_client` documentation](https://es-client.readthedocs.io/en/latest/envvars.html).\n\n#### `file-based`\n\nThe sub-command `file-based` is used *after* setting the client connection\nparameters:\n\n```\n$ pii-tool file-based --help\nUsage: run_script.py file-based [OPTIONS] REDACTIONS_FILE\n\n  Redact from YAML config file\n\nOptions:\n  --dry-run              Do not perform any changes.  [env var: PII_TOOL_DRY_RUN]\n  --tracking-index TEXT  Name for the tracking index.  [env var: PII_TOOL_TRACKING_INDEX; default: redactions-tracker]\n  -h, --help             Show this message and exit.\n```\n\nYou will note that there are environment variables here, too!\n\n### Docker Execution\n\nThe Docker image requires a volume map to `/.config` on the container (for now).\n\nThis is where any configuration files should go, including client YAML files and\nthe redactions file for use with the `file-based` command.\n\nThe command, presuming you have a file called `REDACTIONS_FILE.yaml` in `$(pwd)`:\n\n```shell\ndocker run \\\n  --rm \\  # Remove the container after execution\n  -it \\   # Not strictly necessary, but helpful as it is an interactive terminal\n  -v $(pwd)/:/.config \\  # Map the present-working directory to /.config\n  untergeek/es-pii-tool:${TAG} \\\n    --hosts https://127.0.0.1:9200 \\ \n    --username USERNAME \\\n    --password HIDDEN \\\n    file-based /.config/REDACTIONS_FILE.yaml\n```\n\nAny and all command-line flags are usable. But there's a better way...\n\n#### Environment variables\n\nAs mentioned previously, environment variables can make things really easy for\nDocker and Kubernetes:\n\n```shell\ndocker run \\\n  --rm \\  # Remove the container after execution\n  -it \\   # Not strictly necessary, but helpful as it is an interactive terminal\n  -v $(pwd)/:/.config \\  # Map the present-working directory to /.config\n  -e ESCLIENT_HOSTS=${ESCLIENT_HOSTS} \\ \n  -e ESCLIENT_USERNAME=${ESCLIENT_USERNAME} \\ \n  -e ESCLIENT_PASSWORD=${ESCLIENT_PASSWORD} \\ \n  untergeek/es-pii-tool:0.9.0 file-based /.config/REDACTIONS_FILE.yaml\n```\n\n## Testing\n\n### The `docker_test` directory\n\nHopefully you won't ever have to modify anything here. In order to run the native\ntests, you really *should* have Docker running locally, or at least accessible.\n\nIf not, you should plan on exporting the following environment variables:\n\n* `TEST_ES_SERVER` Should be set to the full URL of the Elasticsearch cluster\n* `TEST_USER` The Elasticsearch test username\n* `TEST_PASS` The Elasticsearch password for `TEST_USER`\n* `CA_CRT` Potentially optional, if your cluster is signed by a known signer\n* `TEST_ES_REPO` The name of the repository on `TEST_ES_SERVER`\n\nEach of these should be put in `.env` in the root directory of the project with\n`export` in front, e.g.\n\n```\nexport TEST_ES_SERVER=https://127.0.0.1:9200\n...\n```\n\nWhy in `.env`? Because `pytest` looks there for it.\n\nYour test cluster would also need to have at least a Trial X-Pack license, as\nsearchable snapshots requires this.\n\nNow, if this seems a bit daunting, it is. The `docker_test` scripts automate this\nfor you.\n\n#### Manual Setup\n\n```\ndocker_test/create.sh VERSION [SCENARIO]\n```\n\n`VERSION` needs to be a recognized image tag at\n`docker.elastic.co/elasticsearch/elasticsearch`.\n\n`SCENARIO` is technically optional, but is not optional for this project.\n\nIn order to test redacting from searchable snapshots from the `frozen` tier, you\nneed to have at least one `data_frozen` node. The scenario that does that is\n`frozen_node`, e.g.:\n\n```\n$ docker_test/create.sh 8.15.1 frozen_node\n\nUsing scenario: frozen_node\nUsing Elasticsearch version 8.15.1\n\nCreating 2 node(s)...\n\nStarting node 1...\nContainer ID: ecf1e6d45bbc8d0617f75f47101372fd723a1424b27f7c250d1c7553f096960c\nGetting Elasticsearch credentials from container es-pii-tool-test-0...\n- 18s elapsed (typically 15s - 25s)...\nTrial license started...\nSnapshot repository initialized...\n\nNode 1 started.\nNode 2 started.\n\nAll nodes ready to test!\n\nEnvironment variables are in .env\n```\n\nThis script will create the first node, and collect the credentials created during\ncreation. It will then start an X-Pack trial license and initialize a snapshot\nrepository. Then in the same subnet, it will spin up another node, with\n`node_role: [\"data_frozen\"]`. In effect, it creates exactly what testing the\n`pii_tool` needs.\n\n**NOTE:** It should take somewhere between 20 and 30 seconds to have both nodes\nup and running. The script keeps a lovely spinner running while it's waiting and\ntrying to capture the credentials so you can see that it's working. It should be done\nrather quickly after that.\n\n#### Manual Teardown\n\n```\n$ docker_test/destroy.sh\nStopping all containers...\nRemoving all containers...\nDeleting remaining files and directories\nCleanup complete.\n```\n\nThis cleans up all containers, env var files, and the \"repo\" directory used as\na shared fs snapshot repository. No extra flags needed.\n\n### Testing\n\nWith `docker_test` so integral to testing `es-pii-tool`, an effort was made to make\nthe setup and teardown of the Docker containers automatic.\n\n```\n$ pytest --docker_create true --docker_destroy true\n```\n\nIf `--docker_create true` and/or `--docker_destroy true` are omitted, the tests\nwill assume you already have a functional test environment and the environment\nvariables are configured in `.env`.\n\nThe output looks like this:\n\n```\n$ pytest --docker_create true --docker_destroy true\nRunning: \"/Users/buh/WORK/es-pii-tool/docker_test/create.sh 8.15.1 frozen_node\"\n\nUsing scenario: frozen_node\nUsing Elasticsearch version 8.15.1\n\nCreating 2 node(s)...\n\nStarting node 1...\nContainer ID: f4e006bc94cdc8fa0977d9ab5bae509250f6d5fd2e0b8de6a3bbc994c2ee684a\nGetting Elasticsearch credentials from container es-pii-tool-test-0...\n- 18s elapsed (typically 15s - 25s)...\nTrial license started...\nSnapshot repository initialized...\n\nNode 1 started.\nNode 2 started.\n\nAll nodes ready to test!\n\nEnvironment variables are in .env[-shell]\n===================================================== test session starts =====================================================\nplatform darwin -- Python 3.12.2, pytest-8.1.1, pluggy-1.5.0\nrootdir: /Users/buh/WORK/es-pii-tool\nconfigfile: pytest.ini\nplugins: cov-5.0.0, anyio-4.3.0, returns-0.22.0\ncollected 70 items\n\ntests/integration/test_cold.py .........................                                                                [ 35%]\ntests/integration/test_frozen.py .........................                                                              [ 71%]\ntests/integration/test_hot.py ....................                                                                      [100%]\n\n -- docker_test environment destroyed.\n\n\n=============================================== 70 passed in 148.31s (0:02:28) ================================================\n\n```\n\n#### Errors during testing\n\nWhile uncommon, occasionally a test will hang. While this could happen for a number\nof reasons, the common one is that Docker could not access a resource properly and\na process timed out. \n\nIf this happens, chances are good that it will look like other tests failed. This\nis because the primary test scenario runs 4 passes across a series of 3 indices.\nDepending on the circumstances, 2 of those indices might be frozen, cold, part of\na data_stream, etc.\n\nDon't panic if a test times out. You *will* need to manually destroy the Docker\nenvironment using `docker_test/destroy.sh` as the testing environment defaults to\n*not* destroying the test environment so you can inspect it, if necessary.\n\nAfter doing so, re-run the test. If it repeatedly fails, then something else may\nbe going on. This is also why being able to inspect the *not* automatically\ndestroyed Elasticsearch environment in Docker is a good thing.\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Redacting field data from your Elasticsearch indices and Searchable Snapshots",
    "version": "0.9.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/elastic/es-pii-tool/issues",
        "Homepage": "https://github.com/elastic/es-pii-tool"
    },
    "split_keywords": [
        "elasticsearch",
        " index",
        " pii",
        " redact"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "226ff9d4598b3704334fdf7c374f6ca5e1f79176af5fbb1b56a0c8c1c3ceebe3",
                "md5": "8a4b350d14d85e63ee298c4c8f404b01",
                "sha256": "2afd7d961b25d1b0037712f440cf6e52abe95c94dfdd26cd457999da8296dc09"
            },
            "downloads": -1,
            "filename": "es_pii_tool-0.9.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8a4b350d14d85e63ee298c4c8f404b01",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 48741,
            "upload_time": "2024-10-02T00:03:27",
            "upload_time_iso_8601": "2024-10-02T00:03:27.919994Z",
            "url": "https://files.pythonhosted.org/packages/22/6f/f9d4598b3704334fdf7c374f6ca5e1f79176af5fbb1b56a0c8c1c3ceebe3/es_pii_tool-0.9.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e0545014f2eaf933c0433f8f3326da747ea1bc7ede6c42fb47cc432f8b67f7bb",
                "md5": "455a1d4c70685945fdcc7d7cc9937025",
                "sha256": "0e1b15e21fce2f012289bbf531760ea4d32c154bdb445a82b08a1759981e9b06"
            },
            "downloads": -1,
            "filename": "es_pii_tool-0.9.0.tar.gz",
            "has_sig": false,
            "md5_digest": "455a1d4c70685945fdcc7d7cc9937025",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 170345,
            "upload_time": "2024-10-02T00:03:30",
            "upload_time_iso_8601": "2024-10-02T00:03:30.142102Z",
            "url": "https://files.pythonhosted.org/packages/e0/54/5014f2eaf933c0433f8f3326da747ea1bc7ede6c42fb47cc432f8b67f7bb/es_pii_tool-0.9.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-02 00:03:30",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "elastic",
    "github_project": "es-pii-tool",
    "github_not_found": true,
    "lcname": "es-pii-tool"
}
        
Elapsed time: 0.38797s