<!-- TOC ignore:true -->
# reaction-metrics-exporter
> [!note] 💚 A lot of inspiration has been drawn from [`dmarc-metrics-exporter`](https://github.com/jgosmann/dmarc-metrics-exporter).
Export [OpenMetrics](https://prometheus.io/docs/specs/om/open_metrics_spec/) for [reaction](https://reaction.ppom.me/). The exporter continuously monitors and parses reaction's logs and state.
## Main metrics
- `reaction_match_count`: number of matches since last collection;
- `reaction_action_count`: current number of actions since last collection;
- `reaction_pending_count`: current number of pending actions.
All metrics are labelled with `stream`, `filter` and matched patterns.
Action-related metrics have an additional `action` label.
For example, matches exported from [the SSH filter](https://reaction.ppom.me/filters/ssh.html) look like:
```
reaction_match_count{stream="ssh",filter="failedlogin",ip="X.X.X.X"}: N
```
`N` being the number of matches for this unique combination of labels.
## Secondary metrics
- `reaction_exporter_build_info`: gives the version of the exporter;
- `reaction_status_info`: gives the version of reaction and its status;
- `reaction_commands_count`: count of start/stop commands run by reaction, labelled with command name, status and exit code;
- `reaction_stream_event_count`: count of stream events, labelled with stream name and event type, which is one of:
- `start`: stream started successfully;
- `exit`: stream exited;
- `exec`: stream is unexecutable;
- `read`: stream is unreadable;
- `timeout`: stream didn't terminate on time on exit.
## Side notes
> ℹ️ In the long-term, `reaction` will have builtin metrics. Whether the proposed ones will be adopted depends on long-term relevance and performance.
> ⚠️ For very large numbers of matches and actions, your TSDB may grow too much. Read ["the patterns dilemma"](#the-patterns-dilemma) if in doubt.
<!-- TOC ignore:true -->
## Table of contents
<!-- TOC -->
- [Quick start](#quick-start)
- [Real-world setups](#real-world-setups)
- [Usage details](#usage-details)
- [The patterns dilemma](#the-patterns-dilemma)
- [Exploitation](#exploitation)
- [Details about design and usage of metrics](#details-about-design-and-usage-of-metrics)
- [Development setup](#development-setup)
<!-- /TOC -->
# Quick start
> [!caution] ⚠️ Do not use in production as-is; see [real-world setup](#real-world-setup).
## Prerequisites
### For running the exporter
- `python>=3.10` and `pip`;
- `reaction==2` (tested up to `2.2.0`);
- [`libsystemd`](https://www.freedesktop.org/software/systemd/man/latest/libsystemd.html);
- [`pkg-config`](https://www.freedesktop.org/wiki/Software/pkg-config/).
### For exploiting the metrics
- A [Prometheus](https://prometheus.io/)-compatible TSBD ([VictoriaMetrics](https://victoriametrics.com/) is recommended, see [below](#details-about-design-and-usage-of-metrics));
- [Grafana](https://grafana.com/oss/grafana/).
## Install
```bash
python3 -m pip install reaction-metrics-exporter
```
## Configure
Create a configuration file, *e.g.* `config.yml`:
```yaml
reaction:
# as you would pass to `reaction test-config`
config: /etc/reaction
logs:
# monitor logs for `reaction.service`
systemd:
```
>>> [!tip] Using a log file ?
```yaml
reaction:
# ...
logs:
file: /var/log/reaction.log
```
>>>
## Run
```bash
python3 -m reaction_metrics_exporter -c config.yml start
```
Metrics are exposed at http://localhost:8080/metrics.
> 💡 The exporter supports an arbitrary 15-minute window maintenance, *i.e.* it saves the last read log and the unexported metrics so it can restart at this point.
# Real-world setups
## Create an unprivileged user
For security reasons, the exporter should run as an unprivileged, system user. This is a prerequisite for what's next.
This user should be able to read [journald](https://www.freedesktop.org/software/systemd/man/latest/systemd-journald.service.html) logs and to communicate with `reaction`'s socket.
To do so, first create a user and a group, then add the user to the `systemd-journal` group.
```bash
# creates group automatically
/sbin/adduser reaction --no-create-home --system
usermod -aG systemd-journal reaction
```
Then, open an editor to configure `reaction.service` :
```bash
systemctl edit reaction.service
```
Paste the following and save:
```systemd
[Service]
Group=reaction
# Files will be created with rwxrwxr_x so the exporter
# can communicate with reaction through the socket
UMask=0002
```
Restart reaction:
```bash
systemctl daemon-reload
systemctl restart reaction
```
>>> [!tip] Check that it worked
```bash
sudo su reaction
reaction show
journalctl -feu reaction
```
>>>
## Running with systemd
It is recommended to install the exporter in a [virtualenv](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/), *e.g.* in `/usr/share/reaction-metrics-exporter/venv`.
Save the [service file](./reaction-metrics.service) in `/etc/systemd/systemd`.
> 💡 Add your config file and adjust the `venv` path in the `ExecStart=` directive.
Enable and start the exporter:
```bash
systemctl daemon-reload
systemctl enable --now reaction-metrics.service
```
Follow the logs with:
```bash
journalctl -feu reaction-metrics.service
```
## Running with Docker
> ⚠️ Running with Docker adds complexity for the exporter; prefer `systemd`.
Start inside the [docker](./docker) directory.
>>> [!tip] Homemade is better
It is recommended to build you own image so that the version of `reaction` inside the container matches yours. In [`compose.yml`](./docker/compose.yml), adjust `REACTION_VERSION` variable, and then:
```bash
docker compose build
```
>>>
Create a `.env` file:
```ini
UID=
GID=
JOURNAL_GID=
```
Values can be found out of command `id reaction`.
You may need to adjust the default mounts in [`compose.yml`](./docker/compose.yml). Expectations are:
- `reaction`'s configuration is mounted on `/etc/reaction`;
- `reaction`'s socket is mounted on `/run/reaction/reaction.sock`;
- `journald` file is mounted on `/var/log/journal`.
Use the [sample configuration file](./docker/config.yml) and tweak it to your needs and run the exporter:
```bash
docker compose up -d && docker compose logs -f
```
The exporter is mapped to the host's `8080` port by default.
> ⚠️ If you run the exporter before reaction, you'll need to restart the exporter so it gains access to the socket.
# Usage details
## Configuration
You can either provide a YAML file or a JSON file. Albeit not recommended, you can run the exporter without a configuration file.
The default configuration is as follows:
```yaml
# only stdout is supported atm
loglevel: INFO
listen:
address: 127.0.0.1
port: 8080
reaction:
config: /etc/reaction
logs:
systemd: reaction.service
# same default as reaction
socket: /run/reaction/reaction.sock
# ignore actions for 60 seconds (in logtime, not realtime) after reaction start
hold: 60
metrics:
all: {}
for: {}
# all metrics with labels are exported by default
export:
matches:
extra: true
actions:
extra: true
pending:
extra: true
commands:
streams:
```
## Commands
```
usage: python -m reaction_metrics_exporter [-h] [-c CONFIG] [-f {yaml,json}] {start,defaults,test-config,version}
positional arguments:
{start,defaults,test-config,version}
mode of operation; see below
options:
-h, --help show this help message and exit
-c, --config CONFIG path to the configuration file (JSON or YAML)
-f, --format {yaml,json}
format for dumping default configuration (defautl yaml)
command:
start: continuously read logs, compute metrics and serve HTTP endpoint
defaults: print the default configuration in json and exit
test-config: validate and output configuration in json and exit
version: print the exporter's version and exit
```
## Pre-treating matches
In some cases, you may want to transform matches prior to exporting, instead of relabelling. You can do so with [Jinja2](https://jinja.palletsprojects.com/en/stable/) expressions.
For example, for an `email` pattern, you could to keep only the domain part in metrics. This can be achieved with:
```yaml
metrics:
all:
email: "{{ email.split('@') | last }}
```
You can also differentiate by stream and filter:
```yaml
metrics:
for:
ssh:
failedlogin:
ip: TEMPLATE_A
traefik:
aiBots:
ip: TEMPLATE_B
```
## Enabling internals metrics
The Prometheus client library has some defauts metrics about Python, Garbage Collector and so on, that I found useless. You can nevertheless enable them:
```yaml
metrics:
export:
internals:
```
# The patterns dilemma
`reaction` matches often contains valuable information, such as IP addresses.
## How matches could become a problem
Quoting the [Prometheus docs](https://prometheus.io/docs/practices/naming/):
> CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.
In other words, each new IP address will therefore **create a new time serie** in the TSDB.
Prometheus used to claim being able of handling [millions of time series](https://prometheus.io/docs/prometheus/1.8/storage/), and VictoriaMetrics claims being able of handling [100 million active time series](https://docs.victoriametrics.com/victoriametrics/faq/#what-are-scalability-limits-of-victoriametrics). Besides, our time series usually have very few data points. We use OpenMetrics in a kind of hackish way, so that recommandations tailored for active and dense time series do not apply.
You should just test and check after a few months. In most cases, the retention period will kick before you run into troubles.
You can still disable the export of some metrics or patterns.
## Disable metrics
```yaml
metrics:
export:
actions: false
matches: false
pending: false
```
## Disable patterns
For example, to remove `ip` from export for the `failedlogin` filter from the `openssh` stream:
```yaml
metrics:
for:
ssh:
failedlogin:
ip: false
```
If you use the pattern `ip` in multiple streams, you can avoid repetition by removing it globally:
```yaml
metrics:
all:
ip: false
```
# Exploitation
## Scraping
Sample Prometheus scrape config:
```yaml
# must start with `reaction`
- job_name: "reaction"
scrape_interval: 60s
# if behind HTTP basic auth
basic_auth:
username: "${REACTION_METRICS_USER}"
password: "${REACTION_METRICS_PASSWORD}"
static_configs:
- targets:
- subdomain1.<domain>
- subdomain2.<domain>
- [...]
# only keep subdomains
# recommended if <domain> is always the same,
# to make tables more readable.
relabel_configs:
- source_labels: [__address__]
regex: "(.*).<domain>"
target_label: instance
replacement: "$1"
```
## Visualisation
Two dashboards are provided:
- One for [Prometheus](./grafana/reaction-metrics-prometheus.json);
- One for [VictoriaMetrics](./grafana/reaction-metrics-victoriametrics.json).
The later only have a few more panels.
Dashboards have been built with multi-instances and `ip` matching, but can be easily adapted for other setups.
### Variables
All the queries are templated with the following variables:

It can be useful to narrow down a research.
### Status

Health and versions of the exporter and of `reaction`.
### Stats

Cumulative sums of events for selected range and for the last year.
You will find this pattern everywhere in the panels.
### Actions

Insights for actions triggered within the selected range. "Pending actions" are the ones printed by `reaction show`.
### Evolution of events

Basic plots, by instances. Note that they will always start from 0, so that evolution within the range is clear.
### Filters

Matches are aggregated by filter and per instance, hopefuly giving insights about which machine is targeted by what.
### Reaction's events
<div align="center">

</div>
Boring plots which look like this. If everything goes well, they should be almost empty. They are probably more useful for alerts (*e.g.* when a stream exited, when a stop command failed, and so on).
### IP blame

Then you have those kind of tables, each time for range and for year, highlighting *e.g.* IP that have been matched a lot.
Another example is IP matched in different filters, which can ask further investigation. In that case you can template every panel with that specific IP:
<div align="center">

</div>
That allow for basic information gathering (which instance(s), which filter(s), time of presence and so on).
### Lifetime

Real usefulness yet to be shown, I feel it lacks something.
Lifetime is the time between the last time and the first time a unique combination of labels has been seen.
In other words, it tells us whether an IP did a oneshot match or a long run.
On the left is the evolution of average lifetime per instance **for the range**. It could help to determine if certain instances are targeted with insistance.
On the right is the distribution of lifetime for the range, per instance.
## Alerts
Basic queries that can be used in Grafana, Prometheus or [vmalert](https://docs.victoriametrics.com/victoriametrics/vmalert/) to warn Alertmanager. Given in Prometheus format.
```yaml
# Reaction is down
- alert: ReactionDown
expr: reaction_status_info{status="0"} == 1
for: 5m
# More than one match per second on an instance
- alert: ReactionHighMatches
expr: rate(sum(reaction_match_count) by (instance)[5m]) >= 1
for: 5m
# A stream exited or cannot be read
- alert: ReactionStreamError
expr: sum(reaction_stream_event_count{type!="start"}) by (instance, type) > 0
# A start/stop command failed
- alert: ReactionCommandError
expr: sum(reaction_commands_count{status!="True"}) by (instance) > 0
# Detects an IP matching on one or several machines without
# being banned (adjust X and T to your needs)
- alert: ReactionIPUnbanned
expr: sum(reaction_match_count[T]) by (ip) > X
```
# Details about design and usage of metrics
## Using OpenMetrics is wrong but useful
The exporter does not quite follow Prometheus' philosophy. For example, it uses [gauges](https://prometheus.io/docs/concepts/metric_types/#gauge) for counting matches, actions and so on.
Let's take the example of matches. Normally, gauge « represents a single numerical value that can arbitrarily go up and down ». In our case, this single numerical value will give **the number of matches since the last export**. Without extra labels, such as IPs, it would fit the model.
But when exporting pattern's values, such as IPs, the result is a multitude of short-lived time series. As an example:
```
reaction_match_count{filter="aiBots",ip="121.229.156.18",stream="traefik"} 1.0
reaction_match_count{filter="aiBots",ip="23.21.179.120",stream="traefik"} 1.0
reaction_match_count{filter="failedlogin",ip="77.201.74.44",stream="ssh"} 3.0
```
These metrics will be ingested as three distinct timeseries and discarded on the exported side. They will get a new data point only on the next hypothetical identical match.
Semantically, **we are exporting events**, which is [discouraged](https://prometheus.io/docs/introduction/faq/#how-to-feed-logs-into-prometheus). Events are usually processed with a log aggregator, *e.g.* [Grafana Loki](https://grafana.com/oss/loki/), or could be aggregated by a Prometheus middleware, *e.g.* [prom-aggregation-gateway](https://github.com/zapier/prom-aggregation-gateway). We made the choice to stick to OpenMetrics because the TSBD/Grafana setup is **simple and widespread**.
In the early development, aggregation was made locally, *i.e.* the total count for each unique combination of stream, filter and matches was exported continuously, with restarts handled by a persistence file, but it made no sense: the HTTP body would grow indefinitly and each timeserie would get a new datapoint at each export while no event has happened.
Above all considerations, the exported metrics had to be exploitable through a comprehensive Grafana dashboard, and they are, minus we don't have yet high-volume tests.
## Emulate monotonic counters from events
A natural expectation would be to visualize the evolution of matches, actions and so on. With our event-based export, the only way would be to do a cumulative sum of timeseries for the current time window, **starting from 0**.
In theory, **this is not possible with Prometheus**. [VictoriaMetrics](https://victoriametrics.com/), beside way higher performance, has a lot of useful built-in functions, including running sum.
We suppose the query is executed in Grafana, which sets start, stop, range (stop - start) and interval (range / max data points, also known as step).
### Using VictoriaMetrics
If you're unfamiliar we MetricsQL, you can [read the beggining of the docs](https://docs.victoriametrics.com/victoriametrics/metricsql/). It has a lot in common with PromQL, with additional functions and sugar-syntax.
A naive approach would be:
```
running_sum(reaction_match_count)
```
That request will calculate, for each time serie (*i.e.* each unique combination of stream, filter and matches), the cumulative sum for the range. As the number of timeseries is huge when exporting extra labels, is it useful to aggregate results *e.g.* per instance:
```
sum(running_sum(reaction_match_count)) by (instance)
```
However, this request will give wrong results for large time windows. The above query is in fact converted (more or less) by VictoriaMetrics to:
```
sum(running_sum(default_rollup(reaction_match_count[step]))) by (instance)
```
`default_rollup` simply takes the last value of the [range vector](https://prometheus.io/docs/prometheus/latest/querying/basics/#range-vector-selectors), *i.e.* the result is equivalent to `reaction_match_count` quantized every 5 minutes. In other words, every event occuring during that time is discarded.
To fix the query, we need to compute the cumulative sum **inside the step**. The final query will look like this:
```
sum(running_sum(sum_over_time(reaction_match_count))) by (instance)
```
`sum_over_time` will [automatically](https://docs.victoriametrics.com/victoriametrics/metricsql/#rollup-functions) use the step as a lookbehind window.
> ⚠️ We cannot nest `running_sum`, because it needs rollup results; the inner `running_sum` would therefore apply on `default_rollup`, calculate rollups, and the outer `running_sum` would give crazy results, calculating the cumulative sum of the initial wrong cumulative sum.
Wrapping up, this is an efficient way to emulate Prometheus [counters](https://prometheus.io/docs/concepts/metric_types/#counter), in a given time window.
### Using Prometheus
Doing the same in Prometheus is a whole different ballgame. A question that may arise is: why don't we simply `sum_over_time` over the whole window (`$__range` in Grafana)?
```
sum_over_time(reaction_match_count[$__range])
```
Let's say our time window is 2 days (`2d`). The problem is that `sum_over_time` being a rollup function, or a moving function, the first value shown on the graph will be the cumulative sum of `[-4d, -2d]`. Values for the last 2 days will be kind of smoothed by old values.
Ideally, we would like to ask Prometheus to have a fixed starting point (`-2d`), and a moving end point. This is really against the way Prometheus works, because the query range is made to be static, not dynamic.
After a lot of infructuous trials, I found a way to do it. It takes advantage of:
- Grafana exposing the `$__from` variable, which is the timestamp of the start of the range;
- The `time()` function which returns the timestamp currently being evaluated (it ranges from `start` to `end` with `step` increments).
The trick is as follows: first, build a timeserie made of timestamps:
```
vector(time()) > $__from/1000
```
Second, build "*rollup-resistant*" range-vectors:
```
(reaction_match_count and on() (vector(time()) > $__from/1000))[$__range:]
```
The [`and` operator](https://prometheus.io/docs/prometheus/latest/querying/operators/#logicalset-binary-operators) will drop every data point in the left part which is not in the right part. The `on()` clause has to do with [vector matching](https://prometheus.io/docs/prometheus/latest/querying/operators/#vector-matching); here it means that no label is required to match, *i.e.* every point with the same timestamp will match.
Finally we can compute the cumulative sum over `$__range`:
```
sum(sum_over_time((reaction_match_count and on() (vector(time()) > $__from/1000))[$__range:]))
```
The first point will be 0, exactly what we want. This is because to compute the first point, Prometheus will start from `-2d`. Therefore, the outer sum will be, in theory, calculated with data from `]-4d, -2d]`. But our fixed timeserie has data within `]-2d, now]`. In other words, Prometheus find no values, giving a 0. Then, the window will slide by `step`, and the cumulative sum from `[-2d, -2d+step]` will give the second value, and so on.
The drawback is these requests perform poorly on large ranges (like several weeks).
# Development setup
In addition of the prerequisites, you need [Poetry](https://python-poetry.org/).
```bash
# inside the cloned repository
poetry install
# run app
poetry run python -m reaction_metrics_exporter [...]
# run tests
poetry run pytest
```
Raw data
{
"_id": null,
"home_page": null,
"name": "reaction-metrics-exporter",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "reaction, prometheus, openmetrics",
"author": "Lilou",
"author_email": "lilou@chosto.me",
"download_url": "https://files.pythonhosted.org/packages/2a/8b/f69ea45f8298a8752ae4309b19d5be380fa3ad1532674029d3f6b36f0ced/reaction_metrics_exporter-1.0.0.tar.gz",
"platform": null,
"description": "\n<!-- TOC ignore:true -->\n# reaction-metrics-exporter\n\n> [!note] \ud83d\udc9a A lot of inspiration has been drawn from [`dmarc-metrics-exporter`](https://github.com/jgosmann/dmarc-metrics-exporter).\n\nExport [OpenMetrics](https://prometheus.io/docs/specs/om/open_metrics_spec/) for [reaction](https://reaction.ppom.me/). The exporter continuously monitors and parses reaction's logs and state. \n\n## Main metrics\n\n- `reaction_match_count`: number of matches since last collection;\n- `reaction_action_count`: current number of actions since last collection;\n- `reaction_pending_count`: current number of pending actions.\n\nAll metrics are labelled with `stream`, `filter` and matched patterns.\nAction-related metrics have an additional `action` label.\n\nFor example, matches exported from [the SSH filter](https://reaction.ppom.me/filters/ssh.html) look like:\n\n```\nreaction_match_count{stream=\"ssh\",filter=\"failedlogin\",ip=\"X.X.X.X\"}: N\n```\n\n`N` being the number of matches for this unique combination of labels.\n\n## Secondary metrics\n\n- `reaction_exporter_build_info`: gives the version of the exporter;\n- `reaction_status_info`: gives the version of reaction and its status;\n- `reaction_commands_count`: count of start/stop commands run by reaction, labelled with command name, status and exit code;\n- `reaction_stream_event_count`: count of stream events, labelled with stream name and event type, which is one of:\n - `start`: stream started successfully;\n - `exit`: stream exited;\n - `exec`: stream is unexecutable;\n - `read`: stream is unreadable;\n - `timeout`: stream didn't terminate on time on exit.\n\n## Side notes\n\n> \u2139\ufe0f In the long-term, `reaction` will have builtin metrics. Whether the proposed ones will be adopted depends on long-term relevance and performance.\n\n> \u26a0\ufe0f For very large numbers of matches and actions, your TSDB may grow too much. Read [\"the patterns dilemma\"](#the-patterns-dilemma) if in doubt.\n\n<!-- TOC ignore:true -->\n## Table of contents\n\n<!-- TOC -->\n\n- [Quick start](#quick-start)\n- [Real-world setups](#real-world-setups)\n- [Usage details](#usage-details)\n- [The patterns dilemma](#the-patterns-dilemma)\n- [Exploitation](#exploitation)\n- [Details about design and usage of metrics](#details-about-design-and-usage-of-metrics)\n- [Development setup](#development-setup)\n\n<!-- /TOC -->\n\n# Quick start\n\n> [!caution] \u26a0\ufe0f Do not use in production as-is; see [real-world setup](#real-world-setup).\n\n## Prerequisites\n\n### For running the exporter\n\n- `python>=3.10` and `pip`;\n- `reaction==2` (tested up to `2.2.0`);\n- [`libsystemd`](https://www.freedesktop.org/software/systemd/man/latest/libsystemd.html);\n- [`pkg-config`](https://www.freedesktop.org/wiki/Software/pkg-config/).\n\n### For exploiting the metrics\n\n- A [Prometheus](https://prometheus.io/)-compatible TSBD ([VictoriaMetrics](https://victoriametrics.com/) is recommended, see [below](#details-about-design-and-usage-of-metrics));\n- [Grafana](https://grafana.com/oss/grafana/).\n\n## Install\n\n```bash\npython3 -m pip install reaction-metrics-exporter\n```\n\n## Configure\n\nCreate a configuration file, *e.g.* `config.yml`:\n\n```yaml\nreaction:\n # as you would pass to `reaction test-config`\n config: /etc/reaction\n logs:\n # monitor logs for `reaction.service`\n systemd:\n```\n\n>>> [!tip] Using a log file ?\n```yaml\nreaction:\n # ...\n logs:\n file: /var/log/reaction.log\n```\n>>>\n\n## Run\n\n```bash\npython3 -m reaction_metrics_exporter -c config.yml start\n```\n\nMetrics are exposed at http://localhost:8080/metrics.\n\n> \ud83d\udca1 The exporter supports an arbitrary 15-minute window maintenance, *i.e.* it saves the last read log and the unexported metrics so it can restart at this point.\n\n# Real-world setups\n\n## Create an unprivileged user\n\nFor security reasons, the exporter should run as an unprivileged, system user. This is a prerequisite for what's next.\n\nThis user should be able to read [journald](https://www.freedesktop.org/software/systemd/man/latest/systemd-journald.service.html) logs and to communicate with `reaction`'s socket.\n\nTo do so, first create a user and a group, then add the user to the `systemd-journal` group.\n\n```bash\n# creates group automatically\n/sbin/adduser reaction --no-create-home --system\nusermod -aG systemd-journal reaction\n```\n\nThen, open an editor to configure `reaction.service` :\n\n```bash\nsystemctl edit reaction.service\n```\n\nPaste the following and save:\n\n```systemd\n[Service]\nGroup=reaction\n# Files will be created with rwxrwxr_x so the exporter\n# can communicate with reaction through the socket\nUMask=0002\n```\n\nRestart reaction:\n\n```bash\nsystemctl daemon-reload\nsystemctl restart reaction\n```\n\n>>> [!tip] Check that it worked\n```bash\nsudo su reaction\nreaction show\njournalctl -feu reaction\n```\n>>>\n\n## Running with systemd\n\nIt is recommended to install the exporter in a [virtualenv](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/), *e.g.* in `/usr/share/reaction-metrics-exporter/venv`.\n\nSave the [service file](./reaction-metrics.service) in `/etc/systemd/systemd`.\n\n> \ud83d\udca1 Add your config file and adjust the `venv` path in the `ExecStart=` directive. \n\nEnable and start the exporter:\n\n```bash\nsystemctl daemon-reload\nsystemctl enable --now reaction-metrics.service\n```\n\nFollow the logs with:\n\n```bash\njournalctl -feu reaction-metrics.service\n```\n\n## Running with Docker\n\n> \u26a0\ufe0f Running with Docker adds complexity for the exporter; prefer `systemd`.\n\nStart inside the [docker](./docker) directory. \n\n>>> [!tip] Homemade is better\nIt is recommended to build you own image so that the version of `reaction` inside the container matches yours. In [`compose.yml`](./docker/compose.yml), adjust `REACTION_VERSION` variable, and then:\n```bash\ndocker compose build\n```\n>>>\n\nCreate a `.env` file:\n\n```ini\nUID=\nGID=\nJOURNAL_GID=\n```\n\nValues can be found out of command `id reaction`.\n\nYou may need to adjust the default mounts in [`compose.yml`](./docker/compose.yml). Expectations are:\n- `reaction`'s configuration is mounted on `/etc/reaction`;\n- `reaction`'s socket is mounted on `/run/reaction/reaction.sock`;\n- `journald` file is mounted on `/var/log/journal`.\n\nUse the [sample configuration file](./docker/config.yml) and tweak it to your needs and run the exporter:\n\n```bash\ndocker compose up -d && docker compose logs -f\n```\n\nThe exporter is mapped to the host's `8080` port by default.\n\n> \u26a0\ufe0f If you run the exporter before reaction, you'll need to restart the exporter so it gains access to the socket.\n\n# Usage details\n\n## Configuration\n\nYou can either provide a YAML file or a JSON file. Albeit not recommended, you can run the exporter without a configuration file.\n\nThe default configuration is as follows:\n\n```yaml\n# only stdout is supported atm\nloglevel: INFO\nlisten:\n address: 127.0.0.1\n port: 8080\nreaction:\n config: /etc/reaction\n logs:\n systemd: reaction.service\n # same default as reaction\n socket: /run/reaction/reaction.sock\n # ignore actions for 60 seconds (in logtime, not realtime) after reaction start\n hold: 60\nmetrics:\n all: {}\n for: {}\n # all metrics with labels are exported by default\n export:\n matches:\n extra: true\n actions:\n extra: true\n pending:\n extra: true\n commands:\n streams:\n```\n\n## Commands\n\n```\nusage: python -m reaction_metrics_exporter [-h] [-c CONFIG] [-f {yaml,json}] {start,defaults,test-config,version}\n\npositional arguments:\n {start,defaults,test-config,version}\n mode of operation; see below\n\noptions:\n -h, --help show this help message and exit\n -c, --config CONFIG path to the configuration file (JSON or YAML)\n -f, --format {yaml,json}\n format for dumping default configuration (defautl yaml)\n\ncommand:\n start: continuously read logs, compute metrics and serve HTTP endpoint\n defaults: print the default configuration in json and exit\n test-config: validate and output configuration in json and exit\n version: print the exporter's version and exit\n```\n\n## Pre-treating matches\n\nIn some cases, you may want to transform matches prior to exporting, instead of relabelling. You can do so with [Jinja2](https://jinja.palletsprojects.com/en/stable/) expressions.\n\nFor example, for an `email` pattern, you could to keep only the domain part in metrics. This can be achieved with:\n\n```yaml\nmetrics:\n all:\n email: \"{{ email.split('@') | last }}\n```\n\nYou can also differentiate by stream and filter:\n\n```yaml\nmetrics:\n for:\n ssh:\n failedlogin:\n ip: TEMPLATE_A\n traefik:\n aiBots:\n ip: TEMPLATE_B\n```\n\n## Enabling internals metrics\n\nThe Prometheus client library has some defauts metrics about Python, Garbage Collector and so on, that I found useless. You can nevertheless enable them:\n\n```yaml\nmetrics:\n export:\n internals:\n```\n\n# The patterns dilemma\n\n`reaction` matches often contains valuable information, such as IP addresses.\n\n## How matches could become a problem\n\nQuoting the [Prometheus docs](https://prometheus.io/docs/practices/naming/):\n\n> CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.\n\nIn other words, each new IP address will therefore **create a new time serie** in the TSDB.\n\n\nPrometheus used to claim being able of handling [millions of time series](https://prometheus.io/docs/prometheus/1.8/storage/), and VictoriaMetrics claims being able of handling [100 million active time series](https://docs.victoriametrics.com/victoriametrics/faq/#what-are-scalability-limits-of-victoriametrics). Besides, our time series usually have very few data points. We use OpenMetrics in a kind of hackish way, so that recommandations tailored for active and dense time series do not apply.\n\nYou should just test and check after a few months. In most cases, the retention period will kick before you run into troubles. \n\nYou can still disable the export of some metrics or patterns.\n\n## Disable metrics\n\n```yaml\nmetrics:\n export:\n actions: false\n matches: false\n pending: false\n```\n\n## Disable patterns\n\nFor example, to remove `ip` from export for the `failedlogin` filter from the `openssh` stream:\n\n```yaml\nmetrics:\n for:\n ssh:\n failedlogin:\n ip: false\n```\n\nIf you use the pattern `ip` in multiple streams, you can avoid repetition by removing it globally:\n\n```yaml\nmetrics:\n all:\n ip: false\n```\n\n# Exploitation\n\n## Scraping\n\nSample Prometheus scrape config:\n\n```yaml\n # must start with `reaction`\n - job_name: \"reaction\"\n scrape_interval: 60s\n # if behind HTTP basic auth\n basic_auth:\n username: \"${REACTION_METRICS_USER}\"\n password: \"${REACTION_METRICS_PASSWORD}\"\n static_configs:\n - targets:\n - subdomain1.<domain>\n - subdomain2.<domain>\n - [...]\n # only keep subdomains\n # recommended if <domain> is always the same,\n # to make tables more readable.\n relabel_configs:\n - source_labels: [__address__]\n regex: \"(.*).<domain>\"\n target_label: instance\n replacement: \"$1\"\n```\n\n## Visualisation\n\nTwo dashboards are provided:\n- One for [Prometheus](./grafana/reaction-metrics-prometheus.json);\n- One for [VictoriaMetrics](./grafana/reaction-metrics-victoriametrics.json).\n\nThe later only have a few more panels.\n\nDashboards have been built with multi-instances and `ip` matching, but can be easily adapted for other setups.\n\n### Variables\n\nAll the queries are templated with the following variables:\n\n\n\nIt can be useful to narrow down a research.\n\n### Status \n\n\n\nHealth and versions of the exporter and of `reaction`.\n\n### Stats\n\n\n\nCumulative sums of events for selected range and for the last year.\nYou will find this pattern everywhere in the panels.\n\n### Actions\n\n\n\nInsights for actions triggered within the selected range. \"Pending actions\" are the ones printed by `reaction show`.\n\n### Evolution of events\n\n\n\nBasic plots, by instances. Note that they will always start from 0, so that evolution within the range is clear.\n\n### Filters\n\n\n\nMatches are aggregated by filter and per instance, hopefuly giving insights about which machine is targeted by what.\n\n### Reaction's events\n\n<div align=\"center\">\n\n\n</div>\n\nBoring plots which look like this. If everything goes well, they should be almost empty. They are probably more useful for alerts (*e.g.* when a stream exited, when a stop command failed, and so on).\n\n### IP blame\n\n\n\nThen you have those kind of tables, each time for range and for year, highlighting *e.g.* IP that have been matched a lot.\n\nAnother example is IP matched in different filters, which can ask further investigation. In that case you can template every panel with that specific IP:\n\n<div align=\"center\">\n\n\n</div>\n\nThat allow for basic information gathering (which instance(s), which filter(s), time of presence and so on).\n\n### Lifetime\n\n\n\nReal usefulness yet to be shown, I feel it lacks something.\nLifetime is the time between the last time and the first time a unique combination of labels has been seen.\nIn other words, it tells us whether an IP did a oneshot match or a long run.\n\nOn the left is the evolution of average lifetime per instance **for the range**. It could help to determine if certain instances are targeted with insistance.\n\nOn the right is the distribution of lifetime for the range, per instance.\n\n## Alerts\n\nBasic queries that can be used in Grafana, Prometheus or [vmalert](https://docs.victoriametrics.com/victoriametrics/vmalert/) to warn Alertmanager. Given in Prometheus format.\n\n```yaml\n# Reaction is down\n- alert: ReactionDown\n expr: reaction_status_info{status=\"0\"} == 1\n for: 5m\n# More than one match per second on an instance\n- alert: ReactionHighMatches\n expr: rate(sum(reaction_match_count) by (instance)[5m]) >= 1\n for: 5m\n# A stream exited or cannot be read\n- alert: ReactionStreamError\n expr: sum(reaction_stream_event_count{type!=\"start\"}) by (instance, type) > 0\n# A start/stop command failed\n- alert: ReactionCommandError\n expr: sum(reaction_commands_count{status!=\"True\"}) by (instance) > 0\n# Detects an IP matching on one or several machines without\n# being banned (adjust X and T to your needs)\n- alert: ReactionIPUnbanned\n expr: sum(reaction_match_count[T]) by (ip) > X\n```\n\n# Details about design and usage of metrics\n\n## Using OpenMetrics is wrong but useful\n\nThe exporter does not quite follow Prometheus' philosophy. For example, it uses [gauges](https://prometheus.io/docs/concepts/metric_types/#gauge) for counting matches, actions and so on.\n\nLet's take the example of matches. Normally, gauge \u00ab represents a single numerical value that can arbitrarily go up and down \u00bb. In our case, this single numerical value will give **the number of matches since the last export**. Without extra labels, such as IPs, it would fit the model.\n\nBut when exporting pattern's values, such as IPs, the result is a multitude of short-lived time series. As an example:\n\n```\nreaction_match_count{filter=\"aiBots\",ip=\"121.229.156.18\",stream=\"traefik\"} 1.0\nreaction_match_count{filter=\"aiBots\",ip=\"23.21.179.120\",stream=\"traefik\"} 1.0\nreaction_match_count{filter=\"failedlogin\",ip=\"77.201.74.44\",stream=\"ssh\"} 3.0\n```\n\nThese metrics will be ingested as three distinct timeseries and discarded on the exported side. They will get a new data point only on the next hypothetical identical match.\n\nSemantically, **we are exporting events**, which is [discouraged](https://prometheus.io/docs/introduction/faq/#how-to-feed-logs-into-prometheus). Events are usually processed with a log aggregator, *e.g.* [Grafana Loki](https://grafana.com/oss/loki/), or could be aggregated by a Prometheus middleware, *e.g.* [prom-aggregation-gateway](https://github.com/zapier/prom-aggregation-gateway). We made the choice to stick to OpenMetrics because the TSBD/Grafana setup is **simple and widespread**.\n\nIn the early development, aggregation was made locally, *i.e.* the total count for each unique combination of stream, filter and matches was exported continuously, with restarts handled by a persistence file, but it made no sense: the HTTP body would grow indefinitly and each timeserie would get a new datapoint at each export while no event has happened.\n\nAbove all considerations, the exported metrics had to be exploitable through a comprehensive Grafana dashboard, and they are, minus we don't have yet high-volume tests.\n\n## Emulate monotonic counters from events\n\nA natural expectation would be to visualize the evolution of matches, actions and so on. With our event-based export, the only way would be to do a cumulative sum of timeseries for the current time window, **starting from 0**.\n\nIn theory, **this is not possible with Prometheus**. [VictoriaMetrics](https://victoriametrics.com/), beside way higher performance, has a lot of useful built-in functions, including running sum.\n\nWe suppose the query is executed in Grafana, which sets start, stop, range (stop - start) and interval (range / max data points, also known as step).\n\n### Using VictoriaMetrics\n\nIf you're unfamiliar we MetricsQL, you can [read the beggining of the docs](https://docs.victoriametrics.com/victoriametrics/metricsql/). It has a lot in common with PromQL, with additional functions and sugar-syntax.\n\nA naive approach would be:\n\n```\nrunning_sum(reaction_match_count)\n```\n\nThat request will calculate, for each time serie (*i.e.* each unique combination of stream, filter and matches), the cumulative sum for the range. As the number of timeseries is huge when exporting extra labels, is it useful to aggregate results *e.g.* per instance:\n\n```\nsum(running_sum(reaction_match_count)) by (instance)\n```\n\nHowever, this request will give wrong results for large time windows. The above query is in fact converted (more or less) by VictoriaMetrics to:\n\n```\nsum(running_sum(default_rollup(reaction_match_count[step]))) by (instance)\n```\n\n`default_rollup` simply takes the last value of the [range vector](https://prometheus.io/docs/prometheus/latest/querying/basics/#range-vector-selectors), *i.e.* the result is equivalent to `reaction_match_count` quantized every 5 minutes. In other words, every event occuring during that time is discarded.\n\nTo fix the query, we need to compute the cumulative sum **inside the step**. The final query will look like this:\n\n```\nsum(running_sum(sum_over_time(reaction_match_count))) by (instance)\n```\n\n`sum_over_time` will [automatically](https://docs.victoriametrics.com/victoriametrics/metricsql/#rollup-functions) use the step as a lookbehind window. \n\n> \u26a0\ufe0f We cannot nest `running_sum`, because it needs rollup results; the inner `running_sum` would therefore apply on `default_rollup`, calculate rollups, and the outer `running_sum` would give crazy results, calculating the cumulative sum of the initial wrong cumulative sum.\n\nWrapping up, this is an efficient way to emulate Prometheus [counters](https://prometheus.io/docs/concepts/metric_types/#counter), in a given time window.\n\n### Using Prometheus\n\nDoing the same in Prometheus is a whole different ballgame. A question that may arise is: why don't we simply `sum_over_time` over the whole window (`$__range` in Grafana)?\n\n```\nsum_over_time(reaction_match_count[$__range])\n```\n\nLet's say our time window is 2 days (`2d`). The problem is that `sum_over_time` being a rollup function, or a moving function, the first value shown on the graph will be the cumulative sum of `[-4d, -2d]`. Values for the last 2 days will be kind of smoothed by old values.\n\nIdeally, we would like to ask Prometheus to have a fixed starting point (`-2d`), and a moving end point. This is really against the way Prometheus works, because the query range is made to be static, not dynamic.\n\nAfter a lot of infructuous trials, I found a way to do it. It takes advantage of:\n\n- Grafana exposing the `$__from` variable, which is the timestamp of the start of the range;\n- The `time()` function which returns the timestamp currently being evaluated (it ranges from `start` to `end` with `step` increments).\n\nThe trick is as follows: first, build a timeserie made of timestamps:\n\n```\nvector(time()) > $__from/1000\n```\n\nSecond, build \"*rollup-resistant*\" range-vectors:\n\n```\n(reaction_match_count and on() (vector(time()) > $__from/1000))[$__range:]\n```\n\nThe [`and` operator](https://prometheus.io/docs/prometheus/latest/querying/operators/#logicalset-binary-operators) will drop every data point in the left part which is not in the right part. The `on()` clause has to do with [vector matching](https://prometheus.io/docs/prometheus/latest/querying/operators/#vector-matching); here it means that no label is required to match, *i.e.* every point with the same timestamp will match. \n\n\nFinally we can compute the cumulative sum over `$__range`:\n\n```\nsum(sum_over_time((reaction_match_count and on() (vector(time()) > $__from/1000))[$__range:]))\n```\n\nThe first point will be 0, exactly what we want. This is because to compute the first point, Prometheus will start from `-2d`. Therefore, the outer sum will be, in theory, calculated with data from `]-4d, -2d]`. But our fixed timeserie has data within `]-2d, now]`. In other words, Prometheus find no values, giving a 0. Then, the window will slide by `step`, and the cumulative sum from `[-2d, -2d+step]` will give the second value, and so on.\n\nThe drawback is these requests perform poorly on large ranges (like several weeks).\n\n\n# Development setup\n\nIn addition of the prerequisites, you need [Poetry](https://python-poetry.org/).\n\n```bash\n# inside the cloned repository\npoetry install\n# run app\npoetry run python -m reaction_metrics_exporter [...]\n# run tests\npoetry run pytest\n```",
"bugtrack_url": null,
"license": "AGPL-3.0-or-later",
"summary": "A daemon that scans reaction outputs, and serves an HTTP OpenMetrics endpoint",
"version": "1.0.0",
"project_urls": null,
"split_keywords": [
"reaction",
" prometheus",
" openmetrics"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "1b04d4a9751a30d24d951c6978b6f256fb9361895d96b239b5cb6250b363f0b1",
"md5": "9e12f771cb90284a51dde36fe75c1608",
"sha256": "7e0f94390d77d3be3d5727695f4e1f24bf819d5f69533be11cdaf03f689b257e"
},
"downloads": -1,
"filename": "reaction_metrics_exporter-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9e12f771cb90284a51dde36fe75c1608",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 40914,
"upload_time": "2025-08-24T14:12:17",
"upload_time_iso_8601": "2025-08-24T14:12:17.184042Z",
"url": "https://files.pythonhosted.org/packages/1b/04/d4a9751a30d24d951c6978b6f256fb9361895d96b239b5cb6250b363f0b1/reaction_metrics_exporter-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "2a8bf69ea45f8298a8752ae4309b19d5be380fa3ad1532674029d3f6b36f0ced",
"md5": "c9ff3f9b5d53176360843d19fc7f6663",
"sha256": "7d651b9698c3dd65d4d8822a016cb3e5a0d6deaf48d95a567996cdadd16a7c36"
},
"downloads": -1,
"filename": "reaction_metrics_exporter-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "c9ff3f9b5d53176360843d19fc7f6663",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 43399,
"upload_time": "2025-08-24T14:12:18",
"upload_time_iso_8601": "2025-08-24T14:12:18.765106Z",
"url": "https://files.pythonhosted.org/packages/2a/8b/f69ea45f8298a8752ae4309b19d5be380fa3ad1532674029d3f6b36f0ced/reaction_metrics_exporter-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-24 14:12:18",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "reaction-metrics-exporter"
}