# Colmet - Collecting metrics about jobs running in a distributed environnement
## Introduction:
Colmet is a monitoring tool to collect metrics about jobs running in a
distributed environnement, especially for gathering metrics on clusters and
grids. It provides currently several backends :
- Input backends:
- taskstats: fetch task metrics from the linux kernel
- rapl: intel processors realtime consumption metrics
- perfhw: perf_event counters
- jobproc: get infos from /proc
- ipmipower: get power metrics from ipmi
- temperature: get temperatures from /sys/class/thermal
- infiniband: get infiniband/omnipath network metrics
- lustre: get lustre FS stats
- Output backends:
- elasticsearch: store the metrics on elasticsearch indexes
- hdf5: store the metrics on the filesystem
- stdout: display the metrics on the terminal
It uses zeromq to transport the metrics across the network.
It is currently bound to the [OAR](http://oar.imag.fr) RJMS.
A Grafana [sample dashboard](./graph/grafana) is provided for the elasticsearch backend. Here are some snapshots:
![](./screenshot1.png)
![](./screenshot2.png)
## Installation:
### Requirements
- a Linux kernel that supports
- Taskstats
- intel_rapl (for RAPL backend)
- perf_event (for perfhw backend)
- ipmi_devintf (for ipmi backend)
- Python Version 2.7 or newer
- python-zmq 2.2.0 or newer
- python-tables 3.3.0 or newer
- python-pyinotify 0.9.3-2 or newer
- python-requests
- For the Elasticsearch output backend (recommended for sites with > 50 nodes)
- An Elasticsearch server
- A Grafana server (for visu)
- For the RAPL input backend:
- libpowercap, powercap-utils (https://github.com/powercap/powercap)
- For the infiniband backend:
- `perfquery` command line tool
- for the ipmipower backend:
- `ipmi-oem` command line tool (freeipmi) or other configurable command
### Installation
You can install, upgrade, uninstall colmet with these commands::
```
$ pip install [--user] colmet
$ pip install [--user] --upgrade colmet
$ pip uninstall colmet
```
Or from git (last development version)::
```
$ pip install [--user] git+https://github.com/oar-team/colmet.git
```
Or if you already pulled the sources::
```
$ pip install [--user] path/to/sources
```
### Usage:
for the nodes :
```
sudo colmet-node -vvv --zeromq-uri tcp://127.0.0.1:5556
```
for the collector :
```
# Simple local HDF5 file collect:
colmet-collector -vvv --zeromq-bind-uri tcp://127.0.0.1:5556 --hdf5-filepath /data/colmet.hdf5 --hdf5-complevel 9
```
```
# Collector with an Elasticsearch backend:
colmet-collector -vvv \
--zeromq-bind-uri tcp://192.168.0.1:5556 \
--buffer-size 5000 \
--sample-period 3 \
--elastic-host http://192.168.0.2:9200 \
--elastic-index-prefix colmet_dahu_ 2>>/var/log/colmet_err.log >> /var/log/colmet.log
```
You will see the number of counters retrieved in the debug log.
For more information, please refer to the help of theses scripts (`--help`)
### Notes about backends
Some input backends may need external libraries that need to be previously compiled and installed:
```
# For the perfhw backend:
cd colmet/node/backends/lib_perf_hw/ && make && cp lib_perf_hw.so /usr/local/lib/
# For the rapl backend:
cd colmet/node/backends/lib_rapl/ && make && cp lib_rapl.so /usr/local/lib/
```
Here's acomplete colmet-node start-up process, with perfw, rapl and more backends:
```
export LIB_PERFHW_PATH=/usr/local/lib/lib_perf_hw.so
export LIB_RAPL_PATH=/applis/site/colmet/lib_rapl.so
colmet-node -vvv --zeromq-uri tcp://192.168.0.1:5556 \
--cpuset_rootpath /dev/cpuset/oar \
--enable-infiniband --omnipath \
--enable-lustre \
--enable-perfhw --perfhw-list instructions cache_misses page_faults cpu_cycles cache_references \
--enable-RAPL \
--enable-jobproc \
--enable-ipmipower >> /var/log/colmet.log 2>&1
```
#### RAPL - Running Average Power Limit (Intel)
RAPL is a feature on recent Intel processors that makes possible to know the power consumption of cpu in realtime.
Usage : start colmet-node with option `--enable-RAPL`
A file named RAPL_mapping.[timestamp].csv is created in the working directory. It established the correspondence between `counter_1`, `counter_2`, etc from collected data and the actual name of the metric as well as the package and zone (core / uncore / dram) of the processor the metric refers to.
If a given counter is not supported by harware the metric name will be "`counter_not_supported_by_hardware`" and `0` values will appear in the collected data; `-1` values in the collected data means there is no counter mapped to the column.
#### Perfhw
This provides metrics collected using interface [perf_event_open](http://man7.org/linux/man-pages/man2/perf_event_open.2.html).
Usage : start colmet-node with option `--enable-perfhw`
Optionnaly choose the metrics you want (max 5 metrics) using options `--perfhw-list` followed by space-separated list of the metrics/
Example : `--enable-perfhw --perfhw-list instructions cpu_cycles cache_misses`
A file named perfhw_mapping.[timestamp].csv is created in the working directory. It establishes the correspondence between `counter_1`, `counter_2`, etc from collected data and the actual name of the metric.
Available metrics (refers to perf_event_open documentation for signification) :
```
cpu_cycles
instructions
cache_references
cache_misses
branch_instructions
branch_misses
bus_cycles
ref_cpu_cycles
cache_l1d
cache_ll
cache_dtlb
cache_itlb
cache_bpu
cache_node
cache_op_read
cache_op_prefetch
cache_result_access
cpu_clock
task_clock
page_faults
context_switches
cpu_migrations
page_faults_min
page_faults_maj
alignment_faults
emulation_faults
dummy
bpf_output
```
#### Temperature
This backend gets temperatures from `/sys/class/thermal/thermal_zone*/temp`
Usage : start colmet-node with option `--enable-temperature`
A file named temperature_mapping.[timestamp].csv is created in the working directory. It establishes the correspondence between `counter_1`, `counter_2`, etc from collected data and the actual name of the metric.
Colmet CHANGELOG
================
Version 0.6.11
--------------
Unreleased
Version 0.6.10
--------------
- Fixed missing exceptions handling into elasticsearch backend (collector)
- ZMQ: Prefer SNDHWM and RCVHWM to HWM
- Fixed: taskstats data could block other data to be collected when cpuset is empty (node)
Version 0.6.9
-------------
- Fix for newer pyzmq versions
Version 0.6.8
-------------
- Added nvidia GPU support
Version 0.6.7
-------------
- bugfix: glob import missing into procstats
Version 0.6.6
-------------
- Added --no-check-certificates option for elastic backend
- Added involved jobs and new metrics into jobprocstats
Version 0.6.4
-------------
- Added http auth support for elasticsearch backend
Version 0.6.3
-------------
Released on September 4th 2020
- Bugfixes into lustrestats and jobprocstats backend
Version 0.6.2
-------------
Released on September 3rd 2020
- Python package fix
Version 0.6.1
-------------
Released on September 3rd 2020
- New input backends: lustre, infiniband, temperature, rapl, perfhw, impipower, jobproc
- New ouptut backend: elasticsearch
- Example Grafana dashboard for Elasticsearch backend
- Added "involved_jobs" value for metrics that are global to a node (job 0)
- Bugfix for "dictionnary changed size during iteration"
Version 0.5.4
-------------
Released on January 19th 2018
- hdf5 extractor script for OAR RESTFUL API
- Added infiniband backend
- Added lustre backend
- Fixed cpuset_rootpath default always appended
Version 0.5.3
-------------
Released on April 29th 2015
- Removed unnecessary lock from the collector to avoid colmet to wait forever
- Removed (async) zmq eventloop and added ``--sample-period`` to the collector.
- Fixed some bugs about hdf file
Version 0.5.2
-------------
Released on Apr 2nd 2015
- Fixed python syntax error
Version 0.5.1
-------------
Released on Apr 2nd 2015
- Fixed error about missing ``requirements.txt`` file in the sdist package
Version 0.5.0
-------------
Released on Apr 2nd 2015
- Don't run colmet as a daemon anymore
- Maintained compatibility with zmq 3.x/4.x
- Dropped ``--zeromq-swap`` (swap was dropped from zmq 3.x)
- Handled zmq name change from HWM to SNDHWM and RCVHWM
- Fixed requirements
- Dropped python 2.6 support
Version 0.4.0
-------------
- Saved metrics in new HDF5 file if colmet is reloaded in order to avoid HDF5 data corruption
- Handled HUP signal to reload ``colmet-collector``
- Removed ``hiwater_rss`` and ``hiwater_vm`` collected metrics.
Version 0.3.1
-------------
- New metrics ``hiwater_rss`` and ``hiwater_vm`` for taskstats
- Worked with pyinotify 0.8
- Added ``--disable-procstats`` option to disable procstats backend.
Version 0.3.0
-------------
- Divided colmet package into three parts
- colmet-node : Retrieve data from taskstats and procstats and send to
collectors with ZeroMQ
- colmet-collector : A collector that stores data received by ZeroMQ in a
hdf5 file
- colmet-common : Common colmet part.
- Added some parameters of ZeroMQ backend to prevent a memory overflow
- Simplified the command line interface
- Dropped rrd backend because it is not yet working
- Added ``--buffer-size`` option for collector to define the maximum number of
counters that colmet should queue in memory before pushing it to output
backend
- Handled SIGTERM and SIGINT to terminate colmet properly
Version 0.2.0
-------------
- Added options to enable hdf5 compression
- Support for multiple job by cgroup path scanning
- Used Inotify events for job list update
- Don't filter packets if no job_id range was specified, especially with zeromq
backend
- Waited the cgroup_path folder creation before scanning the list of jobs
- Added procstat for node monitoring through fictive job with 0 as identifier
- Used absolute time take measure and not delay between measure, to avoid the
drift of measure time
- Added workaround when a newly cgroup is created without process in it
(monitoring is suspended upto one process is launched)
Version 0.0.1
-------------
- Conception
Raw data
{
"_id": null,
"home_page": "http://oar.imag.fr/",
"name": "colmet",
"maintainer": "Salem Harrache",
"docs_url": null,
"requires_python": null,
"maintainer_email": "salem.harrache@inria.fr",
"keywords": "monitoring, taskstat, oar, hpc, sciences",
"author": "Philippe Le Brouster, Olivier Richard",
"author_email": "philippe.le-brouster@imag.fr, olivier.richard@imag.fr",
"download_url": "https://files.pythonhosted.org/packages/34/1f/64a5418ea27ab0557336ea130b9655e857b8425186176e24a0e519bc2fb1/colmet-0.6.10.tar.gz",
"platform": "Linux",
"description": "# Colmet - Collecting metrics about jobs running in a distributed environnement\n\n## Introduction:\n\nColmet is a monitoring tool to collect metrics about jobs running in a\ndistributed environnement, especially for gathering metrics on clusters and\ngrids. It provides currently several backends :\n- Input backends:\n - taskstats: fetch task metrics from the linux kernel\n - rapl: intel processors realtime consumption metrics\n - perfhw: perf_event counters\n - jobproc: get infos from /proc\n - ipmipower: get power metrics from ipmi\n - temperature: get temperatures from /sys/class/thermal\n - infiniband: get infiniband/omnipath network metrics\n - lustre: get lustre FS stats\n- Output backends:\n - elasticsearch: store the metrics on elasticsearch indexes\n - hdf5: store the metrics on the filesystem\n - stdout: display the metrics on the terminal\n\nIt uses zeromq to transport the metrics across the network.\n\nIt is currently bound to the [OAR](http://oar.imag.fr) RJMS.\n\nA Grafana [sample dashboard](./graph/grafana) is provided for the elasticsearch backend. Here are some snapshots:\n\n![](./screenshot1.png)\n\n![](./screenshot2.png)\n\n## Installation:\n\n### Requirements\n\n- a Linux kernel that supports\n - Taskstats\n - intel_rapl (for RAPL backend)\n - perf_event (for perfhw backend)\n - ipmi_devintf (for ipmi backend)\n\n- Python Version 2.7 or newer\n - python-zmq 2.2.0 or newer\n - python-tables 3.3.0 or newer\n - python-pyinotify 0.9.3-2 or newer\n - python-requests\n\n- For the Elasticsearch output backend (recommended for sites with > 50 nodes)\n - An Elasticsearch server\n - A Grafana server (for visu)\n\n- For the RAPL input backend:\n - libpowercap, powercap-utils (https://github.com/powercap/powercap)\n\n- For the infiniband backend:\n - `perfquery` command line tool\n\n- for the ipmipower backend:\n - `ipmi-oem` command line tool (freeipmi) or other configurable command\n\n### Installation\n\nYou can install, upgrade, uninstall colmet with these commands::\n\n```\n$ pip install [--user] colmet\n$ pip install [--user] --upgrade colmet\n$ pip uninstall colmet\n```\n\nOr from git (last development version)::\n\n```\n$ pip install [--user] git+https://github.com/oar-team/colmet.git\n```\n\nOr if you already pulled the sources::\n\n```\n$ pip install [--user] path/to/sources\n```\n\n### Usage:\n\nfor the nodes :\n\n```\nsudo colmet-node -vvv --zeromq-uri tcp://127.0.0.1:5556\n```\n\nfor the collector :\n\n```\n# Simple local HDF5 file collect:\ncolmet-collector -vvv --zeromq-bind-uri tcp://127.0.0.1:5556 --hdf5-filepath /data/colmet.hdf5 --hdf5-complevel 9\n```\n\n```\n# Collector with an Elasticsearch backend:\n colmet-collector -vvv \\\n --zeromq-bind-uri tcp://192.168.0.1:5556 \\\n --buffer-size 5000 \\\n --sample-period 3 \\\n --elastic-host http://192.168.0.2:9200 \\\n --elastic-index-prefix colmet_dahu_ 2>>/var/log/colmet_err.log >> /var/log/colmet.log\n```\n\nYou will see the number of counters retrieved in the debug log.\n\n\nFor more information, please refer to the help of theses scripts (`--help`)\n\n### Notes about backends\n\nSome input backends may need external libraries that need to be previously compiled and installed:\n\n```\n# For the perfhw backend:\ncd colmet/node/backends/lib_perf_hw/ && make && cp lib_perf_hw.so /usr/local/lib/\n# For the rapl backend:\ncd colmet/node/backends/lib_rapl/ && make && cp lib_rapl.so /usr/local/lib/\n```\n\nHere's acomplete colmet-node start-up process, with perfw, rapl and more backends:\n\n```\nexport LIB_PERFHW_PATH=/usr/local/lib/lib_perf_hw.so\nexport LIB_RAPL_PATH=/applis/site/colmet/lib_rapl.so\n\ncolmet-node -vvv --zeromq-uri tcp://192.168.0.1:5556 \\\n --cpuset_rootpath /dev/cpuset/oar \\\n --enable-infiniband --omnipath \\\n --enable-lustre \\\n --enable-perfhw --perfhw-list instructions cache_misses page_faults cpu_cycles cache_references \\\n --enable-RAPL \\\n --enable-jobproc \\\n --enable-ipmipower >> /var/log/colmet.log 2>&1\n```\n\n#### RAPL - Running Average Power Limit (Intel)\n\nRAPL is a feature on recent Intel processors that makes possible to know the power consumption of cpu in realtime.\n\nUsage : start colmet-node with option `--enable-RAPL`\n\nA file named RAPL_mapping.[timestamp].csv is created in the working directory. It established the correspondence between `counter_1`, `counter_2`, etc from collected data and the actual name of the metric as well as the package and zone (core / uncore / dram) of the processor the metric refers to.\n\nIf a given counter is not supported by harware the metric name will be \"`counter_not_supported_by_hardware`\" and `0` values will appear in the collected data; `-1` values in the collected data means there is no counter mapped to the column.\n\n#### Perfhw\n\nThis provides metrics collected using interface [perf_event_open](http://man7.org/linux/man-pages/man2/perf_event_open.2.html).\n\nUsage : start colmet-node with option `--enable-perfhw`\n\nOptionnaly choose the metrics you want (max 5 metrics) using options `--perfhw-list` followed by space-separated list of the metrics/\n\nExample : `--enable-perfhw --perfhw-list instructions cpu_cycles cache_misses`\n\nA file named perfhw_mapping.[timestamp].csv is created in the working directory. It establishes the correspondence between `counter_1`, `counter_2`, etc from collected data and the actual name of the metric.\n\nAvailable metrics (refers to perf_event_open documentation for signification) :\n\n```\ncpu_cycles \ninstructions \ncache_references \ncache_misses \nbranch_instructions\nbranch_misses\nbus_cycles \nref_cpu_cycles \ncache_l1d \ncache_ll\ncache_dtlb \ncache_itlb \ncache_bpu \ncache_node \ncache_op_read \ncache_op_prefetch \ncache_result_access \ncpu_clock \ntask_clock \npage_faults \ncontext_switches \ncpu_migrations\npage_faults_min\npage_faults_maj\nalignment_faults \nemulation_faults\ndummy\nbpf_output\n```\n\n#### Temperature\n\nThis backend gets temperatures from `/sys/class/thermal/thermal_zone*/temp`\n\nUsage : start colmet-node with option `--enable-temperature`\n\nA file named temperature_mapping.[timestamp].csv is created in the working directory. It establishes the correspondence between `counter_1`, `counter_2`, etc from collected data and the actual name of the metric.\n\n\n\nColmet CHANGELOG\n================\n\n\nVersion 0.6.11\n--------------\nUnreleased\n\nVersion 0.6.10\n--------------\n- Fixed missing exceptions handling into elasticsearch backend (collector)\n- ZMQ: Prefer SNDHWM and RCVHWM to HWM\n- Fixed: taskstats data could block other data to be collected when cpuset is empty (node)\n\nVersion 0.6.9\n-------------\n- Fix for newer pyzmq versions\n\nVersion 0.6.8\n-------------\n- Added nvidia GPU support\n\nVersion 0.6.7\n-------------\n- bugfix: glob import missing into procstats\n\nVersion 0.6.6\n-------------\n- Added --no-check-certificates option for elastic backend\n- Added involved jobs and new metrics into jobprocstats\n\nVersion 0.6.4\n-------------\n\n- Added http auth support for elasticsearch backend\n\n\nVersion 0.6.3\n-------------\n\nReleased on September 4th 2020\n\n- Bugfixes into lustrestats and jobprocstats backend\n\nVersion 0.6.2\n-------------\n\nReleased on September 3rd 2020\n\n- Python package fix\n\nVersion 0.6.1\n-------------\n\nReleased on September 3rd 2020\n\n- New input backends: lustre, infiniband, temperature, rapl, perfhw, impipower, jobproc\n- New ouptut backend: elasticsearch\n- Example Grafana dashboard for Elasticsearch backend\n- Added \"involved_jobs\" value for metrics that are global to a node (job 0)\n- Bugfix for \"dictionnary changed size during iteration\"\n\nVersion 0.5.4\n-------------\n\nReleased on January 19th 2018\n\n- hdf5 extractor script for OAR RESTFUL API\n- Added infiniband backend\n- Added lustre backend\n- Fixed cpuset_rootpath default always appended\n\nVersion 0.5.3\n-------------\n\nReleased on April 29th 2015\n\n- Removed unnecessary lock from the collector to avoid colmet to wait forever\n- Removed (async) zmq eventloop and added ``--sample-period`` to the collector.\n- Fixed some bugs about hdf file\n\nVersion 0.5.2\n-------------\n\nReleased on Apr 2nd 2015\n\n- Fixed python syntax error\n\n\nVersion 0.5.1\n-------------\n\nReleased on Apr 2nd 2015\n\n- Fixed error about missing ``requirements.txt`` file in the sdist package\n\n\nVersion 0.5.0\n-------------\n\nReleased on Apr 2nd 2015\n\n- Don't run colmet as a daemon anymore\n- Maintained compatibility with zmq 3.x/4.x\n - Dropped ``--zeromq-swap`` (swap was dropped from zmq 3.x)\n - Handled zmq name change from HWM to SNDHWM and RCVHWM\n- Fixed requirements\n- Dropped python 2.6 support\n\nVersion 0.4.0\n-------------\n\n- Saved metrics in new HDF5 file if colmet is reloaded in order to avoid HDF5 data corruption\n- Handled HUP signal to reload ``colmet-collector``\n- Removed ``hiwater_rss`` and ``hiwater_vm`` collected metrics.\n\n\nVersion 0.3.1\n-------------\n\n- New metrics ``hiwater_rss`` and ``hiwater_vm`` for taskstats\n- Worked with pyinotify 0.8\n- Added ``--disable-procstats`` option to disable procstats backend.\n\n\nVersion 0.3.0\n-------------\n\n- Divided colmet package into three parts\n\n - colmet-node : Retrieve data from taskstats and procstats and send to\n collectors with ZeroMQ\n - colmet-collector : A collector that stores data received by ZeroMQ in a\n hdf5 file\n - colmet-common : Common colmet part.\n- Added some parameters of ZeroMQ backend to prevent a memory overflow\n- Simplified the command line interface\n- Dropped rrd backend because it is not yet working\n- Added ``--buffer-size`` option for collector to define the maximum number of\n counters that colmet should queue in memory before pushing it to output\n backend\n- Handled SIGTERM and SIGINT to terminate colmet properly\n\nVersion 0.2.0\n-------------\n\n- Added options to enable hdf5 compression\n- Support for multiple job by cgroup path scanning\n- Used Inotify events for job list update\n- Don't filter packets if no job_id range was specified, especially with zeromq\n backend\n- Waited the cgroup_path folder creation before scanning the list of jobs\n- Added procstat for node monitoring through fictive job with 0 as identifier\n- Used absolute time take measure and not delay between measure, to avoid the\n drift of measure time\n- Added workaround when a newly cgroup is created without process in it\n (monitoring is suspended upto one process is launched)\n\n\nVersion 0.0.1\n-------------\n\n- Conception\n",
"bugtrack_url": null,
"license": "GNU GPL",
"summary": "A utility to monitor the jobs ressources in a HPC environment, espacially OAR",
"version": "0.6.10",
"project_urls": {
"Homepage": "http://oar.imag.fr/"
},
"split_keywords": [
"monitoring",
" taskstat",
" oar",
" hpc",
" sciences"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "109d039c099fc5f507c6a0441a7b751c749e1463b041143f887e88e015b39328",
"md5": "f44a3df25d749997bd2e99f4381b1fa0",
"sha256": "df93a5d602491694a16c2a592dc28c4c07b9fb103efd3207c0958f0fb13dd882"
},
"downloads": -1,
"filename": "colmet-0.6.10-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f44a3df25d749997bd2e99f4381b1fa0",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 73017,
"upload_time": "2024-11-18T16:37:49",
"upload_time_iso_8601": "2024-11-18T16:37:49.729878Z",
"url": "https://files.pythonhosted.org/packages/10/9d/039c099fc5f507c6a0441a7b751c749e1463b041143f887e88e015b39328/colmet-0.6.10-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "341f64a5418ea27ab0557336ea130b9655e857b8425186176e24a0e519bc2fb1",
"md5": "600367f6b10f1fbfd648f28dee55b659",
"sha256": "8899983a8f063e380196b809b08ff8af69f7cc4bcab9db54f0a8345fa5957f74"
},
"downloads": -1,
"filename": "colmet-0.6.10.tar.gz",
"has_sig": false,
"md5_digest": "600367f6b10f1fbfd648f28dee55b659",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 61315,
"upload_time": "2024-11-18T16:37:52",
"upload_time_iso_8601": "2024-11-18T16:37:52.189169Z",
"url": "https://files.pythonhosted.org/packages/34/1f/64a5418ea27ab0557336ea130b9655e857b8425186176e24a0e519bc2fb1/colmet-0.6.10.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-18 16:37:52",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "colmet"
}