# CVMFS server scraper and prometheus exporter
This tool scrapes the public metadata sources from set of stratum0 and stratum1 servers. It grabs:
- cvmfs/info/v1/repositories.json
And then for every repo it finds (that it's not told to ignore), it grabs:
- cvmfs/<repo>/.cvmfs_status.json
- cvmfs/<repo>/.cvmfspublished
## Installation
`pip install cvmfs-server-scraper`
## Usage
````python
#!/usr/bin/env python3
import logging
from cvmfsscraper import scrape, scrape_server, set_log_level
# server = scrape_server("aws-eu-west1.stratum1.cvmfs.eessi-infra.org")
set_log_level(logging.DEBUG)
servers = scrape(
stratum0_servers=[
"stratum0.tld",
],
stratum1_servers=[
"stratum1-no.tld",
"stratum1-au.tld",
],
repos=[],
ignore_repos=[],
)
# Note that the order of servers is undefined.
print(servers[0])
for repo in servers[0].repositories:
print("Repo: " + repo.name )
print("Root size: " + repo.root_size)
print("Revision: " + repo.revision)
print("Revision timestamp: " + repo.revision_timestamp)
print("Last snapshot: " + str(repo.last_snapshot))
````
Note that if you are using a Stratum1 server with S3 as its backend, you need to set repos explicitly.
This is because the S3 backend does not have a `cvmfs/info/v1/repositories.json` file. Also, the GeoAPI
status will be `NOT_FOUND` for these servers.
````python
# Data structure
## Server
A server object, representing a specific server that has been scraped.
````python
servers = scrape(...)
server_one = servers[0]
````
### Name
#### Type: Attribute
`server.name`
#### Returns
The name of the server, usually its fully qualified domain name.
### GeoApi status
#### Type: Attribute
`server.geoapi_status`
#### Returns
A GeoAPIstatus enum object. Defined in `constants.py`. The possible values are:
- OK (0: OK)
- LOCATION_ERROR (1: GeoApi gives wrong location)
- NO_RESPONSE (2: No response)
- NOT_FOUND (9: The server has no repository available so the GeoApi cannot be tested)
- NOT_YET_TESTED (99: The server has not yet been tested)
### Repositories
#### Type: attribute
`server.repositories`
#### Returns
A list of repository objects, sorted by name. Empty if no repositores are scraped on the server.
### Ignored repositories
#### Type: Attribute
`server.ignored_repositories`
#### Returns
List of repositories names that are to be ignored by the scraper.
### Forced repositories
#### Type: Attribute
`server.forced_repositories`
#### Returns
A list of repository names that the server is forced to scrape. If a repo name exists in both ignored_repositories and forced_repositories, it will be scraped.
## Repository
A repository object, representing a single repository on a scraped server.
````python
servers = scrape(...)
repo_one = servers[0].repositories[0]
````
### Name
#### Type: Attribute
`repo_one.name`
#### Returns
The fully qualified name of the repository.
### Server
#### Type: Attribute
`repo_one.server`
#### Returns
The server object to which the repository belongs.
### Path
#### Type: Attribute
`repo_one.path`
#### Returns
The path for the repository on the server. May differ from the name. To get a complete URL, one can do:
`url = "http://" + repo_one.server.name + repo_one.path`
### Status attributes
These attributes are populated from `cvmfs_status.json`:
| Attribute | Value |
| --- | --- |
| last_gc | Timestamp of last garbage collection |
| last_snapshot | Timestamp of the last snapshot |
Information from `.cvmfspublished` is also provided. For explanations for these keys, please see CVMFS' [official documentation](https://cvmfs.readthedocs.io/en/stable/cpt-details.html). The field value in the table is the field key from `.cvmfspublished`.
| Attribute | Field |
| --- | --- |
| alternative_name | A |
| full_name | N |
| is_garbage_collectable | G |
| metadata_cryptographic_hash | M |
| micro_cataogues | L |
| reflog_checksum_cryptographic_hash | Y |
| revision_timestamp | T |
| root_catalogue_ttl | D |
| root_cryptographic_hash | C |
| root_size | B |
| root_path_hash | R |
| signature | The end signature blob |
| signing_certificate_cryptographic_hash | X |
| tag_history_cryptographic_hash | H |
Raw data
{
"_id": null,
"home_page": "https://github.com/eessi/cvmfs-server-scraper",
"name": "cvmfs-server-scraper",
"maintainer": "Terje Kvernes",
"docs_url": null,
"requires_python": "<4.0,>=3.8",
"maintainer_email": "terje@kvernes.no",
"keywords": "cvmfs, scrape, eessi",
"author": "Terje Kvernes",
"author_email": "terje@kvernes.no",
"download_url": "https://files.pythonhosted.org/packages/67/5f/a8eef0a65e851e30ad7c81be32b255f7edef07a962c66a022eaf5e7a142c/cvmfs_server_scraper-0.0.4.tar.gz",
"platform": null,
"description": "# CVMFS server scraper and prometheus exporter\n\nThis tool scrapes the public metadata sources from set of stratum0 and stratum1 servers. It grabs:\n\n - cvmfs/info/v1/repositories.json \n\nAnd then for every repo it finds (that it's not told to ignore), it grabs:\n\n - cvmfs/<repo>/.cvmfs_status.json\n - cvmfs/<repo>/.cvmfspublished\n\n## Installation\n\n`pip install cvmfs-server-scraper`\n\n## Usage\n\n````python\n#!/usr/bin/env python3\n\nimport logging\nfrom cvmfsscraper import scrape, scrape_server, set_log_level\n\n# server = scrape_server(\"aws-eu-west1.stratum1.cvmfs.eessi-infra.org\")\n\nset_log_level(logging.DEBUG)\n\nservers = scrape(\n stratum0_servers=[\n \"stratum0.tld\",\n ],\n stratum1_servers=[\n \"stratum1-no.tld\",\n \"stratum1-au.tld\",\n ],\n repos=[],\n ignore_repos=[],\n)\n\n# Note that the order of servers is undefined.\nprint(servers[0])\n\nfor repo in servers[0].repositories:\n print(\"Repo: \" + repo.name )\n print(\"Root size: \" + repo.root_size)\n print(\"Revision: \" + repo.revision)\n print(\"Revision timestamp: \" + repo.revision_timestamp)\n print(\"Last snapshot: \" + str(repo.last_snapshot))\n````\n\nNote that if you are using a Stratum1 server with S3 as its backend, you need to set repos explicitly.\nThis is because the S3 backend does not have a `cvmfs/info/v1/repositories.json` file. Also, the GeoAPI\nstatus will be `NOT_FOUND` for these servers.\n\n````python\n\n# Data structure\n\n## Server\n\nA server object, representing a specific server that has been scraped.\n\n````python\nservers = scrape(...)\nserver_one = servers[0]\n````\n\n### Name\n\n#### Type: Attribute\n\n`server.name`\n\n#### Returns\n\nThe name of the server, usually its fully qualified domain name.\n\n### GeoApi status\n\n#### Type: Attribute\n\n`server.geoapi_status`\n\n#### Returns\n\nA GeoAPIstatus enum object. Defined in `constants.py`. The possible values are:\n\n- OK (0: OK)\n- LOCATION_ERROR (1: GeoApi gives wrong location)\n- NO_RESPONSE (2: No response)\n- NOT_FOUND (9: The server has no repository available so the GeoApi cannot be tested)\n- NOT_YET_TESTED (99: The server has not yet been tested)\n\n### Repositories\n\n#### Type: attribute\n\n`server.repositories`\n\n#### Returns\n\nA list of repository objects, sorted by name. Empty if no repositores are scraped on the server.\n\n### Ignored repositories\n\n#### Type: Attribute\n\n`server.ignored_repositories`\n\n#### Returns\n\nList of repositories names that are to be ignored by the scraper.\n\n### Forced repositories\n\n#### Type: Attribute\n\n`server.forced_repositories`\n\n#### Returns\n\nA list of repository names that the server is forced to scrape. If a repo name exists in both ignored_repositories and forced_repositories, it will be scraped.\n\n## Repository\n\nA repository object, representing a single repository on a scraped server.\n\n````python\nservers = scrape(...)\nrepo_one = servers[0].repositories[0]\n````\n\n### Name\n\n#### Type: Attribute\n\n`repo_one.name`\n\n#### Returns\n\nThe fully qualified name of the repository.\n\n### Server\n\n#### Type: Attribute\n\n`repo_one.server`\n\n#### Returns\n\nThe server object to which the repository belongs.\n\n### Path\n\n#### Type: Attribute\n\n`repo_one.path`\n\n#### Returns\n\nThe path for the repository on the server. May differ from the name. To get a complete URL, one can do:\n\n`url = \"http://\" + repo_one.server.name + repo_one.path`\n\n### Status attributes\n\nThese attributes are populated from `cvmfs_status.json`:\n\n| Attribute | Value |\n| --- | --- |\n| last_gc | Timestamp of last garbage collection |\n| last_snapshot | Timestamp of the last snapshot |\n\nInformation from `.cvmfspublished` is also provided. For explanations for these keys, please see CVMFS' [official documentation](https://cvmfs.readthedocs.io/en/stable/cpt-details.html). The field value in the table is the field key from `.cvmfspublished`.\n\n| Attribute |\u00a0Field |\n| --- | --- |\n| alternative_name | A\u00a0|\n| full_name | N |\n| is_garbage_collectable | G |\n| metadata_cryptographic_hash | M |\n| micro_cataogues | L |\n| reflog_checksum_cryptographic_hash | Y |\n| revision_timestamp | T |\n| root_catalogue_ttl | D |\n| root_cryptographic_hash | C |\n| root_size | B |\n| root_path_hash |\u00a0R\u00a0|\n| signature | The end signature blob |\n| signing_certificate_cryptographic_hash | X |\n| tag_history_cryptographic_hash | H |\n",
"bugtrack_url": null,
"license": "GPLv2",
"summary": "Scrape metadata from CVMFS Stratum servers.",
"version": "0.0.4",
"project_urls": {
"Documentation": "https://github.com/eessi/cvmfs-server-scraper",
"Homepage": "https://github.com/eessi/cvmfs-server-scraper",
"Repository": "https://github.com/eessi/cvmfs-server-scraper"
},
"split_keywords": [
"cvmfs",
" scrape",
" eessi"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "13ec62c5ba9a626cd002168ce709b2b1bab37ee6920b3a77c6dda3ff11114f31",
"md5": "908fd857379fb22b85270bc1cc1736ae",
"sha256": "db4ebd9a4689545bc1e1aed0fa141f0d87299b5d5045ef3a9387bbcc670209a9"
},
"downloads": -1,
"filename": "cvmfs_server_scraper-0.0.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "908fd857379fb22b85270bc1cc1736ae",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.8",
"size": 45946,
"upload_time": "2024-06-14T23:42:48",
"upload_time_iso_8601": "2024-06-14T23:42:48.669622Z",
"url": "https://files.pythonhosted.org/packages/13/ec/62c5ba9a626cd002168ce709b2b1bab37ee6920b3a77c6dda3ff11114f31/cvmfs_server_scraper-0.0.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "675fa8eef0a65e851e30ad7c81be32b255f7edef07a962c66a022eaf5e7a142c",
"md5": "89e07747ca97a9cea45eb10bf2f73f9d",
"sha256": "e4e8dcd563cf4f2f73f3c392d79aa38a3f40b9e7536810f892baf9cb0fff2efd"
},
"downloads": -1,
"filename": "cvmfs_server_scraper-0.0.4.tar.gz",
"has_sig": false,
"md5_digest": "89e07747ca97a9cea45eb10bf2f73f9d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.8",
"size": 32758,
"upload_time": "2024-06-14T23:42:50",
"upload_time_iso_8601": "2024-06-14T23:42:50.452134Z",
"url": "https://files.pythonhosted.org/packages/67/5f/a8eef0a65e851e30ad7c81be32b255f7edef07a962c66a022eaf5e7a142c/cvmfs_server_scraper-0.0.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-06-14 23:42:50",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "eessi",
"github_project": "cvmfs-server-scraper",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"tox": true,
"lcname": "cvmfs-server-scraper"
}