# Matricula Online Scraper
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/matricula-online-scraper?logo=python)
![GitHub License](https://img.shields.io/github/license/lsg551/matricula-online-scraper?logo=pypi)
![PyPI - Version](https://img.shields.io/pypi/v/matricula-online-scraper?logo=pypi)
> :warning: This tool is still under development and is NOT yet feature-complete. Expect breaking changes and bugs. Please report any issues.
[Matricula Online](https://data.matricula-online.eu/) is a website that hosts parish registers from various regions across Europe. This CLI tool allows you to fetch data from it and save the data to a file.
---
Our GitHub Workflow automatically scrapes a list with all parishes once a week and pushes to [`cache/parishes`](https://github.com/lsg551/matricula-online-scraper/tree/cache/parishes). Download [`parishes.csv`](https://github.com/lsg551/matricula-online-scraper/raw/cache/parishes/parishes.csv.gz) ⚡️
[![Cache Parishes](https://github.com/lsg551/matricula-online-scraper/actions/workflows/cache-parishes.yml/badge.svg)](https://github.com/lsg551/matricula-online-scraper/actions/workflows/cache-parishes.yml)
![GitHub last commit (branch)](https://img.shields.io/github/last-commit/lsg551/matricula-online-scraper/cache%2Fparishes?path=parishes.csv.gz&label=last%20caching&cacheSeconds=43200)
---
Note that this tool will not format or clean the data in any way. Instead, the data is saved as-is to a file. I mention this because the original data is especially poorly formatted and contains a lot of inconsistencies. It is up to the user to process the data further.
## 🔧 Installation
Make sure to have a recent version of Python installed. You can then install this script via `pip`:
```console
$ pip install --user matricula-online-scraper
```
Nevertheless, you can clone this repository and run the script with [Poetry](https://python-poetry.org).
## 💡 How To Use
```console
$ matricula-online-scraper --help
```
prints available commands and options, including documentation. Same goes for each subcommand, e.g. `matricula-online-scraper fetch --help`.
The `fetch` command is the primary command to fetch any resources from Matricula Online. Its subcommands allow you to scrape different resources, run `matricula-online-scraper fetch --help` to see available subcommands.
### Example 1:
Fetch all available locations and save them to a `.jsonl` file:
```console
$ matricula-online-scraper fetch locations ./output.jsonl
```
> :warning: This will fetch all parishes from Matricula Online, which may take a few minutes. Despite that, this data only changes rarely, but frequent scraping will put unnecessary load on the server. Therefore our GitHub Workflow caches this data once a week and pushes to [`cache/parishes`](https://github.com/lsg551/matricula-online-scraper/tree/cache/parishes). ⚡️ [Download CSV](https://github.com/lsg551/matricula-online-scraper/raw/cache/parishes/parishes.csv.gz) ⚡️
### Example 2:
Fetch all available register from one parish in Münster, Germany and save them to a `.jsonl` file:
```console
$ matricula-online-scraper fetch parish ./output.jsonl --urls https://data.matricula-online.eu/en/deutschland/muenster/muenster-st-martini/
```
## License & Contributing
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions, especially bug fixes. Please make sure to follow the [Contributing Guidelines](CONTRIBUTING.md).
Raw data
{
"_id": null,
"home_page": "https://github.com/lsg551/matricula-online-scraper",
"name": "matricula-online-scraper",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.12",
"maintainer_email": null,
"keywords": "matricula-online, matricula, scraper, parish-registers",
"author": "Luis Schulte",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/1c/8b/cb2658fbf14950935935709f6065afb4f395960d786edfd69ac81d27eec4/matricula_online_scraper-0.5.0.tar.gz",
"platform": null,
"description": "# Matricula Online Scraper\n\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/matricula-online-scraper?logo=python)\n![GitHub License](https://img.shields.io/github/license/lsg551/matricula-online-scraper?logo=pypi)\n![PyPI - Version](https://img.shields.io/pypi/v/matricula-online-scraper?logo=pypi)\n\n> :warning: This tool is still under development and is NOT yet feature-complete. Expect breaking changes and bugs. Please report any issues.\n\n[Matricula Online](https://data.matricula-online.eu/) is a website that hosts parish registers from various regions across Europe. This CLI tool allows you to fetch data from it and save the data to a file.\n\n---\n\nOur GitHub Workflow automatically scrapes a list with all parishes once a week and pushes to [`cache/parishes`](https://github.com/lsg551/matricula-online-scraper/tree/cache/parishes). Download [`parishes.csv`](https://github.com/lsg551/matricula-online-scraper/raw/cache/parishes/parishes.csv.gz) \u26a1\ufe0f\n\n[![Cache Parishes](https://github.com/lsg551/matricula-online-scraper/actions/workflows/cache-parishes.yml/badge.svg)](https://github.com/lsg551/matricula-online-scraper/actions/workflows/cache-parishes.yml)\n![GitHub last commit (branch)](https://img.shields.io/github/last-commit/lsg551/matricula-online-scraper/cache%2Fparishes?path=parishes.csv.gz&label=last%20caching&cacheSeconds=43200)\n\n---\n\nNote that this tool will not format or clean the data in any way. Instead, the data is saved as-is to a file. I mention this because the original data is especially poorly formatted and contains a lot of inconsistencies. It is up to the user to process the data further.\n\n## \ud83d\udd27 Installation\n\nMake sure to have a recent version of Python installed. You can then install this script via `pip`:\n\n```console\n$ pip install --user matricula-online-scraper\n```\n\nNevertheless, you can clone this repository and run the script with [Poetry](https://python-poetry.org).\n\n## \ud83d\udca1 How To Use\n\n```console\n$ matricula-online-scraper --help\n```\n\nprints available commands and options, including documentation. Same goes for each subcommand, e.g. `matricula-online-scraper fetch --help`.\n\nThe `fetch` command is the primary command to fetch any resources from Matricula Online. Its subcommands allow you to scrape different resources, run `matricula-online-scraper fetch --help` to see available subcommands.\n\n### Example 1:\n\nFetch all available locations and save them to a `.jsonl` file:\n\n```console\n$ matricula-online-scraper fetch locations ./output.jsonl\n```\n\n> :warning: This will fetch all parishes from Matricula Online, which may take a few minutes. Despite that, this data only changes rarely, but frequent scraping will put unnecessary load on the server. Therefore our GitHub Workflow caches this data once a week and pushes to [`cache/parishes`](https://github.com/lsg551/matricula-online-scraper/tree/cache/parishes). \u26a1\ufe0f [Download CSV](https://github.com/lsg551/matricula-online-scraper/raw/cache/parishes/parishes.csv.gz) \u26a1\ufe0f\n\n### Example 2:\n\nFetch all available register from one parish in M\u00fcnster, Germany and save them to a `.jsonl` file:\n\n```console\n$ matricula-online-scraper fetch parish ./output.jsonl --urls https://data.matricula-online.eu/en/deutschland/muenster/muenster-st-martini/\n```\n\n## License & Contributing\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\nContributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions, especially bug fixes. Please make sure to follow the [Contributing Guidelines](CONTRIBUTING.md).\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Command Line Interface tool for scraping Matricula Online https://data.matricula-online.eu.",
"version": "0.5.0",
"project_urls": {
"Homepage": "https://github.com/lsg551/matricula-online-scraper",
"Repository": "https://github.com/lsg551/matricula-online-scraper"
},
"split_keywords": [
"matricula-online",
" matricula",
" scraper",
" parish-registers"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "14159dc86b2c0530e67aed091e7cf9253b41f6dbada70c42b2554cb37a2e6b0d",
"md5": "bc9887866182b14c781df3cd03067fa2",
"sha256": "2d376628569ad5a07201d4824b02fa66e3685199f26718009c90536c7feff7b5"
},
"downloads": -1,
"filename": "matricula_online_scraper-0.5.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "bc9887866182b14c781df3cd03067fa2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.12",
"size": 14665,
"upload_time": "2024-06-07T16:26:00",
"upload_time_iso_8601": "2024-06-07T16:26:00.807667Z",
"url": "https://files.pythonhosted.org/packages/14/15/9dc86b2c0530e67aed091e7cf9253b41f6dbada70c42b2554cb37a2e6b0d/matricula_online_scraper-0.5.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "1c8bcb2658fbf14950935935709f6065afb4f395960d786edfd69ac81d27eec4",
"md5": "4d665c76c1f1bc786a49314841659bb8",
"sha256": "e31ab30e158c10954a3a98925a1df743ceb6126f66ae01d4536367c5a7a7a184"
},
"downloads": -1,
"filename": "matricula_online_scraper-0.5.0.tar.gz",
"has_sig": false,
"md5_digest": "4d665c76c1f1bc786a49314841659bb8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.12",
"size": 11248,
"upload_time": "2024-06-07T16:26:03",
"upload_time_iso_8601": "2024-06-07T16:26:03.407856Z",
"url": "https://files.pythonhosted.org/packages/1c/8b/cb2658fbf14950935935709f6065afb4f395960d786edfd69ac81d27eec4/matricula_online_scraper-0.5.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-06-07 16:26:03",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "lsg551",
"github_project": "matricula-online-scraper",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "matricula-online-scraper"
}