<h1 align="center">Deckard π΅οΈββοΈ</h1>
<p align="center">Extract structured data from unstructured text β no AI, just regular expressions. π</p>
[](https://github.com/SpaceShaman/deckard?tab=MIT-1-ov-file)
[](https://app.codecov.io/github/SpaceShaman/deckard)
[](https://app.codecov.io/github/SpaceShaman/deckard)
[](https://pypi.org/project/deckard)
[](https://pypi.org/project/deckard)
[](https://github.com/psf/black)
[](https://github.com/astral-sh/ruff)
[](https://docs.pytest.org/)
> Deckard is a library of regular-expression patterns for extracting structured data (addresses, phone numbers, email addresses, etc.) and a small set of helper utilities that make using those patterns easier.
> Status: very early-stage project. Right now the repository contains mostly patterns for Poland. I am looking for contributors from around the world π β address formats, phone-number formats and other data representations differ by country, so the goal is to gather country-specific patterns for many regions.
## Key features β¨
- ποΈ A collection of ready-to-use regex patterns organized by country (for example [`deckard/patterns/pl.py`](./deckard/patterns/pl.py)).
- π¦ Universal patterns (e.g. email) live in [`deckard/patterns/standard.py`](./deckard/patterns/standard.py).
- π οΈ A small helper function `deckard.search` that combines multiple patterns and returns named-group matches ([deckard/main.py](./deckard/main.py)).
## Installation βοΈ
From PyPI:
```bash
pip install deckard
```
Editable / local development install:
```bash
pip install -e .
```
### For contributors β install dependencies with Poetry π§βπ»
This project uses Poetry to manage dependencies and development dependencies.
1. Install Poetry (see https://python-poetry.org for instructions).
2. From the project root run:
```bash
poetry install
```
This will create a virtual environment and install runtime and development dependencies (including `pytest`).
To run tests using Poetry:
```bash
poetry run pytest
```
Or start a shell in the created virtualenv and run tests directly:
```bash
poetry shell
pytest
```
## Quick usage π§
Example using the current public API:
```python
from deckard import search
from deckard.patterns import standard, pl
text = (
"Hello, my email is spaceshaman@tuta.io and my phone number is "
"+48 792 321 321 and my address is ul. Tesotowa 12/6A, 66-700 Bielsko-BiaΕa."
)
result = search([standard.EMAIL, pl.MOBILE_PHONE, pl.ADDRESS], text)
# result.groupdict() will return a dict of named groups, for example:
# {
# 'email': 'spaceshaman@tuta.io',
# 'mobile_phone': '792 321 321',
# 'street': 'ul. Tesotowa',
# 'building': '12',
# 'apartment': '6A',
# 'zip_code': '66-700',
# 'city': 'Bielsko-BiaΕa'
# }
```
The `search` helper composes the provided patterns into a single regex (using lookaheads) and returns the first match as a `regex.Match` object (or `None` if nothing matched).
## Repository layout
- [`deckard/`](./deckard/) β library code
- [`deckard/main.py`](./deckard/main.py) β helper `search` function
- [`deckard/patterns/standard.py`](./deckard/patterns/standard.py) β universal patterns (e.g. `EMAIL`)
- [`deckard/patterns/pl.py`](./deckard/patterns/pl.py) β Poland-specific patterns (address, postal code, phone, etc.)
- [`tests/`](./tests/) β unit tests
Examples of existing tests:
- [`tests/test_standard_patterns.py`](./tests/test_standard_patterns.py) β test for `standard.EMAIL`
- [`tests/test_search_with_multiple_patterns.py`](./tests/test_search_with_multiple_patterns.py) β integration tests combining `standard.EMAIL` with patterns from `pl.py`
- [`tests/pl/test_search_address_pl.py`](./tests/pl/test_search_address_pl.py) β tests for Polish address patterns
Every new pattern must come with tests. Pull requests without tests will not be accepted.
## Contributing β how to add new patterns
1. Create a new file under [`deckard/patterns/`](./deckard/patterns/) named by the country code, e.g. `us.py`, `de.py`, `fr.py`.
2. Define constants (UPPERCASE) for each pattern, for example `MOBILE_PHONE`, `ADDRESS`, `ZIP_CODE`.
3. Add tests under `tests/`. Use the existing Polish tests (e.g. `tests/test_search_with_multiple_patterns.py`) as a template. Provide normal and edge-case examples.
4. In the PR description explain local rules (phone number format, postal code format, common street abbreviations, etc.).
5. PRs without tests will not be accepted.
Tips π‘:
- π§Ύ Use clear, consistent named groups in regexes (`?P<name>`) so `groupdict()` returns a predictable structure.
- π Document complex patterns with comments and example inputs if necessary.
## Discussion and roadmap π§
The project is not yet final β everything is open for discussion. Areas for contributors and discussion include:
- π Defining a minimal set of patterns every country should provide (email, phone, address, postal code, national ID where applicable).
- π Standardizing group names (`street`, `building`, `apartment`, `zip_code`, `city`, `country`, `mobile_phone`, etc.).
- βοΈ Tools for validation and normalization of extracted values.
- π€ Automating tests with sample documents in various languages.
If you want to help, open an issue or a PR β a short description of the local data format and one or two patterns with tests is a great place to start.
## License π
This project is licensed under the MIT License. See the [LICENSE](./LICENSE) file for the full text.
---
Thanks for your interest β please join the effort. Together we can build an international library of patterns to extract structured data from arbitrary text using robust regular expressions. π
Raw data
{
"_id": null,
"home_page": "https://github.com/SpaceShaman/deckard",
"name": "deckard",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.10",
"maintainer_email": null,
"keywords": "regex, data-extraction, regular-expression",
"author": "SpaceShaman",
"author_email": "spaceshaman@tuta.io",
"download_url": "https://files.pythonhosted.org/packages/ff/15/53003d36fd8d2d79ba0f29bdb751a8bf76e2f54dfcf6ba10e658bbbf9aae/deckard-0.1.0.tar.gz",
"platform": null,
"description": "<h1 align=\"center\">Deckard \ud83d\udd75\ufe0f\u200d\u2642\ufe0f</h1>\n\n<p align=\"center\">Extract structured data from unstructured text \u2014 no AI, just regular expressions. \ud83d\udd0d</p>\n\n[](https://github.com/SpaceShaman/deckard?tab=MIT-1-ov-file)\n[](https://app.codecov.io/github/SpaceShaman/deckard)\n[](https://app.codecov.io/github/SpaceShaman/deckard)\n[](https://pypi.org/project/deckard)\n[](https://pypi.org/project/deckard)\n[](https://github.com/psf/black)\n[](https://github.com/astral-sh/ruff)\n[](https://docs.pytest.org/)\n\n> Deckard is a library of regular-expression patterns for extracting structured data (addresses, phone numbers, email addresses, etc.) and a small set of helper utilities that make using those patterns easier.\n\n> Status: very early-stage project. Right now the repository contains mostly patterns for Poland. I am looking for contributors from around the world \ud83c\udf0d \u2014 address formats, phone-number formats and other data representations differ by country, so the goal is to gather country-specific patterns for many regions.\n\n## Key features \u2728\n\n- \ud83d\uddc2\ufe0f A collection of ready-to-use regex patterns organized by country (for example [`deckard/patterns/pl.py`](./deckard/patterns/pl.py)).\n- \ud83d\udce6 Universal patterns (e.g. email) live in [`deckard/patterns/standard.py`](./deckard/patterns/standard.py).\n- \ud83d\udee0\ufe0f A small helper function `deckard.search` that combines multiple patterns and returns named-group matches ([deckard/main.py](./deckard/main.py)).\n\n## Installation \u2699\ufe0f\n\nFrom PyPI:\n\n```bash\npip install deckard\n```\n\nEditable / local development install:\n\n```bash\npip install -e .\n```\n\n### For contributors \u2014 install dependencies with Poetry \ud83e\uddd1\u200d\ud83d\udcbb\n\nThis project uses Poetry to manage dependencies and development dependencies.\n\n1. Install Poetry (see https://python-poetry.org for instructions).\n2. From the project root run:\n\n```bash\npoetry install\n```\n\nThis will create a virtual environment and install runtime and development dependencies (including `pytest`).\n\nTo run tests using Poetry:\n\n```bash\npoetry run pytest\n```\n\nOr start a shell in the created virtualenv and run tests directly:\n\n```bash\npoetry shell\npytest\n```\n\n## Quick usage \ud83e\udded\n\nExample using the current public API:\n\n```python\nfrom deckard import search\nfrom deckard.patterns import standard, pl\n\ntext = (\n \"Hello, my email is spaceshaman@tuta.io and my phone number is \"\n \"+48 792 321 321 and my address is ul. Tesotowa 12/6A, 66-700 Bielsko-Bia\u0142a.\"\n)\n\nresult = search([standard.EMAIL, pl.MOBILE_PHONE, pl.ADDRESS], text)\n\n# result.groupdict() will return a dict of named groups, for example:\n# {\n# 'email': 'spaceshaman@tuta.io',\n# 'mobile_phone': '792 321 321',\n# 'street': 'ul. Tesotowa',\n# 'building': '12',\n# 'apartment': '6A',\n# 'zip_code': '66-700',\n# 'city': 'Bielsko-Bia\u0142a'\n# }\n```\n\nThe `search` helper composes the provided patterns into a single regex (using lookaheads) and returns the first match as a `regex.Match` object (or `None` if nothing matched).\n\n## Repository layout\n\n- [`deckard/`](./deckard/) \u2014 library code\n - [`deckard/main.py`](./deckard/main.py) \u2014 helper `search` function\n - [`deckard/patterns/standard.py`](./deckard/patterns/standard.py) \u2014 universal patterns (e.g. `EMAIL`)\n - [`deckard/patterns/pl.py`](./deckard/patterns/pl.py) \u2014 Poland-specific patterns (address, postal code, phone, etc.)\n- [`tests/`](./tests/) \u2014 unit tests\n\nExamples of existing tests:\n- [`tests/test_standard_patterns.py`](./tests/test_standard_patterns.py) \u2014 test for `standard.EMAIL`\n- [`tests/test_search_with_multiple_patterns.py`](./tests/test_search_with_multiple_patterns.py) \u2014 integration tests combining `standard.EMAIL` with patterns from `pl.py`\n- [`tests/pl/test_search_address_pl.py`](./tests/pl/test_search_address_pl.py) \u2014 tests for Polish address patterns\n\nEvery new pattern must come with tests. Pull requests without tests will not be accepted.\n\n## Contributing \u2014 how to add new patterns\n\n1. Create a new file under [`deckard/patterns/`](./deckard/patterns/) named by the country code, e.g. `us.py`, `de.py`, `fr.py`.\n2. Define constants (UPPERCASE) for each pattern, for example `MOBILE_PHONE`, `ADDRESS`, `ZIP_CODE`.\n3. Add tests under `tests/`. Use the existing Polish tests (e.g. `tests/test_search_with_multiple_patterns.py`) as a template. Provide normal and edge-case examples.\n4. In the PR description explain local rules (phone number format, postal code format, common street abbreviations, etc.).\n5. PRs without tests will not be accepted.\n\nTips \ud83d\udca1:\n- \ud83e\uddfe Use clear, consistent named groups in regexes (`?P<name>`) so `groupdict()` returns a predictable structure.\n- \ud83d\udcdd Document complex patterns with comments and example inputs if necessary.\n\n## Discussion and roadmap \ud83d\udea7\n\nThe project is not yet final \u2014 everything is open for discussion. Areas for contributors and discussion include:\n\n- \ud83d\udccb Defining a minimal set of patterns every country should provide (email, phone, address, postal code, national ID where applicable).\n- \ud83d\udd20 Standardizing group names (`street`, `building`, `apartment`, `zip_code`, `city`, `country`, `mobile_phone`, etc.).\n- \u2696\ufe0f Tools for validation and normalization of extracted values.\n- \ud83e\udd16 Automating tests with sample documents in various languages.\n\nIf you want to help, open an issue or a PR \u2014 a short description of the local data format and one or two patterns with tests is a great place to start.\n\n## License \ud83d\udcc4\n\nThis project is licensed under the MIT License. See the [LICENSE](./LICENSE) file for the full text.\n\n---\n\nThanks for your interest \u2014 please join the effort. Together we can build an international library of patterns to extract structured data from arbitrary text using robust regular expressions. \ud83d\ude80\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Extract structured data from unstructured text \u2014 no AI, just regular expressions. \ud83d\udd0d",
"version": "0.1.0",
"project_urls": {
"Documentation": "https://github.com/SpaceShaman/deckard",
"Homepage": "https://github.com/SpaceShaman/deckard",
"Repository": "https://github.com/SpaceShaman/deckard"
},
"split_keywords": [
"regex",
" data-extraction",
" regular-expression"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "8bb790e19c1738af0768c86254731c304ebe20446b40ed137f471038e5f2a413",
"md5": "4edee0ade70d10e3e6378498cf7827f6",
"sha256": "d7e616f7507dbedad0787cf21b291c3efc747f3dd24ad1bbc820bdbab4e369ee"
},
"downloads": -1,
"filename": "deckard-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4edee0ade70d10e3e6378498cf7827f6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.10",
"size": 6003,
"upload_time": "2025-08-17T10:37:22",
"upload_time_iso_8601": "2025-08-17T10:37:22.029628Z",
"url": "https://files.pythonhosted.org/packages/8b/b7/90e19c1738af0768c86254731c304ebe20446b40ed137f471038e5f2a413/deckard-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "ff1553003d36fd8d2d79ba0f29bdb751a8bf76e2f54dfcf6ba10e658bbbf9aae",
"md5": "8d175a41edbd27baf3efddfe09446bc1",
"sha256": "7fdcbac0ae6eef71123efe0efffed065ab66c471699bcf86f6195afbf7ea8d32"
},
"downloads": -1,
"filename": "deckard-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "8d175a41edbd27baf3efddfe09446bc1",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.10",
"size": 4836,
"upload_time": "2025-08-17T10:37:23",
"upload_time_iso_8601": "2025-08-17T10:37:23.234052Z",
"url": "https://files.pythonhosted.org/packages/ff/15/53003d36fd8d2d79ba0f29bdb751a8bf76e2f54dfcf6ba10e658bbbf9aae/deckard-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-17 10:37:23",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "SpaceShaman",
"github_project": "deckard",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "deckard"
}