# fek-extractor
[](https://pypi.org/project/fek-extractor/)
[](#-requirements)
[](LICENSE)
[](https://github.com/dmsfiris/fek-extractor/actions/workflows/tests.yml)
[](https://github.com/psf/black)
Extract structured data from Greek Government Gazette (ΦΕΚ) PDFs.
It turns messy, two‑column government PDFs into machine‑readable **JSON/CSV** with FEK metadata and a clean map of **Άρθρα** (articles).
Built on `pdfminer.six`, with careful two‑column handling, header/footer filtering, Greek‑aware de‑hyphenation, and article detection.
---
## About this project
Greek Government Gazette (ΦΕΚ) documents look uniform at a glance, but their **typesetting and structure are anything but**. Even “clean” digital PDFs hide quirks that trip up generic parsers and off‑the‑shelf AI:
- **Multi‑column reading order** with full‑width “tails”, footers, and boilerplate that disrupt token flow.
- **Title vs. body separation** where headings, subtitles, and continuations interleave across pages.
- **Dense legal cross‑references and amendments**, with nested exceptions and renumbered clauses.
- **Inconsistent numbering and metadata**, plus occasional encoding artifacts and discretionary hyphens.
This project addresses those realities with a **layout‑aware, domain‑specific pipeline** that prioritizes *determinism* and *inspectability*:
- **Layout‑aware text reconstruction** — two‑column segmentation (k‑means + gutter valley), “tail” detection, header/footer filtering, and stable reading order.
- **Article‑structure recovery** — detects `Άρθρο N`, associates titles and bodies across page boundaries, and synthesizes a hierarchical TOC when possible.
- **Greek‑aware normalization** — de‑hyphenates safely (soft/discretionary hyphens, wrapped words) while preserving accents/case.
- **Domain heuristics + light NLP hooks** — FEK masthead parsing (series/issue/date), decision numbers, and simple patterns for subject/Θέμα; extension points for NER and reference extraction.
- **Transparent debugging** — page‑focused debug mode and optional metrics so you can see *why* a page parsed a certain way.
**Who it’s for:** legal‑tech teams, data engineers, and researchers who need **reproducible, explainable** FEK extraction that won’t crumble on edge cases.
**Outcome:** **structured, searchable, dependable** data for automation, analysis, and integration.
If your team needs tailored FEK pipelines or additional NLP components, **[AspectSoft](https://aspectsoft.gr)** can help.
---
## Features
- **FEK-aware text extraction**
- Two-column segmentation via k-means over x-coordinates with a gutter-valley heuristic.
- Per-page region classification & demotion — header, footer, column body, full-width tail, noise.
- “Tail” detection for full-width content (signatures, appendices, tables) below the columns.
- Header/footer cleanup tuned to FEK mastheads and page furniture.
- Deterministic reading order (by column → y → x); graceful single-column fallback.
- **Greek de-hyphenation**
- Removes soft/discretionary hyphens (U+00AD) and stitches wrapped words safely.
- Preserves accents/case; conservative rules to avoid over-merging.
- Handles common typography patterns (e.g., hyphen + space breaks).
- **Header parsing**
- Extracts FEK **series** (Α/Β/… including word→letter normalization), **issue number**, and **date** in both `DD.MM.YYYY` and ISO `YYYY-MM-DD`.
- Best-effort detection of **decision numbers** (e.g., “Αριθ.”).
- Tolerant to spacing/diacritic/punctuation variants.
- **Article detection**
- Recognizes `Άρθρο N` (including letter suffixes like `14Α`) and captures **title + body**.
- Stitches articles across page boundaries; keeps original and normalized numbering.
- Produces a structured **articles map** for direct programmatic use.
- **TOC synthesis** *(optional)*
- Builds a hierarchical TOC where present:
**ΜΕΡΟΣ → ΤΙΤΛΟΣ → ΚΕΦΑΛΑΙΟ → ΤΜΗΜΑ → Άρθρα**.
- Emits clean JSON for navigation, QA, or UI rendering.
- **Metrics** *(opt-in via `--include-metrics`)*
- Lengths & counts (characters, words, lines) and median line length.
- Top words, character histogram, and pluggable regex matches (e.g., FEK citations, “Θέμα:”).
- **CLI & Python API**
- CLI: single file or directory recursion, JSON/CSV output, `--jobs N` for parallel processing, focused logging via `--debug [PAGE]`.
- API: `extract_pdf_info(path, include_metrics=True|False, ...)` returns a ready-to-use record.
- **Typed codebase & tests**
- Static typing (PEP 561), lint/format (ruff, black), type checks (mypy), and tests (pytest).
- Clear module boundaries (`io/`, `parsing/`, `metrics/`, `cli.py`, `core.py`).
With this mix, FEK PDFs become consistent, navigable JSON/CSV with reliable metadata and article structure—ready for indexing, analytics, and automation.
> ✨ Sample PDF for testing ships in `data/samples/gr-act-2020-4706-4706_2020.pdf`.
---
## Table of contents
- [Demo \& screenshots](#demo--screenshots)
- [Requirements](#requirements)
- [Install](#install)
- [Quickstart](#quickstart)
- [CLI usage](#cli-usage)
- [Python API](#python-api)
- [Output schema](#output-schema)
- [Technical deep dive](#technical-deep-dive)
- [Architecture](#architecture)
- [Examples](#examples)
- [Debug helpers](#debug-helpers)
- [Performance tips](#performance-tips)
- [Project layout](#project-layout)
- [Development](#development)
- [Contributing](#contributing)
- [License](#license)
- [Contact](#contact)
---
## Demo & screenshots

---
## Requirements
- **Python 3.10+**
- **OS:** Linux, macOS, or Windows
- **Runtime dependency:** [`pdfminer.six`](https://github.com/pdfminer/pdfminer.six)
---
## Install
### From PyPI
```bash
pip install fek-extractor
```
### From source (editable)
```bash
git clone https://github.com/dmsfiris/fek-extractor.git
cd fek-extractor
python -m venv .venv
source .venv/bin/activate # Windows: .\.venv\Scripts\Activate.ps1
pip install -U pip
pip install -e . # library + CLI
# or full dev setup
pip install -e ".[dev]"
pre-commit install
```
### With pipx (isolated CLI)
```bash
pipx install fek-extractor
```
### Docker (no local Python needed)
```bash
docker run --rm -v "$PWD:/work" -w /work python:3.11-slim bash -lc "pip install fek-extractor && fek-extractor -i data/samples -o out.json"
```
---
## Quickstart
```bash
# JSON (default)
fek-extractor -i data/samples -o out.json -f json
# CSV
fek-extractor -i data/samples -o out.csv -f csv
# As a module (equivalent to the CLI)
python -m fek_extractor -i data/samples -o out.json
```
---
## CLI usage
```
usage: fek-extractor [-h] --input INPUT [--out OUT] [--format {json,csv}]
[--no-recursive] [--debug [PAGE]] [--jobs JOBS]
[--include-metrics] [--articles-only] [--toc-only]
Extract structured info from FEK/Greek-law PDFs.
```
**Options**
- `-i, --input PATH` (required) — PDF *file* or *directory*.
- `-o, --out PATH` (default: `out.json`) — Output path.
- `-f, --format {json,csv}` (default: `json`) — Output format.
- `--no-recursive` — When `--input` is a directory, do **not** recurse.
- `--debug [PAGE]` — Enable debug logging; optionally pass a **page number**
(e.g. `--debug 39`) to focus per‑page debug.
- `--jobs JOBS` — Parallel workers when input is a **folder** (default 1).
- `--include-metrics` — Add metrics into each record (see below).
- `--articles-only` — Emit **only** the articles map as JSON (ignores `-f csv`).
- `--toc-only` — Emit **only** the synthesized Table of Contents as JSON.
---
## Python API
```python
from fek_extractor import extract_pdf_info
# Single PDF → record (dict)
record = extract_pdf_info("data/samples/gr-act-2020-4706-4706_2020.pdf", include_metrics=True)
print(record["filename"], record["pages"], record["articles_count"])
# Optional kwargs (subject to change):
# debug=True
# debug_pages=[39] # focus page(s) for diagnostics
# dehyphenate=True # on by default
```
**Return type**: `dict[str, Any]` with the fields shown in [Output schema](#-output-schema).
---
## Output schema
Each **record** (per PDF) typically contains:
| Field | Type | Notes |
|------|------|------|
| `path` | string | Absolute or relative input path |
| `filename` | string | File name only |
| `pages` | int | Page count |
| `fek_series` | string? | Single Greek letter (e.g. `Α`) if detected |
| `fek_number` | string? | Issue number if detected |
| `fek_date` | string? | Dotted date `DD.MM.YYYY` |
| `fek_date_iso` | string? | ISO date `YYYY-MM-DD` |
| `decision_number` | string? | From “Αριθ.” if found |
| `subject` | string? | Document subject/Θέμα (best‑effort) |
| `articles` | object | Map of **article number → article object** |
| `articles_count` | int | Convenience total |
| `first_5_lines` | array | First few text lines (debugging aid) |
| **Metrics** *(only when `--include-metrics`)* |||
| `length` | int | Characters in raw text |
| `num_lines` | int | Number of lines |
| `median_line_length` | int | Median non‑empty line length |
| `char_counts` | object | Char → count |
| `word_counts_top` | object | Top words |
| `chars`, `words` | int | Totals |
| `matches` | object | Regex matches (from `data/patterns/patterns.txt`) |
**Article object**
```jsonc
{
"number": "13", // normalized article id (e.g., "13", "14Α")
"title": "Οργανωτικές ρυθμίσεις",
"body": "…full text…",
// optional structural context when present:
"part_letter": "Α", "part_title": "…", // ΜΕΡΟΣ
"title_letter": "I", "title_title": "…", // ΤΙΤΛΟΣ
"chapter_letter": "1", "chapter_title": "…", // ΚΕΦΑΛΑΙΟ
"section_letter": "Α", "section_title": "…" // ΤΜΗΜΑ
}
```
---
## Technical deep dive
- **Reading order reconstruction**
Rebuilds logical lines from low‑level glyphs, sorts by column then by y/x to maintain human reading order.
- **Two‑column segmentation**
Uses k‑means clustering over x‑coords and gap valley search to find the column gutter; detects and demotes “tail” (full‑width) content below columns.
- **Greek‑aware normalization**
Removes soft hyphens, stitches wrapped words, preserves Greek capitalization/accents conservatively.
- **Header & masthead parsing**
Regex/heuristics for FEK line (series/issue/date), dotted and ISO date, and decision numbers (`Αριθ.`).
- **Article detection & stitching**
Recognizes `Άρθρο N` headings, associates titles/bodies across page boundaries, and builds a robust map.
- **TOC synthesis**
Extracts hierarchical headers (ΜΕΡΟΣ/ΤΙΤΛΟΣ/ΚΕΦΑΛΑΙΟ/ΤΜΗΜΑ) when present.
- **Metrics**
Character/word counts and frequency stats to help diagnose messy PDFs.
---
## Architecture
```
PDF → glyphs → lines → columns → normalized text
→ header parser → articles parser → {record}
→ (optional) metrics / TOC
→ JSON/CSV writer
```
Key modules (under `src/fek_extractor/`):
- `io/pdf.py` – low‑level extraction, column/tail logic
- `parsing/normalize.py` – de‑hyphenation & cleanup
- `parsing/headers.py` – FEK header parsing
- `parsing/articles.py` – article detection + body stitching
- `metrics.py` – optional stats
- `cli.py` – batch processing, JSON/CSV output
---
## Examples
```bash
# 1) All PDFs under a folder → JSON
fek-extractor -i ./data/samples -o out.json
# 2) Single PDF → CSV
fek-extractor -i ./data/samples/gr-act-2020-4706-4706_2020.pdf -o out.csv -f csv
# 3) Articles only (for a file)
fek-extractor -i ./data/samples/gr-act-2020-4706-4706_2020.pdf --articles-only -o articles.json
# 4) Table of Contents only (for a file)
fek-extractor -i ./data/samples/gr-act-2020-4706-4706_2020.pdf --toc-only -o toc.json
# 5) Process a directory in parallel with 4 workers, include metrics
fek-extractor -i ./data/samples --jobs 4 --include-metrics -o out.json
```
---
## Debug helpers
There is a small debug entrypoint to inspect **column extraction** and **page layout**:
```bash
python -m fek_extractor.debug --pdf data/samples/gr-act-2020-4706-4706_2020.pdf --page 39 --check-order
```
---
## Performance tips
- Prefer running with `--jobs N` on directories to parallelize across files.
- For very large gazettes, keep output as JSON first (CSV is slower with many nested keys).
- Pre‑process PDFs (deskew/OCR) if the source is scanned images.
---
## Project layout
```
src/fek_extractor/
__main__.py # supports `python -m fek_extractor`
cli.py # CLI entrypoint
core.py # Orchestration
io/ # PDF I/O and exporters
parsing/ # Normalization & parsing rules (articles, headers, dates, HTML)
metrics.py # Basic text metrics
models.py # Typed record/contexts
utils/ # Logging, HTML cleanup helpers
data/
patterns/patterns.txt # Regexes for extra matches
samples/ # Sample FEK PDF (optional)
tests/ # Unit/CLI/integration tests
docs/ # MkDocs starter (optional)
```
---
## Development
```bash
# clone and set up
git clone https://github.com/dmsfiris/fek-extractor.git
cd fek-extractor
python -m venv .venv && source .venv/bin/activate
pip install -U pip
pip install -e ".[dev]"
pre-commit install
# run checks
ruff check .
black --check .
mypy src
pytest -q
```
---
## Contributing
Contributions are welcome! Please open an issue to discuss substantial changes first.
By contributing you agree to license your work under the project’s **Apache‑2.0** license.
---
## License
This project is licensed under **Apache License 2.0**. See [LICENSE](LICENSE).
If you prefer a copyleft model (keeping derivatives open), consider re‑licensing as **GPLv3/AGPLv3** or offering **dual‑licensing** (AGPL for community + commercial license via AspectSoft). See below for guidance.
### Picking a license (quick guide)
- **Max adoption, simple** → MIT or **Apache‑2.0** (Apache adds a patent grant and NOTICE).
- **Keep derivatives open** → **GPLv3** (apps), **AGPLv3** (network services).
- **File‑level copyleft with easier mixing** → **MPL‑2.0**.
- **Source‑available (not OSI)** → Business Source License (BUSL‑1.1), SSPL, Polyform (non‑commercial).
> For a project that still offers some protection, **Apache‑2.0** is a great default. If you want stronger reciprocity, choose **AGPLv3** or dual‑license.
**How to apply**
1. Add a `LICENSE` file (done).
2. Add a `NOTICE` file (done) and keep third‑party attributions.
3. Optionally add license headers to source files, e.g.:
```python
# Copyright (c) 2025 Your Name
# SPDX-License-Identifier: Apache-2.0
```
---
## Contact
- **Author:** Dimitrios S. Sfyris (AspectSoft)
- **Email:** info@aspectsoft.gr
- **LinkedIn:** https://www.linkedin.com/in/dimitrios-s-sfyris/
- **Get in touch:** If you need bespoke FEK parsing or similar layout‑aware NLP pipelines, reach out.
---
## Acknowledgements
- Built on top of [`pdfminer.six`](https://github.com/pdfminer/pdfminer.six).
- Includes heuristics tuned for FEK / Εφημερίδα της Κυβερνήσεως layouts.
Raw data
{
"_id": null,
"home_page": null,
"name": "fek-extractor",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "FEK, Greek, Government Gazette, NLP, PDF, information extraction",
"author": null,
"author_email": "\"Dimitrios S. Sfyris\" <info@aspectsoft.gr>",
"download_url": "https://files.pythonhosted.org/packages/f5/57/ee6785c13f7e6d1ea06431dcaeba40c154400f0ce2a5f857f6dd8039ce14/fek_extractor-0.1.0.tar.gz",
"platform": null,
"description": "# fek-extractor\r\n[](https://pypi.org/project/fek-extractor/)\r\n[](#-requirements)\r\n[](LICENSE)\r\n[](https://github.com/dmsfiris/fek-extractor/actions/workflows/tests.yml)\r\n[](https://github.com/psf/black)\r\n\r\nExtract structured data from Greek Government Gazette (\u03a6\u0395\u039a) PDFs.\r\n\r\nIt turns messy, two\u2011column government PDFs into machine\u2011readable **JSON/CSV** with FEK metadata and a clean map of **\u0386\u03c1\u03b8\u03c1\u03b1** (articles).\r\nBuilt on `pdfminer.six`, with careful two\u2011column handling, header/footer filtering, Greek\u2011aware de\u2011hyphenation, and article detection.\r\n\r\n---\r\n## About this project\r\n\r\nGreek Government Gazette (\u03a6\u0395\u039a) documents look uniform at a glance, but their **typesetting and structure are anything but**. Even \u201cclean\u201d digital PDFs hide quirks that trip up generic parsers and off\u2011the\u2011shelf AI:\r\n\r\n- **Multi\u2011column reading order** with full\u2011width \u201ctails\u201d, footers, and boilerplate that disrupt token flow.\r\n- **Title vs. body separation** where headings, subtitles, and continuations interleave across pages.\r\n- **Dense legal cross\u2011references and amendments**, with nested exceptions and renumbered clauses.\r\n- **Inconsistent numbering and metadata**, plus occasional encoding artifacts and discretionary hyphens.\r\n\r\nThis project addresses those realities with a **layout\u2011aware, domain\u2011specific pipeline** that prioritizes *determinism* and *inspectability*:\r\n\r\n- **Layout\u2011aware text reconstruction** \u2014 two\u2011column segmentation (k\u2011means + gutter valley), \u201ctail\u201d detection, header/footer filtering, and stable reading order.\r\n- **Article\u2011structure recovery** \u2014 detects `\u0386\u03c1\u03b8\u03c1\u03bf N`, associates titles and bodies across page boundaries, and synthesizes a hierarchical TOC when possible.\r\n- **Greek\u2011aware normalization** \u2014 de\u2011hyphenates safely (soft/discretionary hyphens, wrapped words) while preserving accents/case.\r\n- **Domain heuristics + light NLP hooks** \u2014 FEK masthead parsing (series/issue/date), decision numbers, and simple patterns for subject/\u0398\u03ad\u03bc\u03b1; extension points for NER and reference extraction.\r\n- **Transparent debugging** \u2014 page\u2011focused debug mode and optional metrics so you can see *why* a page parsed a certain way.\r\n\r\n**Who it\u2019s for:** legal\u2011tech teams, data engineers, and researchers who need **reproducible, explainable** FEK extraction that won\u2019t crumble on edge cases.\r\n**Outcome:** **structured, searchable, dependable** data for automation, analysis, and integration.\r\n\r\nIf your team needs tailored FEK pipelines or additional NLP components, **[AspectSoft](https://aspectsoft.gr)** can help.\r\n\r\n---\r\n\r\n## Features\r\n\r\n- **FEK-aware text extraction**\r\n - Two-column segmentation via k-means over x-coordinates with a gutter-valley heuristic.\r\n - Per-page region classification & demotion \u2014 header, footer, column body, full-width tail, noise.\r\n - \u201cTail\u201d detection for full-width content (signatures, appendices, tables) below the columns.\r\n - Header/footer cleanup tuned to FEK mastheads and page furniture.\r\n - Deterministic reading order (by column \u2192 y \u2192 x); graceful single-column fallback.\r\n\r\n- **Greek de-hyphenation**\r\n - Removes soft/discretionary hyphens (U+00AD) and stitches wrapped words safely.\r\n - Preserves accents/case; conservative rules to avoid over-merging.\r\n - Handles common typography patterns (e.g., hyphen + space breaks).\r\n\r\n- **Header parsing**\r\n - Extracts FEK **series** (\u0391/\u0392/\u2026 including word\u2192letter normalization), **issue number**, and **date** in both `DD.MM.YYYY` and ISO `YYYY-MM-DD`.\r\n - Best-effort detection of **decision numbers** (e.g., \u201c\u0391\u03c1\u03b9\u03b8.\u201d).\r\n - Tolerant to spacing/diacritic/punctuation variants.\r\n\r\n- **Article detection**\r\n - Recognizes `\u0386\u03c1\u03b8\u03c1\u03bf N` (including letter suffixes like `14\u0391`) and captures **title + body**.\r\n - Stitches articles across page boundaries; keeps original and normalized numbering.\r\n - Produces a structured **articles map** for direct programmatic use.\r\n\r\n- **TOC synthesis** *(optional)*\r\n - Builds a hierarchical TOC where present:\r\n **\u039c\u0395\u03a1\u039f\u03a3 \u2192 \u03a4\u0399\u03a4\u039b\u039f\u03a3 \u2192 \u039a\u0395\u03a6\u0391\u039b\u0391\u0399\u039f \u2192 \u03a4\u039c\u0397\u039c\u0391 \u2192 \u0386\u03c1\u03b8\u03c1\u03b1**.\r\n - Emits clean JSON for navigation, QA, or UI rendering.\r\n\r\n- **Metrics** *(opt-in via `--include-metrics`)*\r\n - Lengths & counts (characters, words, lines) and median line length.\r\n - Top words, character histogram, and pluggable regex matches (e.g., FEK citations, \u201c\u0398\u03ad\u03bc\u03b1:\u201d).\r\n\r\n- **CLI & Python API**\r\n - CLI: single file or directory recursion, JSON/CSV output, `--jobs N` for parallel processing, focused logging via `--debug [PAGE]`.\r\n - API: `extract_pdf_info(path, include_metrics=True|False, ...)` returns a ready-to-use record.\r\n\r\n- **Typed codebase & tests**\r\n - Static typing (PEP 561), lint/format (ruff, black), type checks (mypy), and tests (pytest).\r\n - Clear module boundaries (`io/`, `parsing/`, `metrics/`, `cli.py`, `core.py`).\r\n\r\nWith this mix, FEK PDFs become consistent, navigable JSON/CSV with reliable metadata and article structure\u2014ready for indexing, analytics, and automation.\r\n\r\n\r\n> \u2728 Sample PDF for testing ships in `data/samples/gr-act-2020-4706-4706_2020.pdf`.\r\n\r\n---\r\n\r\n## Table of contents\r\n\r\n- [Demo \\& screenshots](#demo--screenshots)\r\n- [Requirements](#requirements)\r\n- [Install](#install)\r\n- [Quickstart](#quickstart)\r\n- [CLI usage](#cli-usage)\r\n- [Python API](#python-api)\r\n- [Output schema](#output-schema)\r\n- [Technical deep dive](#technical-deep-dive)\r\n- [Architecture](#architecture)\r\n- [Examples](#examples)\r\n- [Debug helpers](#debug-helpers)\r\n- [Performance tips](#performance-tips)\r\n- [Project layout](#project-layout)\r\n- [Development](#development)\r\n- [Contributing](#contributing)\r\n- [License](#license)\r\n- [Contact](#contact)\r\n\r\n---\r\n\r\n## Demo & screenshots\r\n\r\n\r\n\r\n---\r\n\r\n## Requirements\r\n\r\n- **Python 3.10+**\r\n- **OS:** Linux, macOS, or Windows\r\n- **Runtime dependency:** [`pdfminer.six`](https://github.com/pdfminer/pdfminer.six)\r\n\r\n---\r\n\r\n## Install\r\n\r\n### From PyPI\r\n\r\n```bash\r\npip install fek-extractor\r\n```\r\n\r\n### From source (editable)\r\n\r\n```bash\r\ngit clone https://github.com/dmsfiris/fek-extractor.git\r\ncd fek-extractor\r\npython -m venv .venv\r\nsource .venv/bin/activate # Windows: .\\.venv\\Scripts\\Activate.ps1\r\npip install -U pip\r\npip install -e . # library + CLI\r\n# or full dev setup\r\npip install -e \".[dev]\"\r\npre-commit install\r\n```\r\n\r\n### With pipx (isolated CLI)\r\n\r\n```bash\r\npipx install fek-extractor\r\n```\r\n\r\n### Docker (no local Python needed)\r\n\r\n```bash\r\ndocker run --rm -v \"$PWD:/work\" -w /work python:3.11-slim bash -lc \"pip install fek-extractor && fek-extractor -i data/samples -o out.json\"\r\n```\r\n\r\n---\r\n\r\n## Quickstart\r\n\r\n```bash\r\n# JSON (default)\r\nfek-extractor -i data/samples -o out.json -f json\r\n\r\n# CSV\r\nfek-extractor -i data/samples -o out.csv -f csv\r\n\r\n# As a module (equivalent to the CLI)\r\npython -m fek_extractor -i data/samples -o out.json\r\n```\r\n\r\n---\r\n\r\n## CLI usage\r\n\r\n```\r\nusage: fek-extractor [-h] --input INPUT [--out OUT] [--format {json,csv}]\r\n [--no-recursive] [--debug [PAGE]] [--jobs JOBS]\r\n [--include-metrics] [--articles-only] [--toc-only]\r\n\r\nExtract structured info from FEK/Greek-law PDFs.\r\n```\r\n\r\n**Options**\r\n\r\n- `-i, --input PATH` (required) \u2014 PDF *file* or *directory*.\r\n- `-o, --out PATH` (default: `out.json`) \u2014 Output path.\r\n- `-f, --format {json,csv}` (default: `json`) \u2014 Output format.\r\n- `--no-recursive` \u2014 When `--input` is a directory, do **not** recurse.\r\n- `--debug [PAGE]` \u2014 Enable debug logging; optionally pass a **page number**\r\n (e.g. `--debug 39`) to focus per\u2011page debug.\r\n- `--jobs JOBS` \u2014 Parallel workers when input is a **folder** (default 1).\r\n- `--include-metrics` \u2014 Add metrics into each record (see below).\r\n- `--articles-only` \u2014 Emit **only** the articles map as JSON (ignores `-f csv`).\r\n- `--toc-only` \u2014 Emit **only** the synthesized Table of Contents as JSON.\r\n\r\n---\r\n\r\n## Python API\r\n\r\n```python\r\nfrom fek_extractor import extract_pdf_info\r\n\r\n# Single PDF \u2192 record (dict)\r\nrecord = extract_pdf_info(\"data/samples/gr-act-2020-4706-4706_2020.pdf\", include_metrics=True)\r\nprint(record[\"filename\"], record[\"pages\"], record[\"articles_count\"])\r\n\r\n# Optional kwargs (subject to change):\r\n# debug=True\r\n# debug_pages=[39] # focus page(s) for diagnostics\r\n# dehyphenate=True # on by default\r\n```\r\n\r\n**Return type**: `dict[str, Any]` with the fields shown in [Output schema](#-output-schema).\r\n\r\n---\r\n\r\n## Output schema\r\n\r\nEach **record** (per PDF) typically contains:\r\n\r\n| Field | Type | Notes |\r\n|------|------|------|\r\n| `path` | string | Absolute or relative input path |\r\n| `filename` | string | File name only |\r\n| `pages` | int | Page count |\r\n| `fek_series` | string? | Single Greek letter (e.g. `\u0391`) if detected |\r\n| `fek_number` | string? | Issue number if detected |\r\n| `fek_date` | string? | Dotted date `DD.MM.YYYY` |\r\n| `fek_date_iso` | string? | ISO date `YYYY-MM-DD` |\r\n| `decision_number` | string? | From \u201c\u0391\u03c1\u03b9\u03b8.\u201d if found |\r\n| `subject` | string? | Document subject/\u0398\u03ad\u03bc\u03b1 (best\u2011effort) |\r\n| `articles` | object | Map of **article number \u2192 article object** |\r\n| `articles_count` | int | Convenience total |\r\n| `first_5_lines` | array | First few text lines (debugging aid) |\r\n| **Metrics** *(only when `--include-metrics`)* |||\r\n| `length` | int | Characters in raw text |\r\n| `num_lines` | int | Number of lines |\r\n| `median_line_length` | int | Median non\u2011empty line length |\r\n| `char_counts` | object | Char \u2192 count |\r\n| `word_counts_top` | object | Top words |\r\n| `chars`, `words` | int | Totals |\r\n| `matches` | object | Regex matches (from `data/patterns/patterns.txt`) |\r\n\r\n**Article object**\r\n\r\n```jsonc\r\n{\r\n \"number\": \"13\", // normalized article id (e.g., \"13\", \"14\u0391\")\r\n \"title\": \"\u039f\u03c1\u03b3\u03b1\u03bd\u03c9\u03c4\u03b9\u03ba\u03ad\u03c2 \u03c1\u03c5\u03b8\u03bc\u03af\u03c3\u03b5\u03b9\u03c2\",\r\n \"body\": \"\u2026full text\u2026\",\r\n // optional structural context when present:\r\n \"part_letter\": \"\u0391\", \"part_title\": \"\u2026\", // \u039c\u0395\u03a1\u039f\u03a3\r\n \"title_letter\": \"I\", \"title_title\": \"\u2026\", // \u03a4\u0399\u03a4\u039b\u039f\u03a3\r\n \"chapter_letter\": \"1\", \"chapter_title\": \"\u2026\", // \u039a\u0395\u03a6\u0391\u039b\u0391\u0399\u039f\r\n \"section_letter\": \"\u0391\", \"section_title\": \"\u2026\" // \u03a4\u039c\u0397\u039c\u0391\r\n}\r\n```\r\n\r\n---\r\n\r\n## Technical deep dive\r\n\r\n- **Reading order reconstruction**\r\n Rebuilds logical lines from low\u2011level glyphs, sorts by column then by y/x to maintain human reading order.\r\n- **Two\u2011column segmentation**\r\n Uses k\u2011means clustering over x\u2011coords and gap valley search to find the column gutter; detects and demotes \u201ctail\u201d (full\u2011width) content below columns.\r\n- **Greek\u2011aware normalization**\r\n Removes soft hyphens, stitches wrapped words, preserves Greek capitalization/accents conservatively.\r\n- **Header & masthead parsing**\r\n Regex/heuristics for FEK line (series/issue/date), dotted and ISO date, and decision numbers (`\u0391\u03c1\u03b9\u03b8.`).\r\n- **Article detection & stitching**\r\n Recognizes `\u0386\u03c1\u03b8\u03c1\u03bf N` headings, associates titles/bodies across page boundaries, and builds a robust map.\r\n- **TOC synthesis**\r\n Extracts hierarchical headers (\u039c\u0395\u03a1\u039f\u03a3/\u03a4\u0399\u03a4\u039b\u039f\u03a3/\u039a\u0395\u03a6\u0391\u039b\u0391\u0399\u039f/\u03a4\u039c\u0397\u039c\u0391) when present.\r\n- **Metrics**\r\n Character/word counts and frequency stats to help diagnose messy PDFs.\r\n\r\n---\r\n\r\n## Architecture\r\n\r\n```\r\nPDF \u2192 glyphs \u2192 lines \u2192 columns \u2192 normalized text\r\n \u2192 header parser \u2192 articles parser \u2192 {record}\r\n \u2192 (optional) metrics / TOC\r\n \u2192 JSON/CSV writer\r\n```\r\n\r\nKey modules (under `src/fek_extractor/`):\r\n\r\n- `io/pdf.py` \u2013 low\u2011level extraction, column/tail logic\r\n- `parsing/normalize.py` \u2013 de\u2011hyphenation & cleanup\r\n- `parsing/headers.py` \u2013 FEK header parsing\r\n- `parsing/articles.py` \u2013 article detection + body stitching\r\n- `metrics.py` \u2013 optional stats\r\n- `cli.py` \u2013 batch processing, JSON/CSV output\r\n\r\n---\r\n\r\n## Examples\r\n\r\n```bash\r\n# 1) All PDFs under a folder \u2192 JSON\r\nfek-extractor -i ./data/samples -o out.json\r\n\r\n# 2) Single PDF \u2192 CSV\r\nfek-extractor -i ./data/samples/gr-act-2020-4706-4706_2020.pdf -o out.csv -f csv\r\n\r\n# 3) Articles only (for a file)\r\nfek-extractor -i ./data/samples/gr-act-2020-4706-4706_2020.pdf --articles-only -o articles.json\r\n\r\n# 4) Table of Contents only (for a file)\r\nfek-extractor -i ./data/samples/gr-act-2020-4706-4706_2020.pdf --toc-only -o toc.json\r\n\r\n# 5) Process a directory in parallel with 4 workers, include metrics\r\nfek-extractor -i ./data/samples --jobs 4 --include-metrics -o out.json\r\n```\r\n\r\n---\r\n\r\n## Debug helpers\r\n\r\nThere is a small debug entrypoint to inspect **column extraction** and **page layout**:\r\n\r\n```bash\r\npython -m fek_extractor.debug --pdf data/samples/gr-act-2020-4706-4706_2020.pdf --page 39 --check-order\r\n```\r\n\r\n---\r\n\r\n## Performance tips\r\n\r\n- Prefer running with `--jobs N` on directories to parallelize across files.\r\n- For very large gazettes, keep output as JSON first (CSV is slower with many nested keys).\r\n- Pre\u2011process PDFs (deskew/OCR) if the source is scanned images.\r\n\r\n---\r\n\r\n## Project layout\r\n\r\n```\r\nsrc/fek_extractor/\r\n __main__.py # supports `python -m fek_extractor`\r\n cli.py # CLI entrypoint\r\n core.py # Orchestration\r\n io/ # PDF I/O and exporters\r\n parsing/ # Normalization & parsing rules (articles, headers, dates, HTML)\r\n metrics.py # Basic text metrics\r\n models.py # Typed record/contexts\r\n utils/ # Logging, HTML cleanup helpers\r\n\r\ndata/\r\n patterns/patterns.txt # Regexes for extra matches\r\n samples/ # Sample FEK PDF (optional)\r\n\r\ntests/ # Unit/CLI/integration tests\r\ndocs/ # MkDocs starter (optional)\r\n```\r\n\r\n---\r\n\r\n## Development\r\n\r\n```bash\r\n# clone and set up\r\ngit clone https://github.com/dmsfiris/fek-extractor.git\r\ncd fek-extractor\r\npython -m venv .venv && source .venv/bin/activate\r\npip install -U pip\r\npip install -e \".[dev]\"\r\npre-commit install\r\n\r\n# run checks\r\nruff check .\r\nblack --check .\r\nmypy src\r\npytest -q\r\n```\r\n\r\n---\r\n\r\n## Contributing\r\n\r\nContributions are welcome! Please open an issue to discuss substantial changes first.\r\nBy contributing you agree to license your work under the project\u2019s **Apache\u20112.0** license.\r\n\r\n---\r\n\r\n## License\r\n\r\nThis project is licensed under **Apache License 2.0**. See [LICENSE](LICENSE).\r\nIf you prefer a copyleft model (keeping derivatives open), consider re\u2011licensing as **GPLv3/AGPLv3** or offering **dual\u2011licensing** (AGPL for community + commercial license via AspectSoft). See below for guidance.\r\n\r\n### Picking a license (quick guide)\r\n\r\n- **Max adoption, simple** \u2192 MIT or **Apache\u20112.0** (Apache adds a patent grant and NOTICE).\r\n- **Keep derivatives open** \u2192 **GPLv3** (apps), **AGPLv3** (network services).\r\n- **File\u2011level copyleft with easier mixing** \u2192 **MPL\u20112.0**.\r\n- **Source\u2011available (not OSI)** \u2192 Business Source License (BUSL\u20111.1), SSPL, Polyform (non\u2011commercial).\r\n\r\n> For a project that still offers some protection, **Apache\u20112.0** is a great default. If you want stronger reciprocity, choose **AGPLv3** or dual\u2011license.\r\n\r\n**How to apply**\r\n\r\n1. Add a `LICENSE` file (done).\r\n2. Add a `NOTICE` file (done) and keep third\u2011party attributions.\r\n3. Optionally add license headers to source files, e.g.:\r\n\r\n```python\r\n# Copyright (c) 2025 Your Name\r\n# SPDX-License-Identifier: Apache-2.0\r\n```\r\n\r\n---\r\n\r\n## Contact\r\n\r\n- **Author:** Dimitrios S. Sfyris (AspectSoft)\r\n- **Email:** info@aspectsoft.gr\r\n\r\n- **LinkedIn:** https://www.linkedin.com/in/dimitrios-s-sfyris/\r\n\r\n- **Get in touch:** If you need bespoke FEK parsing or similar layout\u2011aware NLP pipelines, reach out.\r\n\r\n---\r\n\r\n## Acknowledgements\r\n\r\n- Built on top of [`pdfminer.six`](https://github.com/pdfminer/pdfminer.six).\r\n- Includes heuristics tuned for FEK / \u0395\u03c6\u03b7\u03bc\u03b5\u03c1\u03af\u03b4\u03b1 \u03c4\u03b7\u03c2 \u039a\u03c5\u03b2\u03b5\u03c1\u03bd\u03ae\u03c3\u03b5\u03c9\u03c2 layouts.\r\n",
"bugtrack_url": null,
"license": null,
"summary": "Extract structured data from FEK PDFs.",
"version": "0.1.0",
"project_urls": {
"Changelog": "https://github.com/dmsfiris/fek-extractor/blob/master/CHANGELOG.md",
"Homepage": "https://github.com/dmsfiris/fek-extractor",
"Issues": "https://github.com/dmsfiris/fek-extractor/issues",
"Repository": "https://github.com/dmsfiris/fek-extractor"
},
"split_keywords": [
"fek",
" greek",
" government gazette",
" nlp",
" pdf",
" information extraction"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "8bbec29f4ceeff09bdef528308e2fbf8c6b707d5c1abb73bdb8ef8e8cd02e050",
"md5": "fc023b2a43f794bff5e554d8231cd069",
"sha256": "4e6075edd9b1a3cc24887f872ea030bed546e65ff7cf283ad618f8331142b421"
},
"downloads": -1,
"filename": "fek_extractor-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "fc023b2a43f794bff5e554d8231cd069",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 81136,
"upload_time": "2025-09-10T14:34:03",
"upload_time_iso_8601": "2025-09-10T14:34:03.913120Z",
"url": "https://files.pythonhosted.org/packages/8b/be/c29f4ceeff09bdef528308e2fbf8c6b707d5c1abb73bdb8ef8e8cd02e050/fek_extractor-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "f557ee6785c13f7e6d1ea06431dcaeba40c154400f0ce2a5f857f6dd8039ce14",
"md5": "50d65b3b351f809db708f6374a3fa5c2",
"sha256": "567702eef880ff104a1a9f034b0af70b109ccc1148ea77a8fb9e1f4e8a0ffb80"
},
"downloads": -1,
"filename": "fek_extractor-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "50d65b3b351f809db708f6374a3fa5c2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 78722,
"upload_time": "2025-09-10T14:34:05",
"upload_time_iso_8601": "2025-09-10T14:34:05.628295Z",
"url": "https://files.pythonhosted.org/packages/f5/57/ee6785c13f7e6d1ea06431dcaeba40c154400f0ce2a5f857f6dd8039ce14/fek_extractor-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-10 14:34:05",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "dmsfiris",
"github_project": "fek-extractor",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "fek-extractor"
}