fek-extractor


Namefek-extractor JSON
Version 0.1.0 PyPI version JSON
download
home_pageNone
SummaryExtract structured data from FEK PDFs.
upload_time2025-09-10 14:34:05
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseNone
keywords fek greek government gazette nlp pdf information extraction
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # fek-extractor
[![PyPI version](https://img.shields.io/pypi/v/fek-extractor.svg)](https://pypi.org/project/fek-extractor/)
[![Python](https://img.shields.io/badge/python-3.10%2B-blue.svg)](#-requirements)
[![License: Apache-2.0](https://img.shields.io/badge/License-Apache--2.0-green.svg)](LICENSE)
[![CI](https://github.com/dmsfiris/fek-extractor/actions/workflows/tests.yml/badge.svg?branch=master)](https://github.com/dmsfiris/fek-extractor/actions/workflows/tests.yml)
[![code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

Extract structured data from Greek Government Gazette (ΦΕΚ) PDFs.

It turns messy, two‑column government PDFs into machine‑readable **JSON/CSV** with FEK metadata and a clean map of **Άρθρα** (articles).
Built on `pdfminer.six`, with careful two‑column handling, header/footer filtering, Greek‑aware de‑hyphenation, and article detection.

---
## About this project

Greek Government Gazette (ΦΕΚ) documents look uniform at a glance, but their **typesetting and structure are anything but**. Even “clean” digital PDFs hide quirks that trip up generic parsers and off‑the‑shelf AI:

- **Multi‑column reading order** with full‑width “tails”, footers, and boilerplate that disrupt token flow.
- **Title vs. body separation** where headings, subtitles, and continuations interleave across pages.
- **Dense legal cross‑references and amendments**, with nested exceptions and renumbered clauses.
- **Inconsistent numbering and metadata**, plus occasional encoding artifacts and discretionary hyphens.

This project addresses those realities with a **layout‑aware, domain‑specific pipeline** that prioritizes *determinism* and *inspectability*:

- **Layout‑aware text reconstruction** — two‑column segmentation (k‑means + gutter valley), “tail” detection, header/footer filtering, and stable reading order.
- **Article‑structure recovery** — detects `Άρθρο N`, associates titles and bodies across page boundaries, and synthesizes a hierarchical TOC when possible.
- **Greek‑aware normalization** — de‑hyphenates safely (soft/discretionary hyphens, wrapped words) while preserving accents/case.
- **Domain heuristics + light NLP hooks** — FEK masthead parsing (series/issue/date), decision numbers, and simple patterns for subject/Θέμα; extension points for NER and reference extraction.
- **Transparent debugging** — page‑focused debug mode and optional metrics so you can see *why* a page parsed a certain way.

**Who it’s for:** legal‑tech teams, data engineers, and researchers who need **reproducible, explainable** FEK extraction that won’t crumble on edge cases.
**Outcome:** **structured, searchable, dependable** data for automation, analysis, and integration.

If your team needs tailored FEK pipelines or additional NLP components, **[AspectSoft](https://aspectsoft.gr)** can help.

---

## Features

- **FEK-aware text extraction**
  - Two-column segmentation via k-means over x-coordinates with a gutter-valley heuristic.
  - Per-page region classification & demotion — header, footer, column body, full-width tail, noise.
  - “Tail” detection for full-width content (signatures, appendices, tables) below the columns.
  - Header/footer cleanup tuned to FEK mastheads and page furniture.
  - Deterministic reading order (by column → y → x); graceful single-column fallback.

- **Greek de-hyphenation**
  - Removes soft/discretionary hyphens (U+00AD) and stitches wrapped words safely.
  - Preserves accents/case; conservative rules to avoid over-merging.
  - Handles common typography patterns (e.g., hyphen + space breaks).

- **Header parsing**
  - Extracts FEK **series** (Α/Β/… including word→letter normalization), **issue number**, and **date** in both `DD.MM.YYYY` and ISO `YYYY-MM-DD`.
  - Best-effort detection of **decision numbers** (e.g., “Αριθ.”).
  - Tolerant to spacing/diacritic/punctuation variants.

- **Article detection**
  - Recognizes `Άρθρο N` (including letter suffixes like `14Α`) and captures **title + body**.
  - Stitches articles across page boundaries; keeps original and normalized numbering.
  - Produces a structured **articles map** for direct programmatic use.

- **TOC synthesis** *(optional)*
  - Builds a hierarchical TOC where present:
    **ΜΕΡΟΣ → ΤΙΤΛΟΣ → ΚΕΦΑΛΑΙΟ → ΤΜΗΜΑ → Άρθρα**.
  - Emits clean JSON for navigation, QA, or UI rendering.

- **Metrics** *(opt-in via `--include-metrics`)*
  - Lengths & counts (characters, words, lines) and median line length.
  - Top words, character histogram, and pluggable regex matches (e.g., FEK citations, “Θέμα:”).

- **CLI & Python API**
  - CLI: single file or directory recursion, JSON/CSV output, `--jobs N` for parallel processing, focused logging via `--debug [PAGE]`.
  - API: `extract_pdf_info(path, include_metrics=True|False, ...)` returns a ready-to-use record.

- **Typed codebase & tests**
  - Static typing (PEP 561), lint/format (ruff, black), type checks (mypy), and tests (pytest).
  - Clear module boundaries (`io/`, `parsing/`, `metrics/`, `cli.py`, `core.py`).

With this mix, FEK PDFs become consistent, navigable JSON/CSV with reliable metadata and article structure—ready for indexing, analytics, and automation.


> ✨ Sample PDF for testing ships in `data/samples/gr-act-2020-4706-4706_2020.pdf`.

---

## Table of contents

- [Demo \& screenshots](#demo--screenshots)
- [Requirements](#requirements)
- [Install](#install)
- [Quickstart](#quickstart)
- [CLI usage](#cli-usage)
- [Python API](#python-api)
- [Output schema](#output-schema)
- [Technical deep dive](#technical-deep-dive)
- [Architecture](#architecture)
- [Examples](#examples)
- [Debug helpers](#debug-helpers)
- [Performance tips](#performance-tips)
- [Project layout](#project-layout)
- [Development](#development)
- [Contributing](#contributing)
- [License](#license)
- [Contact](#contact)

---

## Demo & screenshots

![FEK extractor — debug view (4514/2018, page 12)](docs/assets/4514-debug-page-12.jpg)

---

## Requirements

- **Python 3.10+**
- **OS:** Linux, macOS, or Windows
- **Runtime dependency:** [`pdfminer.six`](https://github.com/pdfminer/pdfminer.six)

---

## Install

### From PyPI

```bash
pip install fek-extractor
```

### From source (editable)

```bash
git clone https://github.com/dmsfiris/fek-extractor.git
cd fek-extractor
python -m venv .venv
source .venv/bin/activate # Windows: .\.venv\Scripts\Activate.ps1
pip install -U pip
pip install -e . # library + CLI
# or full dev setup
pip install -e ".[dev]"
pre-commit install
```

### With pipx (isolated CLI)

```bash
pipx install fek-extractor
```

### Docker (no local Python needed)

```bash
docker run --rm -v "$PWD:/work" -w /work python:3.11-slim bash -lc "pip install fek-extractor && fek-extractor -i data/samples -o out.json"
```

---

## Quickstart

```bash
# JSON (default)
fek-extractor -i data/samples -o out.json -f json

# CSV
fek-extractor -i data/samples -o out.csv -f csv

# As a module (equivalent to the CLI)
python -m fek_extractor -i data/samples -o out.json
```

---

## CLI usage

```
usage: fek-extractor [-h] --input INPUT [--out OUT] [--format {json,csv}]
 [--no-recursive] [--debug [PAGE]] [--jobs JOBS]
 [--include-metrics] [--articles-only] [--toc-only]

Extract structured info from FEK/Greek-law PDFs.
```

**Options**

- `-i, --input PATH` (required) — PDF *file* or *directory*.
- `-o, --out PATH` (default: `out.json`) — Output path.
- `-f, --format {json,csv}` (default: `json`) — Output format.
- `--no-recursive` — When `--input` is a directory, do **not** recurse.
- `--debug [PAGE]` — Enable debug logging; optionally pass a **page number**
 (e.g. `--debug 39`) to focus per‑page debug.
- `--jobs JOBS` — Parallel workers when input is a **folder** (default 1).
- `--include-metrics` — Add metrics into each record (see below).
- `--articles-only` — Emit **only** the articles map as JSON (ignores `-f csv`).
- `--toc-only` — Emit **only** the synthesized Table of Contents as JSON.

---

## Python API

```python
from fek_extractor import extract_pdf_info

# Single PDF → record (dict)
record = extract_pdf_info("data/samples/gr-act-2020-4706-4706_2020.pdf", include_metrics=True)
print(record["filename"], record["pages"], record["articles_count"])

# Optional kwargs (subject to change):
# debug=True
# debug_pages=[39] # focus page(s) for diagnostics
# dehyphenate=True # on by default
```

**Return type**: `dict[str, Any]` with the fields shown in [Output schema](#-output-schema).

---

## Output schema

Each **record** (per PDF) typically contains:

| Field | Type | Notes |
|------|------|------|
| `path` | string | Absolute or relative input path |
| `filename` | string | File name only |
| `pages` | int | Page count |
| `fek_series` | string? | Single Greek letter (e.g. `Α`) if detected |
| `fek_number` | string? | Issue number if detected |
| `fek_date` | string? | Dotted date `DD.MM.YYYY` |
| `fek_date_iso` | string? | ISO date `YYYY-MM-DD` |
| `decision_number` | string? | From “Αριθ.” if found |
| `subject` | string? | Document subject/Θέμα (best‑effort) |
| `articles` | object | Map of **article number → article object** |
| `articles_count` | int | Convenience total |
| `first_5_lines` | array | First few text lines (debugging aid) |
| **Metrics** *(only when `--include-metrics`)* |||
| `length` | int | Characters in raw text |
| `num_lines` | int | Number of lines |
| `median_line_length` | int | Median non‑empty line length |
| `char_counts` | object | Char → count |
| `word_counts_top` | object | Top words |
| `chars`, `words` | int | Totals |
| `matches` | object | Regex matches (from `data/patterns/patterns.txt`) |

**Article object**

```jsonc
{
 "number": "13", // normalized article id (e.g., "13", "14Α")
 "title": "Οργανωτικές ρυθμίσεις",
 "body": "…full text…",
 // optional structural context when present:
 "part_letter": "Α", "part_title": "…", // ΜΕΡΟΣ
 "title_letter": "I", "title_title": "…", // ΤΙΤΛΟΣ
 "chapter_letter": "1", "chapter_title": "…", // ΚΕΦΑΛΑΙΟ
 "section_letter": "Α", "section_title": "…" // ΤΜΗΜΑ
}
```

---

## Technical deep dive

- **Reading order reconstruction**
 Rebuilds logical lines from low‑level glyphs, sorts by column then by y/x to maintain human reading order.
- **Two‑column segmentation**
 Uses k‑means clustering over x‑coords and gap valley search to find the column gutter; detects and demotes “tail” (full‑width) content below columns.
- **Greek‑aware normalization**
 Removes soft hyphens, stitches wrapped words, preserves Greek capitalization/accents conservatively.
- **Header & masthead parsing**
 Regex/heuristics for FEK line (series/issue/date), dotted and ISO date, and decision numbers (`Αριθ.`).
- **Article detection & stitching**
 Recognizes `Άρθρο N` headings, associates titles/bodies across page boundaries, and builds a robust map.
- **TOC synthesis**
 Extracts hierarchical headers (ΜΕΡΟΣ/ΤΙΤΛΟΣ/ΚΕΦΑΛΑΙΟ/ΤΜΗΜΑ) when present.
- **Metrics**
 Character/word counts and frequency stats to help diagnose messy PDFs.

---

## Architecture

```
PDF → glyphs → lines → columns → normalized text
 → header parser → articles parser → {record}
 → (optional) metrics / TOC
 → JSON/CSV writer
```

Key modules (under `src/fek_extractor/`):

- `io/pdf.py` – low‑level extraction, column/tail logic
- `parsing/normalize.py` – de‑hyphenation & cleanup
- `parsing/headers.py` – FEK header parsing
- `parsing/articles.py` – article detection + body stitching
- `metrics.py` – optional stats
- `cli.py` – batch processing, JSON/CSV output

---

## Examples

```bash
# 1) All PDFs under a folder → JSON
fek-extractor -i ./data/samples -o out.json

# 2) Single PDF → CSV
fek-extractor -i ./data/samples/gr-act-2020-4706-4706_2020.pdf -o out.csv -f csv

# 3) Articles only (for a file)
fek-extractor -i ./data/samples/gr-act-2020-4706-4706_2020.pdf --articles-only -o articles.json

# 4) Table of Contents only (for a file)
fek-extractor -i ./data/samples/gr-act-2020-4706-4706_2020.pdf --toc-only -o toc.json

# 5) Process a directory in parallel with 4 workers, include metrics
fek-extractor -i ./data/samples --jobs 4 --include-metrics -o out.json
```

---

## Debug helpers

There is a small debug entrypoint to inspect **column extraction** and **page layout**:

```bash
python -m fek_extractor.debug --pdf data/samples/gr-act-2020-4706-4706_2020.pdf --page 39 --check-order
```

---

## Performance tips

- Prefer running with `--jobs N` on directories to parallelize across files.
- For very large gazettes, keep output as JSON first (CSV is slower with many nested keys).
- Pre‑process PDFs (deskew/OCR) if the source is scanned images.

---

## Project layout

```
src/fek_extractor/
 __main__.py # supports `python -m fek_extractor`
 cli.py # CLI entrypoint
 core.py # Orchestration
 io/ # PDF I/O and exporters
 parsing/ # Normalization & parsing rules (articles, headers, dates, HTML)
 metrics.py # Basic text metrics
 models.py # Typed record/contexts
 utils/ # Logging, HTML cleanup helpers

data/
 patterns/patterns.txt # Regexes for extra matches
 samples/ # Sample FEK PDF (optional)

tests/ # Unit/CLI/integration tests
docs/ # MkDocs starter (optional)
```

---

## Development

```bash
# clone and set up
git clone https://github.com/dmsfiris/fek-extractor.git
cd fek-extractor
python -m venv .venv && source .venv/bin/activate
pip install -U pip
pip install -e ".[dev]"
pre-commit install

# run checks
ruff check .
black --check .
mypy src
pytest -q
```

---

## Contributing

Contributions are welcome! Please open an issue to discuss substantial changes first.
By contributing you agree to license your work under the project’s **Apache‑2.0** license.

---

## License

This project is licensed under **Apache License 2.0**. See [LICENSE](LICENSE).
If you prefer a copyleft model (keeping derivatives open), consider re‑licensing as **GPLv3/AGPLv3** or offering **dual‑licensing** (AGPL for community + commercial license via AspectSoft). See below for guidance.

### Picking a license (quick guide)

- **Max adoption, simple** → MIT or **Apache‑2.0** (Apache adds a patent grant and NOTICE).
- **Keep derivatives open** → **GPLv3** (apps), **AGPLv3** (network services).
- **File‑level copyleft with easier mixing** → **MPL‑2.0**.
- **Source‑available (not OSI)** → Business Source License (BUSL‑1.1), SSPL, Polyform (non‑commercial).

> For a project that still offers some protection, **Apache‑2.0** is a great default. If you want stronger reciprocity, choose **AGPLv3** or dual‑license.

**How to apply**

1. Add a `LICENSE` file (done).
2. Add a `NOTICE` file (done) and keep third‑party attributions.
3. Optionally add license headers to source files, e.g.:

```python
# Copyright (c) 2025 Your Name
# SPDX-License-Identifier: Apache-2.0
```

---

## Contact

- **Author:** Dimitrios S. Sfyris (AspectSoft)
- **Email:** info@aspectsoft.gr

- **LinkedIn:** https://www.linkedin.com/in/dimitrios-s-sfyris/

- **Get in touch:** If you need bespoke FEK parsing or similar layout‑aware NLP pipelines, reach out.

---

## Acknowledgements

- Built on top of [`pdfminer.six`](https://github.com/pdfminer/pdfminer.six).
- Includes heuristics tuned for FEK / Εφημερίδα της Κυβερνήσεως layouts.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "fek-extractor",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "FEK, Greek, Government Gazette, NLP, PDF, information extraction",
    "author": null,
    "author_email": "\"Dimitrios S. Sfyris\" <info@aspectsoft.gr>",
    "download_url": "https://files.pythonhosted.org/packages/f5/57/ee6785c13f7e6d1ea06431dcaeba40c154400f0ce2a5f857f6dd8039ce14/fek_extractor-0.1.0.tar.gz",
    "platform": null,
    "description": "# fek-extractor\r\n[![PyPI version](https://img.shields.io/pypi/v/fek-extractor.svg)](https://pypi.org/project/fek-extractor/)\r\n[![Python](https://img.shields.io/badge/python-3.10%2B-blue.svg)](#-requirements)\r\n[![License: Apache-2.0](https://img.shields.io/badge/License-Apache--2.0-green.svg)](LICENSE)\r\n[![CI](https://github.com/dmsfiris/fek-extractor/actions/workflows/tests.yml/badge.svg?branch=master)](https://github.com/dmsfiris/fek-extractor/actions/workflows/tests.yml)\r\n[![code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\r\n\r\nExtract structured data from Greek Government Gazette (\u03a6\u0395\u039a) PDFs.\r\n\r\nIt turns messy, two\u2011column government PDFs into machine\u2011readable **JSON/CSV** with FEK metadata and a clean map of **\u0386\u03c1\u03b8\u03c1\u03b1** (articles).\r\nBuilt on `pdfminer.six`, with careful two\u2011column handling, header/footer filtering, Greek\u2011aware de\u2011hyphenation, and article detection.\r\n\r\n---\r\n## About this project\r\n\r\nGreek Government Gazette (\u03a6\u0395\u039a) documents look uniform at a glance, but their **typesetting and structure are anything but**. Even \u201cclean\u201d digital PDFs hide quirks that trip up generic parsers and off\u2011the\u2011shelf AI:\r\n\r\n- **Multi\u2011column reading order** with full\u2011width \u201ctails\u201d, footers, and boilerplate that disrupt token flow.\r\n- **Title vs. body separation** where headings, subtitles, and continuations interleave across pages.\r\n- **Dense legal cross\u2011references and amendments**, with nested exceptions and renumbered clauses.\r\n- **Inconsistent numbering and metadata**, plus occasional encoding artifacts and discretionary hyphens.\r\n\r\nThis project addresses those realities with a **layout\u2011aware, domain\u2011specific pipeline** that prioritizes *determinism* and *inspectability*:\r\n\r\n- **Layout\u2011aware text reconstruction** \u2014 two\u2011column segmentation (k\u2011means + gutter valley), \u201ctail\u201d detection, header/footer filtering, and stable reading order.\r\n- **Article\u2011structure recovery** \u2014 detects `\u0386\u03c1\u03b8\u03c1\u03bf N`, associates titles and bodies across page boundaries, and synthesizes a hierarchical TOC when possible.\r\n- **Greek\u2011aware normalization** \u2014 de\u2011hyphenates safely (soft/discretionary hyphens, wrapped words) while preserving accents/case.\r\n- **Domain heuristics + light NLP hooks** \u2014 FEK masthead parsing (series/issue/date), decision numbers, and simple patterns for subject/\u0398\u03ad\u03bc\u03b1; extension points for NER and reference extraction.\r\n- **Transparent debugging** \u2014 page\u2011focused debug mode and optional metrics so you can see *why* a page parsed a certain way.\r\n\r\n**Who it\u2019s for:** legal\u2011tech teams, data engineers, and researchers who need **reproducible, explainable** FEK extraction that won\u2019t crumble on edge cases.\r\n**Outcome:** **structured, searchable, dependable** data for automation, analysis, and integration.\r\n\r\nIf your team needs tailored FEK pipelines or additional NLP components, **[AspectSoft](https://aspectsoft.gr)** can help.\r\n\r\n---\r\n\r\n## Features\r\n\r\n- **FEK-aware text extraction**\r\n  - Two-column segmentation via k-means over x-coordinates with a gutter-valley heuristic.\r\n  - Per-page region classification & demotion \u2014 header, footer, column body, full-width tail, noise.\r\n  - \u201cTail\u201d detection for full-width content (signatures, appendices, tables) below the columns.\r\n  - Header/footer cleanup tuned to FEK mastheads and page furniture.\r\n  - Deterministic reading order (by column \u2192 y \u2192 x); graceful single-column fallback.\r\n\r\n- **Greek de-hyphenation**\r\n  - Removes soft/discretionary hyphens (U+00AD) and stitches wrapped words safely.\r\n  - Preserves accents/case; conservative rules to avoid over-merging.\r\n  - Handles common typography patterns (e.g., hyphen + space breaks).\r\n\r\n- **Header parsing**\r\n  - Extracts FEK **series** (\u0391/\u0392/\u2026 including word\u2192letter normalization), **issue number**, and **date** in both `DD.MM.YYYY` and ISO `YYYY-MM-DD`.\r\n  - Best-effort detection of **decision numbers** (e.g., \u201c\u0391\u03c1\u03b9\u03b8.\u201d).\r\n  - Tolerant to spacing/diacritic/punctuation variants.\r\n\r\n- **Article detection**\r\n  - Recognizes `\u0386\u03c1\u03b8\u03c1\u03bf N` (including letter suffixes like `14\u0391`) and captures **title + body**.\r\n  - Stitches articles across page boundaries; keeps original and normalized numbering.\r\n  - Produces a structured **articles map** for direct programmatic use.\r\n\r\n- **TOC synthesis** *(optional)*\r\n  - Builds a hierarchical TOC where present:\r\n    **\u039c\u0395\u03a1\u039f\u03a3 \u2192 \u03a4\u0399\u03a4\u039b\u039f\u03a3 \u2192 \u039a\u0395\u03a6\u0391\u039b\u0391\u0399\u039f \u2192 \u03a4\u039c\u0397\u039c\u0391 \u2192 \u0386\u03c1\u03b8\u03c1\u03b1**.\r\n  - Emits clean JSON for navigation, QA, or UI rendering.\r\n\r\n- **Metrics** *(opt-in via `--include-metrics`)*\r\n  - Lengths & counts (characters, words, lines) and median line length.\r\n  - Top words, character histogram, and pluggable regex matches (e.g., FEK citations, \u201c\u0398\u03ad\u03bc\u03b1:\u201d).\r\n\r\n- **CLI & Python API**\r\n  - CLI: single file or directory recursion, JSON/CSV output, `--jobs N` for parallel processing, focused logging via `--debug [PAGE]`.\r\n  - API: `extract_pdf_info(path, include_metrics=True|False, ...)` returns a ready-to-use record.\r\n\r\n- **Typed codebase & tests**\r\n  - Static typing (PEP 561), lint/format (ruff, black), type checks (mypy), and tests (pytest).\r\n  - Clear module boundaries (`io/`, `parsing/`, `metrics/`, `cli.py`, `core.py`).\r\n\r\nWith this mix, FEK PDFs become consistent, navigable JSON/CSV with reliable metadata and article structure\u2014ready for indexing, analytics, and automation.\r\n\r\n\r\n> \u2728 Sample PDF for testing ships in `data/samples/gr-act-2020-4706-4706_2020.pdf`.\r\n\r\n---\r\n\r\n## Table of contents\r\n\r\n- [Demo \\& screenshots](#demo--screenshots)\r\n- [Requirements](#requirements)\r\n- [Install](#install)\r\n- [Quickstart](#quickstart)\r\n- [CLI usage](#cli-usage)\r\n- [Python API](#python-api)\r\n- [Output schema](#output-schema)\r\n- [Technical deep dive](#technical-deep-dive)\r\n- [Architecture](#architecture)\r\n- [Examples](#examples)\r\n- [Debug helpers](#debug-helpers)\r\n- [Performance tips](#performance-tips)\r\n- [Project layout](#project-layout)\r\n- [Development](#development)\r\n- [Contributing](#contributing)\r\n- [License](#license)\r\n- [Contact](#contact)\r\n\r\n---\r\n\r\n## Demo & screenshots\r\n\r\n![FEK extractor \u2014 debug view (4514/2018, page 12)](docs/assets/4514-debug-page-12.jpg)\r\n\r\n---\r\n\r\n## Requirements\r\n\r\n- **Python 3.10+**\r\n- **OS:** Linux, macOS, or Windows\r\n- **Runtime dependency:** [`pdfminer.six`](https://github.com/pdfminer/pdfminer.six)\r\n\r\n---\r\n\r\n## Install\r\n\r\n### From PyPI\r\n\r\n```bash\r\npip install fek-extractor\r\n```\r\n\r\n### From source (editable)\r\n\r\n```bash\r\ngit clone https://github.com/dmsfiris/fek-extractor.git\r\ncd fek-extractor\r\npython -m venv .venv\r\nsource .venv/bin/activate # Windows: .\\.venv\\Scripts\\Activate.ps1\r\npip install -U pip\r\npip install -e . # library + CLI\r\n# or full dev setup\r\npip install -e \".[dev]\"\r\npre-commit install\r\n```\r\n\r\n### With pipx (isolated CLI)\r\n\r\n```bash\r\npipx install fek-extractor\r\n```\r\n\r\n### Docker (no local Python needed)\r\n\r\n```bash\r\ndocker run --rm -v \"$PWD:/work\" -w /work python:3.11-slim bash -lc \"pip install fek-extractor && fek-extractor -i data/samples -o out.json\"\r\n```\r\n\r\n---\r\n\r\n## Quickstart\r\n\r\n```bash\r\n# JSON (default)\r\nfek-extractor -i data/samples -o out.json -f json\r\n\r\n# CSV\r\nfek-extractor -i data/samples -o out.csv -f csv\r\n\r\n# As a module (equivalent to the CLI)\r\npython -m fek_extractor -i data/samples -o out.json\r\n```\r\n\r\n---\r\n\r\n## CLI usage\r\n\r\n```\r\nusage: fek-extractor [-h] --input INPUT [--out OUT] [--format {json,csv}]\r\n [--no-recursive] [--debug [PAGE]] [--jobs JOBS]\r\n [--include-metrics] [--articles-only] [--toc-only]\r\n\r\nExtract structured info from FEK/Greek-law PDFs.\r\n```\r\n\r\n**Options**\r\n\r\n- `-i, --input PATH` (required) \u2014 PDF *file* or *directory*.\r\n- `-o, --out PATH` (default: `out.json`) \u2014 Output path.\r\n- `-f, --format {json,csv}` (default: `json`) \u2014 Output format.\r\n- `--no-recursive` \u2014 When `--input` is a directory, do **not** recurse.\r\n- `--debug [PAGE]` \u2014 Enable debug logging; optionally pass a **page number**\r\n (e.g. `--debug 39`) to focus per\u2011page debug.\r\n- `--jobs JOBS` \u2014 Parallel workers when input is a **folder** (default 1).\r\n- `--include-metrics` \u2014 Add metrics into each record (see below).\r\n- `--articles-only` \u2014 Emit **only** the articles map as JSON (ignores `-f csv`).\r\n- `--toc-only` \u2014 Emit **only** the synthesized Table of Contents as JSON.\r\n\r\n---\r\n\r\n## Python API\r\n\r\n```python\r\nfrom fek_extractor import extract_pdf_info\r\n\r\n# Single PDF \u2192 record (dict)\r\nrecord = extract_pdf_info(\"data/samples/gr-act-2020-4706-4706_2020.pdf\", include_metrics=True)\r\nprint(record[\"filename\"], record[\"pages\"], record[\"articles_count\"])\r\n\r\n# Optional kwargs (subject to change):\r\n# debug=True\r\n# debug_pages=[39] # focus page(s) for diagnostics\r\n# dehyphenate=True # on by default\r\n```\r\n\r\n**Return type**: `dict[str, Any]` with the fields shown in [Output schema](#-output-schema).\r\n\r\n---\r\n\r\n## Output schema\r\n\r\nEach **record** (per PDF) typically contains:\r\n\r\n| Field | Type | Notes |\r\n|------|------|------|\r\n| `path` | string | Absolute or relative input path |\r\n| `filename` | string | File name only |\r\n| `pages` | int | Page count |\r\n| `fek_series` | string? | Single Greek letter (e.g. `\u0391`) if detected |\r\n| `fek_number` | string? | Issue number if detected |\r\n| `fek_date` | string? | Dotted date `DD.MM.YYYY` |\r\n| `fek_date_iso` | string? | ISO date `YYYY-MM-DD` |\r\n| `decision_number` | string? | From \u201c\u0391\u03c1\u03b9\u03b8.\u201d if found |\r\n| `subject` | string? | Document subject/\u0398\u03ad\u03bc\u03b1 (best\u2011effort) |\r\n| `articles` | object | Map of **article number \u2192 article object** |\r\n| `articles_count` | int | Convenience total |\r\n| `first_5_lines` | array | First few text lines (debugging aid) |\r\n| **Metrics** *(only when `--include-metrics`)* |||\r\n| `length` | int | Characters in raw text |\r\n| `num_lines` | int | Number of lines |\r\n| `median_line_length` | int | Median non\u2011empty line length |\r\n| `char_counts` | object | Char \u2192 count |\r\n| `word_counts_top` | object | Top words |\r\n| `chars`, `words` | int | Totals |\r\n| `matches` | object | Regex matches (from `data/patterns/patterns.txt`) |\r\n\r\n**Article object**\r\n\r\n```jsonc\r\n{\r\n \"number\": \"13\", // normalized article id (e.g., \"13\", \"14\u0391\")\r\n \"title\": \"\u039f\u03c1\u03b3\u03b1\u03bd\u03c9\u03c4\u03b9\u03ba\u03ad\u03c2 \u03c1\u03c5\u03b8\u03bc\u03af\u03c3\u03b5\u03b9\u03c2\",\r\n \"body\": \"\u2026full text\u2026\",\r\n // optional structural context when present:\r\n \"part_letter\": \"\u0391\", \"part_title\": \"\u2026\", // \u039c\u0395\u03a1\u039f\u03a3\r\n \"title_letter\": \"I\", \"title_title\": \"\u2026\", // \u03a4\u0399\u03a4\u039b\u039f\u03a3\r\n \"chapter_letter\": \"1\", \"chapter_title\": \"\u2026\", // \u039a\u0395\u03a6\u0391\u039b\u0391\u0399\u039f\r\n \"section_letter\": \"\u0391\", \"section_title\": \"\u2026\" // \u03a4\u039c\u0397\u039c\u0391\r\n}\r\n```\r\n\r\n---\r\n\r\n## Technical deep dive\r\n\r\n- **Reading order reconstruction**\r\n Rebuilds logical lines from low\u2011level glyphs, sorts by column then by y/x to maintain human reading order.\r\n- **Two\u2011column segmentation**\r\n Uses k\u2011means clustering over x\u2011coords and gap valley search to find the column gutter; detects and demotes \u201ctail\u201d (full\u2011width) content below columns.\r\n- **Greek\u2011aware normalization**\r\n Removes soft hyphens, stitches wrapped words, preserves Greek capitalization/accents conservatively.\r\n- **Header & masthead parsing**\r\n Regex/heuristics for FEK line (series/issue/date), dotted and ISO date, and decision numbers (`\u0391\u03c1\u03b9\u03b8.`).\r\n- **Article detection & stitching**\r\n Recognizes `\u0386\u03c1\u03b8\u03c1\u03bf N` headings, associates titles/bodies across page boundaries, and builds a robust map.\r\n- **TOC synthesis**\r\n Extracts hierarchical headers (\u039c\u0395\u03a1\u039f\u03a3/\u03a4\u0399\u03a4\u039b\u039f\u03a3/\u039a\u0395\u03a6\u0391\u039b\u0391\u0399\u039f/\u03a4\u039c\u0397\u039c\u0391) when present.\r\n- **Metrics**\r\n Character/word counts and frequency stats to help diagnose messy PDFs.\r\n\r\n---\r\n\r\n## Architecture\r\n\r\n```\r\nPDF \u2192 glyphs \u2192 lines \u2192 columns \u2192 normalized text\r\n \u2192 header parser \u2192 articles parser \u2192 {record}\r\n \u2192 (optional) metrics / TOC\r\n \u2192 JSON/CSV writer\r\n```\r\n\r\nKey modules (under `src/fek_extractor/`):\r\n\r\n- `io/pdf.py` \u2013 low\u2011level extraction, column/tail logic\r\n- `parsing/normalize.py` \u2013 de\u2011hyphenation & cleanup\r\n- `parsing/headers.py` \u2013 FEK header parsing\r\n- `parsing/articles.py` \u2013 article detection + body stitching\r\n- `metrics.py` \u2013 optional stats\r\n- `cli.py` \u2013 batch processing, JSON/CSV output\r\n\r\n---\r\n\r\n## Examples\r\n\r\n```bash\r\n# 1) All PDFs under a folder \u2192 JSON\r\nfek-extractor -i ./data/samples -o out.json\r\n\r\n# 2) Single PDF \u2192 CSV\r\nfek-extractor -i ./data/samples/gr-act-2020-4706-4706_2020.pdf -o out.csv -f csv\r\n\r\n# 3) Articles only (for a file)\r\nfek-extractor -i ./data/samples/gr-act-2020-4706-4706_2020.pdf --articles-only -o articles.json\r\n\r\n# 4) Table of Contents only (for a file)\r\nfek-extractor -i ./data/samples/gr-act-2020-4706-4706_2020.pdf --toc-only -o toc.json\r\n\r\n# 5) Process a directory in parallel with 4 workers, include metrics\r\nfek-extractor -i ./data/samples --jobs 4 --include-metrics -o out.json\r\n```\r\n\r\n---\r\n\r\n## Debug helpers\r\n\r\nThere is a small debug entrypoint to inspect **column extraction** and **page layout**:\r\n\r\n```bash\r\npython -m fek_extractor.debug --pdf data/samples/gr-act-2020-4706-4706_2020.pdf --page 39 --check-order\r\n```\r\n\r\n---\r\n\r\n## Performance tips\r\n\r\n- Prefer running with `--jobs N` on directories to parallelize across files.\r\n- For very large gazettes, keep output as JSON first (CSV is slower with many nested keys).\r\n- Pre\u2011process PDFs (deskew/OCR) if the source is scanned images.\r\n\r\n---\r\n\r\n## Project layout\r\n\r\n```\r\nsrc/fek_extractor/\r\n __main__.py # supports `python -m fek_extractor`\r\n cli.py # CLI entrypoint\r\n core.py # Orchestration\r\n io/ # PDF I/O and exporters\r\n parsing/ # Normalization & parsing rules (articles, headers, dates, HTML)\r\n metrics.py # Basic text metrics\r\n models.py # Typed record/contexts\r\n utils/ # Logging, HTML cleanup helpers\r\n\r\ndata/\r\n patterns/patterns.txt # Regexes for extra matches\r\n samples/ # Sample FEK PDF (optional)\r\n\r\ntests/ # Unit/CLI/integration tests\r\ndocs/ # MkDocs starter (optional)\r\n```\r\n\r\n---\r\n\r\n## Development\r\n\r\n```bash\r\n# clone and set up\r\ngit clone https://github.com/dmsfiris/fek-extractor.git\r\ncd fek-extractor\r\npython -m venv .venv && source .venv/bin/activate\r\npip install -U pip\r\npip install -e \".[dev]\"\r\npre-commit install\r\n\r\n# run checks\r\nruff check .\r\nblack --check .\r\nmypy src\r\npytest -q\r\n```\r\n\r\n---\r\n\r\n## Contributing\r\n\r\nContributions are welcome! Please open an issue to discuss substantial changes first.\r\nBy contributing you agree to license your work under the project\u2019s **Apache\u20112.0** license.\r\n\r\n---\r\n\r\n## License\r\n\r\nThis project is licensed under **Apache License 2.0**. See [LICENSE](LICENSE).\r\nIf you prefer a copyleft model (keeping derivatives open), consider re\u2011licensing as **GPLv3/AGPLv3** or offering **dual\u2011licensing** (AGPL for community + commercial license via AspectSoft). See below for guidance.\r\n\r\n### Picking a license (quick guide)\r\n\r\n- **Max adoption, simple** \u2192 MIT or **Apache\u20112.0** (Apache adds a patent grant and NOTICE).\r\n- **Keep derivatives open** \u2192 **GPLv3** (apps), **AGPLv3** (network services).\r\n- **File\u2011level copyleft with easier mixing** \u2192 **MPL\u20112.0**.\r\n- **Source\u2011available (not OSI)** \u2192 Business Source License (BUSL\u20111.1), SSPL, Polyform (non\u2011commercial).\r\n\r\n> For a project that still offers some protection, **Apache\u20112.0** is a great default. If you want stronger reciprocity, choose **AGPLv3** or dual\u2011license.\r\n\r\n**How to apply**\r\n\r\n1. Add a `LICENSE` file (done).\r\n2. Add a `NOTICE` file (done) and keep third\u2011party attributions.\r\n3. Optionally add license headers to source files, e.g.:\r\n\r\n```python\r\n# Copyright (c) 2025 Your Name\r\n# SPDX-License-Identifier: Apache-2.0\r\n```\r\n\r\n---\r\n\r\n## Contact\r\n\r\n- **Author:** Dimitrios S. Sfyris (AspectSoft)\r\n- **Email:** info@aspectsoft.gr\r\n\r\n- **LinkedIn:** https://www.linkedin.com/in/dimitrios-s-sfyris/\r\n\r\n- **Get in touch:** If you need bespoke FEK parsing or similar layout\u2011aware NLP pipelines, reach out.\r\n\r\n---\r\n\r\n## Acknowledgements\r\n\r\n- Built on top of [`pdfminer.six`](https://github.com/pdfminer/pdfminer.six).\r\n- Includes heuristics tuned for FEK / \u0395\u03c6\u03b7\u03bc\u03b5\u03c1\u03af\u03b4\u03b1 \u03c4\u03b7\u03c2 \u039a\u03c5\u03b2\u03b5\u03c1\u03bd\u03ae\u03c3\u03b5\u03c9\u03c2 layouts.\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Extract structured data from FEK PDFs.",
    "version": "0.1.0",
    "project_urls": {
        "Changelog": "https://github.com/dmsfiris/fek-extractor/blob/master/CHANGELOG.md",
        "Homepage": "https://github.com/dmsfiris/fek-extractor",
        "Issues": "https://github.com/dmsfiris/fek-extractor/issues",
        "Repository": "https://github.com/dmsfiris/fek-extractor"
    },
    "split_keywords": [
        "fek",
        " greek",
        " government gazette",
        " nlp",
        " pdf",
        " information extraction"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8bbec29f4ceeff09bdef528308e2fbf8c6b707d5c1abb73bdb8ef8e8cd02e050",
                "md5": "fc023b2a43f794bff5e554d8231cd069",
                "sha256": "4e6075edd9b1a3cc24887f872ea030bed546e65ff7cf283ad618f8331142b421"
            },
            "downloads": -1,
            "filename": "fek_extractor-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "fc023b2a43f794bff5e554d8231cd069",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 81136,
            "upload_time": "2025-09-10T14:34:03",
            "upload_time_iso_8601": "2025-09-10T14:34:03.913120Z",
            "url": "https://files.pythonhosted.org/packages/8b/be/c29f4ceeff09bdef528308e2fbf8c6b707d5c1abb73bdb8ef8e8cd02e050/fek_extractor-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f557ee6785c13f7e6d1ea06431dcaeba40c154400f0ce2a5f857f6dd8039ce14",
                "md5": "50d65b3b351f809db708f6374a3fa5c2",
                "sha256": "567702eef880ff104a1a9f034b0af70b109ccc1148ea77a8fb9e1f4e8a0ffb80"
            },
            "downloads": -1,
            "filename": "fek_extractor-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "50d65b3b351f809db708f6374a3fa5c2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 78722,
            "upload_time": "2025-09-10T14:34:05",
            "upload_time_iso_8601": "2025-09-10T14:34:05.628295Z",
            "url": "https://files.pythonhosted.org/packages/f5/57/ee6785c13f7e6d1ea06431dcaeba40c154400f0ce2a5f857f6dd8039ce14/fek_extractor-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-10 14:34:05",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "dmsfiris",
    "github_project": "fek-extractor",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "fek-extractor"
}
        
Elapsed time: 3.21600s