omega-omg


Nameomega-omg JSON
Version 0.2.1 PyPI version JSON
download
home_pageNone
SummaryOmega Object Matching Grammar (OmegaOMG): DSL and engine for high-performance object/entity matching.
upload_time2025-08-09 01:23:36
maintainerNone
docs_urlNone
authorOmegaOMG Authors
requires_python>=3.9
licenseNone
keywords dsl matching entity-resolution nlp regex omega
VCS
bugtrack_url
requirements lark omega_match
Travis-CI No Travis.
coveralls test coverage
            # OmegaOMG: Omega Object Matching Grammar

<p align="center">
   <img src="https://raw.githubusercontent.com/scholarsmate/omega-omg/main/images/icon.png" alt="OmegaOMG Logo" width="180" />
</p>

<p align="center">
   <a href="https://github.com/scholarsmate/omega-omg/actions/workflows/ci.yml">
      <img alt="CI" src="https://github.com/scholarsmate/omega-omg/actions/workflows/ci.yml/badge.svg" />
   </a>
   <a href="https://codecov.io/gh/scholarsmate/omega-omg">
      <img alt="Coverage" src="https://codecov.io/gh/scholarsmate/omega-omg/branch/main/graph/badge.svg" />
   </a>
  
</p>

OmegaOMG is a domain-specific language (DSL) and runtime engine for defining and evaluating high‑performance object / entity matching rules against large byte-based inputs (“haystacks”). It leverages pre‑anchored longest, non‑overlapping pattern matches (via the [`OmegaMatch`](https://github.com/scholarsmate/omega-match) library), an optimized AST evaluation engine, and a modular entity resolution pipeline to produce clean, canonicalized, and enriched match streams.

## Key Features

- **Expressive DSL** (version `1.0`):
   - `version 1.0` header (mandatory)
   - `import "file.txt" as alias [with flags...]`
   - Pattern atoms: literals, escapes (`\d \s \w` etc.), anchors `^ $`, dot `.`, character classes `[...]`, list matches `[[alias]]`, optional filters `[[alias:startsWith("A")]]`, named captures `(?P<name> ...)`.
   - Operators: concatenation, alternation `|`, grouping `(...)`.
   - Quantifiers: bounded `{m}`, `{m,n}`, and `?` (no unbounded `*` / `+` – enforced at runtime).
   - Every rule must include at least one `ListMatch` anchor (validated).
   - Dotted rule names (e.g. `person.surname`) supported for parent/child entity models.
- **Import flags**: `ignore-case`, `ignore-punctuation`, `elide-whitespace`, `word-boundary`, `word-prefix`, `word-suffix`, `line-start`, `line-end` (forwarded to `omega_match`).
- **Pre‑anchored matching**: Delegates raw token list detection to `omega_match` with `longest_only` & `no_overlap` guarantees per alias.
- **Optimized AST evaluation**:
   - Offset‑indexed & binary searched ListMatch anchors
   - Greedy quantified ListMatch chaining
   - Caching for pattern parts, prefix length, listmatch presence, unbounded checks
   - Adaptive sampling of potential start offsets to dramatically reduce scan points
- **Entity Resolution Pipeline** (see `RESOLUTION.md`): Implements Steps 1‑6
   1. Validation & normalization
   2. Overlap resolution with deterministic tie‑breaking
   3. Tokenization + optional token filtering
   4. Horizontal canonicalization (parent deduplication)
   5. Vertical child resolution (child → parent referencing)
   6. Metadata enrichment (sentence / paragraph boundaries)
- **Resolver configuration**:
   - `resolver default uses exact ...` sets a default for rules
   - Per‑rule: `rule = ... uses resolver fuzzy(threshold="0.9") with ignore-case, optional-tokens("file.txt")`
   - Parent rules without children skip resolution for speed; parents with children receive an automatic lightweight `boundary-only` resolver if not explicitly configured.
- **Resolver methods**: Grammar accepts arbitrary resolver method identifiers; built-ins implemented are `exact` and `fuzzy(threshold=...)`. For parent canonicalization, unknown methods fall back to `exact`. For child resolution, use `exact` or `fuzzy` to guarantee matching; unknown methods may result in children being discarded. An internal `boundary-only` mode is used automatically for certain parent rules.
- **Highlighter utility**: Renders enriched matches to interactive HTML (`highlighter.py`) with rule toggles and keyboard navigation (`n` / `p`).
- **VS Code language integration**: See [OMG Language Support](https://github.com/scholarsmate/omega-omg-vscode) for syntax highlighting & IntelliSense.
- **Lean dependencies**: Runtime requires only `lark` and `omega_match`.

> For algorithmic details and performance rationale see: [`RESOLUTION.md`](RESOLUTION.md)

## Installation

Requires: Python 3.9+ (uses builtin generics like tuple[str, ...]).

1. Clone this repository:

   ```powershell
   git clone https://github.com/scholarsmate/omega-omg.git
   cd omega-omg
   ```

2. Create and activate a Python virtual environment:

   a. Windows:

   ```powershell
   python3.exe -m venv .venv
   .\.venv\Scripts\Activate.ps1
   ```

   b. *nix and macOS:

   ```sh
   python3 -m venv .venv
   source ./.venv/bin/activate
   ```

3. Install runtime dependencies (and optionally dev tooling):

   ```powershell
   pip install -r requirements.txt
   # For contributors / tests / linting
   pip install -r requirements-dev.txt
   ```

4. (Optional) Run tests to verify environment:

   ```powershell
   pytest -q
   ```

## Usage

### 1. Define a DSL file

Create a `.omg` file with rules, e.g., `demo/demo.omg`:
```dsl
version 1.0

# Import match lists
import "name_prefix.txt" as prefix with word-boundary, ignore-case
import "names.txt" as given_name with word-boundary
import "surnames.txt" as surname with word-boundary
import "name_suffix.txt" as suffix with word-boundary
import "0000-9999.txt" as 4_digits with word-boundary
import "tlds.txt" as tld with word-boundary, ignore-case

# Configure the default resolver
resolver default uses exact with ignore-case, ignore-punctuation

# Top-level rule for matching a person's name
person = ( [[prefix]] \s{1,4} )? \
    [[given_name]] ( \s{1,4} [[given_name]] )? ( \s{1,4} \w | \s{1,4} \w "." )? \
    \s{1,4} [[surname]] \
    (\s{0,4} "," \s{1,4} [[suffix]])? \
    uses default resolver with optional-tokens("person-opt_tokens.txt")

# Dotted-rule references resolve to top-level person matches
person.prefix_surname = [[prefix]] \s{1,4} [[surname]] (\s{0,4} "," \s{1,4} [[suffix]])? \
    uses default resolver with optional-tokens("person-opt_tokens.txt")
person.surname = [[surname]] (\s{0,4} "," \s{1,4} [[suffix]])? \
    uses default resolver with optional-tokens("person-opt_tokens.txt")

# Rule for matching a phone number
phone = "(" \s{0,2} \d{3} \s{0,2} ")" \s{0,2} \d{3} "-" \s{0,2} [[4_digits]]

# Rule for matching email addresses with bounded quantifiers
# Pattern: username@domain.tld
# Username: 1-64 chars (alphanumeric, dots, hyphens, underscores)
# Domain: 1-253 chars total, each label 1-63 chars
email = [A-Za-z0-9._-]{1,64} "@" [A-Za-z0-9-]{1,63} ("." [A-Za-z0-9-]{1,63}){0,10} "." [[tld]]
```

### 2. Parse and evaluate in Python

```python
from dsl.omg_parser import parse_file
from dsl.omg_evaluator import RuleEvaluator

# Load DSL and input haystack
ast = parse_file("demo/demo.omg")
with open("demo/CIA_Briefings_of_Presidential_Candidates_1952-1992.txt", "rb") as f:
    haystack = f.read()

# Evaluate a specific rule
engine = RuleEvaluator(ast_root=ast, haystack=haystack)
matches = engine.evaluate_rule(ast.rules["person"])
for m in matches:
    print(m.offset, m.match.decode())
```

### 3. Command-Line Tool

A command-line interface is provided by `omg.py`.

```powershell
python omg.py --help
```

Common flags:

| Flag | Purpose |
|------|---------|
| `--show-stats` | Emit resolution statistics (input vs output, stage timings) |
| `--show-timing` | Show breakdown of file load, parse, evaluation, resolution |
| `--no-resolve` | Skip entity resolution; emit raw rule matches |
| `--pretty-print` | Emit a single JSON array instead of line-delimited JSON objects |
| `--log-level LEVEL` | Adjust logging (default WARNING) |
| `-o file.json` | Write JSON output to file (UTF‑8, LF) |
| `--version` | Show component & DSL versions |

Version output example:
```
Version information:
   omega_match: <x.y.z>
   omg: 0.2.0
   DSL: 1.0
```

#### Demo: End-to-End Object Matching and Highlighting

The following demonstrates how to use the CLI tools to extract and visualize matches from a text file using a demo OMG rule set:

1. **Run the matcher and output results to JSON (line‑delimited):**

   ```powershell
   python omg.py --show-stats --show-timing --output matches.json .\demo\demo.omg .\demo\CIA_Briefings_of_Presidential_Candidates_1952-1992.txt
   ```
   This command will print timing and statistics to the terminal and write all matches to `matches.json` in UTF-8 with LF line endings.

2. **Render the matches as highlighted HTML:**

   ```powershell
   python highlighter.py .\demo\CIA_Briefings_of_Presidential_Candidates_1952-1992.txt matches.json CIA_demo.html
   ```
   This will generate an HTML file (`CIA_demo.html`) with all matched objects highlighted for easy review.

You can open the resulting HTML file in a browser to visually inspect the extracted matches.

## Project Structure

```
omg.py               # CLI driver (evaluate + optional resolution + JSON output)
highlighter.py       # Convert line-delimited match JSON to interactive HTML
dsl/
   omg_grammar.lark   # Lark grammar definition for DSL v1.0
   omg_parser.py      # Parser + resolver clause extraction + version enforcement
   omg_ast.py         # Immutable AST node dataclasses
   omg_transformer.py # Grammar → AST transformer
   omg_evaluator.py   # Optimized rule evaluation engine
   omg_resolver.py    # Resolver façade (imports components below)
   resolver/          # Entity resolution submodules (overlap, horizontal, vertical, tokenizer, metadata)
demo/                # Example DSL + pattern lists + sample text
tests/               # Comprehensive pytest suite
RESOLUTION.md        # Detailed entity resolution algorithm spec
```

## DSL Constraints & Gotchas

- All rules must include at least one `[[alias]]` (ListMatch). Pure literal / regex‑like rules are rejected.
- Unbounded quantifiers (`*`, `+`) are disallowed; use `{0,n}` / `{1,n}` equivalents.
- Quantified `ListMatch` chains are greedily extended with adjacency (no gaps) and optional line boundary enforcement.
- Dotted (child) rules without an explicit resolver inherit the default; parents with children but no explicit resolver receive a lightweight `boundary-only` config to add structural metadata.
- Import paths in a DSL file are resolved relative to that DSL file when relative.

## Entity Resolution Summary

After raw AST evaluation, resolution (unless `--no-resolve`) applies:

1. Overlap removal (length > earlier offset > shorter rule name > lexical rule name).
2. Parent canonicalization by normalized token bag (flags + optional tokens removed).
3. Child rule validation: each child must map to exactly one canonical parent (else dropped).
4. Metadata enrichment: sentence & paragraph boundary offsets.

See `RESOLUTION.md` for full reasoning, complexity, and future extension recommendations.

## Performance Notes

- Matching cost reduced via adaptive anchor sampling and per‑alias offset maps.
- Regex-like escapes use pre‑compiled single‑byte patterns for speed.
- Caches (pattern part, prefix length, ListMatch presence, unbounded quantifier detection) materially cut repeated traversals.
- Resolution skips unnecessary work (e.g., no resolver for isolated parent rules).

## Development

Formatting / linting (optional but recommended):

```powershell
ruff check .
pylint dsl omg.py highlighter.py
pytest --cov
```

Type checking:
```powershell
mypy dsl
```

Releasing (example):
```powershell
python -m build
twine upload dist/*
```

## Troubleshooting

| Issue | Likely Cause | Fix |
|-------|--------------|-----|
| `ValueError: Rule 'x' must include at least one list match` | Rule lacks `[[alias]]` | Add an import + list match anchor |
| `Unsupported OMG DSL version` | DSL file version mismatch | Update `version 1.0` or engine constant |
| No matches produced | Missing import flags (e.g. `word-boundary`) or list file path issue | Verify list file contents & flags |
| Child rules disappear | Unresolved parent reference | Ensure corresponding parent rule matches same span |
| HTML missing colors for a rule | Rule produced zero matches | Confirm JSON lines include that rule |

## Roadmap (Planned / Potential)

- Plugin resolver strategy interface (custom similarity algorithms)
- Parallel rule evaluation for very large haystacks
- Configurable overlap priority strategies
- More built-in resolver methods beyond `exact`, `fuzzy`, `contains`
- Richer IDE tooling (hover docs, go‑to definition)

## Contributing

1. Fork the repo and create a feature branch.
2. Write tests under `tests/` for new features or bug fixes.
3. Run `pytest` to ensure all tests pass.
   ```powershell
   pytest
   ```
4. Submit a pull request.

## License

The OmegaOMG project is licensed under the [Apache License 2.0](LICENSE).

OmegaOMG is **not** an official Apache Software Foundation (ASF) project.

---

Questions or ideas? Open an issue or start a discussion – contributions and feedback are welcome.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "omega-omg",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "dsl, matching, entity-resolution, nlp, regex, omega",
    "author": "OmegaOMG Authors",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/29/32/4efd4412f21bc01b0bd70896d748fb519e79aabde0031f334ff8608fd817/omega_omg-0.2.1.tar.gz",
    "platform": null,
    "description": "# OmegaOMG: Omega Object Matching Grammar\n\n<p align=\"center\">\n   <img src=\"https://raw.githubusercontent.com/scholarsmate/omega-omg/main/images/icon.png\" alt=\"OmegaOMG Logo\" width=\"180\" />\n</p>\n\n<p align=\"center\">\n   <a href=\"https://github.com/scholarsmate/omega-omg/actions/workflows/ci.yml\">\n      <img alt=\"CI\" src=\"https://github.com/scholarsmate/omega-omg/actions/workflows/ci.yml/badge.svg\" />\n   </a>\n   <a href=\"https://codecov.io/gh/scholarsmate/omega-omg\">\n      <img alt=\"Coverage\" src=\"https://codecov.io/gh/scholarsmate/omega-omg/branch/main/graph/badge.svg\" />\n   </a>\n  \n</p>\n\nOmegaOMG is a domain-specific language (DSL) and runtime engine for defining and evaluating high\u2011performance object / entity matching rules against large byte-based inputs (\u201chaystacks\u201d). It leverages pre\u2011anchored longest, non\u2011overlapping pattern matches (via the [`OmegaMatch`](https://github.com/scholarsmate/omega-match) library), an optimized AST evaluation engine, and a modular entity resolution pipeline to produce clean, canonicalized, and enriched match streams.\n\n## Key Features\n\n- **Expressive DSL** (version `1.0`):\n   - `version 1.0` header (mandatory)\n   - `import \"file.txt\" as alias [with flags...]`\n   - Pattern atoms: literals, escapes (`\\d \\s \\w` etc.), anchors `^ $`, dot `.`, character classes `[...]`, list matches `[[alias]]`, optional filters `[[alias:startsWith(\"A\")]]`, named captures `(?P<name> ...)`.\n   - Operators: concatenation, alternation `|`, grouping `(...)`.\n   - Quantifiers: bounded `{m}`, `{m,n}`, and `?` (no unbounded `*` / `+` \u2013 enforced at runtime).\n   - Every rule must include at least one `ListMatch` anchor (validated).\n   - Dotted rule names (e.g. `person.surname`) supported for parent/child entity models.\n- **Import flags**: `ignore-case`, `ignore-punctuation`, `elide-whitespace`, `word-boundary`, `word-prefix`, `word-suffix`, `line-start`, `line-end` (forwarded to `omega_match`).\n- **Pre\u2011anchored matching**: Delegates raw token list detection to `omega_match` with `longest_only` & `no_overlap` guarantees per alias.\n- **Optimized AST evaluation**:\n   - Offset\u2011indexed & binary searched ListMatch anchors\n   - Greedy quantified ListMatch chaining\n   - Caching for pattern parts, prefix length, listmatch presence, unbounded checks\n   - Adaptive sampling of potential start offsets to dramatically reduce scan points\n- **Entity Resolution Pipeline** (see `RESOLUTION.md`): Implements Steps 1\u20116\n   1. Validation & normalization\n   2. Overlap resolution with deterministic tie\u2011breaking\n   3. Tokenization + optional token filtering\n   4. Horizontal canonicalization (parent deduplication)\n   5. Vertical child resolution (child \u2192 parent referencing)\n   6. Metadata enrichment (sentence / paragraph boundaries)\n- **Resolver configuration**:\n   - `resolver default uses exact ...` sets a default for rules\n   - Per\u2011rule: `rule = ... uses resolver fuzzy(threshold=\"0.9\") with ignore-case, optional-tokens(\"file.txt\")`\n   - Parent rules without children skip resolution for speed; parents with children receive an automatic lightweight `boundary-only` resolver if not explicitly configured.\n- **Resolver methods**: Grammar accepts arbitrary resolver method identifiers; built-ins implemented are `exact` and `fuzzy(threshold=...)`. For parent canonicalization, unknown methods fall back to `exact`. For child resolution, use `exact` or `fuzzy` to guarantee matching; unknown methods may result in children being discarded. An internal `boundary-only` mode is used automatically for certain parent rules.\n- **Highlighter utility**: Renders enriched matches to interactive HTML (`highlighter.py`) with rule toggles and keyboard navigation (`n` / `p`).\n- **VS Code language integration**: See [OMG Language Support](https://github.com/scholarsmate/omega-omg-vscode) for syntax highlighting & IntelliSense.\n- **Lean dependencies**: Runtime requires only `lark` and `omega_match`.\n\n> For algorithmic details and performance rationale see: [`RESOLUTION.md`](RESOLUTION.md)\n\n## Installation\n\nRequires: Python 3.9+ (uses builtin generics like tuple[str, ...]).\n\n1. Clone this repository:\n\n   ```powershell\n   git clone https://github.com/scholarsmate/omega-omg.git\n   cd omega-omg\n   ```\n\n2. Create and activate a Python virtual environment:\n\n   a. Windows:\n\n   ```powershell\n   python3.exe -m venv .venv\n   .\\.venv\\Scripts\\Activate.ps1\n   ```\n\n   b. *nix and macOS:\n\n   ```sh\n   python3 -m venv .venv\n   source ./.venv/bin/activate\n   ```\n\n3. Install runtime dependencies (and optionally dev tooling):\n\n   ```powershell\n   pip install -r requirements.txt\n   # For contributors / tests / linting\n   pip install -r requirements-dev.txt\n   ```\n\n4. (Optional) Run tests to verify environment:\n\n   ```powershell\n   pytest -q\n   ```\n\n## Usage\n\n### 1. Define a DSL file\n\nCreate a `.omg` file with rules, e.g., `demo/demo.omg`:\n```dsl\nversion 1.0\n\n# Import match lists\nimport \"name_prefix.txt\" as prefix with word-boundary, ignore-case\nimport \"names.txt\" as given_name with word-boundary\nimport \"surnames.txt\" as surname with word-boundary\nimport \"name_suffix.txt\" as suffix with word-boundary\nimport \"0000-9999.txt\" as 4_digits with word-boundary\nimport \"tlds.txt\" as tld with word-boundary, ignore-case\n\n# Configure the default resolver\nresolver default uses exact with ignore-case, ignore-punctuation\n\n# Top-level rule for matching a person's name\nperson = ( [[prefix]] \\s{1,4} )? \\\n    [[given_name]] ( \\s{1,4} [[given_name]] )? ( \\s{1,4} \\w | \\s{1,4} \\w \".\" )? \\\n    \\s{1,4} [[surname]] \\\n    (\\s{0,4} \",\" \\s{1,4} [[suffix]])? \\\n    uses default resolver with optional-tokens(\"person-opt_tokens.txt\")\n\n# Dotted-rule references resolve to top-level person matches\nperson.prefix_surname = [[prefix]] \\s{1,4} [[surname]] (\\s{0,4} \",\" \\s{1,4} [[suffix]])? \\\n    uses default resolver with optional-tokens(\"person-opt_tokens.txt\")\nperson.surname = [[surname]] (\\s{0,4} \",\" \\s{1,4} [[suffix]])? \\\n    uses default resolver with optional-tokens(\"person-opt_tokens.txt\")\n\n# Rule for matching a phone number\nphone = \"(\" \\s{0,2} \\d{3} \\s{0,2} \")\" \\s{0,2} \\d{3} \"-\" \\s{0,2} [[4_digits]]\n\n# Rule for matching email addresses with bounded quantifiers\n# Pattern: username@domain.tld\n# Username: 1-64 chars (alphanumeric, dots, hyphens, underscores)\n# Domain: 1-253 chars total, each label 1-63 chars\nemail = [A-Za-z0-9._-]{1,64} \"@\" [A-Za-z0-9-]{1,63} (\".\" [A-Za-z0-9-]{1,63}){0,10} \".\" [[tld]]\n```\n\n### 2. Parse and evaluate in Python\n\n```python\nfrom dsl.omg_parser import parse_file\nfrom dsl.omg_evaluator import RuleEvaluator\n\n# Load DSL and input haystack\nast = parse_file(\"demo/demo.omg\")\nwith open(\"demo/CIA_Briefings_of_Presidential_Candidates_1952-1992.txt\", \"rb\") as f:\n    haystack = f.read()\n\n# Evaluate a specific rule\nengine = RuleEvaluator(ast_root=ast, haystack=haystack)\nmatches = engine.evaluate_rule(ast.rules[\"person\"])\nfor m in matches:\n    print(m.offset, m.match.decode())\n```\n\n### 3. Command-Line Tool\n\nA command-line interface is provided by `omg.py`.\n\n```powershell\npython omg.py --help\n```\n\nCommon flags:\n\n| Flag | Purpose |\n|------|---------|\n| `--show-stats` | Emit resolution statistics (input vs output, stage timings) |\n| `--show-timing` | Show breakdown of file load, parse, evaluation, resolution |\n| `--no-resolve` | Skip entity resolution; emit raw rule matches |\n| `--pretty-print` | Emit a single JSON array instead of line-delimited JSON objects |\n| `--log-level LEVEL` | Adjust logging (default WARNING) |\n| `-o file.json` | Write JSON output to file (UTF\u20118, LF) |\n| `--version` | Show component & DSL versions |\n\nVersion output example:\n```\nVersion information:\n   omega_match: <x.y.z>\n   omg: 0.2.0\n   DSL: 1.0\n```\n\n#### Demo: End-to-End Object Matching and Highlighting\n\nThe following demonstrates how to use the CLI tools to extract and visualize matches from a text file using a demo OMG rule set:\n\n1. **Run the matcher and output results to JSON (line\u2011delimited):**\n\n   ```powershell\n   python omg.py --show-stats --show-timing --output matches.json .\\demo\\demo.omg .\\demo\\CIA_Briefings_of_Presidential_Candidates_1952-1992.txt\n   ```\n   This command will print timing and statistics to the terminal and write all matches to `matches.json` in UTF-8 with LF line endings.\n\n2. **Render the matches as highlighted HTML:**\n\n   ```powershell\n   python highlighter.py .\\demo\\CIA_Briefings_of_Presidential_Candidates_1952-1992.txt matches.json CIA_demo.html\n   ```\n   This will generate an HTML file (`CIA_demo.html`) with all matched objects highlighted for easy review.\n\nYou can open the resulting HTML file in a browser to visually inspect the extracted matches.\n\n## Project Structure\n\n```\nomg.py               # CLI driver (evaluate + optional resolution + JSON output)\nhighlighter.py       # Convert line-delimited match JSON to interactive HTML\ndsl/\n   omg_grammar.lark   # Lark grammar definition for DSL v1.0\n   omg_parser.py      # Parser + resolver clause extraction + version enforcement\n   omg_ast.py         # Immutable AST node dataclasses\n   omg_transformer.py # Grammar \u2192 AST transformer\n   omg_evaluator.py   # Optimized rule evaluation engine\n   omg_resolver.py    # Resolver fa\u00e7ade (imports components below)\n   resolver/          # Entity resolution submodules (overlap, horizontal, vertical, tokenizer, metadata)\ndemo/                # Example DSL + pattern lists + sample text\ntests/               # Comprehensive pytest suite\nRESOLUTION.md        # Detailed entity resolution algorithm spec\n```\n\n## DSL Constraints & Gotchas\n\n- All rules must include at least one `[[alias]]` (ListMatch). Pure literal / regex\u2011like rules are rejected.\n- Unbounded quantifiers (`*`, `+`) are disallowed; use `{0,n}` / `{1,n}` equivalents.\n- Quantified `ListMatch` chains are greedily extended with adjacency (no gaps) and optional line boundary enforcement.\n- Dotted (child) rules without an explicit resolver inherit the default; parents with children but no explicit resolver receive a lightweight `boundary-only` config to add structural metadata.\n- Import paths in a DSL file are resolved relative to that DSL file when relative.\n\n## Entity Resolution Summary\n\nAfter raw AST evaluation, resolution (unless `--no-resolve`) applies:\n\n1. Overlap removal (length > earlier offset > shorter rule name > lexical rule name).\n2. Parent canonicalization by normalized token bag (flags + optional tokens removed).\n3. Child rule validation: each child must map to exactly one canonical parent (else dropped).\n4. Metadata enrichment: sentence & paragraph boundary offsets.\n\nSee `RESOLUTION.md` for full reasoning, complexity, and future extension recommendations.\n\n## Performance Notes\n\n- Matching cost reduced via adaptive anchor sampling and per\u2011alias offset maps.\n- Regex-like escapes use pre\u2011compiled single\u2011byte patterns for speed.\n- Caches (pattern part, prefix length, ListMatch presence, unbounded quantifier detection) materially cut repeated traversals.\n- Resolution skips unnecessary work (e.g., no resolver for isolated parent rules).\n\n## Development\n\nFormatting / linting (optional but recommended):\n\n```powershell\nruff check .\npylint dsl omg.py highlighter.py\npytest --cov\n```\n\nType checking:\n```powershell\nmypy dsl\n```\n\nReleasing (example):\n```powershell\npython -m build\ntwine upload dist/*\n```\n\n## Troubleshooting\n\n| Issue | Likely Cause | Fix |\n|-------|--------------|-----|\n| `ValueError: Rule 'x' must include at least one list match` | Rule lacks `[[alias]]` | Add an import + list match anchor |\n| `Unsupported OMG DSL version` | DSL file version mismatch | Update `version 1.0` or engine constant |\n| No matches produced | Missing import flags (e.g. `word-boundary`) or list file path issue | Verify list file contents & flags |\n| Child rules disappear | Unresolved parent reference | Ensure corresponding parent rule matches same span |\n| HTML missing colors for a rule | Rule produced zero matches | Confirm JSON lines include that rule |\n\n## Roadmap (Planned / Potential)\n\n- Plugin resolver strategy interface (custom similarity algorithms)\n- Parallel rule evaluation for very large haystacks\n- Configurable overlap priority strategies\n- More built-in resolver methods beyond `exact`, `fuzzy`, `contains`\n- Richer IDE tooling (hover docs, go\u2011to definition)\n\n## Contributing\n\n1. Fork the repo and create a feature branch.\n2. Write tests under `tests/` for new features or bug fixes.\n3. Run `pytest` to ensure all tests pass.\n   ```powershell\n   pytest\n   ```\n4. Submit a pull request.\n\n## License\n\nThe OmegaOMG project is licensed under the [Apache License 2.0](LICENSE).\n\nOmegaOMG is **not** an official Apache Software Foundation (ASF) project.\n\n---\n\nQuestions or ideas? Open an issue or start a discussion \u2013 contributions and feedback are welcome.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Omega Object Matching Grammar (OmegaOMG): DSL and engine for high-performance object/entity matching.",
    "version": "0.2.1",
    "project_urls": {
        "Homepage": "https://github.com/scholarsmate/omega-omg",
        "Issues": "https://github.com/scholarsmate/omega-omg/issues",
        "Repository": "https://github.com/scholarsmate/omega-omg"
    },
    "split_keywords": [
        "dsl",
        " matching",
        " entity-resolution",
        " nlp",
        " regex",
        " omega"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "46ef82a4129df4f41cdb86da907924c0c3338c39068209bedde75c39ef50a08c",
                "md5": "eea1f6160c1f12c577fc94d6c8e60a71",
                "sha256": "49008b61a63f7bfa86d3f635b77ccb729c089b1e92ce18de70537d998ec6e60d"
            },
            "downloads": -1,
            "filename": "omega_omg-0.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "eea1f6160c1f12c577fc94d6c8e60a71",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 55282,
            "upload_time": "2025-08-09T01:23:35",
            "upload_time_iso_8601": "2025-08-09T01:23:35.520214Z",
            "url": "https://files.pythonhosted.org/packages/46/ef/82a4129df4f41cdb86da907924c0c3338c39068209bedde75c39ef50a08c/omega_omg-0.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "29324efd4412f21bc01b0bd70896d748fb519e79aabde0031f334ff8608fd817",
                "md5": "8540ca0063c1d5c46227205d4790d2ba",
                "sha256": "48c2a82677a4c2a6da6f8f8b8650f0a7df95433059c1ed5b1cc2b977959c0d2a"
            },
            "downloads": -1,
            "filename": "omega_omg-0.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "8540ca0063c1d5c46227205d4790d2ba",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 79372,
            "upload_time": "2025-08-09T01:23:36",
            "upload_time_iso_8601": "2025-08-09T01:23:36.381070Z",
            "url": "https://files.pythonhosted.org/packages/29/32/4efd4412f21bc01b0bd70896d748fb519e79aabde0031f334ff8608fd817/omega_omg-0.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-09 01:23:36",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "scholarsmate",
    "github_project": "omega-omg",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "requirements": [
        {
            "name": "lark",
            "specs": []
        },
        {
            "name": "omega_match",
            "specs": []
        }
    ],
    "lcname": "omega-omg"
}
        
Elapsed time: 1.66758s