# OmegaOMG: Omega Object Matching Grammar
<p align="center">
<img src="https://raw.githubusercontent.com/scholarsmate/omega-omg/main/images/icon.png" alt="OmegaOMG Logo" width="180" />
</p>
<p align="center">
<a href="https://github.com/scholarsmate/omega-omg/actions/workflows/ci.yml">
<img alt="CI" src="https://github.com/scholarsmate/omega-omg/actions/workflows/ci.yml/badge.svg" />
</a>
<a href="https://codecov.io/gh/scholarsmate/omega-omg">
<img alt="Coverage" src="https://codecov.io/gh/scholarsmate/omega-omg/branch/main/graph/badge.svg" />
</a>
</p>
OmegaOMG is a domain-specific language (DSL) and runtime engine for defining and evaluating high‑performance object / entity matching rules against large byte-based inputs (“haystacks”). It leverages pre‑anchored longest, non‑overlapping pattern matches (via the [`OmegaMatch`](https://github.com/scholarsmate/omega-match) library), an optimized AST evaluation engine, and a modular entity resolution pipeline to produce clean, canonicalized, and enriched match streams.
## Key Features
- **Expressive DSL** (version `1.0`):
- `version 1.0` header (mandatory)
- `import "file.txt" as alias [with flags...]`
- Pattern atoms: literals, escapes (`\d \s \w` etc.), anchors `^ $`, dot `.`, character classes `[...]`, list matches `[[alias]]`, optional filters `[[alias:startsWith("A")]]`, named captures `(?P<name> ...)`.
- Operators: concatenation, alternation `|`, grouping `(...)`.
- Quantifiers: bounded `{m}`, `{m,n}`, and `?` (no unbounded `*` / `+` – enforced at runtime).
- Every rule must include at least one `ListMatch` anchor (validated).
- Dotted rule names (e.g. `person.surname`) supported for parent/child entity models.
- **Import flags**: `ignore-case`, `ignore-punctuation`, `elide-whitespace`, `word-boundary`, `word-prefix`, `word-suffix`, `line-start`, `line-end` (forwarded to `omega_match`).
- **Pre‑anchored matching**: Delegates raw token list detection to `omega_match` with `longest_only` & `no_overlap` guarantees per alias.
- **Optimized AST evaluation**:
- Offset‑indexed & binary searched ListMatch anchors
- Greedy quantified ListMatch chaining
- Caching for pattern parts, prefix length, listmatch presence, unbounded checks
- Adaptive sampling of potential start offsets to dramatically reduce scan points
- **Entity Resolution Pipeline** (see `RESOLUTION.md`): Implements Steps 1‑6
1. Validation & normalization
2. Overlap resolution with deterministic tie‑breaking
3. Tokenization + optional token filtering
4. Horizontal canonicalization (parent deduplication)
5. Vertical child resolution (child → parent referencing)
6. Metadata enrichment (sentence / paragraph boundaries)
- **Resolver configuration**:
- `resolver default uses exact ...` sets a default for rules
- Per‑rule: `rule = ... uses resolver fuzzy(threshold="0.9") with ignore-case, optional-tokens("file.txt")`
- Parent rules without children skip resolution for speed; parents with children receive an automatic lightweight `boundary-only` resolver if not explicitly configured.
- **Resolver methods**: Grammar accepts arbitrary resolver method identifiers; built-ins implemented are `exact` and `fuzzy(threshold=...)`. For parent canonicalization, unknown methods fall back to `exact`. For child resolution, use `exact` or `fuzzy` to guarantee matching; unknown methods may result in children being discarded. An internal `boundary-only` mode is used automatically for certain parent rules.
- **Highlighter utility**: Renders enriched matches to interactive HTML (`highlighter.py`) with rule toggles and keyboard navigation (`n` / `p`).
- **VS Code language integration**: See [OMG Language Support](https://github.com/scholarsmate/omega-omg-vscode) for syntax highlighting & IntelliSense.
- **Lean dependencies**: Runtime requires only `lark` and `omega_match`.
> For algorithmic details and performance rationale see: [`RESOLUTION.md`](RESOLUTION.md)
## Installation
Requires: Python 3.9+ (uses builtin generics like tuple[str, ...]).
1. Clone this repository:
```powershell
git clone https://github.com/scholarsmate/omega-omg.git
cd omega-omg
```
2. Create and activate a Python virtual environment:
a. Windows:
```powershell
python3.exe -m venv .venv
.\.venv\Scripts\Activate.ps1
```
b. *nix and macOS:
```sh
python3 -m venv .venv
source ./.venv/bin/activate
```
3. Install runtime dependencies (and optionally dev tooling):
```powershell
pip install -r requirements.txt
# For contributors / tests / linting
pip install -r requirements-dev.txt
```
4. (Optional) Run tests to verify environment:
```powershell
pytest -q
```
## Usage
### 1. Define a DSL file
Create a `.omg` file with rules, e.g., `demo/demo.omg`:
```dsl
version 1.0
# Import match lists
import "name_prefix.txt" as prefix with word-boundary, ignore-case
import "names.txt" as given_name with word-boundary
import "surnames.txt" as surname with word-boundary
import "name_suffix.txt" as suffix with word-boundary
import "0000-9999.txt" as 4_digits with word-boundary
import "tlds.txt" as tld with word-boundary, ignore-case
# Configure the default resolver
resolver default uses exact with ignore-case, ignore-punctuation
# Top-level rule for matching a person's name
person = ( [[prefix]] \s{1,4} )? \
[[given_name]] ( \s{1,4} [[given_name]] )? ( \s{1,4} \w | \s{1,4} \w "." )? \
\s{1,4} [[surname]] \
(\s{0,4} "," \s{1,4} [[suffix]])? \
uses default resolver with optional-tokens("person-opt_tokens.txt")
# Dotted-rule references resolve to top-level person matches
person.prefix_surname = [[prefix]] \s{1,4} [[surname]] (\s{0,4} "," \s{1,4} [[suffix]])? \
uses default resolver with optional-tokens("person-opt_tokens.txt")
person.surname = [[surname]] (\s{0,4} "," \s{1,4} [[suffix]])? \
uses default resolver with optional-tokens("person-opt_tokens.txt")
# Rule for matching a phone number
phone = "(" \s{0,2} \d{3} \s{0,2} ")" \s{0,2} \d{3} "-" \s{0,2} [[4_digits]]
# Rule for matching email addresses with bounded quantifiers
# Pattern: username@domain.tld
# Username: 1-64 chars (alphanumeric, dots, hyphens, underscores)
# Domain: 1-253 chars total, each label 1-63 chars
email = [A-Za-z0-9._-]{1,64} "@" [A-Za-z0-9-]{1,63} ("." [A-Za-z0-9-]{1,63}){0,10} "." [[tld]]
```
### 2. Parse and evaluate in Python
```python
from dsl.omg_parser import parse_file
from dsl.omg_evaluator import RuleEvaluator
# Load DSL and input haystack
ast = parse_file("demo/demo.omg")
with open("demo/CIA_Briefings_of_Presidential_Candidates_1952-1992.txt", "rb") as f:
haystack = f.read()
# Evaluate a specific rule
engine = RuleEvaluator(ast_root=ast, haystack=haystack)
matches = engine.evaluate_rule(ast.rules["person"])
for m in matches:
print(m.offset, m.match.decode())
```
### 3. Command-Line Tool
A command-line interface is provided by `omg.py`.
```powershell
python omg.py --help
```
Common flags:
| Flag | Purpose |
|------|---------|
| `--show-stats` | Emit resolution statistics (input vs output, stage timings) |
| `--show-timing` | Show breakdown of file load, parse, evaluation, resolution |
| `--no-resolve` | Skip entity resolution; emit raw rule matches |
| `--pretty-print` | Emit a single JSON array instead of line-delimited JSON objects |
| `--log-level LEVEL` | Adjust logging (default WARNING) |
| `-o file.json` | Write JSON output to file (UTF‑8, LF) |
| `--version` | Show component & DSL versions |
Version output example:
```
Version information:
omega_match: <x.y.z>
omg: 0.2.0
DSL: 1.0
```
#### Demo: End-to-End Object Matching and Highlighting
The following demonstrates how to use the CLI tools to extract and visualize matches from a text file using a demo OMG rule set:
1. **Run the matcher and output results to JSON (line‑delimited):**
```powershell
python omg.py --show-stats --show-timing --output matches.json .\demo\demo.omg .\demo\CIA_Briefings_of_Presidential_Candidates_1952-1992.txt
```
This command will print timing and statistics to the terminal and write all matches to `matches.json` in UTF-8 with LF line endings.
2. **Render the matches as highlighted HTML:**
```powershell
python highlighter.py .\demo\CIA_Briefings_of_Presidential_Candidates_1952-1992.txt matches.json CIA_demo.html
```
This will generate an HTML file (`CIA_demo.html`) with all matched objects highlighted for easy review.
You can open the resulting HTML file in a browser to visually inspect the extracted matches.
## Project Structure
```
omg.py # CLI driver (evaluate + optional resolution + JSON output)
highlighter.py # Convert line-delimited match JSON to interactive HTML
dsl/
omg_grammar.lark # Lark grammar definition for DSL v1.0
omg_parser.py # Parser + resolver clause extraction + version enforcement
omg_ast.py # Immutable AST node dataclasses
omg_transformer.py # Grammar → AST transformer
omg_evaluator.py # Optimized rule evaluation engine
omg_resolver.py # Resolver façade (imports components below)
resolver/ # Entity resolution submodules (overlap, horizontal, vertical, tokenizer, metadata)
demo/ # Example DSL + pattern lists + sample text
tests/ # Comprehensive pytest suite
RESOLUTION.md # Detailed entity resolution algorithm spec
```
## DSL Constraints & Gotchas
- All rules must include at least one `[[alias]]` (ListMatch). Pure literal / regex‑like rules are rejected.
- Unbounded quantifiers (`*`, `+`) are disallowed; use `{0,n}` / `{1,n}` equivalents.
- Quantified `ListMatch` chains are greedily extended with adjacency (no gaps) and optional line boundary enforcement.
- Dotted (child) rules without an explicit resolver inherit the default; parents with children but no explicit resolver receive a lightweight `boundary-only` config to add structural metadata.
- Import paths in a DSL file are resolved relative to that DSL file when relative.
## Entity Resolution Summary
After raw AST evaluation, resolution (unless `--no-resolve`) applies:
1. Overlap removal (length > earlier offset > shorter rule name > lexical rule name).
2. Parent canonicalization by normalized token bag (flags + optional tokens removed).
3. Child rule validation: each child must map to exactly one canonical parent (else dropped).
4. Metadata enrichment: sentence & paragraph boundary offsets.
See `RESOLUTION.md` for full reasoning, complexity, and future extension recommendations.
## Performance Notes
- Matching cost reduced via adaptive anchor sampling and per‑alias offset maps.
- Regex-like escapes use pre‑compiled single‑byte patterns for speed.
- Caches (pattern part, prefix length, ListMatch presence, unbounded quantifier detection) materially cut repeated traversals.
- Resolution skips unnecessary work (e.g., no resolver for isolated parent rules).
## Development
Formatting / linting (optional but recommended):
```powershell
ruff check .
pylint dsl omg.py highlighter.py
pytest --cov
```
Type checking:
```powershell
mypy dsl
```
Releasing (example):
```powershell
python -m build
twine upload dist/*
```
## Troubleshooting
| Issue | Likely Cause | Fix |
|-------|--------------|-----|
| `ValueError: Rule 'x' must include at least one list match` | Rule lacks `[[alias]]` | Add an import + list match anchor |
| `Unsupported OMG DSL version` | DSL file version mismatch | Update `version 1.0` or engine constant |
| No matches produced | Missing import flags (e.g. `word-boundary`) or list file path issue | Verify list file contents & flags |
| Child rules disappear | Unresolved parent reference | Ensure corresponding parent rule matches same span |
| HTML missing colors for a rule | Rule produced zero matches | Confirm JSON lines include that rule |
## Roadmap (Planned / Potential)
- Plugin resolver strategy interface (custom similarity algorithms)
- Parallel rule evaluation for very large haystacks
- Configurable overlap priority strategies
- More built-in resolver methods beyond `exact`, `fuzzy`, `contains`
- Richer IDE tooling (hover docs, go‑to definition)
## Contributing
1. Fork the repo and create a feature branch.
2. Write tests under `tests/` for new features or bug fixes.
3. Run `pytest` to ensure all tests pass.
```powershell
pytest
```
4. Submit a pull request.
## License
The OmegaOMG project is licensed under the [Apache License 2.0](LICENSE).
OmegaOMG is **not** an official Apache Software Foundation (ASF) project.
---
Questions or ideas? Open an issue or start a discussion – contributions and feedback are welcome.
Raw data
{
"_id": null,
"home_page": null,
"name": "omega-omg",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "dsl, matching, entity-resolution, nlp, regex, omega",
"author": "OmegaOMG Authors",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/29/32/4efd4412f21bc01b0bd70896d748fb519e79aabde0031f334ff8608fd817/omega_omg-0.2.1.tar.gz",
"platform": null,
"description": "# OmegaOMG: Omega Object Matching Grammar\n\n<p align=\"center\">\n <img src=\"https://raw.githubusercontent.com/scholarsmate/omega-omg/main/images/icon.png\" alt=\"OmegaOMG Logo\" width=\"180\" />\n</p>\n\n<p align=\"center\">\n <a href=\"https://github.com/scholarsmate/omega-omg/actions/workflows/ci.yml\">\n <img alt=\"CI\" src=\"https://github.com/scholarsmate/omega-omg/actions/workflows/ci.yml/badge.svg\" />\n </a>\n <a href=\"https://codecov.io/gh/scholarsmate/omega-omg\">\n <img alt=\"Coverage\" src=\"https://codecov.io/gh/scholarsmate/omega-omg/branch/main/graph/badge.svg\" />\n </a>\n \n</p>\n\nOmegaOMG is a domain-specific language (DSL) and runtime engine for defining and evaluating high\u2011performance object / entity matching rules against large byte-based inputs (\u201chaystacks\u201d). It leverages pre\u2011anchored longest, non\u2011overlapping pattern matches (via the [`OmegaMatch`](https://github.com/scholarsmate/omega-match) library), an optimized AST evaluation engine, and a modular entity resolution pipeline to produce clean, canonicalized, and enriched match streams.\n\n## Key Features\n\n- **Expressive DSL** (version `1.0`):\n - `version 1.0` header (mandatory)\n - `import \"file.txt\" as alias [with flags...]`\n - Pattern atoms: literals, escapes (`\\d \\s \\w` etc.), anchors `^ $`, dot `.`, character classes `[...]`, list matches `[[alias]]`, optional filters `[[alias:startsWith(\"A\")]]`, named captures `(?P<name> ...)`.\n - Operators: concatenation, alternation `|`, grouping `(...)`.\n - Quantifiers: bounded `{m}`, `{m,n}`, and `?` (no unbounded `*` / `+` \u2013 enforced at runtime).\n - Every rule must include at least one `ListMatch` anchor (validated).\n - Dotted rule names (e.g. `person.surname`) supported for parent/child entity models.\n- **Import flags**: `ignore-case`, `ignore-punctuation`, `elide-whitespace`, `word-boundary`, `word-prefix`, `word-suffix`, `line-start`, `line-end` (forwarded to `omega_match`).\n- **Pre\u2011anchored matching**: Delegates raw token list detection to `omega_match` with `longest_only` & `no_overlap` guarantees per alias.\n- **Optimized AST evaluation**:\n - Offset\u2011indexed & binary searched ListMatch anchors\n - Greedy quantified ListMatch chaining\n - Caching for pattern parts, prefix length, listmatch presence, unbounded checks\n - Adaptive sampling of potential start offsets to dramatically reduce scan points\n- **Entity Resolution Pipeline** (see `RESOLUTION.md`): Implements Steps 1\u20116\n 1. Validation & normalization\n 2. Overlap resolution with deterministic tie\u2011breaking\n 3. Tokenization + optional token filtering\n 4. Horizontal canonicalization (parent deduplication)\n 5. Vertical child resolution (child \u2192 parent referencing)\n 6. Metadata enrichment (sentence / paragraph boundaries)\n- **Resolver configuration**:\n - `resolver default uses exact ...` sets a default for rules\n - Per\u2011rule: `rule = ... uses resolver fuzzy(threshold=\"0.9\") with ignore-case, optional-tokens(\"file.txt\")`\n - Parent rules without children skip resolution for speed; parents with children receive an automatic lightweight `boundary-only` resolver if not explicitly configured.\n- **Resolver methods**: Grammar accepts arbitrary resolver method identifiers; built-ins implemented are `exact` and `fuzzy(threshold=...)`. For parent canonicalization, unknown methods fall back to `exact`. For child resolution, use `exact` or `fuzzy` to guarantee matching; unknown methods may result in children being discarded. An internal `boundary-only` mode is used automatically for certain parent rules.\n- **Highlighter utility**: Renders enriched matches to interactive HTML (`highlighter.py`) with rule toggles and keyboard navigation (`n` / `p`).\n- **VS Code language integration**: See [OMG Language Support](https://github.com/scholarsmate/omega-omg-vscode) for syntax highlighting & IntelliSense.\n- **Lean dependencies**: Runtime requires only `lark` and `omega_match`.\n\n> For algorithmic details and performance rationale see: [`RESOLUTION.md`](RESOLUTION.md)\n\n## Installation\n\nRequires: Python 3.9+ (uses builtin generics like tuple[str, ...]).\n\n1. Clone this repository:\n\n ```powershell\n git clone https://github.com/scholarsmate/omega-omg.git\n cd omega-omg\n ```\n\n2. Create and activate a Python virtual environment:\n\n a. Windows:\n\n ```powershell\n python3.exe -m venv .venv\n .\\.venv\\Scripts\\Activate.ps1\n ```\n\n b. *nix and macOS:\n\n ```sh\n python3 -m venv .venv\n source ./.venv/bin/activate\n ```\n\n3. Install runtime dependencies (and optionally dev tooling):\n\n ```powershell\n pip install -r requirements.txt\n # For contributors / tests / linting\n pip install -r requirements-dev.txt\n ```\n\n4. (Optional) Run tests to verify environment:\n\n ```powershell\n pytest -q\n ```\n\n## Usage\n\n### 1. Define a DSL file\n\nCreate a `.omg` file with rules, e.g., `demo/demo.omg`:\n```dsl\nversion 1.0\n\n# Import match lists\nimport \"name_prefix.txt\" as prefix with word-boundary, ignore-case\nimport \"names.txt\" as given_name with word-boundary\nimport \"surnames.txt\" as surname with word-boundary\nimport \"name_suffix.txt\" as suffix with word-boundary\nimport \"0000-9999.txt\" as 4_digits with word-boundary\nimport \"tlds.txt\" as tld with word-boundary, ignore-case\n\n# Configure the default resolver\nresolver default uses exact with ignore-case, ignore-punctuation\n\n# Top-level rule for matching a person's name\nperson = ( [[prefix]] \\s{1,4} )? \\\n [[given_name]] ( \\s{1,4} [[given_name]] )? ( \\s{1,4} \\w | \\s{1,4} \\w \".\" )? \\\n \\s{1,4} [[surname]] \\\n (\\s{0,4} \",\" \\s{1,4} [[suffix]])? \\\n uses default resolver with optional-tokens(\"person-opt_tokens.txt\")\n\n# Dotted-rule references resolve to top-level person matches\nperson.prefix_surname = [[prefix]] \\s{1,4} [[surname]] (\\s{0,4} \",\" \\s{1,4} [[suffix]])? \\\n uses default resolver with optional-tokens(\"person-opt_tokens.txt\")\nperson.surname = [[surname]] (\\s{0,4} \",\" \\s{1,4} [[suffix]])? \\\n uses default resolver with optional-tokens(\"person-opt_tokens.txt\")\n\n# Rule for matching a phone number\nphone = \"(\" \\s{0,2} \\d{3} \\s{0,2} \")\" \\s{0,2} \\d{3} \"-\" \\s{0,2} [[4_digits]]\n\n# Rule for matching email addresses with bounded quantifiers\n# Pattern: username@domain.tld\n# Username: 1-64 chars (alphanumeric, dots, hyphens, underscores)\n# Domain: 1-253 chars total, each label 1-63 chars\nemail = [A-Za-z0-9._-]{1,64} \"@\" [A-Za-z0-9-]{1,63} (\".\" [A-Za-z0-9-]{1,63}){0,10} \".\" [[tld]]\n```\n\n### 2. Parse and evaluate in Python\n\n```python\nfrom dsl.omg_parser import parse_file\nfrom dsl.omg_evaluator import RuleEvaluator\n\n# Load DSL and input haystack\nast = parse_file(\"demo/demo.omg\")\nwith open(\"demo/CIA_Briefings_of_Presidential_Candidates_1952-1992.txt\", \"rb\") as f:\n haystack = f.read()\n\n# Evaluate a specific rule\nengine = RuleEvaluator(ast_root=ast, haystack=haystack)\nmatches = engine.evaluate_rule(ast.rules[\"person\"])\nfor m in matches:\n print(m.offset, m.match.decode())\n```\n\n### 3. Command-Line Tool\n\nA command-line interface is provided by `omg.py`.\n\n```powershell\npython omg.py --help\n```\n\nCommon flags:\n\n| Flag | Purpose |\n|------|---------|\n| `--show-stats` | Emit resolution statistics (input vs output, stage timings) |\n| `--show-timing` | Show breakdown of file load, parse, evaluation, resolution |\n| `--no-resolve` | Skip entity resolution; emit raw rule matches |\n| `--pretty-print` | Emit a single JSON array instead of line-delimited JSON objects |\n| `--log-level LEVEL` | Adjust logging (default WARNING) |\n| `-o file.json` | Write JSON output to file (UTF\u20118, LF) |\n| `--version` | Show component & DSL versions |\n\nVersion output example:\n```\nVersion information:\n omega_match: <x.y.z>\n omg: 0.2.0\n DSL: 1.0\n```\n\n#### Demo: End-to-End Object Matching and Highlighting\n\nThe following demonstrates how to use the CLI tools to extract and visualize matches from a text file using a demo OMG rule set:\n\n1. **Run the matcher and output results to JSON (line\u2011delimited):**\n\n ```powershell\n python omg.py --show-stats --show-timing --output matches.json .\\demo\\demo.omg .\\demo\\CIA_Briefings_of_Presidential_Candidates_1952-1992.txt\n ```\n This command will print timing and statistics to the terminal and write all matches to `matches.json` in UTF-8 with LF line endings.\n\n2. **Render the matches as highlighted HTML:**\n\n ```powershell\n python highlighter.py .\\demo\\CIA_Briefings_of_Presidential_Candidates_1952-1992.txt matches.json CIA_demo.html\n ```\n This will generate an HTML file (`CIA_demo.html`) with all matched objects highlighted for easy review.\n\nYou can open the resulting HTML file in a browser to visually inspect the extracted matches.\n\n## Project Structure\n\n```\nomg.py # CLI driver (evaluate + optional resolution + JSON output)\nhighlighter.py # Convert line-delimited match JSON to interactive HTML\ndsl/\n omg_grammar.lark # Lark grammar definition for DSL v1.0\n omg_parser.py # Parser + resolver clause extraction + version enforcement\n omg_ast.py # Immutable AST node dataclasses\n omg_transformer.py # Grammar \u2192 AST transformer\n omg_evaluator.py # Optimized rule evaluation engine\n omg_resolver.py # Resolver fa\u00e7ade (imports components below)\n resolver/ # Entity resolution submodules (overlap, horizontal, vertical, tokenizer, metadata)\ndemo/ # Example DSL + pattern lists + sample text\ntests/ # Comprehensive pytest suite\nRESOLUTION.md # Detailed entity resolution algorithm spec\n```\n\n## DSL Constraints & Gotchas\n\n- All rules must include at least one `[[alias]]` (ListMatch). Pure literal / regex\u2011like rules are rejected.\n- Unbounded quantifiers (`*`, `+`) are disallowed; use `{0,n}` / `{1,n}` equivalents.\n- Quantified `ListMatch` chains are greedily extended with adjacency (no gaps) and optional line boundary enforcement.\n- Dotted (child) rules without an explicit resolver inherit the default; parents with children but no explicit resolver receive a lightweight `boundary-only` config to add structural metadata.\n- Import paths in a DSL file are resolved relative to that DSL file when relative.\n\n## Entity Resolution Summary\n\nAfter raw AST evaluation, resolution (unless `--no-resolve`) applies:\n\n1. Overlap removal (length > earlier offset > shorter rule name > lexical rule name).\n2. Parent canonicalization by normalized token bag (flags + optional tokens removed).\n3. Child rule validation: each child must map to exactly one canonical parent (else dropped).\n4. Metadata enrichment: sentence & paragraph boundary offsets.\n\nSee `RESOLUTION.md` for full reasoning, complexity, and future extension recommendations.\n\n## Performance Notes\n\n- Matching cost reduced via adaptive anchor sampling and per\u2011alias offset maps.\n- Regex-like escapes use pre\u2011compiled single\u2011byte patterns for speed.\n- Caches (pattern part, prefix length, ListMatch presence, unbounded quantifier detection) materially cut repeated traversals.\n- Resolution skips unnecessary work (e.g., no resolver for isolated parent rules).\n\n## Development\n\nFormatting / linting (optional but recommended):\n\n```powershell\nruff check .\npylint dsl omg.py highlighter.py\npytest --cov\n```\n\nType checking:\n```powershell\nmypy dsl\n```\n\nReleasing (example):\n```powershell\npython -m build\ntwine upload dist/*\n```\n\n## Troubleshooting\n\n| Issue | Likely Cause | Fix |\n|-------|--------------|-----|\n| `ValueError: Rule 'x' must include at least one list match` | Rule lacks `[[alias]]` | Add an import + list match anchor |\n| `Unsupported OMG DSL version` | DSL file version mismatch | Update `version 1.0` or engine constant |\n| No matches produced | Missing import flags (e.g. `word-boundary`) or list file path issue | Verify list file contents & flags |\n| Child rules disappear | Unresolved parent reference | Ensure corresponding parent rule matches same span |\n| HTML missing colors for a rule | Rule produced zero matches | Confirm JSON lines include that rule |\n\n## Roadmap (Planned / Potential)\n\n- Plugin resolver strategy interface (custom similarity algorithms)\n- Parallel rule evaluation for very large haystacks\n- Configurable overlap priority strategies\n- More built-in resolver methods beyond `exact`, `fuzzy`, `contains`\n- Richer IDE tooling (hover docs, go\u2011to definition)\n\n## Contributing\n\n1. Fork the repo and create a feature branch.\n2. Write tests under `tests/` for new features or bug fixes.\n3. Run `pytest` to ensure all tests pass.\n ```powershell\n pytest\n ```\n4. Submit a pull request.\n\n## License\n\nThe OmegaOMG project is licensed under the [Apache License 2.0](LICENSE).\n\nOmegaOMG is **not** an official Apache Software Foundation (ASF) project.\n\n---\n\nQuestions or ideas? Open an issue or start a discussion \u2013 contributions and feedback are welcome.\n",
"bugtrack_url": null,
"license": null,
"summary": "Omega Object Matching Grammar (OmegaOMG): DSL and engine for high-performance object/entity matching.",
"version": "0.2.1",
"project_urls": {
"Homepage": "https://github.com/scholarsmate/omega-omg",
"Issues": "https://github.com/scholarsmate/omega-omg/issues",
"Repository": "https://github.com/scholarsmate/omega-omg"
},
"split_keywords": [
"dsl",
" matching",
" entity-resolution",
" nlp",
" regex",
" omega"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "46ef82a4129df4f41cdb86da907924c0c3338c39068209bedde75c39ef50a08c",
"md5": "eea1f6160c1f12c577fc94d6c8e60a71",
"sha256": "49008b61a63f7bfa86d3f635b77ccb729c089b1e92ce18de70537d998ec6e60d"
},
"downloads": -1,
"filename": "omega_omg-0.2.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "eea1f6160c1f12c577fc94d6c8e60a71",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 55282,
"upload_time": "2025-08-09T01:23:35",
"upload_time_iso_8601": "2025-08-09T01:23:35.520214Z",
"url": "https://files.pythonhosted.org/packages/46/ef/82a4129df4f41cdb86da907924c0c3338c39068209bedde75c39ef50a08c/omega_omg-0.2.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "29324efd4412f21bc01b0bd70896d748fb519e79aabde0031f334ff8608fd817",
"md5": "8540ca0063c1d5c46227205d4790d2ba",
"sha256": "48c2a82677a4c2a6da6f8f8b8650f0a7df95433059c1ed5b1cc2b977959c0d2a"
},
"downloads": -1,
"filename": "omega_omg-0.2.1.tar.gz",
"has_sig": false,
"md5_digest": "8540ca0063c1d5c46227205d4790d2ba",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 79372,
"upload_time": "2025-08-09T01:23:36",
"upload_time_iso_8601": "2025-08-09T01:23:36.381070Z",
"url": "https://files.pythonhosted.org/packages/29/32/4efd4412f21bc01b0bd70896d748fb519e79aabde0031f334ff8608fd817/omega_omg-0.2.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-09 01:23:36",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "scholarsmate",
"github_project": "omega-omg",
"travis_ci": false,
"coveralls": true,
"github_actions": true,
"requirements": [
{
"name": "lark",
"specs": []
},
{
"name": "omega_match",
"specs": []
}
],
"lcname": "omega-omg"
}