seanox-ai-nlp


Nameseanox-ai-nlp JSON
Version 1.3.0.1 PyPI version JSON
download
home_pageNone
SummaryLightweight NLP components for semantic processing of domain-specific content.
upload_time2025-10-09 11:17:39
maintainerNone
docs_urlNone
authorSeanox Software Solutions
requires_python>=3.10
licenseApache-2.0
keywords nlp data annotation data generation domain-specific entity extraction fine-tuning information extraction information retrieval measurement extraction measurement units preprocessing pretraining data retrieval optimization semantic labeling semantic processing semantic retrieval sentence generator structured data synthetic data synthetic text template engine text processing training data augmentation
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <p>
  <a href="https://github.com/seanox/seanox-ai-nlp/pulls"
      title="Development"
    ><img src="https://img.shields.io/badge/development-active-green?style=for-the-badge"
  ></a>  
  <a href="https://github.com/seanox/seanox-ai-nlp/issues"
    ><img src="https://img.shields.io/badge/maintenance-active-green?style=for-the-badge"
  ></a>
  <a href="https://seanox.com/contact"
    ><img src="https://img.shields.io/badge/support-active-green?style=for-the-badge"
  ></a>
</p>

# Description
Structured data in technical domains (e.g. engineering, meteorology) often
contain specialized terminology, measurement units, parameter specifications,
and symbolic values. These elements pose a challenge for similarity methods
based solely on embeddings due to their limited semantic resolution.

This package follows a hybrid approach, in which rule-based processing,
NLP-based filtering, and embeddings can be combined so that domain-specific
entities are identified and organized across multiple levels of abstraction,
enabling interpretable and reproducible retrieval workflows.

The package integrates lightweight components into existing NLP pipelines. These
components are designed to work without relying on large language models (LLMs)
and to structure relevant data using deterministic and auditable mechanisms.

__Additional modules are planned to support structured query generation,
including:__

- __Semantic Logic Composer__: Parses natural-language input and produces a
  logical structure enriched with extracted entities. This structure can be used
  as a basis for formats such as SQL, JSON or YAML.

## Structured NLP Workflow

The following figures illustrate the core motivation and design focus of this
package. They outline the typical stages of a structured NLP pipeline and
highlight the specific components where this package provides support.

![Retrieval Process](https://raw.githubusercontent.com/seanox/seanox-ai-nlp/refs/heads/master/assets/retrieval-process.svg)

This conceptual overview serves as a foundation for understanding the individual
components, which are detailed in the next section.

# Licence Agreement
Seanox Software Solutions is an open-source project, hereinafter referred to as
__Seanox__.

This software is licensed under the __Apache License, Version 2.0__.

__Copyright (C) 2025 Seanox Software Solutions__

Licensed under the Apache License, Version 2.0 (the "License"); you may not use
this file except in compliance with the License. You may obtain a copy of the
License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed
under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

# System Requirement
- Python 3.10 or higher

# Installation & Setup
```
pip install seanox-ai-nlp
```

# Packages & Modules

## [units](https://github.com/seanox/seanox-ai-nlp/blob/master/seanox_ai_nlp/units/README.md)
The __units__ module applies rule-based, deterministic pattern recognition to
identify numerical expressions and measurement units in text. It is designed for
integration into lightweight NLP pipelines and does not rely on large language
models (LLMs). Its language-agnostic architecture and flexible formatting
support a broad range of use cases, including general, semi-technical and
semi-academic content.

The module can be integrated with tools such as spaCy’s `EntityRuler`, enabling
annotation, filtering, and token alignment workflows. It produces structured
output suitable for downstream analysis, without performing semantic
interpretation.

### Features
- __Pattern-based extraction__  
  Identifies constructs like _5 km_, _-20 &ordm;C_, or _1000 hPa_ using regular
  expressions and token patterns -- no training required.
- __Language-independent architecture__  
  Operates at token and character level; applicable across multilingual content.
- __Support for compound expressions__  
  Recognizes unit combinations (_km/h, kWh/m&sup2;, g/cm&sup3;_) and numerical
  constructs involving signs and operators: _&plusmn;, &times;, &middot;,
  :, /, ^, –_ and more.
- __Integration-ready output__  
  Returns structured entities compatible with tools like spaCy’s EntityRuler.

### Quickstart
```python
from seanox_ai_nlp.units import units
text = "The cruising speed of the Boeing 747 is approximately 900 km/h (559 mph)."
for entity in units(text):
    print(entity)
```

- [Usage](https://github.com/seanox/seanox-ai-nlp/blob/master/seanox_ai_nlp/units/README.md#usage)
- [Integration in NLP Workflows](https://github.com/seanox/seanox-ai-nlp/blob/master/seanox_ai_nlp/units/README.md#integration-in-nlp-workflows)
- [Downstream Processing with pandas](https://github.com/seanox/seanox-ai-nlp/blob/master/seanox_ai_nlp/units/README.md#downstream-processing-with-pandas)

## [synthetics](https://github.com/seanox/seanox-ai-nlp/blob/master/seanox_ai_nlp/synthetics/README.md)
The __synthetics__ module generates annotated natural language from structured
input data -- such as records from databases or knowledge graphs. It uses
template-based, rule-driven methods to produce controlled and annotated
sentences. Designed for deterministic NLP pipelines, it avoids large language
models (LLMs) and supports reproducible generation.

### Features
- __Template-Based Text Generation__  
  Produces natural-language output from structured input using YAML-defined
  Jinja2 templates. Template selection is context-sensitive.
- __Stochastic Variation__  
  Filters such as __random_set__, __random_range__, and
  __random_range_join_phrase__ introduce lexical and syntactic diversity from
  identical data structures.
- __Domain-Specific Annotation__  
  Annotates entities with structured markers for precise extraction and control.
- __Rule-Based Span Detection__  
  Identifies semantic spans using regular expressions, independent of
  tokenization or parsing.
- __Interpretation-Free Generation__  
  Output is deterministic and reproducible; no semantic analysis is performed.
- __NLP Pipeline compatibility__  
  The __Synthetic__ object includes raw and annotated text, entity spans and
  regex-based semantic spans. Compatible with spaCy-style frameworks for
  fine-tuning, evaluation, and augmentation.

### Quickstart
```python
from seanox_ai_nlp.synthetics import synthetics
import json

with open("synthetics-planets_en.json", encoding="utf-8") as file:
    datas = json.load(file)
    
for data in datas:
    synthetic = synthetics(".", "synthetics_en_annotate.yaml", data)
    print(synthetic)
```

- [Usage](https://github.com/seanox/seanox-ai-nlp/blob/master/seanox_ai_nlp/synthetics/README.md#usage)
- [Integration in NLP Workflows](https://github.com/seanox/seanox-ai-nlp/blob/master/seanox_ai_nlp/synthetics/README.md#integration-in-nlp-workflows)
- [Downstream Processing with pandas](https://github.com/seanox/seanox-ai-nlp/blob/master/seanox_ai_nlp/synthetics/README.md#downstream-processing-with-pandas)

# Changes

## 1.3.0.1 20251009
BF: Release: Unwanted content in distribution (seanox_ai_nlp.whl / seanox_ai_nlp.gz)

## 1.3.0 20251001
BF: Python: Corrections/optimizations of dependencies  
BF: synthetics: Correction for empty templates / missing segments  
BF: synthetics: Consistent use of the parameter pattern for RegEx in spans  
CR: Python: Increased the requirement to Python 3.10 or higher  
CR: synthetics: Added schema and validation for template YAML  
CR: synthetics: Added custom filters for template rendering  
CR: synthetics: Template section span - regex added support for labels

[Read more](https://raw.githubusercontent.com/seanox/seanox-ai-nlp/refs/heads/master/CHANGES)

# Contact
[Issues](https://github.com/seanox/seanox-ai-nlp/issues)  
[Requests](https://github.com/seanox/seanox-ai-nlp/pulls)  
[Mail](https://seanox.com/contact)

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "seanox-ai-nlp",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "NLP, data annotation, data generation, domain-specific, entity extraction, fine-tuning, information extraction, information retrieval, measurement extraction, measurement units, preprocessing, pretraining data, retrieval optimization, semantic labeling, semantic processing, semantic retrieval, sentence generator, structured data, synthetic data, synthetic text, template engine, text processing, training data augmentation",
    "author": "Seanox Software Solutions",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/b0/e9/37c039d9827ae0002ac969176772a03af8b74d197e1562c7b3d89667a327/seanox_ai_nlp-1.3.0.1.tar.gz",
    "platform": null,
    "description": "<p>\n  <a href=\"https://github.com/seanox/seanox-ai-nlp/pulls\"\n      title=\"Development\"\n    ><img src=\"https://img.shields.io/badge/development-active-green?style=for-the-badge\"\n  ></a>  \n  <a href=\"https://github.com/seanox/seanox-ai-nlp/issues\"\n    ><img src=\"https://img.shields.io/badge/maintenance-active-green?style=for-the-badge\"\n  ></a>\n  <a href=\"https://seanox.com/contact\"\n    ><img src=\"https://img.shields.io/badge/support-active-green?style=for-the-badge\"\n  ></a>\n</p>\n\n# Description\nStructured data in technical domains (e.g. engineering, meteorology) often\ncontain specialized terminology, measurement units, parameter specifications,\nand symbolic values. These elements pose a challenge for similarity methods\nbased solely on embeddings due to their limited semantic resolution.\n\nThis package follows a hybrid approach, in which rule-based processing,\nNLP-based filtering, and embeddings can be combined so that domain-specific\nentities are identified and organized across multiple levels of abstraction,\nenabling interpretable and reproducible retrieval workflows.\n\nThe package integrates lightweight components into existing NLP pipelines. These\ncomponents are designed to work without relying on large language models (LLMs)\nand to structure relevant data using deterministic and auditable mechanisms.\n\n__Additional modules are planned to support structured query generation,\nincluding:__\n\n- __Semantic Logic Composer__: Parses natural-language input and produces a\n  logical structure enriched with extracted entities. This structure can be used\n  as a basis for formats such as SQL, JSON or YAML.\n\n## Structured NLP Workflow\n\nThe following figures illustrate the core motivation and design focus of this\npackage. They outline the typical stages of a structured NLP pipeline and\nhighlight the specific components where this package provides support.\n\n![Retrieval Process](https://raw.githubusercontent.com/seanox/seanox-ai-nlp/refs/heads/master/assets/retrieval-process.svg)\n\nThis conceptual overview serves as a foundation for understanding the individual\ncomponents, which are detailed in the next section.\n\n# Licence Agreement\nSeanox Software Solutions is an open-source project, hereinafter referred to as\n__Seanox__.\n\nThis software is licensed under the __Apache License, Version 2.0__.\n\n__Copyright (C) 2025 Seanox Software Solutions__\n\nLicensed under the Apache License, Version 2.0 (the \"License\"); you may not use\nthis file except in compliance with the License. You may obtain a copy of the\nLicense at\n\nhttps://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software distributed\nunder the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR\nCONDITIONS OF ANY KIND, either express or implied. See the License for the\nspecific language governing permissions and limitations under the License.\n\n# System Requirement\n- Python 3.10 or higher\n\n# Installation & Setup\n```\npip install seanox-ai-nlp\n```\n\n# Packages & Modules\n\n## [units](https://github.com/seanox/seanox-ai-nlp/blob/master/seanox_ai_nlp/units/README.md)\nThe __units__ module applies rule-based, deterministic pattern recognition to\nidentify numerical expressions and measurement units in text. It is designed for\nintegration into lightweight NLP pipelines and does not rely on large language\nmodels (LLMs). Its language-agnostic architecture and flexible formatting\nsupport a broad range of use cases, including general, semi-technical and\nsemi-academic content.\n\nThe module can be integrated with tools such as spaCy\u2019s `EntityRuler`, enabling\nannotation, filtering, and token alignment workflows. It produces structured\noutput suitable for downstream analysis, without performing semantic\ninterpretation.\n\n### Features\n- __Pattern-based extraction__  \n  Identifies constructs like _5 km_, _-20 &ordm;C_, or _1000 hPa_ using regular\n  expressions and token patterns -- no training required.\n- __Language-independent architecture__  \n  Operates at token and character level; applicable across multilingual content.\n- __Support for compound expressions__  \n  Recognizes unit combinations (_km/h, kWh/m&sup2;, g/cm&sup3;_) and numerical\n  constructs involving signs and operators: _&plusmn;, &times;, &middot;,\n  :, /, ^, \u2013_ and more.\n- __Integration-ready output__  \n  Returns structured entities compatible with tools like spaCy\u2019s EntityRuler.\n\n### Quickstart\n```python\nfrom seanox_ai_nlp.units import units\ntext = \"The cruising speed of the Boeing 747 is approximately 900 km/h (559 mph).\"\nfor entity in units(text):\n    print(entity)\n```\n\n- [Usage](https://github.com/seanox/seanox-ai-nlp/blob/master/seanox_ai_nlp/units/README.md#usage)\n- [Integration in NLP Workflows](https://github.com/seanox/seanox-ai-nlp/blob/master/seanox_ai_nlp/units/README.md#integration-in-nlp-workflows)\n- [Downstream Processing with pandas](https://github.com/seanox/seanox-ai-nlp/blob/master/seanox_ai_nlp/units/README.md#downstream-processing-with-pandas)\n\n## [synthetics](https://github.com/seanox/seanox-ai-nlp/blob/master/seanox_ai_nlp/synthetics/README.md)\nThe __synthetics__ module generates annotated natural language from structured\ninput data -- such as records from databases or knowledge graphs. It uses\ntemplate-based, rule-driven methods to produce controlled and annotated\nsentences. Designed for deterministic NLP pipelines, it avoids large language\nmodels (LLMs) and supports reproducible generation.\n\n### Features\n- __Template-Based Text Generation__  \n  Produces natural-language output from structured input using YAML-defined\n  Jinja2 templates. Template selection is context-sensitive.\n- __Stochastic Variation__  \n  Filters such as __random_set__, __random_range__, and\n  __random_range_join_phrase__ introduce lexical and syntactic diversity from\n  identical data structures.\n- __Domain-Specific Annotation__  \n  Annotates entities with structured markers for precise extraction and control.\n- __Rule-Based Span Detection__  \n  Identifies semantic spans using regular expressions, independent of\n  tokenization or parsing.\n- __Interpretation-Free Generation__  \n  Output is deterministic and reproducible; no semantic analysis is performed.\n- __NLP Pipeline compatibility__  \n  The __Synthetic__ object includes raw and annotated text, entity spans and\n  regex-based semantic spans. Compatible with spaCy-style frameworks for\n  fine-tuning, evaluation, and augmentation.\n\n### Quickstart\n```python\nfrom seanox_ai_nlp.synthetics import synthetics\nimport json\n\nwith open(\"synthetics-planets_en.json\", encoding=\"utf-8\") as file:\n    datas = json.load(file)\n    \nfor data in datas:\n    synthetic = synthetics(\".\", \"synthetics_en_annotate.yaml\", data)\n    print(synthetic)\n```\n\n- [Usage](https://github.com/seanox/seanox-ai-nlp/blob/master/seanox_ai_nlp/synthetics/README.md#usage)\n- [Integration in NLP Workflows](https://github.com/seanox/seanox-ai-nlp/blob/master/seanox_ai_nlp/synthetics/README.md#integration-in-nlp-workflows)\n- [Downstream Processing with pandas](https://github.com/seanox/seanox-ai-nlp/blob/master/seanox_ai_nlp/synthetics/README.md#downstream-processing-with-pandas)\n\n# Changes\n\n## 1.3.0.1 20251009\nBF: Release: Unwanted content in distribution (seanox_ai_nlp.whl / seanox_ai_nlp.gz)\n\n## 1.3.0 20251001\nBF: Python: Corrections/optimizations of dependencies  \nBF: synthetics: Correction for empty templates / missing segments  \nBF: synthetics: Consistent use of the parameter pattern for RegEx in spans  \nCR: Python: Increased the requirement to Python 3.10 or higher  \nCR: synthetics: Added schema and validation for template YAML  \nCR: synthetics: Added custom filters for template rendering  \nCR: synthetics: Template section span - regex added support for labels\n\n[Read more](https://raw.githubusercontent.com/seanox/seanox-ai-nlp/refs/heads/master/CHANGES)\n\n# Contact\n[Issues](https://github.com/seanox/seanox-ai-nlp/issues)  \n[Requests](https://github.com/seanox/seanox-ai-nlp/pulls)  \n[Mail](https://seanox.com/contact)\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Lightweight NLP components for semantic processing of domain-specific content.",
    "version": "1.3.0.1",
    "project_urls": {
        "Homepage": "https://github.com/seanox/seanox-ai-nlp",
        "Issues": "https://github.com/seanox/seanox-ai-nlp/issues"
    },
    "split_keywords": [
        "nlp",
        " data annotation",
        " data generation",
        " domain-specific",
        " entity extraction",
        " fine-tuning",
        " information extraction",
        " information retrieval",
        " measurement extraction",
        " measurement units",
        " preprocessing",
        " pretraining data",
        " retrieval optimization",
        " semantic labeling",
        " semantic processing",
        " semantic retrieval",
        " sentence generator",
        " structured data",
        " synthetic data",
        " synthetic text",
        " template engine",
        " text processing",
        " training data augmentation"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8d6c0c6cf1df416ae2b52b70068fea9e722a284120a0a7c2baa4f535fd720bdc",
                "md5": "4d69c7d62d9c9b65e27efe827ea7dbf9",
                "sha256": "c2c2ced55f86ad3f7e66c0c0c32038bd8a41f076a21eed114688a8e232b7103a"
            },
            "downloads": -1,
            "filename": "seanox_ai_nlp-1.3.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4d69c7d62d9c9b65e27efe827ea7dbf9",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 69660,
            "upload_time": "2025-10-09T11:17:37",
            "upload_time_iso_8601": "2025-10-09T11:17:37.458285Z",
            "url": "https://files.pythonhosted.org/packages/8d/6c/0c6cf1df416ae2b52b70068fea9e722a284120a0a7c2baa4f535fd720bdc/seanox_ai_nlp-1.3.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b0e937c039d9827ae0002ac969176772a03af8b74d197e1562c7b3d89667a327",
                "md5": "d963170c2326b261e23d87bb63bb3efc",
                "sha256": "6985dbb7a4f76220def681eb3b8f5d1a22f5d2a47a3b7a13cde85df8e45e49eb"
            },
            "downloads": -1,
            "filename": "seanox_ai_nlp-1.3.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "d963170c2326b261e23d87bb63bb3efc",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 197932,
            "upload_time": "2025-10-09T11:17:39",
            "upload_time_iso_8601": "2025-10-09T11:17:39.291809Z",
            "url": "https://files.pythonhosted.org/packages/b0/e9/37c039d9827ae0002ac969176772a03af8b74d197e1562c7b3d89667a327/seanox_ai_nlp-1.3.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-09 11:17:39",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "seanox",
    "github_project": "seanox-ai-nlp",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "seanox-ai-nlp"
}
        
Elapsed time: 1.72027s