py4jrush

Name	py4jrush JSON
Version	1.0.11 JSON
	download
home_page	https://github.com/jianlins/py4jrush
Summary	A fast implementation of RuSH (Rule-based sentence Segmenter using Hashing).
upload_time	2025-09-15 15:26:13
maintainer	None
docs_url	None
author	Jianlin
requires_python	>=3.8
license	MIT License Copyright (c) 2020 Jianlin Shi Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	ner regex
VCS
bugtrack_url
requirements	loguru setuptools py4j install-jdk
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            
# py4jrush

py4jrush is the python interface to
RuSH(https://github.com/jianlins/RuSH) (**Ru** le-based sentence **S**
egmenter using **H** ashing), which is originally developed using Java. 
This version is implemented through py4j, compared with the original PyRuSH.

RuSH is an efficient, reliable, and easy adaptable rule-based sentence
segmentation solution. It is specifically designed to handle the
telegraphic written text in clinical note. It leverages a nested hash
table to execute simultaneous rule processing, which reduces the impact
of the rule-base growth on execution time and eliminates the effect of
rule order on accuracy.

If you wish to cite RuSH in a publication, please use:

Jianlin Shi ; Danielle Mowery ; Kristina M. Doing-Harris ; John F.
Hurdle.RuSH: a Rule-based Segmentation Tool Using Hashing for Extremely
Accurate Sentence Segmentation of Clinical Text. AMIA Annu Symp Proc.
2016: 1587.

The full text can be found
[here](https://knowledge.amia.org/amia-63300-1.3360278/t005-1.3362920/f005-1.3362921/2495498-1.3363244/2495498-1.3363247?timeStamp=1479743941616).

## Installation

When you run `pip install py4jrush`, the installer will automatically check for Java JDK 8. If JDK 8 is not found, it will use the [install-jdk](https://pypi.org/project/install-jdk/) Python package to download and install JDK 8 for you. No manual Java setup is required.

```bash
pip install py4jrush
```

## Development and Release Process

This project uses automated workflows for building, testing, and releasing. Here's how it works:

### Automated Release Workflow

The project has an automated release workflow that:
1. **Automatically bumps version numbers** using semantic versioning
2. **Builds and tests** the package across multiple environments  
3. **Publishes to PyPI** with proper validation
4. **Creates GitHub releases** with built artifacts

### How to Release

#### Option 1: Manual Workflow Trigger (Recommended)
1. Go to the **Actions** tab in GitHub
2. Select **Build and Publish Python Package** workflow
3. Click **Run workflow** and configure:
   - **Publish to PyPI**: `true` (to actually publish)
   - **Create GitHub release**: `true` (to create a release)
   - **Version bump type**: Choose `patch`, `minor`, or `major`

#### Option 2: GitHub Release Creation
Create a new release in the GitHub UI, and the workflow will automatically trigger.

### Version Bumping Strategy
- **patch** (1.0.10 → 1.0.11): Bug fixes, small changes
- **minor** (1.0.10 → 1.1.0): New features, backwards compatible
- **major** (1.0.10 → 2.0.0): Breaking changes

### Workflow Features
- ✅ **Automatic version detection** from current VERSION file
- ✅ **Smart version bumping** prevents conflicts with existing releases
- ✅ **Multi-environment testing** (Python 3.9, 3.10, 3.11 on Ubuntu, Windows, macOS)
- ✅ **PyPI validation** with twine before publishing
- ✅ **Automatic VERSION file updates** after successful release 
- ✅ **GitHub release creation** with built artifacts

## How to use

A standalone RuSH class is available to be directly used in your code.
From 1.0.4, pyRush adopt spaCy 3.x api to initiate an component.

### Basic Usage

```python
from py4jrush import RuSH
input_str = "The patient was admitted on 03/26/08\n and was started on IV antibiotics elevation" +\
             ", was also counseled to minimizing the cigarette smoking. The patient had edema\n\n" +\
             "\n of his bilateral lower extremities. The hospital consult was also obtained to " +\
             "address edema issue question was related to his liver hepatitis C. Hospital consult" +\
             " was obtained. This included an ultrasound of his abdomen, which showed just mild " +\
             "cirrhosis. "
rush = RuSH('../conf/rush_rules.tsv')
sentences=rush.segToSentenceSpans(input_str)
for sentence in sentences:
    print("Sentence({0}-{1}):\t>{2}<".format(sentence.begin, sentence.end, input_str[sentence.begin:sentence.end]))
```

### Maximum Sentence Length Control

The RuSH class now supports a `max_sentence_length` parameter that automatically splits sentences exceeding a specified character limit. This is particularly useful for downstream NLP tasks that have input length constraints.

```python
from py4jrush import RuSH

# Initialize RuSH with maximum sentence length of 100 characters
rush = RuSH('../conf/rush_rules.tsv', max_sentence_length=100, enable_logger=True)

# Long sentence that will be automatically split
long_text = "This is a very long clinical sentence that contains multiple medical concepts and exceeds the maximum length limit, so it will be intelligently split into smaller segments at appropriate boundaries like whitespace or punctuation marks to ensure each resulting sentence stays within the specified character limit."

sentences = rush.segToSentenceSpans(long_text)
for i, sentence in enumerate(sentences):
    segment = long_text[sentence.begin:sentence.end]
    print(f"Sentence {i} (length {len(segment)}): {segment}")
```

#### Splitting Strategy

When `max_sentence_length` is specified, RuSH uses an intelligent splitting strategy:

1. **Whitespace preferred**: Splits at word boundaries when possible
2. **Punctuation fallback**: Splits at punctuation marks (`,`, `;`, `.`, etc.) when no whitespace is available  
3. **Force split**: Hard split at character limit when no good split points exist

#### Parameters

- `max_sentence_length` (Optional[int]): Maximum allowed sentence length in characters. If `None` (default), no length-based splitting is performed.
- `enable_logger` (bool): When `True`, enables detailed logging of the splitting process.

#### Example Output

```
Sentence 0 (length 98): This is a very long clinical sentence that contains multiple medical concepts and exceeds the 
Sentence 1 (length 97): maximum length limit, so it will be intelligently split into smaller segments at appropriate 
Sentence 2 (length 87): boundaries like whitespace or punctuation marks to ensure each resulting sentence 
Sentence 3 (length 58): stays within the specified character limit.
```

### Constructor Parameters

The RuSH class constructor accepts the following parameters:

```python
RuSH(rules='', min_sent_chars=5, enable_logger=False, py4jar=None, rushjar=None, java_path='java', max_sentence_length=None)
```

- **`rules`** (Union[str, List]): Path to segmentation rules file or list of rule strings
- **`min_sent_chars`** (int): Minimum sentence length in characters (default: 5)
- **`enable_logger`** (bool): Whether to enable logging (default: False)
- **`py4jar`** (Optional[str]): Path to py4j JAR file (auto-detected if None)
- **`rushjar`** (Optional[str]): Path to RuSH JAR file (auto-detected if None)  
- **`java_path`** (str): Path to Java executable (default: 'java')
- **`max_sentence_length`** (Optional[int]): Maximum sentence length in characters (default: None - no splitting)

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/jianlins/py4jrush",
    "name": "py4jrush",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "ner, regex",
    "author": "Jianlin",
    "author_email": "Jianlin <jianlinshi.cn@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/16/95/a2f95520c17e4249da44d58a52fa7c12902977ac15700e39f432baa88bec/py4jrush-1.0.11.tar.gz",
    "platform": null,
    "description": "\n# py4jrush\n\npy4jrush is the python interface to\nRuSH(https://github.com/jianlins/RuSH) (**Ru** le-based sentence **S**\negmenter using **H** ashing), which is originally developed using Java. \nThis version is implemented through py4j, compared with the original PyRuSH.\n\nRuSH is an efficient, reliable, and easy adaptable rule-based sentence\nsegmentation solution. It is specifically designed to handle the\ntelegraphic written text in clinical note. It leverages a nested hash\ntable to execute simultaneous rule processing, which reduces the impact\nof the rule-base growth on execution time and eliminates the effect of\nrule order on accuracy.\n\nIf you wish to cite RuSH in a publication, please use:\n\nJianlin Shi ; Danielle Mowery ; Kristina M. Doing-Harris ; John F.\nHurdle.RuSH: a Rule-based Segmentation Tool Using Hashing for Extremely\nAccurate Sentence Segmentation of Clinical Text. AMIA Annu Symp Proc.\n2016: 1587.\n\nThe full text can be found\n[here](https://knowledge.amia.org/amia-63300-1.3360278/t005-1.3362920/f005-1.3362921/2495498-1.3363244/2495498-1.3363247?timeStamp=1479743941616).\n\n## Installation\n\nWhen you run `pip install py4jrush`, the installer will automatically check for Java JDK 8. If JDK 8 is not found, it will use the [install-jdk](https://pypi.org/project/install-jdk/) Python package to download and install JDK 8 for you. No manual Java setup is required.\n\n```bash\npip install py4jrush\n```\n\n## Development and Release Process\n\nThis project uses automated workflows for building, testing, and releasing. Here's how it works:\n\n### Automated Release Workflow\n\nThe project has an automated release workflow that:\n1. **Automatically bumps version numbers** using semantic versioning\n2. **Builds and tests** the package across multiple environments  \n3. **Publishes to PyPI** with proper validation\n4. **Creates GitHub releases** with built artifacts\n\n### How to Release\n\n#### Option 1: Manual Workflow Trigger (Recommended)\n1. Go to the **Actions** tab in GitHub\n2. Select **Build and Publish Python Package** workflow\n3. Click **Run workflow** and configure:\n   - **Publish to PyPI**: `true` (to actually publish)\n   - **Create GitHub release**: `true` (to create a release)\n   - **Version bump type**: Choose `patch`, `minor`, or `major`\n\n#### Option 2: GitHub Release Creation\nCreate a new release in the GitHub UI, and the workflow will automatically trigger.\n\n### Version Bumping Strategy\n- **patch** (1.0.10 \u2192 1.0.11): Bug fixes, small changes\n- **minor** (1.0.10 \u2192 1.1.0): New features, backwards compatible\n- **major** (1.0.10 \u2192 2.0.0): Breaking changes\n\n### Workflow Features\n- \u2705 **Automatic version detection** from current VERSION file\n- \u2705 **Smart version bumping** prevents conflicts with existing releases\n- \u2705 **Multi-environment testing** (Python 3.9, 3.10, 3.11 on Ubuntu, Windows, macOS)\n- \u2705 **PyPI validation** with twine before publishing\n- \u2705 **Automatic VERSION file updates** after successful release \n- \u2705 **GitHub release creation** with built artifacts\n\n## How to use\n\nA standalone RuSH class is available to be directly used in your code.\nFrom 1.0.4, pyRush adopt spaCy 3.x api to initiate an component.\n\n### Basic Usage\n\n```python\nfrom py4jrush import RuSH\ninput_str = \"The patient was admitted on 03/26/08\\n and was started on IV antibiotics elevation\" +\\\n             \", was also counseled to minimizing the cigarette smoking. The patient had edema\\n\\n\" +\\\n             \"\\n of his bilateral lower extremities. The hospital consult was also obtained to \" +\\\n             \"address edema issue question was related to his liver hepatitis C. Hospital consult\" +\\\n             \" was obtained. This included an ultrasound of his abdomen, which showed just mild \" +\\\n             \"cirrhosis. \"\nrush = RuSH('../conf/rush_rules.tsv')\nsentences=rush.segToSentenceSpans(input_str)\nfor sentence in sentences:\n    print(\"Sentence({0}-{1}):\\t>{2}<\".format(sentence.begin, sentence.end, input_str[sentence.begin:sentence.end]))\n```\n\n### Maximum Sentence Length Control\n\nThe RuSH class now supports a `max_sentence_length` parameter that automatically splits sentences exceeding a specified character limit. This is particularly useful for downstream NLP tasks that have input length constraints.\n\n```python\nfrom py4jrush import RuSH\n\n# Initialize RuSH with maximum sentence length of 100 characters\nrush = RuSH('../conf/rush_rules.tsv', max_sentence_length=100, enable_logger=True)\n\n# Long sentence that will be automatically split\nlong_text = \"This is a very long clinical sentence that contains multiple medical concepts and exceeds the maximum length limit, so it will be intelligently split into smaller segments at appropriate boundaries like whitespace or punctuation marks to ensure each resulting sentence stays within the specified character limit.\"\n\nsentences = rush.segToSentenceSpans(long_text)\nfor i, sentence in enumerate(sentences):\n    segment = long_text[sentence.begin:sentence.end]\n    print(f\"Sentence {i} (length {len(segment)}): {segment}\")\n```\n\n#### Splitting Strategy\n\nWhen `max_sentence_length` is specified, RuSH uses an intelligent splitting strategy:\n\n1. **Whitespace preferred**: Splits at word boundaries when possible\n2. **Punctuation fallback**: Splits at punctuation marks (`,`, `;`, `.`, etc.) when no whitespace is available  \n3. **Force split**: Hard split at character limit when no good split points exist\n\n#### Parameters\n\n- `max_sentence_length` (Optional[int]): Maximum allowed sentence length in characters. If `None` (default), no length-based splitting is performed.\n- `enable_logger` (bool): When `True`, enables detailed logging of the splitting process.\n\n#### Example Output\n\n```\nSentence 0 (length 98): This is a very long clinical sentence that contains multiple medical concepts and exceeds the \nSentence 1 (length 97): maximum length limit, so it will be intelligently split into smaller segments at appropriate \nSentence 2 (length 87): boundaries like whitespace or punctuation marks to ensure each resulting sentence \nSentence 3 (length 58): stays within the specified character limit.\n```\n\n### Constructor Parameters\n\nThe RuSH class constructor accepts the following parameters:\n\n```python\nRuSH(rules='', min_sent_chars=5, enable_logger=False, py4jar=None, rushjar=None, java_path='java', max_sentence_length=None)\n```\n\n- **`rules`** (Union[str, List]): Path to segmentation rules file or list of rule strings\n- **`min_sent_chars`** (int): Minimum sentence length in characters (default: 5)\n- **`enable_logger`** (bool): Whether to enable logging (default: False)\n- **`py4jar`** (Optional[str]): Path to py4j JAR file (auto-detected if None)\n- **`rushjar`** (Optional[str]): Path to RuSH JAR file (auto-detected if None)  \n- **`java_path`** (str): Path to Java executable (default: 'java')\n- **`max_sentence_length`** (Optional[int]): Maximum sentence length in characters (default: None - no splitting)\n",
    "bugtrack_url": null,
    "license": "MIT License\n        \n        Copyright (c) 2020 Jianlin Shi\n        \n        Permission is hereby granted, free of charge, to any person obtaining a copy\n        of this software and associated documentation files (the \"Software\"), to deal\n        in the Software without restriction, including without limitation the rights\n        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n        copies of the Software, and to permit persons to whom the Software is\n        furnished to do so, subject to the following conditions:\n        \n        The above copyright notice and this permission notice shall be included in all\n        copies or substantial portions of the Software.\n        \n        THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n        SOFTWARE.\n        ",
    "summary": "A fast implementation of RuSH (Rule-based sentence Segmenter using Hashing).",
    "version": "1.0.11",
    "project_urls": {
        "Homepage": "https://github.com/jianlins/py4jrush",
        "Source": "https://github.com/jianlins/py4jrush"
    },
    "split_keywords": [
        "ner",
        " regex"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "de0c02c46280b8c302580563a4649e2b2a5a748aa4768d07e6303592cf913c17",
                "md5": "95150fc50782748cbd22e8e1253a8d99",
                "sha256": "2f7ab7109a0fa9f55b87da6140b811a1fb6aca4cbf5ad48dedf605fc0c632b76"
            },
            "downloads": -1,
            "filename": "py4jrush-1.0.11-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "95150fc50782748cbd22e8e1253a8d99",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 39610159,
            "upload_time": "2025-09-15T15:26:10",
            "upload_time_iso_8601": "2025-09-15T15:26:10.325465Z",
            "url": "https://files.pythonhosted.org/packages/de/0c/02c46280b8c302580563a4649e2b2a5a748aa4768d07e6303592cf913c17/py4jrush-1.0.11-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1695a2f95520c17e4249da44d58a52fa7c12902977ac15700e39f432baa88bec",
                "md5": "a6f082e66d8b62e2d49e4c68cf093456",
                "sha256": "7c96a070ac8412f898c23d08362c76f2977483ea451787a4b84864dc164eac96"
            },
            "downloads": -1,
            "filename": "py4jrush-1.0.11.tar.gz",
            "has_sig": false,
            "md5_digest": "a6f082e66d8b62e2d49e4c68cf093456",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 39618242,
            "upload_time": "2025-09-15T15:26:13",
            "upload_time_iso_8601": "2025-09-15T15:26:13.370000Z",
            "url": "https://files.pythonhosted.org/packages/16/95/a2f95520c17e4249da44d58a52fa7c12902977ac15700e39f432baa88bec/py4jrush-1.0.11.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-15 15:26:13",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jianlins",
    "github_project": "py4jrush",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "loguru",
            "specs": [
                [
                    ">=",
                    "0.5.0"
                ]
            ]
        },
        {
            "name": "setuptools",
            "specs": []
        },
        {
            "name": "py4j",
            "specs": []
        },
        {
            "name": "install-jdk",
            "specs": []
        }
    ],
    "lcname": "py4jrush"
}

Jianlin