# sentok
**Sentok** is a fast and dynamic Python package for converting paragraphs into sentences. It offers customizable thresholds for adaptive sentence segmentation and is built on top of pandas for high performance and easy adjustment. The package allows you to easily convert paragraphs into a list of sentences or a DataFrame with probability columns.
## Features
- **High Performance**: Efficient handling of large texts.
- **Dynamic Configuration**: Customizable parameters and regular expressions.
- **Simple Logic**: Easy to understand and extend.
## Installation
### Via pip
To install the latest version directly from the GitHub repository, use:
```bash
pip install sentok
```
Or
```bash
pip install git+https://github.com/kothiyarajesh/sentok.git
```
### Building from Source
1. Clone the repository:
```bash
git clone https://github.com/kothiyarajesh/sentok.git
```
2. Navigate to the project directory:
```bash
cd sentok
```
3. Install the package:
```bash
python setup.py install
```
## Usage
### Python Script
Here’s a simple example of how to use the `sentok` library in a Python script:
```python
import sentok
# Display current weights used by the tokenizer
# Uncomment the following line to view the current weights in use:
# print(sentok.get_weights())
# Adjust weights only if necessary for specific use cases
# For example, updating the set of start characters:
# sentok.set_weights({'start_chars': list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')})
# Sample text for sentence tokenization
text = """Natural language processing (NLP) is a captivating domain that merges computer science, artificial intelligence, and linguistics. It empowers computers to comprehend, interpret, and produce human language in a manner that is both useful and insightful. NLP finds application in various fields, such as text analysis, speech recognition, and machine translation. For example, advanced language models like GPT-3 have showcased exceptional skills in generating text that resembles human writing and in answering queries. As technology progresses, NLP continues to advance, enhancing its precision and expanding its scope of applications."""
# Tokenize the sample text into sentences using the default threshold of 0.65
# Adjust the threshold as needed based on your text's quality.
sentences = sentok.sent_tokenize(text, 0.64)
# Print each extracted sentence
for sentence in sentences:
print('->', sentence)
# Print the total number of sentences extracted
print('Total Sentences:', len(sentences))
# Obtain a DataFrame with tokenization features for further analysis or model training:
df = sentok.get_sent_tokenize_df(text)
print(df)
```
## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
Raw data
{
"_id": null,
"home_page": "https://github.com/kothiyarajesh/sentok",
"name": "sentok",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "sentence tokenization, natural language processing, text segmentation",
"author": "Rajesh Kothiya",
"author_email": "rkahir2222@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/e4/d0/f9c7853ae615d8c7377a6c63f273ff896128b1328e8ff926a82cece16fec/sentok-0.1.2.tar.gz",
"platform": null,
"description": "\n# sentok\n\n**Sentok** is a fast and dynamic Python package for converting paragraphs into sentences. It offers customizable thresholds for adaptive sentence segmentation and is built on top of pandas for high performance and easy adjustment. The package allows you to easily convert paragraphs into a list of sentences or a DataFrame with probability columns.\n\n## Features\n\n- **High Performance**: Efficient handling of large texts.\n- **Dynamic Configuration**: Customizable parameters and regular expressions.\n- **Simple Logic**: Easy to understand and extend.\n\n## Installation\n\n### Via pip\n\nTo install the latest version directly from the GitHub repository, use:\n\n```bash\npip install sentok\n```\n\nOr\n\n```bash\npip install git+https://github.com/kothiyarajesh/sentok.git\n```\n\n### Building from Source\n\n1. Clone the repository:\n\n ```bash\n git clone https://github.com/kothiyarajesh/sentok.git\n ```\n\n2. Navigate to the project directory:\n\n ```bash\n cd sentok\n ```\n\n3. Install the package:\n\n ```bash\n python setup.py install\n ```\n\n## Usage\n\n### Python Script\n\nHere\u2019s a simple example of how to use the `sentok` library in a Python script:\n\n```python\nimport sentok\n\n# Display current weights used by the tokenizer\n# Uncomment the following line to view the current weights in use:\n# print(sentok.get_weights())\n\n# Adjust weights only if necessary for specific use cases\n# For example, updating the set of start characters:\n# sentok.set_weights({'start_chars': list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')})\n\n# Sample text for sentence tokenization\ntext = \"\"\"Natural language processing (NLP) is a captivating domain that merges computer science, artificial intelligence, and linguistics. It empowers computers to comprehend, interpret, and produce human language in a manner that is both useful and insightful. NLP finds application in various fields, such as text analysis, speech recognition, and machine translation. For example, advanced language models like GPT-3 have showcased exceptional skills in generating text that resembles human writing and in answering queries. As technology progresses, NLP continues to advance, enhancing its precision and expanding its scope of applications.\"\"\"\n\n# Tokenize the sample text into sentences using the default threshold of 0.65\n# Adjust the threshold as needed based on your text's quality.\nsentences = sentok.sent_tokenize(text, 0.64)\n\n# Print each extracted sentence\nfor sentence in sentences:\n print('->', sentence)\n\n# Print the total number of sentences extracted\nprint('Total Sentences:', len(sentences))\n\n# Obtain a DataFrame with tokenization features for further analysis or model training:\ndf = sentok.get_sent_tokenize_df(text)\nprint(df)\n```\n\n## License\n\nThis project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A fast and dynamic python package for converting paragraphs into sentences with customizable thresholds for adaptive sentence segmentation.",
"version": "0.1.2",
"project_urls": {
"Bug Reports": "https://github.com/kothiyarajesh/sentok/issues",
"Homepage": "https://github.com/kothiyarajesh/sentok",
"Source": "https://github.com/kothiyarajesh/sentok"
},
"split_keywords": [
"sentence tokenization",
" natural language processing",
" text segmentation"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "bb8d218c4e1f2879ad17e874bba42b77c3c9f02076052c273efe2b61201e3d72",
"md5": "945f3dbe9300fec150fa59bb2c803069",
"sha256": "8b279f5be870199d89e3488c7487606ff2ef752c21d217308fc254c9a36c5e3c"
},
"downloads": -1,
"filename": "sentok-0.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "945f3dbe9300fec150fa59bb2c803069",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 8223,
"upload_time": "2024-08-29T06:54:22",
"upload_time_iso_8601": "2024-08-29T06:54:22.451409Z",
"url": "https://files.pythonhosted.org/packages/bb/8d/218c4e1f2879ad17e874bba42b77c3c9f02076052c273efe2b61201e3d72/sentok-0.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "e4d0f9c7853ae615d8c7377a6c63f273ff896128b1328e8ff926a82cece16fec",
"md5": "6c9849ef98e7d33bcd96d0dcae86d412",
"sha256": "4e1a2e3e2a645e6c8194cd86387bc550c8f5980cf067f8a15dea9ba2ae95af46"
},
"downloads": -1,
"filename": "sentok-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "6c9849ef98e7d33bcd96d0dcae86d412",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 8072,
"upload_time": "2024-08-29T06:54:24",
"upload_time_iso_8601": "2024-08-29T06:54:24.311713Z",
"url": "https://files.pythonhosted.org/packages/e4/d0/f9c7853ae615d8c7377a6c63f273ff896128b1328e8ff926a82cece16fec/sentok-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-29 06:54:24",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "kothiyarajesh",
"github_project": "sentok",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "numpy",
"specs": [
[
"==",
"1.23.0"
]
]
},
{
"name": "pandas",
"specs": [
[
"==",
"2.0.3"
]
]
}
],
"lcname": "sentok"
}