# Jina Segmenter
一个基于 Jina AI API 的智能文本分段工具。它能够智能地将长文本分割成合适大小的片段,同时保持语义的完整性。
## 特性
- 智能文本分段,保持语义完整性
- 自动计算和优化分片大小
- 支持自定义最大分片大小
- 返回每个分片的 token 数量
- 简单易用的 API
## 安装
```bash
pip install jina-segmenter
```
## 使用方法
首先,你需要设置 Jina AI 的 API key:
```python
import os
os.environ['JINA_API_KEY'] = 'your_jina_api_key'
```
然后你就可以使用分段功能:
```python
from jina_segmenter import segment_text
text = "你的长文本..."
chunks = segment_text(text) # 默认最大分片大小为 1500 tokens
# 查看分片结果
for i, chunk in enumerate(chunks, 1):
print(f"片段 {i} (tokens: {chunk['tokens']}):")
print(chunk['text'])
print("-" * 30)
```
你也可以自定义最大分片大小:
```python
chunks = segment_text(text, max_chunk_size=1000)
```
## 获取 API Key
1. 访问 [Jina AI](https://jina.ai/)
2. 注册并登录你的账号
3. 在控制台中创建新的 API key
## 依赖
- Python >= 3.6
- requests >= 2.31.0
## 许可证
MIT License
## 作者
WSY (wangshuyue@gmail.com)
Raw data
{
"_id": null,
"home_page": "https://github.com/wsy/jina-segmenter",
"name": "jina-segmenter",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": null,
"keywords": "jina, text segmentation, nlp, ai",
"author": "WSY",
"author_email": "wangshuyue@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/d9/83/7ea286948794bb049a73cbc4da798bed685236537c9ba782ae0052139333/jina_segmenter-0.1.0.tar.gz",
"platform": null,
"description": "# Jina Segmenter\n\n\u4e00\u4e2a\u57fa\u4e8e Jina AI API \u7684\u667a\u80fd\u6587\u672c\u5206\u6bb5\u5de5\u5177\u3002\u5b83\u80fd\u591f\u667a\u80fd\u5730\u5c06\u957f\u6587\u672c\u5206\u5272\u6210\u5408\u9002\u5927\u5c0f\u7684\u7247\u6bb5\uff0c\u540c\u65f6\u4fdd\u6301\u8bed\u4e49\u7684\u5b8c\u6574\u6027\u3002\n\n## \u7279\u6027\n\n- \u667a\u80fd\u6587\u672c\u5206\u6bb5\uff0c\u4fdd\u6301\u8bed\u4e49\u5b8c\u6574\u6027\n- \u81ea\u52a8\u8ba1\u7b97\u548c\u4f18\u5316\u5206\u7247\u5927\u5c0f\n- \u652f\u6301\u81ea\u5b9a\u4e49\u6700\u5927\u5206\u7247\u5927\u5c0f\n- \u8fd4\u56de\u6bcf\u4e2a\u5206\u7247\u7684 token \u6570\u91cf\n- \u7b80\u5355\u6613\u7528\u7684 API\n\n## \u5b89\u88c5\n\n```bash\npip install jina-segmenter\n```\n\n## \u4f7f\u7528\u65b9\u6cd5\n\n\u9996\u5148\uff0c\u4f60\u9700\u8981\u8bbe\u7f6e Jina AI \u7684 API key\uff1a\n\n```python\nimport os\nos.environ['JINA_API_KEY'] = 'your_jina_api_key'\n```\n\n\u7136\u540e\u4f60\u5c31\u53ef\u4ee5\u4f7f\u7528\u5206\u6bb5\u529f\u80fd\uff1a\n\n```python\nfrom jina_segmenter import segment_text\n\ntext = \"\u4f60\u7684\u957f\u6587\u672c...\"\nchunks = segment_text(text) # \u9ed8\u8ba4\u6700\u5927\u5206\u7247\u5927\u5c0f\u4e3a 1500 tokens\n\n# \u67e5\u770b\u5206\u7247\u7ed3\u679c\nfor i, chunk in enumerate(chunks, 1):\n print(f\"\u7247\u6bb5 {i} (tokens: {chunk['tokens']}):\")\n print(chunk['text'])\n print(\"-\" * 30)\n```\n\n\u4f60\u4e5f\u53ef\u4ee5\u81ea\u5b9a\u4e49\u6700\u5927\u5206\u7247\u5927\u5c0f\uff1a\n\n```python\nchunks = segment_text(text, max_chunk_size=1000)\n```\n\n## \u83b7\u53d6 API Key\n\n1. \u8bbf\u95ee [Jina AI](https://jina.ai/)\n2. \u6ce8\u518c\u5e76\u767b\u5f55\u4f60\u7684\u8d26\u53f7\n3. \u5728\u63a7\u5236\u53f0\u4e2d\u521b\u5efa\u65b0\u7684 API key\n\n## \u4f9d\u8d56\n\n- Python >= 3.6\n- requests >= 2.31.0\n\n## \u8bb8\u53ef\u8bc1\n\nMIT License\n\n## \u4f5c\u8005\n\nWSY (wangshuyue@gmail.com)\n",
"bugtrack_url": null,
"license": null,
"summary": "A text segmentation tool using Jina AI API",
"version": "0.1.0",
"project_urls": {
"Bug Tracker": "https://github.com/wsy/jina-segmenter/issues",
"Homepage": "https://github.com/wsy/jina-segmenter"
},
"split_keywords": [
"jina",
" text segmentation",
" nlp",
" ai"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "6c68b821c602296cec211b9f296089c80e86060e36d42e2f42cb0700b982354b",
"md5": "4138e965422845e0472f1c09234ec73b",
"sha256": "84603e4b9f3fd4c8b24f6a626605474f1d9d1ff022caad482a4fb502a0b733f8"
},
"downloads": -1,
"filename": "jina_segmenter-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4138e965422845e0472f1c09234ec73b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 4437,
"upload_time": "2024-12-04T01:15:35",
"upload_time_iso_8601": "2024-12-04T01:15:35.210788Z",
"url": "https://files.pythonhosted.org/packages/6c/68/b821c602296cec211b9f296089c80e86060e36d42e2f42cb0700b982354b/jina_segmenter-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "d9837ea286948794bb049a73cbc4da798bed685236537c9ba782ae0052139333",
"md5": "7d3501ddcd77365814b87bd7ed1bed3a",
"sha256": "e6ed6fd18ad9c8b4d939cc9fcac899df8261ebc92c0b4f2e6bed2f73fad2e97c"
},
"downloads": -1,
"filename": "jina_segmenter-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "7d3501ddcd77365814b87bd7ed1bed3a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 3887,
"upload_time": "2024-12-04T01:15:37",
"upload_time_iso_8601": "2024-12-04T01:15:37.250221Z",
"url": "https://files.pythonhosted.org/packages/d9/83/7ea286948794bb049a73cbc4da798bed685236537c9ba782ae0052139333/jina_segmenter-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-04 01:15:37",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "wsy",
"github_project": "jina-segmenter",
"github_not_found": true,
"lcname": "jina-segmenter"
}