jina-segmenter


Namejina-segmenter JSON
Version 0.1.0 PyPI version JSON
download
home_pagehttps://github.com/wsy/jina-segmenter
SummaryA text segmentation tool using Jina AI API
upload_time2024-12-04 01:15:37
maintainerNone
docs_urlNone
authorWSY
requires_python>=3.6
licenseNone
keywords jina text segmentation nlp ai
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Jina Segmenter

一个基于 Jina AI API 的智能文本分段工具。它能够智能地将长文本分割成合适大小的片段,同时保持语义的完整性。

## 特性

- 智能文本分段,保持语义完整性
- 自动计算和优化分片大小
- 支持自定义最大分片大小
- 返回每个分片的 token 数量
- 简单易用的 API

## 安装

```bash
pip install jina-segmenter
```

## 使用方法

首先,你需要设置 Jina AI 的 API key:

```python
import os
os.environ['JINA_API_KEY'] = 'your_jina_api_key'
```

然后你就可以使用分段功能:

```python
from jina_segmenter import segment_text

text = "你的长文本..."
chunks = segment_text(text)  # 默认最大分片大小为 1500 tokens

# 查看分片结果
for i, chunk in enumerate(chunks, 1):
    print(f"片段 {i} (tokens: {chunk['tokens']}):")
    print(chunk['text'])
    print("-" * 30)
```

你也可以自定义最大分片大小:

```python
chunks = segment_text(text, max_chunk_size=1000)
```

## 获取 API Key

1. 访问 [Jina AI](https://jina.ai/)
2. 注册并登录你的账号
3. 在控制台中创建新的 API key

## 依赖

- Python >= 3.6
- requests >= 2.31.0

## 许可证

MIT License

## 作者

WSY (wangshuyue@gmail.com)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/wsy/jina-segmenter",
    "name": "jina-segmenter",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": "jina, text segmentation, nlp, ai",
    "author": "WSY",
    "author_email": "wangshuyue@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/d9/83/7ea286948794bb049a73cbc4da798bed685236537c9ba782ae0052139333/jina_segmenter-0.1.0.tar.gz",
    "platform": null,
    "description": "# Jina Segmenter\n\n\u4e00\u4e2a\u57fa\u4e8e Jina AI API \u7684\u667a\u80fd\u6587\u672c\u5206\u6bb5\u5de5\u5177\u3002\u5b83\u80fd\u591f\u667a\u80fd\u5730\u5c06\u957f\u6587\u672c\u5206\u5272\u6210\u5408\u9002\u5927\u5c0f\u7684\u7247\u6bb5\uff0c\u540c\u65f6\u4fdd\u6301\u8bed\u4e49\u7684\u5b8c\u6574\u6027\u3002\n\n## \u7279\u6027\n\n- \u667a\u80fd\u6587\u672c\u5206\u6bb5\uff0c\u4fdd\u6301\u8bed\u4e49\u5b8c\u6574\u6027\n- \u81ea\u52a8\u8ba1\u7b97\u548c\u4f18\u5316\u5206\u7247\u5927\u5c0f\n- \u652f\u6301\u81ea\u5b9a\u4e49\u6700\u5927\u5206\u7247\u5927\u5c0f\n- \u8fd4\u56de\u6bcf\u4e2a\u5206\u7247\u7684 token \u6570\u91cf\n- \u7b80\u5355\u6613\u7528\u7684 API\n\n## \u5b89\u88c5\n\n```bash\npip install jina-segmenter\n```\n\n## \u4f7f\u7528\u65b9\u6cd5\n\n\u9996\u5148\uff0c\u4f60\u9700\u8981\u8bbe\u7f6e Jina AI \u7684 API key\uff1a\n\n```python\nimport os\nos.environ['JINA_API_KEY'] = 'your_jina_api_key'\n```\n\n\u7136\u540e\u4f60\u5c31\u53ef\u4ee5\u4f7f\u7528\u5206\u6bb5\u529f\u80fd\uff1a\n\n```python\nfrom jina_segmenter import segment_text\n\ntext = \"\u4f60\u7684\u957f\u6587\u672c...\"\nchunks = segment_text(text)  # \u9ed8\u8ba4\u6700\u5927\u5206\u7247\u5927\u5c0f\u4e3a 1500 tokens\n\n# \u67e5\u770b\u5206\u7247\u7ed3\u679c\nfor i, chunk in enumerate(chunks, 1):\n    print(f\"\u7247\u6bb5 {i} (tokens: {chunk['tokens']}):\")\n    print(chunk['text'])\n    print(\"-\" * 30)\n```\n\n\u4f60\u4e5f\u53ef\u4ee5\u81ea\u5b9a\u4e49\u6700\u5927\u5206\u7247\u5927\u5c0f\uff1a\n\n```python\nchunks = segment_text(text, max_chunk_size=1000)\n```\n\n## \u83b7\u53d6 API Key\n\n1. \u8bbf\u95ee [Jina AI](https://jina.ai/)\n2. \u6ce8\u518c\u5e76\u767b\u5f55\u4f60\u7684\u8d26\u53f7\n3. \u5728\u63a7\u5236\u53f0\u4e2d\u521b\u5efa\u65b0\u7684 API key\n\n## \u4f9d\u8d56\n\n- Python >= 3.6\n- requests >= 2.31.0\n\n## \u8bb8\u53ef\u8bc1\n\nMIT License\n\n## \u4f5c\u8005\n\nWSY (wangshuyue@gmail.com)\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A text segmentation tool using Jina AI API",
    "version": "0.1.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/wsy/jina-segmenter/issues",
        "Homepage": "https://github.com/wsy/jina-segmenter"
    },
    "split_keywords": [
        "jina",
        " text segmentation",
        " nlp",
        " ai"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6c68b821c602296cec211b9f296089c80e86060e36d42e2f42cb0700b982354b",
                "md5": "4138e965422845e0472f1c09234ec73b",
                "sha256": "84603e4b9f3fd4c8b24f6a626605474f1d9d1ff022caad482a4fb502a0b733f8"
            },
            "downloads": -1,
            "filename": "jina_segmenter-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4138e965422845e0472f1c09234ec73b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 4437,
            "upload_time": "2024-12-04T01:15:35",
            "upload_time_iso_8601": "2024-12-04T01:15:35.210788Z",
            "url": "https://files.pythonhosted.org/packages/6c/68/b821c602296cec211b9f296089c80e86060e36d42e2f42cb0700b982354b/jina_segmenter-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d9837ea286948794bb049a73cbc4da798bed685236537c9ba782ae0052139333",
                "md5": "7d3501ddcd77365814b87bd7ed1bed3a",
                "sha256": "e6ed6fd18ad9c8b4d939cc9fcac899df8261ebc92c0b4f2e6bed2f73fad2e97c"
            },
            "downloads": -1,
            "filename": "jina_segmenter-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "7d3501ddcd77365814b87bd7ed1bed3a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 3887,
            "upload_time": "2024-12-04T01:15:37",
            "upload_time_iso_8601": "2024-12-04T01:15:37.250221Z",
            "url": "https://files.pythonhosted.org/packages/d9/83/7ea286948794bb049a73cbc4da798bed685236537c9ba782ae0052139333/jina_segmenter-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-04 01:15:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "wsy",
    "github_project": "jina-segmenter",
    "github_not_found": true,
    "lcname": "jina-segmenter"
}
        
WSY
Elapsed time: 0.44800s