split-markdown4gpt


Namesplit-markdown4gpt JSON
Version 1.0.9 PyPI version JSON
download
home_pagehttps://github.com/twardoch/split-markdown4gpt
SummaryA Python tool for splitting large Markdown files into smaller sections based on a specified token limit. This is particularly useful for processing large Markdown files with GPT models, as it allows the models to handle the data in manageable chunks.
upload_time2023-06-19 18:19:14
maintainer
docs_urlNone
authorAdam Twardoch
requires_python>=3.10
licenseApache-2.0
keywords python nlp markdown natural-language-processing text-analysis openai text-summarization summarization text-processing gpt data-preprocessing mistletoe split-text text-tokenization openai-gpt gpt-3 gpt-4 gpt-35-turbo gpt-35-turbo-16k markdown-processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage
            # split_markdown4gpt

`split_markdown4gpt` is a Python tool designed to split large Markdown files into smaller sections based on a specified token limit. This is particularly useful for processing large Markdown files with GPT models, as it allows the models to handle the data in manageable chunks.

_**Version 1.0.9** (2023-06-19)_

## Installation

You can install `split_markdown4gpt` via pip:

```bash
pip install split_markdown4gpt
```

## CLI Usage

After installation, you can use the `mdsplit4gpt` command to split a Markdown file. Here's the basic syntax:

```bash
mdsplit4gpt path_to_your_file.md --model gpt-3.5-turbo --limit 4096 --separator "=== SPLIT ==="
```

This command will split the Markdown file at `path_to_your_file.md` into sections, each containing no more than 4096 tokens (as counted by the `gpt-3.5-turbo` model). The sections will be separated by `=== SPLIT ===`.

All CLI options:

```
NAME
    mdsplit4gpt - Splits a Markdown file into sections according to GPT token size limits.

SYNOPSIS
    mdsplit4gpt MD_PATH <flags>

DESCRIPTION
    This tool loads a Markdown file, and splits its content into sections
    that are within the specified token size limit using the desired GPT tokenizing model. The resulting
    sections are then concatenated using the specified separator and returned as a single string.

POSITIONAL ARGUMENTS
    MD_PATH
        Type: Union
        The path of the source Markdown file to be split.

FLAGS
    -m, --model=MODEL
        Type: str
        Default: 'gpt-3.5-turbo'
        The GPT tokenizer model to use for calculating token sizes. Defaults to "gpt-3.5-turbo".
    -l, --limit=LIMIT
        Type: Optional[int]
        Default: None
        The maximum number of GPT tokens allowed per section. Defaults to the model's maximum tokens.
    -s, --separator=SEPARATOR
        Type: str
        Default: '=== SPLIT ==='
        The string used to separate sections in the output. Defaults to "=== SPLIT ===".
```

## Python Usage

You can also use `split_markdown4gpt` in your Python code. Here's a basic example:

```python
from split_markdown4gpt import split

sections = split("path_to_your_file.md", model="gpt-3.5-turbo", limit=4096)
for section in sections:
    print(section)
```

This code does the same thing as the CLI command above, but in Python.

- **[See the API documentation](https://twardoch.github.io/split-markdown4gpt/API.html)** for more advanced usage

## How it Works

`split_markdown4gpt` works by tokenizing the input Markdown file using the specified GPT model's tokenizer (default is `gpt-3.5-turbo`). It then splits the file into sections, each containing no more than the specified token limit.

The splitting process respects the structure of the Markdown file. It will not split a section (as defined by Markdown headings) across multiple output sections unless the section is longer than the token limit. In that case, it

will split the section at the sentence level.

The tool uses several libraries to accomplish this:

- `tiktoken` for tokenizing the text according to the GPT model's rules.
- `fire` for creating the CLI.
- `frontmatter` for parsing the Markdown file's front matter (metadata at the start of the file).
- `mistletoe` for parsing the Markdown file into a syntax tree.
- `syntok` for splitting the text into sentences.
- `regex` and `PyYAML` for various utility functions.

### Use Cases

`split_markdown4gpt` is particularly useful in scenarios where you need to process large Markdown files with GPT models. For instance:

- **Text Generation**: If you're using a GPT model to generate text based on a large Markdown file, you can use `split_markdown4gpt` to split the file into manageable sections. This allows the GPT model to process the file in chunks, preventing token overflow errors.

- **Data Preprocessing**: In machine learning projects, you often need to preprocess your data before feeding it into your model. If your data is in the form of large Markdown files, `split_markdown4gpt` can help you split these files into smaller sections based on the token limit of your model.

- **Document Analysis**: If you're analyzing large Markdown documents (e.g., extracting keywords, summarizing content), you can use `split_markdown4gpt` to break down the documents into smaller sections. This makes the analysis more manageable and efficient.

## Changelog

- v1.0.7: Switched the model for each section from namedtuple to dict
- v1.0.6: Fixes
- v1.0.0: Initial release

## Contributing

Contributions to `split_markdown4gpt` are welcome! Please open an issue or submit a pull request on the [GitHub repository](https://github.com/twardoch/split-markdown4gpt).

## License

- Copyright (c) 2023 [Adam Twardoch](./AUTHORS.md)
- Written with assistance from ChatGPT
- Licensed under the [Apache License 2.0](./LICENSE.txt)<a id="split_markdown4gpt"></a>


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/twardoch/split-markdown4gpt",
    "name": "split-markdown4gpt",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "",
    "keywords": "python,nlp,markdown,natural-language-processing,text-analysis,openai,text-summarization,summarization,text-processing,gpt,data-preprocessing,mistletoe,split-text,text-tokenization,openai-gpt,gpt-3,gpt-4,gpt-35-turbo,gpt-35-turbo-16k,markdown-processing",
    "author": "Adam Twardoch",
    "author_email": "adam+github@twardoch.com",
    "download_url": "https://files.pythonhosted.org/packages/33/12/ce086a0ffd78b4a8cb688458bcbf6b38ce16055db2ee892cc31d961fead2/split_markdown4gpt-1.0.9.tar.gz",
    "platform": "any",
    "description": "# split_markdown4gpt\n\n`split_markdown4gpt` is a Python tool designed to split large Markdown files into smaller sections based on a specified token limit. This is particularly useful for processing large Markdown files with GPT models, as it allows the models to handle the data in manageable chunks.\n\n_**Version 1.0.9** (2023-06-19)_\n\n## Installation\n\nYou can install `split_markdown4gpt` via pip:\n\n```bash\npip install split_markdown4gpt\n```\n\n## CLI Usage\n\nAfter installation, you can use the `mdsplit4gpt` command to split a Markdown file. Here's the basic syntax:\n\n```bash\nmdsplit4gpt path_to_your_file.md --model gpt-3.5-turbo --limit 4096 --separator \"=== SPLIT ===\"\n```\n\nThis command will split the Markdown file at `path_to_your_file.md` into sections, each containing no more than 4096 tokens (as counted by the `gpt-3.5-turbo` model). The sections will be separated by `=== SPLIT ===`.\n\nAll CLI options:\n\n```\nNAME\n    mdsplit4gpt - Splits a Markdown file into sections according to GPT token size limits.\n\nSYNOPSIS\n    mdsplit4gpt MD_PATH <flags>\n\nDESCRIPTION\n    This tool loads a Markdown file, and splits its content into sections\n    that are within the specified token size limit using the desired GPT tokenizing model. The resulting\n    sections are then concatenated using the specified separator and returned as a single string.\n\nPOSITIONAL ARGUMENTS\n    MD_PATH\n        Type: Union\n        The path of the source Markdown file to be split.\n\nFLAGS\n    -m, --model=MODEL\n        Type: str\n        Default: 'gpt-3.5-turbo'\n        The GPT tokenizer model to use for calculating token sizes. Defaults to \"gpt-3.5-turbo\".\n    -l, --limit=LIMIT\n        Type: Optional[int]\n        Default: None\n        The maximum number of GPT tokens allowed per section. Defaults to the model's maximum tokens.\n    -s, --separator=SEPARATOR\n        Type: str\n        Default: '=== SPLIT ==='\n        The string used to separate sections in the output. Defaults to \"=== SPLIT ===\".\n```\n\n## Python Usage\n\nYou can also use `split_markdown4gpt` in your Python code. Here's a basic example:\n\n```python\nfrom split_markdown4gpt import split\n\nsections = split(\"path_to_your_file.md\", model=\"gpt-3.5-turbo\", limit=4096)\nfor section in sections:\n    print(section)\n```\n\nThis code does the same thing as the CLI command above, but in Python.\n\n- **[See the API documentation](https://twardoch.github.io/split-markdown4gpt/API.html)** for more advanced usage\n\n## How it Works\n\n`split_markdown4gpt` works by tokenizing the input Markdown file using the specified GPT model's tokenizer (default is `gpt-3.5-turbo`). It then splits the file into sections, each containing no more than the specified token limit.\n\nThe splitting process respects the structure of the Markdown file. It will not split a section (as defined by Markdown headings) across multiple output sections unless the section is longer than the token limit. In that case, it\n\nwill split the section at the sentence level.\n\nThe tool uses several libraries to accomplish this:\n\n- `tiktoken` for tokenizing the text according to the GPT model's rules.\n- `fire` for creating the CLI.\n- `frontmatter` for parsing the Markdown file's front matter (metadata at the start of the file).\n- `mistletoe` for parsing the Markdown file into a syntax tree.\n- `syntok` for splitting the text into sentences.\n- `regex` and `PyYAML` for various utility functions.\n\n### Use Cases\n\n`split_markdown4gpt` is particularly useful in scenarios where you need to process large Markdown files with GPT models. For instance:\n\n- **Text Generation**: If you're using a GPT model to generate text based on a large Markdown file, you can use `split_markdown4gpt` to split the file into manageable sections. This allows the GPT model to process the file in chunks, preventing token overflow errors.\n\n- **Data Preprocessing**: In machine learning projects, you often need to preprocess your data before feeding it into your model. If your data is in the form of large Markdown files, `split_markdown4gpt` can help you split these files into smaller sections based on the token limit of your model.\n\n- **Document Analysis**: If you're analyzing large Markdown documents (e.g., extracting keywords, summarizing content), you can use `split_markdown4gpt` to break down the documents into smaller sections. This makes the analysis more manageable and efficient.\n\n## Changelog\n\n- v1.0.7: Switched the model for each section from namedtuple to dict\n- v1.0.6: Fixes\n- v1.0.0: Initial release\n\n## Contributing\n\nContributions to `split_markdown4gpt` are welcome! Please open an issue or submit a pull request on the [GitHub repository](https://github.com/twardoch/split-markdown4gpt).\n\n## License\n\n- Copyright (c) 2023 [Adam Twardoch](./AUTHORS.md)\n- Written with assistance from ChatGPT\n- Licensed under the [Apache License 2.0](./LICENSE.txt)<a id=\"split_markdown4gpt\"></a>\n\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "A Python tool for splitting large Markdown files into smaller sections based on a specified token limit. This is particularly useful for processing large Markdown files with GPT models, as it allows the models to handle the data in manageable chunks.",
    "version": "1.0.9",
    "project_urls": {
        "Documentation": "https://twardoch.github.io/split-markdown4gpt/",
        "Homepage": "https://github.com/twardoch/split-markdown4gpt",
        "Source": "https://github.com/twardoch/split-markdown4gpt"
    },
    "split_keywords": [
        "python",
        "nlp",
        "markdown",
        "natural-language-processing",
        "text-analysis",
        "openai",
        "text-summarization",
        "summarization",
        "text-processing",
        "gpt",
        "data-preprocessing",
        "mistletoe",
        "split-text",
        "text-tokenization",
        "openai-gpt",
        "gpt-3",
        "gpt-4",
        "gpt-35-turbo",
        "gpt-35-turbo-16k",
        "markdown-processing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a19095e21c877d2939de37d9775c4d12461f74f652464e53b5b5f94a7f4841e1",
                "md5": "e8cfd40e33bec8cffdd5a36d2fd50044",
                "sha256": "1c0bbf724f4f3feddc3dcb6c065d4c0c175676ed45c85e00477a912b6b349c36"
            },
            "downloads": -1,
            "filename": "split_markdown4gpt-1.0.9-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e8cfd40e33bec8cffdd5a36d2fd50044",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 12857,
            "upload_time": "2023-06-19T18:19:12",
            "upload_time_iso_8601": "2023-06-19T18:19:12.591045Z",
            "url": "https://files.pythonhosted.org/packages/a1/90/95e21c877d2939de37d9775c4d12461f74f652464e53b5b5f94a7f4841e1/split_markdown4gpt-1.0.9-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3312ce086a0ffd78b4a8cb688458bcbf6b38ce16055db2ee892cc31d961fead2",
                "md5": "9d8938372c21648871a8ccdc52fbaa1d",
                "sha256": "1afaaac5526d1faca5554718882a7828166e3439c45df49726ec750243ade5e0"
            },
            "downloads": -1,
            "filename": "split_markdown4gpt-1.0.9.tar.gz",
            "has_sig": false,
            "md5_digest": "9d8938372c21648871a8ccdc52fbaa1d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 21241,
            "upload_time": "2023-06-19T18:19:14",
            "upload_time_iso_8601": "2023-06-19T18:19:14.035819Z",
            "url": "https://files.pythonhosted.org/packages/33/12/ce086a0ffd78b4a8cb688458bcbf6b38ce16055db2ee892cc31d961fead2/split_markdown4gpt-1.0.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-19 18:19:14",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "twardoch",
    "github_project": "split-markdown4gpt",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "tox": true,
    "lcname": "split-markdown4gpt"
}
        
Elapsed time: 0.11894s