markitdown-pdf-separators


Namemarkitdown-pdf-separators JSON
Version 0.4.1 PyPI version JSON
download
home_pageNone
SummaryMarkItDown with PDF page separators - convert PDFs to Markdown with page boundary markers
upload_time2025-07-22 14:44:59
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseNone
keywords converter document-conversion markdown page-separators pdf
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # MarkItDown with PDF Page Separators

> [!IMPORTANT]
> **MarkItDown with PDF Page Separators** is a Python package and command-line utility for converting various files to Markdown, with the addition of PDF page separator and header/footer removal functionality.
>
> This is a fork of the original [MarkItDown](https://github.com/microsoft/markitdown) project by Microsoft, adding PDF page separator and header/footer removal support.

## 🆕 New Features

### PDF Page Separators
Convert PDFs to Markdown with clear page boundaries using the `add_page_separators` parameter:

```python
from markitdown import MarkItDown

md = MarkItDown(enable_plugins=False)

# With page separators (new feature!)
result = md.convert("document.pdf", add_page_separators=True)
# Output includes "---" between pages

# Without page separators (default behavior)
result = md.convert("document.pdf", add_page_separators=False)
# Output is continuous text
```

### PDF Header/Footer Removal
Remove headers and footers from PDFs using the `remove_headers_footers` parameter:

```python
from markitdown import MarkItDown

md = MarkItDown(enable_plugins=False)

# Remove headers and footers (automatically enables page separators)
result = md.convert("document.pdf", remove_headers_footers=True)
# Output excludes common headers/footers like page numbers, copyright notices, etc.

# Combine both features explicitly
result = md.convert("document.pdf", 
                   add_page_separators=True, 
                   remove_headers_footers=True)
# Clean output with page separators and no headers/footers
```

## Installation

From PyPI:

```bash
pip install markitdown-pdf-separators[all]
```

For header/footer removal functionality:

```bash
pip install markitdown-pdf-separators[pdf-clean]
```

From source:

```bash
git clone https://github.com/yourusername/markitdown-pdf-separators.git
cd markitdown-pdf-separators
pip install -e .
```

## Usage

### Command-Line

```bash
# Basic conversion
markitdown path-to-file.pdf > document.md

# With page separators (if supported by your version)
markitdown path-to-file.pdf --add-page-separators > document.md
```

### Python API

```python
from markitdown import MarkItDown

# Initialize
md = MarkItDown(enable_plugins=False)

# Convert various file types
result = md.convert("test.xlsx")
print(result.markdown)

# Convert PDF with page separators
result = md.convert("document.pdf", add_page_separators=True)
print(result.markdown)

# Convert PDF with header/footer removal
result = md.convert("document.pdf", remove_headers_footers=True)
print(result.markdown)

# Convert PDF with both features
result = md.convert("document.pdf", 
                   add_page_separators=True, 
                   remove_headers_footers=True)
print(result.markdown)
```

## Supported File Types

- **PDF** (with page separators and header/footer removal) ✨
- Word documents (.docx)
- Excel spreadsheets (.xlsx, .xls)
- PowerPoint presentations (.pptx)
- HTML files
- Plain text files
- Images (with OCR)
- Audio files (with transcription)
- And many more...

## PDF Features

### Page Separators (`add_page_separators`)
- **Parameter**: `add_page_separators=True/False` (default: `False`)
- Extracts text page by page from PDFs
- Adds `---` (Markdown horizontal rule) between pages
- Maintains document structure and readability
- Works with multi-page documents
- Useful for maintaining page boundaries in the output

### Header/Footer Removal (`remove_headers_footers`)
- **Parameter**: `remove_headers_footers=True/False` (default: `False`)
- **Note**: Automatically enables page separators when this feature is used
- Removes common headers and footers automatically
- Detects and removes up to 2 lines from the beginning and end of each page
- Identifies duplicate content across pages (headers/footers that repeat)
- Removes page numbers, copyright notices, ELI links, and other boilerplate text
- Works with most standard document formats
- Preserves main content while cleaning up formatting
- Requires PyMuPDF dependency (`pip install markitdown-pdf-separators[pdf-clean]`)

### Using Both Features Together
You can combine both features for clean, well-structured output:

```python
result = md.convert("document.pdf", 
                   add_page_separators=True, 
                   remove_headers_footers=True)
```

This will:
1. Add page separators (`---`) between each page
2. Remove headers and footers from each page
3. Produce clean, readable Markdown output

### Performance
- Optimized for efficiency with large PDFs
- Minimal overhead compared to standard conversion
- Memory-efficient processing

### Example Output
```markdown
Page 1 content here (without headers/footers)...

---

Page 2 content here (without headers/footers)...

---

Page 3 content here (without headers/footers)...
```

## Development

This project is based on the original [MarkItDown](https://github.com/microsoft/markitdown) by Microsoft, with added PDF page separator and header/footer removal functionality.

### Key Changes:
- Added PDF page separator support
- Added PDF header/footer removal support
- Optimized performance for large documents
- Backward-compatible API

## License

MIT License - see LICENSE file for details.

## Acknowledgments

- Original MarkItDown project by Microsoft
- Based on work by Adam Fourney and contributors

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
trademarks or logos is subject to and must follow
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
Any use of third-party trademarks or logos are subject to those third-party's policies.
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "markitdown-pdf-separators",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "converter, document-conversion, markdown, page-separators, pdf",
    "author": null,
    "author_email": "Yu Pei <yu.pei00q@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/4a/5a/6f4f70f91acac1c6e5f234801d43866b1b14b7995e3247f9721ad26ca79a/markitdown_pdf_separators-0.4.1.tar.gz",
    "platform": null,
    "description": "# MarkItDown with PDF Page Separators\n\n> [!IMPORTANT]\n> **MarkItDown with PDF Page Separators** is a Python package and command-line utility for converting various files to Markdown, with the addition of PDF page separator and header/footer removal functionality.\n>\n> This is a fork of the original [MarkItDown](https://github.com/microsoft/markitdown) project by Microsoft, adding PDF page separator and header/footer removal support.\n\n## \ud83c\udd95 New Features\n\n### PDF Page Separators\nConvert PDFs to Markdown with clear page boundaries using the `add_page_separators` parameter:\n\n```python\nfrom markitdown import MarkItDown\n\nmd = MarkItDown(enable_plugins=False)\n\n# With page separators (new feature!)\nresult = md.convert(\"document.pdf\", add_page_separators=True)\n# Output includes \"---\" between pages\n\n# Without page separators (default behavior)\nresult = md.convert(\"document.pdf\", add_page_separators=False)\n# Output is continuous text\n```\n\n### PDF Header/Footer Removal\nRemove headers and footers from PDFs using the `remove_headers_footers` parameter:\n\n```python\nfrom markitdown import MarkItDown\n\nmd = MarkItDown(enable_plugins=False)\n\n# Remove headers and footers (automatically enables page separators)\nresult = md.convert(\"document.pdf\", remove_headers_footers=True)\n# Output excludes common headers/footers like page numbers, copyright notices, etc.\n\n# Combine both features explicitly\nresult = md.convert(\"document.pdf\", \n                   add_page_separators=True, \n                   remove_headers_footers=True)\n# Clean output with page separators and no headers/footers\n```\n\n## Installation\n\nFrom PyPI:\n\n```bash\npip install markitdown-pdf-separators[all]\n```\n\nFor header/footer removal functionality:\n\n```bash\npip install markitdown-pdf-separators[pdf-clean]\n```\n\nFrom source:\n\n```bash\ngit clone https://github.com/yourusername/markitdown-pdf-separators.git\ncd markitdown-pdf-separators\npip install -e .\n```\n\n## Usage\n\n### Command-Line\n\n```bash\n# Basic conversion\nmarkitdown path-to-file.pdf > document.md\n\n# With page separators (if supported by your version)\nmarkitdown path-to-file.pdf --add-page-separators > document.md\n```\n\n### Python API\n\n```python\nfrom markitdown import MarkItDown\n\n# Initialize\nmd = MarkItDown(enable_plugins=False)\n\n# Convert various file types\nresult = md.convert(\"test.xlsx\")\nprint(result.markdown)\n\n# Convert PDF with page separators\nresult = md.convert(\"document.pdf\", add_page_separators=True)\nprint(result.markdown)\n\n# Convert PDF with header/footer removal\nresult = md.convert(\"document.pdf\", remove_headers_footers=True)\nprint(result.markdown)\n\n# Convert PDF with both features\nresult = md.convert(\"document.pdf\", \n                   add_page_separators=True, \n                   remove_headers_footers=True)\nprint(result.markdown)\n```\n\n## Supported File Types\n\n- **PDF** (with page separators and header/footer removal) \u2728\n- Word documents (.docx)\n- Excel spreadsheets (.xlsx, .xls)\n- PowerPoint presentations (.pptx)\n- HTML files\n- Plain text files\n- Images (with OCR)\n- Audio files (with transcription)\n- And many more...\n\n## PDF Features\n\n### Page Separators (`add_page_separators`)\n- **Parameter**: `add_page_separators=True/False` (default: `False`)\n- Extracts text page by page from PDFs\n- Adds `---` (Markdown horizontal rule) between pages\n- Maintains document structure and readability\n- Works with multi-page documents\n- Useful for maintaining page boundaries in the output\n\n### Header/Footer Removal (`remove_headers_footers`)\n- **Parameter**: `remove_headers_footers=True/False` (default: `False`)\n- **Note**: Automatically enables page separators when this feature is used\n- Removes common headers and footers automatically\n- Detects and removes up to 2 lines from the beginning and end of each page\n- Identifies duplicate content across pages (headers/footers that repeat)\n- Removes page numbers, copyright notices, ELI links, and other boilerplate text\n- Works with most standard document formats\n- Preserves main content while cleaning up formatting\n- Requires PyMuPDF dependency (`pip install markitdown-pdf-separators[pdf-clean]`)\n\n### Using Both Features Together\nYou can combine both features for clean, well-structured output:\n\n```python\nresult = md.convert(\"document.pdf\", \n                   add_page_separators=True, \n                   remove_headers_footers=True)\n```\n\nThis will:\n1. Add page separators (`---`) between each page\n2. Remove headers and footers from each page\n3. Produce clean, readable Markdown output\n\n### Performance\n- Optimized for efficiency with large PDFs\n- Minimal overhead compared to standard conversion\n- Memory-efficient processing\n\n### Example Output\n```markdown\nPage 1 content here (without headers/footers)...\n\n---\n\nPage 2 content here (without headers/footers)...\n\n---\n\nPage 3 content here (without headers/footers)...\n```\n\n## Development\n\nThis project is based on the original [MarkItDown](https://github.com/microsoft/markitdown) by Microsoft, with added PDF page separator and header/footer removal functionality.\n\n### Key Changes:\n- Added PDF page separator support\n- Added PDF header/footer removal support\n- Optimized performance for large documents\n- Backward-compatible API\n\n## License\n\nMIT License - see LICENSE file for details.\n\n## Acknowledgments\n\n- Original MarkItDown project by Microsoft\n- Based on work by Adam Fourney and contributors\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## Trademarks\n\nThis project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft\ntrademarks or logos is subject to and must follow\n[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).\nUse of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.\nAny use of third-party trademarks or logos are subject to those third-party's policies.",
    "bugtrack_url": null,
    "license": null,
    "summary": "MarkItDown with PDF page separators - convert PDFs to Markdown with page boundary markers",
    "version": "0.4.1",
    "project_urls": {
        "Documentation": "https://github.com/Staceypy/markitdown-pdf-separators#readme",
        "Issues": "https://github.com/Staceypy/markitdown-pdf-separators/issues",
        "Repository": "https://github.com/Staceypy/markitdown-pdf-separators",
        "Source": "https://github.com/Staceypy/markitdown-pdf-separators"
    },
    "split_keywords": [
        "converter",
        " document-conversion",
        " markdown",
        " page-separators",
        " pdf"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "de95abc59132ea7002a860c7d6b29c6913971a9b2f41c6bd8a1f1608071ee5b3",
                "md5": "7dd2e7169dff1cde01d0d021f85944c1",
                "sha256": "3ac3855a22930e511cecbfb78ca127199353abf27abce402625d2d92642e60d4"
            },
            "downloads": -1,
            "filename": "markitdown_pdf_separators-0.4.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7dd2e7169dff1cde01d0d021f85944c1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 63558,
            "upload_time": "2025-07-22T14:44:57",
            "upload_time_iso_8601": "2025-07-22T14:44:57.475951Z",
            "url": "https://files.pythonhosted.org/packages/de/95/abc59132ea7002a860c7d6b29c6913971a9b2f41c6bd8a1f1608071ee5b3/markitdown_pdf_separators-0.4.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "4a5a6f4f70f91acac1c6e5f234801d43866b1b14b7995e3247f9721ad26ca79a",
                "md5": "f82f56e50def882503b9b158d397cb7d",
                "sha256": "6f804c1dc39fe5531c84b221f90f14cbdb917cc8ddaef27726c77714f7fdb01d"
            },
            "downloads": -1,
            "filename": "markitdown_pdf_separators-0.4.1.tar.gz",
            "has_sig": false,
            "md5_digest": "f82f56e50def882503b9b158d397cb7d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 45115,
            "upload_time": "2025-07-22T14:44:59",
            "upload_time_iso_8601": "2025-07-22T14:44:59.866427Z",
            "url": "https://files.pythonhosted.org/packages/4a/5a/6f4f70f91acac1c6e5f234801d43866b1b14b7995e3247f9721ad26ca79a/markitdown_pdf_separators-0.4.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-22 14:44:59",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Staceypy",
    "github_project": "markitdown-pdf-separators#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "markitdown-pdf-separators"
}
        
Elapsed time: 1.06658s