# MarkItDown with PDF Page Separators
> [!IMPORTANT]
> **MarkItDown with PDF Page Separators** is a Python package and command-line utility for converting various files to Markdown, with the addition of PDF page separator and header/footer removal functionality.
>
> This is a fork of the original [MarkItDown](https://github.com/microsoft/markitdown) project by Microsoft, adding PDF page separator and header/footer removal support.
## 🆕 New Features
### PDF Page Separators
Convert PDFs to Markdown with clear page boundaries using the `add_page_separators` parameter:
```python
from markitdown import MarkItDown
md = MarkItDown(enable_plugins=False)
# With page separators (new feature!)
result = md.convert("document.pdf", add_page_separators=True)
# Output includes "---" between pages
# Without page separators (default behavior)
result = md.convert("document.pdf", add_page_separators=False)
# Output is continuous text
```
### PDF Header/Footer Removal
Remove headers and footers from PDFs using the `remove_headers_footers` parameter:
```python
from markitdown import MarkItDown
md = MarkItDown(enable_plugins=False)
# Remove headers and footers (automatically enables page separators)
result = md.convert("document.pdf", remove_headers_footers=True)
# Output excludes common headers/footers like page numbers, copyright notices, etc.
# Combine both features explicitly
result = md.convert("document.pdf",
add_page_separators=True,
remove_headers_footers=True)
# Clean output with page separators and no headers/footers
```
## Installation
From PyPI:
```bash
pip install markitdown-pdf-separators[all]
```
For header/footer removal functionality:
```bash
pip install markitdown-pdf-separators[pdf-clean]
```
From source:
```bash
git clone https://github.com/yourusername/markitdown-pdf-separators.git
cd markitdown-pdf-separators
pip install -e .
```
## Usage
### Command-Line
```bash
# Basic conversion
markitdown path-to-file.pdf > document.md
# With page separators (if supported by your version)
markitdown path-to-file.pdf --add-page-separators > document.md
```
### Python API
```python
from markitdown import MarkItDown
# Initialize
md = MarkItDown(enable_plugins=False)
# Convert various file types
result = md.convert("test.xlsx")
print(result.markdown)
# Convert PDF with page separators
result = md.convert("document.pdf", add_page_separators=True)
print(result.markdown)
# Convert PDF with header/footer removal
result = md.convert("document.pdf", remove_headers_footers=True)
print(result.markdown)
# Convert PDF with both features
result = md.convert("document.pdf",
add_page_separators=True,
remove_headers_footers=True)
print(result.markdown)
```
## Supported File Types
- **PDF** (with page separators and header/footer removal) ✨
- Word documents (.docx)
- Excel spreadsheets (.xlsx, .xls)
- PowerPoint presentations (.pptx)
- HTML files
- Plain text files
- Images (with OCR)
- Audio files (with transcription)
- And many more...
## PDF Features
### Page Separators (`add_page_separators`)
- **Parameter**: `add_page_separators=True/False` (default: `False`)
- Extracts text page by page from PDFs
- Adds `---` (Markdown horizontal rule) between pages
- Maintains document structure and readability
- Works with multi-page documents
- Useful for maintaining page boundaries in the output
### Header/Footer Removal (`remove_headers_footers`)
- **Parameter**: `remove_headers_footers=True/False` (default: `False`)
- **Note**: Automatically enables page separators when this feature is used
- Removes common headers and footers automatically
- Detects and removes up to 2 lines from the beginning and end of each page
- Identifies duplicate content across pages (headers/footers that repeat)
- Removes page numbers, copyright notices, ELI links, and other boilerplate text
- Works with most standard document formats
- Preserves main content while cleaning up formatting
- Requires PyMuPDF dependency (`pip install markitdown-pdf-separators[pdf-clean]`)
### Using Both Features Together
You can combine both features for clean, well-structured output:
```python
result = md.convert("document.pdf",
add_page_separators=True,
remove_headers_footers=True)
```
This will:
1. Add page separators (`---`) between each page
2. Remove headers and footers from each page
3. Produce clean, readable Markdown output
### Performance
- Optimized for efficiency with large PDFs
- Minimal overhead compared to standard conversion
- Memory-efficient processing
### Example Output
```markdown
Page 1 content here (without headers/footers)...
---
Page 2 content here (without headers/footers)...
---
Page 3 content here (without headers/footers)...
```
## Development
This project is based on the original [MarkItDown](https://github.com/microsoft/markitdown) by Microsoft, with added PDF page separator and header/footer removal functionality.
### Key Changes:
- Added PDF page separator support
- Added PDF header/footer removal support
- Optimized performance for large documents
- Backward-compatible API
## License
MIT License - see LICENSE file for details.
## Acknowledgments
- Original MarkItDown project by Microsoft
- Based on work by Adam Fourney and contributors
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
trademarks or logos is subject to and must follow
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
Any use of third-party trademarks or logos are subject to those third-party's policies.
Raw data
{
"_id": null,
"home_page": null,
"name": "markitdown-pdf-separators",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "converter, document-conversion, markdown, page-separators, pdf",
"author": null,
"author_email": "Yu Pei <yu.pei00q@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/4a/5a/6f4f70f91acac1c6e5f234801d43866b1b14b7995e3247f9721ad26ca79a/markitdown_pdf_separators-0.4.1.tar.gz",
"platform": null,
"description": "# MarkItDown with PDF Page Separators\n\n> [!IMPORTANT]\n> **MarkItDown with PDF Page Separators** is a Python package and command-line utility for converting various files to Markdown, with the addition of PDF page separator and header/footer removal functionality.\n>\n> This is a fork of the original [MarkItDown](https://github.com/microsoft/markitdown) project by Microsoft, adding PDF page separator and header/footer removal support.\n\n## \ud83c\udd95 New Features\n\n### PDF Page Separators\nConvert PDFs to Markdown with clear page boundaries using the `add_page_separators` parameter:\n\n```python\nfrom markitdown import MarkItDown\n\nmd = MarkItDown(enable_plugins=False)\n\n# With page separators (new feature!)\nresult = md.convert(\"document.pdf\", add_page_separators=True)\n# Output includes \"---\" between pages\n\n# Without page separators (default behavior)\nresult = md.convert(\"document.pdf\", add_page_separators=False)\n# Output is continuous text\n```\n\n### PDF Header/Footer Removal\nRemove headers and footers from PDFs using the `remove_headers_footers` parameter:\n\n```python\nfrom markitdown import MarkItDown\n\nmd = MarkItDown(enable_plugins=False)\n\n# Remove headers and footers (automatically enables page separators)\nresult = md.convert(\"document.pdf\", remove_headers_footers=True)\n# Output excludes common headers/footers like page numbers, copyright notices, etc.\n\n# Combine both features explicitly\nresult = md.convert(\"document.pdf\", \n add_page_separators=True, \n remove_headers_footers=True)\n# Clean output with page separators and no headers/footers\n```\n\n## Installation\n\nFrom PyPI:\n\n```bash\npip install markitdown-pdf-separators[all]\n```\n\nFor header/footer removal functionality:\n\n```bash\npip install markitdown-pdf-separators[pdf-clean]\n```\n\nFrom source:\n\n```bash\ngit clone https://github.com/yourusername/markitdown-pdf-separators.git\ncd markitdown-pdf-separators\npip install -e .\n```\n\n## Usage\n\n### Command-Line\n\n```bash\n# Basic conversion\nmarkitdown path-to-file.pdf > document.md\n\n# With page separators (if supported by your version)\nmarkitdown path-to-file.pdf --add-page-separators > document.md\n```\n\n### Python API\n\n```python\nfrom markitdown import MarkItDown\n\n# Initialize\nmd = MarkItDown(enable_plugins=False)\n\n# Convert various file types\nresult = md.convert(\"test.xlsx\")\nprint(result.markdown)\n\n# Convert PDF with page separators\nresult = md.convert(\"document.pdf\", add_page_separators=True)\nprint(result.markdown)\n\n# Convert PDF with header/footer removal\nresult = md.convert(\"document.pdf\", remove_headers_footers=True)\nprint(result.markdown)\n\n# Convert PDF with both features\nresult = md.convert(\"document.pdf\", \n add_page_separators=True, \n remove_headers_footers=True)\nprint(result.markdown)\n```\n\n## Supported File Types\n\n- **PDF** (with page separators and header/footer removal) \u2728\n- Word documents (.docx)\n- Excel spreadsheets (.xlsx, .xls)\n- PowerPoint presentations (.pptx)\n- HTML files\n- Plain text files\n- Images (with OCR)\n- Audio files (with transcription)\n- And many more...\n\n## PDF Features\n\n### Page Separators (`add_page_separators`)\n- **Parameter**: `add_page_separators=True/False` (default: `False`)\n- Extracts text page by page from PDFs\n- Adds `---` (Markdown horizontal rule) between pages\n- Maintains document structure and readability\n- Works with multi-page documents\n- Useful for maintaining page boundaries in the output\n\n### Header/Footer Removal (`remove_headers_footers`)\n- **Parameter**: `remove_headers_footers=True/False` (default: `False`)\n- **Note**: Automatically enables page separators when this feature is used\n- Removes common headers and footers automatically\n- Detects and removes up to 2 lines from the beginning and end of each page\n- Identifies duplicate content across pages (headers/footers that repeat)\n- Removes page numbers, copyright notices, ELI links, and other boilerplate text\n- Works with most standard document formats\n- Preserves main content while cleaning up formatting\n- Requires PyMuPDF dependency (`pip install markitdown-pdf-separators[pdf-clean]`)\n\n### Using Both Features Together\nYou can combine both features for clean, well-structured output:\n\n```python\nresult = md.convert(\"document.pdf\", \n add_page_separators=True, \n remove_headers_footers=True)\n```\n\nThis will:\n1. Add page separators (`---`) between each page\n2. Remove headers and footers from each page\n3. Produce clean, readable Markdown output\n\n### Performance\n- Optimized for efficiency with large PDFs\n- Minimal overhead compared to standard conversion\n- Memory-efficient processing\n\n### Example Output\n```markdown\nPage 1 content here (without headers/footers)...\n\n---\n\nPage 2 content here (without headers/footers)...\n\n---\n\nPage 3 content here (without headers/footers)...\n```\n\n## Development\n\nThis project is based on the original [MarkItDown](https://github.com/microsoft/markitdown) by Microsoft, with added PDF page separator and header/footer removal functionality.\n\n### Key Changes:\n- Added PDF page separator support\n- Added PDF header/footer removal support\n- Optimized performance for large documents\n- Backward-compatible API\n\n## License\n\nMIT License - see LICENSE file for details.\n\n## Acknowledgments\n\n- Original MarkItDown project by Microsoft\n- Based on work by Adam Fourney and contributors\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## Trademarks\n\nThis project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft\ntrademarks or logos is subject to and must follow\n[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).\nUse of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.\nAny use of third-party trademarks or logos are subject to those third-party's policies.",
"bugtrack_url": null,
"license": null,
"summary": "MarkItDown with PDF page separators - convert PDFs to Markdown with page boundary markers",
"version": "0.4.1",
"project_urls": {
"Documentation": "https://github.com/Staceypy/markitdown-pdf-separators#readme",
"Issues": "https://github.com/Staceypy/markitdown-pdf-separators/issues",
"Repository": "https://github.com/Staceypy/markitdown-pdf-separators",
"Source": "https://github.com/Staceypy/markitdown-pdf-separators"
},
"split_keywords": [
"converter",
" document-conversion",
" markdown",
" page-separators",
" pdf"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "de95abc59132ea7002a860c7d6b29c6913971a9b2f41c6bd8a1f1608071ee5b3",
"md5": "7dd2e7169dff1cde01d0d021f85944c1",
"sha256": "3ac3855a22930e511cecbfb78ca127199353abf27abce402625d2d92642e60d4"
},
"downloads": -1,
"filename": "markitdown_pdf_separators-0.4.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "7dd2e7169dff1cde01d0d021f85944c1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 63558,
"upload_time": "2025-07-22T14:44:57",
"upload_time_iso_8601": "2025-07-22T14:44:57.475951Z",
"url": "https://files.pythonhosted.org/packages/de/95/abc59132ea7002a860c7d6b29c6913971a9b2f41c6bd8a1f1608071ee5b3/markitdown_pdf_separators-0.4.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "4a5a6f4f70f91acac1c6e5f234801d43866b1b14b7995e3247f9721ad26ca79a",
"md5": "f82f56e50def882503b9b158d397cb7d",
"sha256": "6f804c1dc39fe5531c84b221f90f14cbdb917cc8ddaef27726c77714f7fdb01d"
},
"downloads": -1,
"filename": "markitdown_pdf_separators-0.4.1.tar.gz",
"has_sig": false,
"md5_digest": "f82f56e50def882503b9b158d397cb7d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 45115,
"upload_time": "2025-07-22T14:44:59",
"upload_time_iso_8601": "2025-07-22T14:44:59.866427Z",
"url": "https://files.pythonhosted.org/packages/4a/5a/6f4f70f91acac1c6e5f234801d43866b1b14b7995e3247f9721ad26ca79a/markitdown_pdf_separators-0.4.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-22 14:44:59",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Staceypy",
"github_project": "markitdown-pdf-separators#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "markitdown-pdf-separators"
}