# markdowncleaner
A simple Python tool for cleaning and formatting markdown documents. Default configuration with regex patterns for PDFs of academic papers that have been converted to markdown.
## Description
`markdowncleaner` helps you clean up markdown files by removing unwanted content such as:
- References, bibliographies, and citations
- Footnotes and endnotes
- Copyright notices and legal disclaimers
- Acknowledgements and funding information
- Author information and contact details
- Specific patterns like DOIs, URLs, and email addresses
- Short lines and excessive whitespace
This tool is particularly useful for processing academic papers, books, or any markdown document that needs formatting cleanup.
## Installation
```bash
pip install markdowncleaner
```
## Usage
### Basic Usage
```python
from markdowncleaner import MarkdownCleaner
from pathlib import Path
# Create a cleaner with default patterns
cleaner = MarkdownCleaner()
# Clean a markdown file
result_path = cleaner.clean_markdown_file(Path("input.md"))
# Clean a markdown string
text = "# Title\nSome content here. [1]\n\nReferences\n1. Citation"
cleaned_text = cleaner.clean_markdown_string(text)
print(cleaned_text)
```
### Customizing Cleaning Options
```python
from markdowncleaner import MarkdownCleaner, CleanerOptions
# Create custom options
options = CleanerOptions()
options.remove_short_lines = True
options.min_line_length = 50 # custom minimum line length
options.remove_footnotes_in_text = True
options.contract_empty_lines = True
# Initialize cleaner with custom options
cleaner = MarkdownCleaner(options=options)
# Use the cleaner as before
```
### Custom Cleaning Patterns
You can also provide custom cleaning patterns:
```python
from markdowncleaner import MarkdownCleaner, CleaningPatterns
from pathlib import Path
# Load custom patterns from a YAML file
custom_patterns = CleaningPatterns.from_yaml(Path("my_patterns.yaml"))
# Initialize cleaner with custom patterns
cleaner = MarkdownCleaner(patterns=custom_patterns)
```
## Configuration
The default cleaning patterns are defined in `default_cleaning_patterns.yaml` and include:
- **Sections to Remove**: Acknowledgements, References, Bibliography, etc.
- **Bad Inline Patterns**: Citations, figure references, etc.
- **Bad Lines Patterns**: Copyright notices, DOIs, URLs, etc.
- **Footnote Patterns**: Footnote references in text that fit the pattern '.1'
- **Replacements**: Various character replacements for PDF parsing errors
## Options
- `remove_short_lines`: Remove lines shorter than `min_line_length` (default: 70 characters)
- `remove_whole_lines`: Remove lines matching specific patterns
- `remove_sections`: Remove entire sections based on section headings
- `remove_footnotes_in_text`: Remove footnote references
- `replace_within_lines`: Replace specific patterns within lines
- `remove_within_lines`: Remove specific patterns within lines
- `contract_empty_lines`: Normalize whitespace
- `crimp_linebreaks`: Improve line break formatting
## License
MIT License
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Raw data
{
"_id": null,
"home_page": null,
"name": "markdowncleaner",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "markdown, cleaning, formatting, text processing",
"author": null,
"author_email": "Johannes Himmelreich <jrhimmel@syr.edu>",
"download_url": "https://files.pythonhosted.org/packages/5f/7a/5eb97d1fdbbc3c2563a280be9e308594fc54f19adc203abb0c6eb9453dfb/markdowncleaner-0.1.1.tar.gz",
"platform": null,
"description": "# markdowncleaner\n\nA simple Python tool for cleaning and formatting markdown documents. Default configuration with regex patterns for PDFs of academic papers that have been converted to markdown.\n\n## Description\n\n`markdowncleaner` helps you clean up markdown files by removing unwanted content such as:\n- References, bibliographies, and citations\n- Footnotes and endnotes\n- Copyright notices and legal disclaimers\n- Acknowledgements and funding information\n- Author information and contact details\n- Specific patterns like DOIs, URLs, and email addresses\n- Short lines and excessive whitespace\n\nThis tool is particularly useful for processing academic papers, books, or any markdown document that needs formatting cleanup.\n\n## Installation\n\n```bash\npip install markdowncleaner\n```\n\n## Usage\n\n### Basic Usage\n\n```python\nfrom markdowncleaner import MarkdownCleaner\nfrom pathlib import Path\n\n# Create a cleaner with default patterns\ncleaner = MarkdownCleaner()\n\n# Clean a markdown file\nresult_path = cleaner.clean_markdown_file(Path(\"input.md\"))\n\n# Clean a markdown string\ntext = \"# Title\\nSome content here. [1]\\n\\nReferences\\n1. Citation\"\ncleaned_text = cleaner.clean_markdown_string(text)\nprint(cleaned_text)\n```\n\n### Customizing Cleaning Options\n\n```python\nfrom markdowncleaner import MarkdownCleaner, CleanerOptions\n\n# Create custom options\noptions = CleanerOptions()\noptions.remove_short_lines = True\noptions.min_line_length = 50 # custom minimum line length\noptions.remove_footnotes_in_text = True\noptions.contract_empty_lines = True\n\n# Initialize cleaner with custom options\ncleaner = MarkdownCleaner(options=options)\n\n# Use the cleaner as before\n```\n\n### Custom Cleaning Patterns\n\nYou can also provide custom cleaning patterns:\n\n```python\nfrom markdowncleaner import MarkdownCleaner, CleaningPatterns\nfrom pathlib import Path\n\n# Load custom patterns from a YAML file\ncustom_patterns = CleaningPatterns.from_yaml(Path(\"my_patterns.yaml\"))\n\n# Initialize cleaner with custom patterns\ncleaner = MarkdownCleaner(patterns=custom_patterns)\n```\n\n## Configuration\n\nThe default cleaning patterns are defined in `default_cleaning_patterns.yaml` and include:\n\n- **Sections to Remove**: Acknowledgements, References, Bibliography, etc.\n- **Bad Inline Patterns**: Citations, figure references, etc.\n- **Bad Lines Patterns**: Copyright notices, DOIs, URLs, etc.\n- **Footnote Patterns**: Footnote references in text that fit the pattern '.1'\n- **Replacements**: Various character replacements for PDF parsing errors\n\n## Options\n\n- `remove_short_lines`: Remove lines shorter than `min_line_length` (default: 70 characters)\n- `remove_whole_lines`: Remove lines matching specific patterns\n- `remove_sections`: Remove entire sections based on section headings\n- `remove_footnotes_in_text`: Remove footnote references\n- `replace_within_lines`: Replace specific patterns within lines\n- `remove_within_lines`: Remove specific patterns within lines\n- `contract_empty_lines`: Normalize whitespace\n- `crimp_linebreaks`: Improve line break formatting\n\n## License\n\nMIT License\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A tool for cleaning and formatting markdown documents",
"version": "0.1.1",
"project_urls": {
"Issues": "https://github.com/josk0/markdowncleaner/issues",
"Repository": "https://github.com/josk0/markdowncleaner"
},
"split_keywords": [
"markdown",
" cleaning",
" formatting",
" text processing"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "15c6583b85138a63aed8599c01fd465b82cddc538881db84717b4ecc3e7c1f79",
"md5": "4c96a7a8d4e4d4b7a59985ccc861e148",
"sha256": "e7fb8b821ca70d43ed1ccdb3ece7c3592db2296ddb09a637c15be734fd1a1a26"
},
"downloads": -1,
"filename": "markdowncleaner-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4c96a7a8d4e4d4b7a59985ccc861e148",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 13081,
"upload_time": "2025-03-03T05:17:12",
"upload_time_iso_8601": "2025-03-03T05:17:12.239542Z",
"url": "https://files.pythonhosted.org/packages/15/c6/583b85138a63aed8599c01fd465b82cddc538881db84717b4ecc3e7c1f79/markdowncleaner-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "5f7a5eb97d1fdbbc3c2563a280be9e308594fc54f19adc203abb0c6eb9453dfb",
"md5": "5c33ff27c159fa5964b241d73888dc24",
"sha256": "6ae75972d9ca606deea039e161054bcc4b595e0ce0b2915ff0d61367a2dce43d"
},
"downloads": -1,
"filename": "markdowncleaner-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "5c33ff27c159fa5964b241d73888dc24",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 14879,
"upload_time": "2025-03-03T05:17:13",
"upload_time_iso_8601": "2025-03-03T05:17:13.533924Z",
"url": "https://files.pythonhosted.org/packages/5f/7a/5eb97d1fdbbc3c2563a280be9e308594fc54f19adc203abb0c6eb9453dfb/markdowncleaner-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-03-03 05:17:13",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "josk0",
"github_project": "markdowncleaner",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "markdowncleaner"
}