markdowncleaner


Namemarkdowncleaner JSON
Version 0.1.1 PyPI version JSON
download
home_pageNone
SummaryA tool for cleaning and formatting markdown documents
upload_time2025-03-03 05:17:13
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT
keywords markdown cleaning formatting text processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # markdowncleaner

A simple Python tool for cleaning and formatting markdown documents. Default configuration with regex patterns for PDFs of academic papers that have been converted to markdown.

## Description

`markdowncleaner` helps you clean up markdown files by removing unwanted content such as:
- References, bibliographies, and citations
- Footnotes and endnotes
- Copyright notices and legal disclaimers
- Acknowledgements and funding information
- Author information and contact details
- Specific patterns like DOIs, URLs, and email addresses
- Short lines and excessive whitespace

This tool is particularly useful for processing academic papers, books, or any markdown document that needs formatting cleanup.

## Installation

```bash
pip install markdowncleaner
```

## Usage

### Basic Usage

```python
from markdowncleaner import MarkdownCleaner
from pathlib import Path

# Create a cleaner with default patterns
cleaner = MarkdownCleaner()

# Clean a markdown file
result_path = cleaner.clean_markdown_file(Path("input.md"))

# Clean a markdown string
text = "# Title\nSome content here. [1]\n\nReferences\n1. Citation"
cleaned_text = cleaner.clean_markdown_string(text)
print(cleaned_text)
```

### Customizing Cleaning Options

```python
from markdowncleaner import MarkdownCleaner, CleanerOptions

# Create custom options
options = CleanerOptions()
options.remove_short_lines = True
options.min_line_length = 50  # custom minimum line length
options.remove_footnotes_in_text = True
options.contract_empty_lines = True

# Initialize cleaner with custom options
cleaner = MarkdownCleaner(options=options)

# Use the cleaner as before
```

### Custom Cleaning Patterns

You can also provide custom cleaning patterns:

```python
from markdowncleaner import MarkdownCleaner, CleaningPatterns
from pathlib import Path

# Load custom patterns from a YAML file
custom_patterns = CleaningPatterns.from_yaml(Path("my_patterns.yaml"))

# Initialize cleaner with custom patterns
cleaner = MarkdownCleaner(patterns=custom_patterns)
```

## Configuration

The default cleaning patterns are defined in `default_cleaning_patterns.yaml` and include:

- **Sections to Remove**: Acknowledgements, References, Bibliography, etc.
- **Bad Inline Patterns**: Citations, figure references, etc.
- **Bad Lines Patterns**: Copyright notices, DOIs, URLs, etc.
- **Footnote Patterns**: Footnote references in text that fit the pattern '.1'
- **Replacements**: Various character replacements for PDF parsing errors

## Options

- `remove_short_lines`: Remove lines shorter than `min_line_length` (default: 70 characters)
- `remove_whole_lines`: Remove lines matching specific patterns
- `remove_sections`: Remove entire sections based on section headings
- `remove_footnotes_in_text`: Remove footnote references
- `replace_within_lines`: Replace specific patterns within lines
- `remove_within_lines`: Remove specific patterns within lines
- `contract_empty_lines`: Normalize whitespace
- `crimp_linebreaks`: Improve line break formatting

## License

MIT License

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "markdowncleaner",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "markdown, cleaning, formatting, text processing",
    "author": null,
    "author_email": "Johannes Himmelreich <jrhimmel@syr.edu>",
    "download_url": "https://files.pythonhosted.org/packages/5f/7a/5eb97d1fdbbc3c2563a280be9e308594fc54f19adc203abb0c6eb9453dfb/markdowncleaner-0.1.1.tar.gz",
    "platform": null,
    "description": "# markdowncleaner\n\nA simple Python tool for cleaning and formatting markdown documents. Default configuration with regex patterns for PDFs of academic papers that have been converted to markdown.\n\n## Description\n\n`markdowncleaner` helps you clean up markdown files by removing unwanted content such as:\n- References, bibliographies, and citations\n- Footnotes and endnotes\n- Copyright notices and legal disclaimers\n- Acknowledgements and funding information\n- Author information and contact details\n- Specific patterns like DOIs, URLs, and email addresses\n- Short lines and excessive whitespace\n\nThis tool is particularly useful for processing academic papers, books, or any markdown document that needs formatting cleanup.\n\n## Installation\n\n```bash\npip install markdowncleaner\n```\n\n## Usage\n\n### Basic Usage\n\n```python\nfrom markdowncleaner import MarkdownCleaner\nfrom pathlib import Path\n\n# Create a cleaner with default patterns\ncleaner = MarkdownCleaner()\n\n# Clean a markdown file\nresult_path = cleaner.clean_markdown_file(Path(\"input.md\"))\n\n# Clean a markdown string\ntext = \"# Title\\nSome content here. [1]\\n\\nReferences\\n1. Citation\"\ncleaned_text = cleaner.clean_markdown_string(text)\nprint(cleaned_text)\n```\n\n### Customizing Cleaning Options\n\n```python\nfrom markdowncleaner import MarkdownCleaner, CleanerOptions\n\n# Create custom options\noptions = CleanerOptions()\noptions.remove_short_lines = True\noptions.min_line_length = 50  # custom minimum line length\noptions.remove_footnotes_in_text = True\noptions.contract_empty_lines = True\n\n# Initialize cleaner with custom options\ncleaner = MarkdownCleaner(options=options)\n\n# Use the cleaner as before\n```\n\n### Custom Cleaning Patterns\n\nYou can also provide custom cleaning patterns:\n\n```python\nfrom markdowncleaner import MarkdownCleaner, CleaningPatterns\nfrom pathlib import Path\n\n# Load custom patterns from a YAML file\ncustom_patterns = CleaningPatterns.from_yaml(Path(\"my_patterns.yaml\"))\n\n# Initialize cleaner with custom patterns\ncleaner = MarkdownCleaner(patterns=custom_patterns)\n```\n\n## Configuration\n\nThe default cleaning patterns are defined in `default_cleaning_patterns.yaml` and include:\n\n- **Sections to Remove**: Acknowledgements, References, Bibliography, etc.\n- **Bad Inline Patterns**: Citations, figure references, etc.\n- **Bad Lines Patterns**: Copyright notices, DOIs, URLs, etc.\n- **Footnote Patterns**: Footnote references in text that fit the pattern '.1'\n- **Replacements**: Various character replacements for PDF parsing errors\n\n## Options\n\n- `remove_short_lines`: Remove lines shorter than `min_line_length` (default: 70 characters)\n- `remove_whole_lines`: Remove lines matching specific patterns\n- `remove_sections`: Remove entire sections based on section headings\n- `remove_footnotes_in_text`: Remove footnote references\n- `replace_within_lines`: Replace specific patterns within lines\n- `remove_within_lines`: Remove specific patterns within lines\n- `contract_empty_lines`: Normalize whitespace\n- `crimp_linebreaks`: Improve line break formatting\n\n## License\n\nMIT License\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A tool for cleaning and formatting markdown documents",
    "version": "0.1.1",
    "project_urls": {
        "Issues": "https://github.com/josk0/markdowncleaner/issues",
        "Repository": "https://github.com/josk0/markdowncleaner"
    },
    "split_keywords": [
        "markdown",
        " cleaning",
        " formatting",
        " text processing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "15c6583b85138a63aed8599c01fd465b82cddc538881db84717b4ecc3e7c1f79",
                "md5": "4c96a7a8d4e4d4b7a59985ccc861e148",
                "sha256": "e7fb8b821ca70d43ed1ccdb3ece7c3592db2296ddb09a637c15be734fd1a1a26"
            },
            "downloads": -1,
            "filename": "markdowncleaner-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4c96a7a8d4e4d4b7a59985ccc861e148",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 13081,
            "upload_time": "2025-03-03T05:17:12",
            "upload_time_iso_8601": "2025-03-03T05:17:12.239542Z",
            "url": "https://files.pythonhosted.org/packages/15/c6/583b85138a63aed8599c01fd465b82cddc538881db84717b4ecc3e7c1f79/markdowncleaner-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "5f7a5eb97d1fdbbc3c2563a280be9e308594fc54f19adc203abb0c6eb9453dfb",
                "md5": "5c33ff27c159fa5964b241d73888dc24",
                "sha256": "6ae75972d9ca606deea039e161054bcc4b595e0ce0b2915ff0d61367a2dce43d"
            },
            "downloads": -1,
            "filename": "markdowncleaner-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "5c33ff27c159fa5964b241d73888dc24",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 14879,
            "upload_time": "2025-03-03T05:17:13",
            "upload_time_iso_8601": "2025-03-03T05:17:13.533924Z",
            "url": "https://files.pythonhosted.org/packages/5f/7a/5eb97d1fdbbc3c2563a280be9e308594fc54f19adc203abb0c6eb9453dfb/markdowncleaner-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-03-03 05:17:13",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "josk0",
    "github_project": "markdowncleaner",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "markdowncleaner"
}
        
Elapsed time: 0.41416s