html2cleantext

Name	html2cleantext JSON
Version	0.1.3 JSON
	download
home_page	https://github.com/Shawn-Imran/html2cleantext
Summary	Convert HTML to clean, structured Markdown or plain text
upload_time	2025-09-02 12:22:12
maintainer	None
docs_url	None
author	Md Al Mahmud Imran
requires_python	>=3.7
license	MIT
keywords	html markdown text cleaning boilerplate nlp
VCS
bugtrack_url
requirements	beautifulsoup4 lxml markdownify readability-lxml langdetect requests
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # html2cleantext

Convert HTML to clean, structured Markdown or plain text. Perfect for extracting readable content from web pages with robust boilerplate removal and language-aware processing.

## Features

- 🧹 **Smart Cleaning**: Automatically removes navigation, footers, ads, and other boilerplate
- 📝 **Flexible Output**: Convert to Markdown or plain text
- 🌍 **Language-Aware**: Special support for Bengali and English with automatic language detection
- 🔗 **Link Control**: Choose to keep or remove links and images
- 🚀 **Multiple Input Sources**: Process HTML strings, files, or URLs
- ⚡ **CLI & Python API**: Use from command line or integrate into your Python projects
- 📦 **Minimal Dependencies**: Modern, lightweight dependency stack

## Installation

```bash
pip install html2cleantext
```

Or install from source:

```bash
git clone https://github.com/Shawn-Imran/html2cleantext.git
cd html2cleantext
pip install -e .
```

## Quick Start

### Python API

```python
import html2cleantext

# From HTML string
html = "<h1>Hello World</h1><p>This is a test.</p>"
markdown = html2cleantext.to_markdown(html)
text = html2cleantext.to_text(html)

# From file
markdown = html2cleantext.to_markdown("page.html")

# From URL
markdown = html2cleantext.to_markdown("https://example.com")

# With options
clean_text = html2cleantext.to_text(
    html,
    keep_links=False,
    keep_images=False,
    remove_boilerplate=True
)
```

### Command Line Interface

```bash
# Convert to Markdown (default)
html2cleantext input.html

# Convert to plain text
html2cleantext input.html --mode text

# From URL
html2cleantext https://example.com --output clean.md

# Remove links and images
html2cleantext input.html --no-links --no-images

# Keep all content (no boilerplate removal)
html2cleantext input.html --no-remove_boilerplate
```

## API Reference

### Core Functions

#### `to_markdown(html_input, **options)`

Convert HTML to clean Markdown format.

**Parameters:**
- `html_input` (str|Path): HTML string, file path, or URL
- `keep_links` (bool): Preserve links (default: True)
- `keep_images` (bool): Preserve images (default: True)
- `remove_boilerplate` (bool): Remove boilerplate content (default: True)
- `normalize_lang` (bool): Apply language normalization (default: True)
- `language` (str, optional): Language code for normalization (auto-detected if None)

**Returns:** Clean Markdown text (str)

#### `to_text(html_input, **options)`

Convert HTML to clean plain text format.

**Parameters:**
- Same as `to_markdown()` but with different defaults:
- `keep_links` (bool): Default False
- `keep_images` (bool): Default False

**Returns:** Clean plain text (str)

### CLI Options

```
positional arguments:
  input                 HTML input: file path, URL, or raw HTML string

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --mode {markdown,text}, -m {markdown,text}
                        Output format (default: markdown)
  --output OUTPUT, -o OUTPUT
                        Output file path (default: stdout)
  --keep-links          Preserve links in the output
  --no-links            Remove links from the output
  --keep-images         Preserve images in the output
  --no-images           Remove images from the output
  --remove_boilerplate   Remove navigation, footers, and boilerplate content
  --no-remove_boilerplate
                        Keep all content including navigation and footers
  --language LANGUAGE, -l LANGUAGE
                        Language code for normalization
  --no-normalize        Skip language-specific normalization
  --verbose, -v         Enable verbose logging
```

## Examples

### Basic Usage

```python
import html2cleantext

# Simple HTML to Markdown
html = """
<html>
<head><title>Test Page</title></head>
<body>
    <nav>Navigation menu</nav>
    <main>
        <h1>Main Title</h1>
        <p>This is the main content with a <a href="https://example.com">link</a>.</p>
        <ul>
            <li>Item 1</li>
            <li>Item 2</li>
        </ul>
    </main>
    <footer>Footer content</footer>
</body>
</html>
"""

result = html2cleantext.to_markdown(html)
print(result)
```

Output:
```markdown
# Main Title

This is the main content with a [link](https://example.com).

* Item 1
* Item 2
```

### Advanced Usage

```python
import html2cleantext
# Process a Bengali webpage
bengali_html = "<p>এই একটি বাংলা বাক্য।</p>"
clean_text = html2cleantext.to_text(
    bengali_html,
    language='bn',
    normalize_lang=True
)

# Batch processing
import glob

for html_file in glob.glob("*.html"):
    markdown_file = html_file.replace('.html', '.md')
    with open(markdown_file, 'w') as f:
        f.write(html2cleantext.to_markdown(html_file))
```

### Command Line Examples

```bash
# Basic conversion
html2cleantext index.html > clean.md

# Process URL and save to file
html2cleantext https://news.example.com/article --output article.md

# Plain text with no links/images
html2cleantext complex.html --mode text --no-links --no-images

# Preserve all content (no cleaning)
html2cleantext raw.html --no-remove_boilerplate --output raw_content.md

# Bengali content with specific language
html2cleantext bengali.html --language bn --mode text
```

## Language Support

html2cleantext provides enhanced support for:

- **English**: Smart quote normalization, punctuation cleanup
- **Bengali**: Unicode normalization, punctuation handling
- **Auto-detection**: Automatically detects language when not specified

Additional languages can be easily added by extending the normalization functions.

## Architecture

The package follows a clean pipeline architecture:

1. **Input Processing**: Handles HTML strings, files, or URLs
2. **HTML Parsing**: Uses BeautifulSoup with lxml parser
3. **Cleaning**: Removes scripts, styles, and unwanted attributes
4. **Boilerplate Removal**: Strips navigation, footers, ads using readability-lxml or manual rules
5. **Language Detection**: Auto-detects content language
6. **Conversion**: Converts to Markdown using markdownify or extracts plain text
7. **Normalization**: Applies language-specific text cleanup
8. **Output**: Returns clean text or writes to file

## Dependencies

- `beautifulsoup4` - HTML parsing
- `lxml` - Fast XML/HTML parser
- `markdownify` - HTML to Markdown conversion
- `readability-lxml` - Content extraction and boilerplate removal
- `langdetect` - Language detection
- `requests` - HTTP requests for URL fetching

## Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.

### Development Setup

```bash
git clone https://github.com/Shawn-Imran/html2cleantext.git
cd html2cleantext
pip install -e .[dev]  # Install with development dependencies
# OR
pip install -e .  # Install package only
pip install -r requirements-dev.txt  # Install dev dependencies separately
```

### Running Tests

```bash
python -m pytest tests/
```

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Changelog

### v0.1.0
- Initial release
- Core HTML to Markdown/text conversion
- Boilerplate removal using readability-lxml
- Language-aware normalization for Bengali and English
- Command-line interface
- Support for HTML strings, files, and URLs

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Shawn-Imran/html2cleantext",
    "name": "html2cleantext",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "html, markdown, text, cleaning, boilerplate, nlp",
    "author": "Md Al Mahmud Imran",
    "author_email": "Md Al Mahmud Imran <md.almahmudimran@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/f4/91/3b25d84ad7cf41fe35d0cfac3fa22bdde7063dd947222e8d10e847300b2a/html2cleantext-0.1.3.tar.gz",
    "platform": null,
    "description": "# html2cleantext\r\n\r\nConvert HTML to clean, structured Markdown or plain text. Perfect for extracting readable content from web pages with robust boilerplate removal and language-aware processing.\r\n\r\n## Features\r\n\r\n- \ud83e\uddf9 **Smart Cleaning**: Automatically removes navigation, footers, ads, and other boilerplate\r\n- \ud83d\udcdd **Flexible Output**: Convert to Markdown or plain text\r\n- \ud83c\udf0d **Language-Aware**: Special support for Bengali and English with automatic language detection\r\n- \ud83d\udd17 **Link Control**: Choose to keep or remove links and images\r\n- \ud83d\ude80 **Multiple Input Sources**: Process HTML strings, files, or URLs\r\n- \u26a1 **CLI & Python API**: Use from command line or integrate into your Python projects\r\n- \ud83d\udce6 **Minimal Dependencies**: Modern, lightweight dependency stack\r\n\r\n## Installation\r\n\r\n```bash\r\npip install html2cleantext\r\n```\r\n\r\nOr install from source:\r\n\r\n```bash\r\ngit clone https://github.com/Shawn-Imran/html2cleantext.git\r\ncd html2cleantext\r\npip install -e .\r\n```\r\n\r\n## Quick Start\r\n\r\n### Python API\r\n\r\n```python\r\nimport html2cleantext\r\n\r\n# From HTML string\r\nhtml = \"<h1>Hello World</h1><p>This is a test.</p>\"\r\nmarkdown = html2cleantext.to_markdown(html)\r\ntext = html2cleantext.to_text(html)\r\n\r\n# From file\r\nmarkdown = html2cleantext.to_markdown(\"page.html\")\r\n\r\n# From URL\r\nmarkdown = html2cleantext.to_markdown(\"https://example.com\")\r\n\r\n# With options\r\nclean_text = html2cleantext.to_text(\r\n    html,\r\n    keep_links=False,\r\n    keep_images=False,\r\n    remove_boilerplate=True\r\n)\r\n```\r\n\r\n### Command Line Interface\r\n\r\n```bash\r\n# Convert to Markdown (default)\r\nhtml2cleantext input.html\r\n\r\n# Convert to plain text\r\nhtml2cleantext input.html --mode text\r\n\r\n# From URL\r\nhtml2cleantext https://example.com --output clean.md\r\n\r\n# Remove links and images\r\nhtml2cleantext input.html --no-links --no-images\r\n\r\n# Keep all content (no boilerplate removal)\r\nhtml2cleantext input.html --no-remove_boilerplate\r\n```\r\n\r\n## API Reference\r\n\r\n### Core Functions\r\n\r\n#### `to_markdown(html_input, **options)`\r\n\r\nConvert HTML to clean Markdown format.\r\n\r\n**Parameters:**\r\n- `html_input` (str|Path): HTML string, file path, or URL\r\n- `keep_links` (bool): Preserve links (default: True)\r\n- `keep_images` (bool): Preserve images (default: True)\r\n- `remove_boilerplate` (bool): Remove boilerplate content (default: True)\r\n- `normalize_lang` (bool): Apply language normalization (default: True)\r\n- `language` (str, optional): Language code for normalization (auto-detected if None)\r\n\r\n**Returns:** Clean Markdown text (str)\r\n\r\n#### `to_text(html_input, **options)`\r\n\r\nConvert HTML to clean plain text format.\r\n\r\n**Parameters:**\r\n- Same as `to_markdown()` but with different defaults:\r\n- `keep_links` (bool): Default False\r\n- `keep_images` (bool): Default False\r\n\r\n**Returns:** Clean plain text (str)\r\n\r\n### CLI Options\r\n\r\n```\r\npositional arguments:\r\n  input                 HTML input: file path, URL, or raw HTML string\r\n\r\noptional arguments:\r\n  -h, --help            show this help message and exit\r\n  --version             show program's version number and exit\r\n  --mode {markdown,text}, -m {markdown,text}\r\n                        Output format (default: markdown)\r\n  --output OUTPUT, -o OUTPUT\r\n                        Output file path (default: stdout)\r\n  --keep-links          Preserve links in the output\r\n  --no-links            Remove links from the output\r\n  --keep-images         Preserve images in the output\r\n  --no-images           Remove images from the output\r\n  --remove_boilerplate   Remove navigation, footers, and boilerplate content\r\n  --no-remove_boilerplate\r\n                        Keep all content including navigation and footers\r\n  --language LANGUAGE, -l LANGUAGE\r\n                        Language code for normalization\r\n  --no-normalize        Skip language-specific normalization\r\n  --verbose, -v         Enable verbose logging\r\n```\r\n\r\n## Examples\r\n\r\n### Basic Usage\r\n\r\n```python\r\nimport html2cleantext\r\n\r\n# Simple HTML to Markdown\r\nhtml = \"\"\"\r\n<html>\r\n<head><title>Test Page</title></head>\r\n<body>\r\n    <nav>Navigation menu</nav>\r\n    <main>\r\n        <h1>Main Title</h1>\r\n        <p>This is the main content with a <a href=\"https://example.com\">link</a>.</p>\r\n        <ul>\r\n            <li>Item 1</li>\r\n            <li>Item 2</li>\r\n        </ul>\r\n    </main>\r\n    <footer>Footer content</footer>\r\n</body>\r\n</html>\r\n\"\"\"\r\n\r\nresult = html2cleantext.to_markdown(html)\r\nprint(result)\r\n```\r\n\r\nOutput:\r\n```markdown\r\n# Main Title\r\n\r\nThis is the main content with a [link](https://example.com).\r\n\r\n* Item 1\r\n* Item 2\r\n```\r\n\r\n### Advanced Usage\r\n\r\n```python\r\nimport html2cleantext\r\n# Process a Bengali webpage\r\nbengali_html = \"<p>\u098f\u0987 \u098f\u0995\u099f\u09bf \u09ac\u09be\u0982\u09b2\u09be \u09ac\u09be\u0995\u09cd\u09af\u0964</p>\"\r\nclean_text = html2cleantext.to_text(\r\n    bengali_html,\r\n    language='bn',\r\n    normalize_lang=True\r\n)\r\n\r\n# Batch processing\r\nimport glob\r\n\r\nfor html_file in glob.glob(\"*.html\"):\r\n    markdown_file = html_file.replace('.html', '.md')\r\n    with open(markdown_file, 'w') as f:\r\n        f.write(html2cleantext.to_markdown(html_file))\r\n```\r\n\r\n### Command Line Examples\r\n\r\n```bash\r\n# Basic conversion\r\nhtml2cleantext index.html > clean.md\r\n\r\n# Process URL and save to file\r\nhtml2cleantext https://news.example.com/article --output article.md\r\n\r\n# Plain text with no links/images\r\nhtml2cleantext complex.html --mode text --no-links --no-images\r\n\r\n# Preserve all content (no cleaning)\r\nhtml2cleantext raw.html --no-remove_boilerplate --output raw_content.md\r\n\r\n# Bengali content with specific language\r\nhtml2cleantext bengali.html --language bn --mode text\r\n```\r\n\r\n## Language Support\r\n\r\nhtml2cleantext provides enhanced support for:\r\n\r\n- **English**: Smart quote normalization, punctuation cleanup\r\n- **Bengali**: Unicode normalization, punctuation handling\r\n- **Auto-detection**: Automatically detects language when not specified\r\n\r\nAdditional languages can be easily added by extending the normalization functions.\r\n\r\n## Architecture\r\n\r\nThe package follows a clean pipeline architecture:\r\n\r\n1. **Input Processing**: Handles HTML strings, files, or URLs\r\n2. **HTML Parsing**: Uses BeautifulSoup with lxml parser\r\n3. **Cleaning**: Removes scripts, styles, and unwanted attributes\r\n4. **Boilerplate Removal**: Strips navigation, footers, ads using readability-lxml or manual rules\r\n5. **Language Detection**: Auto-detects content language\r\n6. **Conversion**: Converts to Markdown using markdownify or extracts plain text\r\n7. **Normalization**: Applies language-specific text cleanup\r\n8. **Output**: Returns clean text or writes to file\r\n\r\n## Dependencies\r\n\r\n- `beautifulsoup4` - HTML parsing\r\n- `lxml` - Fast XML/HTML parser\r\n- `markdownify` - HTML to Markdown conversion\r\n- `readability-lxml` - Content extraction and boilerplate removal\r\n- `langdetect` - Language detection\r\n- `requests` - HTTP requests for URL fetching\r\n\r\n## Contributing\r\n\r\nContributions are welcome! Please feel free to submit issues, feature requests, or pull requests.\r\n\r\n### Development Setup\r\n\r\n```bash\r\ngit clone https://github.com/Shawn-Imran/html2cleantext.git\r\ncd html2cleantext\r\npip install -e .[dev]  # Install with development dependencies\r\n# OR\r\npip install -e .  # Install package only\r\npip install -r requirements-dev.txt  # Install dev dependencies separately\r\n```\r\n\r\n### Running Tests\r\n\r\n```bash\r\npython -m pytest tests/\r\n```\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License - see the LICENSE file for details.\r\n\r\n## Changelog\r\n\r\n### v0.1.0\r\n- Initial release\r\n- Core HTML to Markdown/text conversion\r\n- Boilerplate removal using readability-lxml\r\n- Language-aware normalization for Bengali and English\r\n- Command-line interface\r\n- Support for HTML strings, files, and URLs\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Convert HTML to clean, structured Markdown or plain text",
    "version": "0.1.3",
    "project_urls": {
        "Bug Reports": "https://github.com/Shawn-Imran/html2cleantext/issues",
        "Homepage": "https://github.com/Shawn-Imran/html2cleantext",
        "Source": "https://github.com/Shawn-Imran/html2cleantext"
    },
    "split_keywords": [
        "html",
        " markdown",
        " text",
        " cleaning",
        " boilerplate",
        " nlp"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "49be75c6b1d6e6623d79ef53206ca8249a5f82426267c19cbf9d96633422e23f",
                "md5": "cbcba708a7e4ae0165270c3b1b7735c5",
                "sha256": "0d8b9e5cfb56f29543af2525ca18376ac14c37439ace4d66012ccff1062ac13e"
            },
            "downloads": -1,
            "filename": "html2cleantext-0.1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "cbcba708a7e4ae0165270c3b1b7735c5",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 14067,
            "upload_time": "2025-09-02T12:22:09",
            "upload_time_iso_8601": "2025-09-02T12:22:09.137834Z",
            "url": "https://files.pythonhosted.org/packages/49/be/75c6b1d6e6623d79ef53206ca8249a5f82426267c19cbf9d96633422e23f/html2cleantext-0.1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f4913b25d84ad7cf41fe35d0cfac3fa22bdde7063dd947222e8d10e847300b2a",
                "md5": "8ac09d02073f771d760f47bbfbc1aa92",
                "sha256": "837dedf3740c8954fb8276f54132cd662364d41b8bb636df44dbd0f79fb1ccbf"
            },
            "downloads": -1,
            "filename": "html2cleantext-0.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "8ac09d02073f771d760f47bbfbc1aa92",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 21931,
            "upload_time": "2025-09-02T12:22:12",
            "upload_time_iso_8601": "2025-09-02T12:22:12.636719Z",
            "url": "https://files.pythonhosted.org/packages/f4/91/3b25d84ad7cf41fe35d0cfac3fa22bdde7063dd947222e8d10e847300b2a/html2cleantext-0.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-02 12:22:12",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Shawn-Imran",
    "github_project": "html2cleantext",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "beautifulsoup4",
            "specs": [
                [
                    ">=",
                    "4.9.0"
                ]
            ]
        },
        {
            "name": "lxml",
            "specs": [
                [
                    ">=",
                    "4.6.0"
                ]
            ]
        },
        {
            "name": "markdownify",
            "specs": [
                [
                    ">=",
                    "0.11.0"
                ]
            ]
        },
        {
            "name": "readability-lxml",
            "specs": [
                [
                    ">=",
                    "0.8.0"
                ]
            ]
        },
        {
            "name": "langdetect",
            "specs": [
                [
                    ">=",
                    "1.0.9"
                ]
            ]
        },
        {
            "name": "requests",
            "specs": [
                [
                    ">=",
                    "2.25.0"
                ]
            ]
        }
    ],
    "lcname": "html2cleantext"
}

Md Al Mahmud Imran