cewlio

Name	cewlio JSON
Version	1.2.2 JSON
	download
home_page	None
Summary	Custom word list generator for web content
upload_time	2025-07-10 21:10:13
maintainer	Kumar Ashwin
docs_url	None
author	None
requires_python	>=3.12
license	MIT
keywords	wordlist web scraping html parsing email extraction metadata
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # CeWLio 🕵️‍♂️✨

[![AI-Assisted Development](https://img.shields.io/badge/AI--Assisted-Development-blue?style=for-the-badge&logo=openai&logoColor=white)](https://github.com/0xCardinal/cewlio)
[![Python](https://img.shields.io/badge/Python-3.12+-blue?style=for-the-badge&logo=python&logoColor=white)](https://www.python.org/)
[![License](https://img.shields.io/badge/License-MIT-green?style=for-the-badge)](LICENSE)
[![Tests](https://img.shields.io/badge/Tests-Passed-brightgreen?style=for-the-badge)](CONTRIBUTING.md#testing)

**CeWLio** is a powerful, Python-based Custom Word List Generator inspired by the original [CeWL](https://digi.ninja/projects/cewl.php) by Robin Wood. While CeWL is excellent for static HTML content, CeWLio brings modern web scraping capabilities to handle today's JavaScript-heavy websites. It crawls web pages, executes JavaScript, and extracts:

- 📚 Unique words (with advanced filtering)
- 📧 Email addresses  
- 🏷️ Metadata (description, keywords, author)

Perfect for penetration testers, security researchers, and anyone needing high-quality, site-specific wordlists!

> **🤖 AI-Assisted Development**: This project was created with the help of AI tools, but solves real-world problems in web scraping and word list generation. Every line of code has been carefully reviewed, tested, and optimized for production use.

---

## 🚀 Features

- **JavaScript-Aware Extraction:** Uses headless browser to render pages and extract content after JavaScript execution.
- **Modern Web Support:** Handles Single Page Applications (SPAs), infinite scroll, lazy loading, and dynamic content that traditional scrapers miss.
- **Advanced Word Processing:**
  - Minimum/maximum word length filtering
  - Lowercase conversion
  - Alphanumeric or alpha-only words
  - Umlaut conversion (ä→ae, ö→oe, ü→ue, ß→ss)
  - Word frequency counting
- **Word Grouping:** Generate multi-word phrases (e.g., 2-grams, 3-grams)
- **Email & Metadata Extraction:** Find emails from content and mailto links, extract meta tags
- **Flexible Output:** Save words, emails, and metadata to separate files or stdout
- **Professional CLI:** All features accessible via command-line interface with CeWL-compatible flags
- **Silent Operation:** Runs quietly by default, with optional debug output
- **Comprehensive Testing:** 100% test coverage

---

## 🛠️ Installation

### From PyPI (Recommended)
```bash
pip install cewlio
```

### From Source
```bash
git clone https://github.com/0xCardinal/cewlio
cd cewlio
pip install -e .
```

### Dependencies
- Python 3.12+
- Playwright (for browser automation)
- BeautifulSoup4 (for HTML parsing)
- Requests (for HTTP handling)

**Note:** After installing Playwright, you only need to install the chromium-headless-shell browser:
```bash
playwright install chromium-headless-shell
```

---

## ⚡ Quick Start

### Basic Usage
```bash
# Extract words from a website (silent by default)
cewlio https://example.com

# Save words to a file
cewlio https://example.com --output wordlist.txt

# Include emails in stdout output
cewlio https://example.com -e

# Include metadata in stdout output
cewlio https://example.com -a

# Save emails and metadata to files
cewlio https://example.com --email_file emails.txt --meta_file meta.txt
```

### More Examples

**Generate word groups with counts:**
```bash
cewlio https://example.com --groups 3 -c --output phrases.txt
```

**Custom word filtering:**
```bash
cewlio https://example.com -m 4 --max-length 12 --lowercase --convert-umlauts
```

**Handle JavaScript-heavy sites:**
```bash
cewlio https://example.com -w 5 --visible
```

**Extract only emails and metadata (no words):**
```bash
cewlio https://example.com -e -a
```

**Extract only emails (no words):**
```bash
cewlio https://example.com -e
```

**Extract only metadata (no words):**
```bash
cewlio https://example.com -a
```

**Save emails to file (no words to stdout):**
```bash
cewlio https://example.com --email_file emails.txt
```

---

## 🎛️ Command-Line Options

| Option | Description | Default |
|--------|-------------|---------|
| `url` | URL to process | Required |
| `--version` | Show version and exit | - |
| `--output` | Output file for words | stdout |
| `-e, --email` | Include email addresses in stdout output | False |
| `--email_file` | Output file for email addresses | - |
| `-a, --meta` | Include metadata in stdout output | False |
| `--meta_file` | Output file for metadata | - |
| `-m, --min_word_length` | Minimum word length | 3 |
| `--max-length` | Maximum word length | No limit |
| `--lowercase` | Convert words to lowercase | False |
| `--with-numbers` | Include words with numbers | False |
| `--convert-umlauts` | Convert umlaut characters | False |
| `-c, --count` | Show word counts | False |
| `--groups` | Generate word groups of specified size | - |
| `-w, --wait` | Wait time for JavaScript execution (seconds) | 0 |
| `--visible` | Show browser window | False |
| `--timeout` | Browser timeout (milliseconds) | 30000 |
| `--debug` | Show debug/summary output | False |

---

## 📚 API Usage

### Basic Python Usage
```python
from cewlio import CeWLio

# Create instance with custom settings
cewlio = CeWLio(
    min_word_length=4,
    max_word_length=12,
    lowercase=True,
    convert_umlauts=True
)

# Process HTML content
html_content = "<p>Hello world! Contact us at test@example.com</p>"
cewlio.process_html(html_content)

# Access results
print("Words:", list(cewlio.words.keys()))
print("Emails:", list(cewlio.emails))
print("Metadata:", list(cewlio.metadata))
```

### Process URLs
```python
import asyncio
from cewlio import CeWLio, process_url_with_cewlio

async def main():
    cewlio = CeWLio()
    success = await process_url_with_cewlio(
        url="https://example.com",
        cewlio_instance=cewlio,
        wait_time=5,
        headless=True
    )
    
    if success:
        print(f"Found {len(cewlio.words)} words")
        print(f"Found {len(cewlio.emails)} emails")

asyncio.run(main())
```

---

## 🧪 Testing

The project includes a comprehensive test suite with 38 tests covering all functionality:

- ✅ Core functionality tests (15 tests)
- ✅ HTML extraction tests (3 tests)  
- ✅ URL processing tests (2 tests)
- ✅ Integration tests (3 tests)
- ✅ CLI argument validation tests (5 tests)
- ✅ Edge case tests (10 tests)

**Total: 38 tests with 100% success rate**

For detailed testing information and development setup, see [CONTRIBUTING.md](CONTRIBUTING.md).

---

## 🐛 Troubleshooting

### Common Issues

**"No module named 'playwright'"**
```bash
pip install playwright
playwright install chromium-headless-shell
```

**JavaScript-heavy sites not loading properly**
```bash
# Increase wait time for JavaScript execution
cewlio https://example.com -w 10
```

**Browser timeout errors**
```bash
# Increase timeout and wait time
cewlio https://example.com --timeout 60000 -w 5
```

---

## 🤝 Contributing

We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for detailed guidelines on:

- 🚀 Getting started with development
- 📝 Code style and formatting guidelines
- 🧪 Testing requirements and procedures
- 🔄 Submitting pull requests
- 🐛 Reporting issues
- 💡 Feature requests

Quick start:
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests for new functionality
5. Submit a pull request

For detailed development setup and guidelines, see [CONTRIBUTING.md](CONTRIBUTING.md).

---

## 🙏 Credits

- Inspired by [CeWL](https://digi.ninja/projects/cewl.php) by Robin Wood
- Built with [Playwright](https://playwright.dev/) for browser automation
- Uses [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/) for HTML parsing

---

## 📞 Support

- 🐛 **Issues:** [GitHub Issues](https://github.com/0xCardinal/cewlio/issues)
- 📖 **Documentation:** [GitHub Wiki](https://github.com/0xCardinal/cewlio/wiki)
- 💬 **Discussions:** [GitHub Discussions](https://github.com/0xCardinal/cewlio/discussions)

---

**Made with ❤️ for the security community**

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "cewlio",
    "maintainer": "Kumar Ashwin",
    "docs_url": null,
    "requires_python": ">=3.12",
    "maintainer_email": null,
    "keywords": "wordlist, web scraping, html parsing, email extraction, metadata",
    "author": null,
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/ac/6d/00d6053d81dbccb6f23c02f163ad81fe0e1fa89d117617db4fea287b2675/cewlio-1.2.2.tar.gz",
    "platform": null,
    "description": "# CeWLio \ud83d\udd75\ufe0f\u200d\u2642\ufe0f\u2728\n\n[![AI-Assisted Development](https://img.shields.io/badge/AI--Assisted-Development-blue?style=for-the-badge&logo=openai&logoColor=white)](https://github.com/0xCardinal/cewlio)\n[![Python](https://img.shields.io/badge/Python-3.12+-blue?style=for-the-badge&logo=python&logoColor=white)](https://www.python.org/)\n[![License](https://img.shields.io/badge/License-MIT-green?style=for-the-badge)](LICENSE)\n[![Tests](https://img.shields.io/badge/Tests-Passed-brightgreen?style=for-the-badge)](CONTRIBUTING.md#testing)\n\n**CeWLio** is a powerful, Python-based Custom Word List Generator inspired by the original [CeWL](https://digi.ninja/projects/cewl.php) by Robin Wood. While CeWL is excellent for static HTML content, CeWLio brings modern web scraping capabilities to handle today's JavaScript-heavy websites. It crawls web pages, executes JavaScript, and extracts:\n\n- \ud83d\udcda Unique words (with advanced filtering)\n- \ud83d\udce7 Email addresses  \n- \ud83c\udff7\ufe0f Metadata (description, keywords, author)\n\nPerfect for penetration testers, security researchers, and anyone needing high-quality, site-specific wordlists!\n\n> **\ud83e\udd16 AI-Assisted Development**: This project was created with the help of AI tools, but solves real-world problems in web scraping and word list generation. Every line of code has been carefully reviewed, tested, and optimized for production use.\n\n---\n\n## \ud83d\ude80 Features\n\n- **JavaScript-Aware Extraction:** Uses headless browser to render pages and extract content after JavaScript execution.\n- **Modern Web Support:** Handles Single Page Applications (SPAs), infinite scroll, lazy loading, and dynamic content that traditional scrapers miss.\n- **Advanced Word Processing:**\n  - Minimum/maximum word length filtering\n  - Lowercase conversion\n  - Alphanumeric or alpha-only words\n  - Umlaut conversion (\u00e4\u2192ae, \u00f6\u2192oe, \u00fc\u2192ue, \u00df\u2192ss)\n  - Word frequency counting\n- **Word Grouping:** Generate multi-word phrases (e.g., 2-grams, 3-grams)\n- **Email & Metadata Extraction:** Find emails from content and mailto links, extract meta tags\n- **Flexible Output:** Save words, emails, and metadata to separate files or stdout\n- **Professional CLI:** All features accessible via command-line interface with CeWL-compatible flags\n- **Silent Operation:** Runs quietly by default, with optional debug output\n- **Comprehensive Testing:** 100% test coverage\n\n---\n\n## \ud83d\udee0\ufe0f Installation\n\n### From PyPI (Recommended)\n```bash\npip install cewlio\n```\n\n### From Source\n```bash\ngit clone https://github.com/0xCardinal/cewlio\ncd cewlio\npip install -e .\n```\n\n### Dependencies\n- Python 3.12+\n- Playwright (for browser automation)\n- BeautifulSoup4 (for HTML parsing)\n- Requests (for HTTP handling)\n\n**Note:** After installing Playwright, you only need to install the chromium-headless-shell browser:\n```bash\nplaywright install chromium-headless-shell\n```\n\n---\n\n## \u26a1 Quick Start\n\n### Basic Usage\n```bash\n# Extract words from a website (silent by default)\ncewlio https://example.com\n\n# Save words to a file\ncewlio https://example.com --output wordlist.txt\n\n# Include emails in stdout output\ncewlio https://example.com -e\n\n# Include metadata in stdout output\ncewlio https://example.com -a\n\n# Save emails and metadata to files\ncewlio https://example.com --email_file emails.txt --meta_file meta.txt\n```\n\n### More Examples\n\n**Generate word groups with counts:**\n```bash\ncewlio https://example.com --groups 3 -c --output phrases.txt\n```\n\n**Custom word filtering:**\n```bash\ncewlio https://example.com -m 4 --max-length 12 --lowercase --convert-umlauts\n```\n\n**Handle JavaScript-heavy sites:**\n```bash\ncewlio https://example.com -w 5 --visible\n```\n\n**Extract only emails and metadata (no words):**\n```bash\ncewlio https://example.com -e -a\n```\n\n**Extract only emails (no words):**\n```bash\ncewlio https://example.com -e\n```\n\n**Extract only metadata (no words):**\n```bash\ncewlio https://example.com -a\n```\n\n**Save emails to file (no words to stdout):**\n```bash\ncewlio https://example.com --email_file emails.txt\n```\n\n---\n\n## \ud83c\udf9b\ufe0f Command-Line Options\n\n| Option | Description | Default |\n|--------|-------------|---------|\n| `url` | URL to process | Required |\n| `--version` | Show version and exit | - |\n| `--output` | Output file for words | stdout |\n| `-e, --email` | Include email addresses in stdout output | False |\n| `--email_file` | Output file for email addresses | - |\n| `-a, --meta` | Include metadata in stdout output | False |\n| `--meta_file` | Output file for metadata | - |\n| `-m, --min_word_length` | Minimum word length | 3 |\n| `--max-length` | Maximum word length | No limit |\n| `--lowercase` | Convert words to lowercase | False |\n| `--with-numbers` | Include words with numbers | False |\n| `--convert-umlauts` | Convert umlaut characters | False |\n| `-c, --count` | Show word counts | False |\n| `--groups` | Generate word groups of specified size | - |\n| `-w, --wait` | Wait time for JavaScript execution (seconds) | 0 |\n| `--visible` | Show browser window | False |\n| `--timeout` | Browser timeout (milliseconds) | 30000 |\n| `--debug` | Show debug/summary output | False |\n\n---\n\n## \ud83d\udcda API Usage\n\n### Basic Python Usage\n```python\nfrom cewlio import CeWLio\n\n# Create instance with custom settings\ncewlio = CeWLio(\n    min_word_length=4,\n    max_word_length=12,\n    lowercase=True,\n    convert_umlauts=True\n)\n\n# Process HTML content\nhtml_content = \"<p>Hello world! Contact us at test@example.com</p>\"\ncewlio.process_html(html_content)\n\n# Access results\nprint(\"Words:\", list(cewlio.words.keys()))\nprint(\"Emails:\", list(cewlio.emails))\nprint(\"Metadata:\", list(cewlio.metadata))\n```\n\n### Process URLs\n```python\nimport asyncio\nfrom cewlio import CeWLio, process_url_with_cewlio\n\nasync def main():\n    cewlio = CeWLio()\n    success = await process_url_with_cewlio(\n        url=\"https://example.com\",\n        cewlio_instance=cewlio,\n        wait_time=5,\n        headless=True\n    )\n    \n    if success:\n        print(f\"Found {len(cewlio.words)} words\")\n        print(f\"Found {len(cewlio.emails)} emails\")\n\nasyncio.run(main())\n```\n\n---\n\n## \ud83e\uddea Testing\n\nThe project includes a comprehensive test suite with 38 tests covering all functionality:\n\n- \u2705 Core functionality tests (15 tests)\n- \u2705 HTML extraction tests (3 tests)  \n- \u2705 URL processing tests (2 tests)\n- \u2705 Integration tests (3 tests)\n- \u2705 CLI argument validation tests (5 tests)\n- \u2705 Edge case tests (10 tests)\n\n**Total: 38 tests with 100% success rate**\n\nFor detailed testing information and development setup, see [CONTRIBUTING.md](CONTRIBUTING.md).\n\n---\n\n## \ud83d\udc1b Troubleshooting\n\n### Common Issues\n\n**\"No module named 'playwright'\"**\n```bash\npip install playwright\nplaywright install chromium-headless-shell\n```\n\n**JavaScript-heavy sites not loading properly**\n```bash\n# Increase wait time for JavaScript execution\ncewlio https://example.com -w 10\n```\n\n**Browser timeout errors**\n```bash\n# Increase timeout and wait time\ncewlio https://example.com --timeout 60000 -w 5\n```\n\n---\n\n## \ud83e\udd1d Contributing\n\nWe welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for detailed guidelines on:\n\n- \ud83d\ude80 Getting started with development\n- \ud83d\udcdd Code style and formatting guidelines\n- \ud83e\uddea Testing requirements and procedures\n- \ud83d\udd04 Submitting pull requests\n- \ud83d\udc1b Reporting issues\n- \ud83d\udca1 Feature requests\n\nQuick start:\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes\n4. Add tests for new functionality\n5. Submit a pull request\n\nFor detailed development setup and guidelines, see [CONTRIBUTING.md](CONTRIBUTING.md).\n\n---\n\n## \ud83d\ude4f Credits\n\n- Inspired by [CeWL](https://digi.ninja/projects/cewl.php) by Robin Wood\n- Built with [Playwright](https://playwright.dev/) for browser automation\n- Uses [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/) for HTML parsing\n\n---\n\n## \ud83d\udcde Support\n\n- \ud83d\udc1b **Issues:** [GitHub Issues](https://github.com/0xCardinal/cewlio/issues)\n- \ud83d\udcd6 **Documentation:** [GitHub Wiki](https://github.com/0xCardinal/cewlio/wiki)\n- \ud83d\udcac **Discussions:** [GitHub Discussions](https://github.com/0xCardinal/cewlio/discussions)\n\n---\n\n**Made with \u2764\ufe0f for the security community** \n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Custom word list generator for web content",
    "version": "1.2.2",
    "project_urls": {
        "Bug Tracker": "https://github.com/0xCardinal/cewlio/issues",
        "Documentation": "https://github.com/0xCardinal/cewlio#readme",
        "Homepage": "https://github.com/0xCardinal/cewlio",
        "Repository": "https://github.com/0xCardinal/cewlio"
    },
    "split_keywords": [
        "wordlist",
        " web scraping",
        " html parsing",
        " email extraction",
        " metadata"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e05844637b1083c3815b948bf0b0f8b8c3faefc8e9be775778f6c6119371aee8",
                "md5": "6d7fabc0f6984817d0735a835767728c",
                "sha256": "fb99baa7b44caaa3b6e600b25c4d6476d953e1ea65a9bf23c52db968e485c650"
            },
            "downloads": -1,
            "filename": "cewlio-1.2.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6d7fabc0f6984817d0735a835767728c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.12",
            "size": 37897,
            "upload_time": "2025-07-10T21:10:11",
            "upload_time_iso_8601": "2025-07-10T21:10:11.521335Z",
            "url": "https://files.pythonhosted.org/packages/e0/58/44637b1083c3815b948bf0b0f8b8c3faefc8e9be775778f6c6119371aee8/cewlio-1.2.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ac6d00d6053d81dbccb6f23c02f163ad81fe0e1fa89d117617db4fea287b2675",
                "md5": "7107776569cf0cb0fa49e647470276ec",
                "sha256": "4f6a73264b34efac349da587bd52213b86db08873d4715516963dccf6041972c"
            },
            "downloads": -1,
            "filename": "cewlio-1.2.2.tar.gz",
            "has_sig": false,
            "md5_digest": "7107776569cf0cb0fa49e647470276ec",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.12",
            "size": 37335,
            "upload_time": "2025-07-10T21:10:13",
            "upload_time_iso_8601": "2025-07-10T21:10:13.877970Z",
            "url": "https://files.pythonhosted.org/packages/ac/6d/00d6053d81dbccb6f23c02f163ad81fe0e1fa89d117617db4fea287b2675/cewlio-1.2.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-10 21:10:13",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "0xCardinal",
    "github_project": "cewlio",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "cewlio"
}

None