# CeWLio π΅οΈββοΈβ¨
[](https://github.com/0xCardinal/cewlio)
[](https://www.python.org/)
[](LICENSE)
[](CONTRIBUTING.md#testing)
**CeWLio** is a powerful, Python-based Custom Word List Generator inspired by the original [CeWL](https://digi.ninja/projects/cewl.php) by Robin Wood. While CeWL is excellent for static HTML content, CeWLio brings modern web scraping capabilities to handle today's JavaScript-heavy websites. It crawls web pages, executes JavaScript, and extracts:
- π Unique words (with advanced filtering)
- π§ Email addresses
- π·οΈ Metadata (description, keywords, author)
Perfect for penetration testers, security researchers, and anyone needing high-quality, site-specific wordlists!
> **π€ AI-Assisted Development**: This project was created with the help of AI tools, but solves real-world problems in web scraping and word list generation. Every line of code has been carefully reviewed, tested, and optimized for production use.
---
## π Features
- **JavaScript-Aware Extraction:** Uses headless browser to render pages and extract content after JavaScript execution.
- **Modern Web Support:** Handles Single Page Applications (SPAs), infinite scroll, lazy loading, and dynamic content that traditional scrapers miss.
- **Advanced Word Processing:**
- Minimum/maximum word length filtering
- Lowercase conversion
- Alphanumeric or alpha-only words
- Umlaut conversion (Γ€βae, ΓΆβoe, ΓΌβue, Γβss)
- Word frequency counting
- **Word Grouping:** Generate multi-word phrases (e.g., 2-grams, 3-grams)
- **Email & Metadata Extraction:** Find emails from content and mailto links, extract meta tags
- **Flexible Output:** Save words, emails, and metadata to separate files or stdout
- **Professional CLI:** All features accessible via command-line interface with CeWL-compatible flags
- **Silent Operation:** Runs quietly by default, with optional debug output
- **Comprehensive Testing:** 100% test coverage
---
## π οΈ Installation
### From PyPI (Recommended)
```bash
pip install cewlio
```
### From Source
```bash
git clone https://github.com/0xCardinal/cewlio
cd cewlio
pip install -e .
```
### Dependencies
- Python 3.12+
- Playwright (for browser automation)
- BeautifulSoup4 (for HTML parsing)
- Requests (for HTTP handling)
**Note:** After installing Playwright, you only need to install the chromium-headless-shell browser:
```bash
playwright install chromium-headless-shell
```
---
## β‘ Quick Start
### Basic Usage
```bash
# Extract words from a website (silent by default)
cewlio https://example.com
# Save words to a file
cewlio https://example.com --output wordlist.txt
# Include emails in stdout output
cewlio https://example.com -e
# Include metadata in stdout output
cewlio https://example.com -a
# Save emails and metadata to files
cewlio https://example.com --email_file emails.txt --meta_file meta.txt
```
### More Examples
**Generate word groups with counts:**
```bash
cewlio https://example.com --groups 3 -c --output phrases.txt
```
**Custom word filtering:**
```bash
cewlio https://example.com -m 4 --max-length 12 --lowercase --convert-umlauts
```
**Handle JavaScript-heavy sites:**
```bash
cewlio https://example.com -w 5 --visible
```
**Extract only emails and metadata (no words):**
```bash
cewlio https://example.com -e -a
```
**Extract only emails (no words):**
```bash
cewlio https://example.com -e
```
**Extract only metadata (no words):**
```bash
cewlio https://example.com -a
```
**Save emails to file (no words to stdout):**
```bash
cewlio https://example.com --email_file emails.txt
```
---
## ποΈ Command-Line Options
| Option | Description | Default |
|--------|-------------|---------|
| `url` | URL to process | Required |
| `--version` | Show version and exit | - |
| `--output` | Output file for words | stdout |
| `-e, --email` | Include email addresses in stdout output | False |
| `--email_file` | Output file for email addresses | - |
| `-a, --meta` | Include metadata in stdout output | False |
| `--meta_file` | Output file for metadata | - |
| `-m, --min_word_length` | Minimum word length | 3 |
| `--max-length` | Maximum word length | No limit |
| `--lowercase` | Convert words to lowercase | False |
| `--with-numbers` | Include words with numbers | False |
| `--convert-umlauts` | Convert umlaut characters | False |
| `-c, --count` | Show word counts | False |
| `--groups` | Generate word groups of specified size | - |
| `-w, --wait` | Wait time for JavaScript execution (seconds) | 0 |
| `--visible` | Show browser window | False |
| `--timeout` | Browser timeout (milliseconds) | 30000 |
| `--debug` | Show debug/summary output | False |
---
## π API Usage
### Basic Python Usage
```python
from cewlio import CeWLio
# Create instance with custom settings
cewlio = CeWLio(
min_word_length=4,
max_word_length=12,
lowercase=True,
convert_umlauts=True
)
# Process HTML content
html_content = "<p>Hello world! Contact us at test@example.com</p>"
cewlio.process_html(html_content)
# Access results
print("Words:", list(cewlio.words.keys()))
print("Emails:", list(cewlio.emails))
print("Metadata:", list(cewlio.metadata))
```
### Process URLs
```python
import asyncio
from cewlio import CeWLio, process_url_with_cewlio
async def main():
cewlio = CeWLio()
success = await process_url_with_cewlio(
url="https://example.com",
cewlio_instance=cewlio,
wait_time=5,
headless=True
)
if success:
print(f"Found {len(cewlio.words)} words")
print(f"Found {len(cewlio.emails)} emails")
asyncio.run(main())
```
---
## π§ͺ Testing
The project includes a comprehensive test suite with 38 tests covering all functionality:
- β
Core functionality tests (15 tests)
- β
HTML extraction tests (3 tests)
- β
URL processing tests (2 tests)
- β
Integration tests (3 tests)
- β
CLI argument validation tests (5 tests)
- β
Edge case tests (10 tests)
**Total: 38 tests with 100% success rate**
For detailed testing information and development setup, see [CONTRIBUTING.md](CONTRIBUTING.md).
---
## π Troubleshooting
### Common Issues
**"No module named 'playwright'"**
```bash
pip install playwright
playwright install chromium-headless-shell
```
**JavaScript-heavy sites not loading properly**
```bash
# Increase wait time for JavaScript execution
cewlio https://example.com -w 10
```
**Browser timeout errors**
```bash
# Increase timeout and wait time
cewlio https://example.com --timeout 60000 -w 5
```
---
## π€ Contributing
We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for detailed guidelines on:
- π Getting started with development
- π Code style and formatting guidelines
- π§ͺ Testing requirements and procedures
- π Submitting pull requests
- π Reporting issues
- π‘ Feature requests
Quick start:
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests for new functionality
5. Submit a pull request
For detailed development setup and guidelines, see [CONTRIBUTING.md](CONTRIBUTING.md).
---
## π Credits
- Inspired by [CeWL](https://digi.ninja/projects/cewl.php) by Robin Wood
- Built with [Playwright](https://playwright.dev/) for browser automation
- Uses [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/) for HTML parsing
---
## π Support
- π **Issues:** [GitHub Issues](https://github.com/0xCardinal/cewlio/issues)
- π **Documentation:** [GitHub Wiki](https://github.com/0xCardinal/cewlio/wiki)
- π¬ **Discussions:** [GitHub Discussions](https://github.com/0xCardinal/cewlio/discussions)
---
**Made with β€οΈ for the security community**
Raw data
{
"_id": null,
"home_page": null,
"name": "cewlio",
"maintainer": "Kumar Ashwin",
"docs_url": null,
"requires_python": ">=3.12",
"maintainer_email": null,
"keywords": "wordlist, web scraping, html parsing, email extraction, metadata",
"author": null,
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/ac/6d/00d6053d81dbccb6f23c02f163ad81fe0e1fa89d117617db4fea287b2675/cewlio-1.2.2.tar.gz",
"platform": null,
"description": "# CeWLio \ud83d\udd75\ufe0f\u200d\u2642\ufe0f\u2728\n\n[](https://github.com/0xCardinal/cewlio)\n[](https://www.python.org/)\n[](LICENSE)\n[](CONTRIBUTING.md#testing)\n\n**CeWLio** is a powerful, Python-based Custom Word List Generator inspired by the original [CeWL](https://digi.ninja/projects/cewl.php) by Robin Wood. While CeWL is excellent for static HTML content, CeWLio brings modern web scraping capabilities to handle today's JavaScript-heavy websites. It crawls web pages, executes JavaScript, and extracts:\n\n- \ud83d\udcda Unique words (with advanced filtering)\n- \ud83d\udce7 Email addresses \n- \ud83c\udff7\ufe0f Metadata (description, keywords, author)\n\nPerfect for penetration testers, security researchers, and anyone needing high-quality, site-specific wordlists!\n\n> **\ud83e\udd16 AI-Assisted Development**: This project was created with the help of AI tools, but solves real-world problems in web scraping and word list generation. Every line of code has been carefully reviewed, tested, and optimized for production use.\n\n---\n\n## \ud83d\ude80 Features\n\n- **JavaScript-Aware Extraction:** Uses headless browser to render pages and extract content after JavaScript execution.\n- **Modern Web Support:** Handles Single Page Applications (SPAs), infinite scroll, lazy loading, and dynamic content that traditional scrapers miss.\n- **Advanced Word Processing:**\n - Minimum/maximum word length filtering\n - Lowercase conversion\n - Alphanumeric or alpha-only words\n - Umlaut conversion (\u00e4\u2192ae, \u00f6\u2192oe, \u00fc\u2192ue, \u00df\u2192ss)\n - Word frequency counting\n- **Word Grouping:** Generate multi-word phrases (e.g., 2-grams, 3-grams)\n- **Email & Metadata Extraction:** Find emails from content and mailto links, extract meta tags\n- **Flexible Output:** Save words, emails, and metadata to separate files or stdout\n- **Professional CLI:** All features accessible via command-line interface with CeWL-compatible flags\n- **Silent Operation:** Runs quietly by default, with optional debug output\n- **Comprehensive Testing:** 100% test coverage\n\n---\n\n## \ud83d\udee0\ufe0f Installation\n\n### From PyPI (Recommended)\n```bash\npip install cewlio\n```\n\n### From Source\n```bash\ngit clone https://github.com/0xCardinal/cewlio\ncd cewlio\npip install -e .\n```\n\n### Dependencies\n- Python 3.12+\n- Playwright (for browser automation)\n- BeautifulSoup4 (for HTML parsing)\n- Requests (for HTTP handling)\n\n**Note:** After installing Playwright, you only need to install the chromium-headless-shell browser:\n```bash\nplaywright install chromium-headless-shell\n```\n\n---\n\n## \u26a1 Quick Start\n\n### Basic Usage\n```bash\n# Extract words from a website (silent by default)\ncewlio https://example.com\n\n# Save words to a file\ncewlio https://example.com --output wordlist.txt\n\n# Include emails in stdout output\ncewlio https://example.com -e\n\n# Include metadata in stdout output\ncewlio https://example.com -a\n\n# Save emails and metadata to files\ncewlio https://example.com --email_file emails.txt --meta_file meta.txt\n```\n\n### More Examples\n\n**Generate word groups with counts:**\n```bash\ncewlio https://example.com --groups 3 -c --output phrases.txt\n```\n\n**Custom word filtering:**\n```bash\ncewlio https://example.com -m 4 --max-length 12 --lowercase --convert-umlauts\n```\n\n**Handle JavaScript-heavy sites:**\n```bash\ncewlio https://example.com -w 5 --visible\n```\n\n**Extract only emails and metadata (no words):**\n```bash\ncewlio https://example.com -e -a\n```\n\n**Extract only emails (no words):**\n```bash\ncewlio https://example.com -e\n```\n\n**Extract only metadata (no words):**\n```bash\ncewlio https://example.com -a\n```\n\n**Save emails to file (no words to stdout):**\n```bash\ncewlio https://example.com --email_file emails.txt\n```\n\n---\n\n## \ud83c\udf9b\ufe0f Command-Line Options\n\n| Option | Description | Default |\n|--------|-------------|---------|\n| `url` | URL to process | Required |\n| `--version` | Show version and exit | - |\n| `--output` | Output file for words | stdout |\n| `-e, --email` | Include email addresses in stdout output | False |\n| `--email_file` | Output file for email addresses | - |\n| `-a, --meta` | Include metadata in stdout output | False |\n| `--meta_file` | Output file for metadata | - |\n| `-m, --min_word_length` | Minimum word length | 3 |\n| `--max-length` | Maximum word length | No limit |\n| `--lowercase` | Convert words to lowercase | False |\n| `--with-numbers` | Include words with numbers | False |\n| `--convert-umlauts` | Convert umlaut characters | False |\n| `-c, --count` | Show word counts | False |\n| `--groups` | Generate word groups of specified size | - |\n| `-w, --wait` | Wait time for JavaScript execution (seconds) | 0 |\n| `--visible` | Show browser window | False |\n| `--timeout` | Browser timeout (milliseconds) | 30000 |\n| `--debug` | Show debug/summary output | False |\n\n---\n\n## \ud83d\udcda API Usage\n\n### Basic Python Usage\n```python\nfrom cewlio import CeWLio\n\n# Create instance with custom settings\ncewlio = CeWLio(\n min_word_length=4,\n max_word_length=12,\n lowercase=True,\n convert_umlauts=True\n)\n\n# Process HTML content\nhtml_content = \"<p>Hello world! Contact us at test@example.com</p>\"\ncewlio.process_html(html_content)\n\n# Access results\nprint(\"Words:\", list(cewlio.words.keys()))\nprint(\"Emails:\", list(cewlio.emails))\nprint(\"Metadata:\", list(cewlio.metadata))\n```\n\n### Process URLs\n```python\nimport asyncio\nfrom cewlio import CeWLio, process_url_with_cewlio\n\nasync def main():\n cewlio = CeWLio()\n success = await process_url_with_cewlio(\n url=\"https://example.com\",\n cewlio_instance=cewlio,\n wait_time=5,\n headless=True\n )\n \n if success:\n print(f\"Found {len(cewlio.words)} words\")\n print(f\"Found {len(cewlio.emails)} emails\")\n\nasyncio.run(main())\n```\n\n---\n\n## \ud83e\uddea Testing\n\nThe project includes a comprehensive test suite with 38 tests covering all functionality:\n\n- \u2705 Core functionality tests (15 tests)\n- \u2705 HTML extraction tests (3 tests) \n- \u2705 URL processing tests (2 tests)\n- \u2705 Integration tests (3 tests)\n- \u2705 CLI argument validation tests (5 tests)\n- \u2705 Edge case tests (10 tests)\n\n**Total: 38 tests with 100% success rate**\n\nFor detailed testing information and development setup, see [CONTRIBUTING.md](CONTRIBUTING.md).\n\n---\n\n## \ud83d\udc1b Troubleshooting\n\n### Common Issues\n\n**\"No module named 'playwright'\"**\n```bash\npip install playwright\nplaywright install chromium-headless-shell\n```\n\n**JavaScript-heavy sites not loading properly**\n```bash\n# Increase wait time for JavaScript execution\ncewlio https://example.com -w 10\n```\n\n**Browser timeout errors**\n```bash\n# Increase timeout and wait time\ncewlio https://example.com --timeout 60000 -w 5\n```\n\n---\n\n## \ud83e\udd1d Contributing\n\nWe welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for detailed guidelines on:\n\n- \ud83d\ude80 Getting started with development\n- \ud83d\udcdd Code style and formatting guidelines\n- \ud83e\uddea Testing requirements and procedures\n- \ud83d\udd04 Submitting pull requests\n- \ud83d\udc1b Reporting issues\n- \ud83d\udca1 Feature requests\n\nQuick start:\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes\n4. Add tests for new functionality\n5. Submit a pull request\n\nFor detailed development setup and guidelines, see [CONTRIBUTING.md](CONTRIBUTING.md).\n\n---\n\n## \ud83d\ude4f Credits\n\n- Inspired by [CeWL](https://digi.ninja/projects/cewl.php) by Robin Wood\n- Built with [Playwright](https://playwright.dev/) for browser automation\n- Uses [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/) for HTML parsing\n\n---\n\n## \ud83d\udcde Support\n\n- \ud83d\udc1b **Issues:** [GitHub Issues](https://github.com/0xCardinal/cewlio/issues)\n- \ud83d\udcd6 **Documentation:** [GitHub Wiki](https://github.com/0xCardinal/cewlio/wiki)\n- \ud83d\udcac **Discussions:** [GitHub Discussions](https://github.com/0xCardinal/cewlio/discussions)\n\n---\n\n**Made with \u2764\ufe0f for the security community** \n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Custom word list generator for web content",
"version": "1.2.2",
"project_urls": {
"Bug Tracker": "https://github.com/0xCardinal/cewlio/issues",
"Documentation": "https://github.com/0xCardinal/cewlio#readme",
"Homepage": "https://github.com/0xCardinal/cewlio",
"Repository": "https://github.com/0xCardinal/cewlio"
},
"split_keywords": [
"wordlist",
" web scraping",
" html parsing",
" email extraction",
" metadata"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "e05844637b1083c3815b948bf0b0f8b8c3faefc8e9be775778f6c6119371aee8",
"md5": "6d7fabc0f6984817d0735a835767728c",
"sha256": "fb99baa7b44caaa3b6e600b25c4d6476d953e1ea65a9bf23c52db968e485c650"
},
"downloads": -1,
"filename": "cewlio-1.2.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "6d7fabc0f6984817d0735a835767728c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.12",
"size": 37897,
"upload_time": "2025-07-10T21:10:11",
"upload_time_iso_8601": "2025-07-10T21:10:11.521335Z",
"url": "https://files.pythonhosted.org/packages/e0/58/44637b1083c3815b948bf0b0f8b8c3faefc8e9be775778f6c6119371aee8/cewlio-1.2.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "ac6d00d6053d81dbccb6f23c02f163ad81fe0e1fa89d117617db4fea287b2675",
"md5": "7107776569cf0cb0fa49e647470276ec",
"sha256": "4f6a73264b34efac349da587bd52213b86db08873d4715516963dccf6041972c"
},
"downloads": -1,
"filename": "cewlio-1.2.2.tar.gz",
"has_sig": false,
"md5_digest": "7107776569cf0cb0fa49e647470276ec",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.12",
"size": 37335,
"upload_time": "2025-07-10T21:10:13",
"upload_time_iso_8601": "2025-07-10T21:10:13.877970Z",
"url": "https://files.pythonhosted.org/packages/ac/6d/00d6053d81dbccb6f23c02f163ad81fe0e1fa89d117617db4fea287b2675/cewlio-1.2.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-10 21:10:13",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "0xCardinal",
"github_project": "cewlio",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "cewlio"
}