===========================================================================================
piedomains: AI-powered domain content classification
===========================================================================================
.. image:: https://github.com/themains/piedomains/actions/workflows/python-publish.yml/badge.svg
:target: https://github.com/themains/piedomains/actions/workflows/python-publish.yml
.. image:: https://img.shields.io/pypi/v/piedomains.svg
:target: https://pypi.python.org/pypi/piedomains
.. image:: https://readthedocs.org/projects/piedomains/badge/?version=latest
:target: http://piedomains.readthedocs.io/en/latest/?badge=latest
**piedomains** predicts website content categories using AI analysis of domain names, text content, and homepage screenshots. Classify domains as news, shopping, adult content, education, etc. with high accuracy.
🚀 **Quickstart**
-------------------
Install and classify domains in 3 lines:
.. code-block:: python
pip install piedomains
from piedomains import DomainClassifier
classifier = DomainClassifier()
# Classify current content
result = classifier.classify(["cnn.com", "amazon.com", "wikipedia.org"])
print(result[['domain', 'pred_label', 'pred_prob']])
# Expected output:
# domain pred_label pred_prob
# 0 cnn.com news 0.876543
# 1 amazon.com shopping 0.923456
# 2 wikipedia.org education 0.891234
📊 **Key Features**
--------------------
- **High Accuracy**: Combines text analysis + visual screenshots for 90%+ accuracy
- **Historical Analysis**: Classify websites from any point in time using archive.org
- **Fast & Scalable**: Batch processing with caching for 1000s of domains
- **Easy Integration**: Modern Python API with pandas output
- **41 Categories**: From news/finance to adult/gambling content
⚡ **Usage Examples**
---------------------
**Basic Classification**
.. code-block:: python
from piedomains import DomainClassifier
classifier = DomainClassifier()
# Combined analysis (most accurate)
result = classifier.classify(["github.com", "reddit.com"])
# Text-only (faster)
result = classifier.classify_by_text(["news.google.com"])
# Images-only (good for visual content)
result = classifier.classify_by_images(["instagram.com"])
**Historical Analysis**
.. code-block:: python
# Analyze how Facebook looked in 2010 vs today
old_facebook = classifier.classify(["facebook.com"], archive_date="20100101")
new_facebook = classifier.classify(["facebook.com"])
print(f"2010: {old_facebook.iloc[0]['pred_label']}")
print(f"2024: {new_facebook.iloc[0]['pred_label']}")
**Batch Processing**
.. code-block:: python
# Process large lists efficiently
domains = ["site1.com", "site2.com", ...] # 1000s of domains
results = classifier.classify_batch(
domains,
method="text", # text|images|combined
batch_size=50, # Process 50 at a time
show_progress=True # Progress bar
)
🏷️ **Supported Categories**
------------------------------
News, Finance, Shopping, Education, Government, Adult Content, Gambling, Social Networks, Search Engines, and 32 more categories based on the Shallalist taxonomy.
📈 **Performance**
-------------------
- **Speed**: ~10-50 domains/minute (depends on method and network)
- **Accuracy**: 85-95% depending on content type and method
- **Memory**: <500MB for batch processing
- **Caching**: Automatic content caching for faster re-runs
🔧 **Installation**
--------------------
**Requirements**: Python 3.9+
.. code-block:: bash
# Basic installation
pip install piedomains
# For development
git clone https://github.com/themains/piedomains
cd piedomains
pip install -e .
🔄 **Migration from v0.2.x**
-----------------------------
**Old API** (still supported):
.. code-block:: python
from piedomains import domain
result = domain.pred_shalla_cat_with_text(["example.com"])
**New API** (recommended):
.. code-block:: python
from piedomains import DomainClassifier
classifier = DomainClassifier()
result = classifier.classify_by_text(["example.com"])
📖 **Documentation**
---------------------
- **API Reference**: https://piedomains.readthedocs.io
- **Examples**: `/examples` directory
- **Notebooks**: `/piedomains/notebooks` (training & analysis)
🤝 **Contributing**
--------------------
.. code-block:: bash
# Setup development environment
git clone https://github.com/themains/piedomains
cd piedomains
pip install -e ".[dev]"
# Run tests
pytest piedomains/tests/ -v
# Run linting
flake8 piedomains/
📄 **License**
---------------
MIT License - see LICENSE file.
📚 **Citation**
----------------
If you use piedomains in research, please cite:
.. code-block:: bibtex
@software{piedomains,
title={piedomains: AI-powered domain content classification},
author={Chintalapati, Rajashekar and Sood, Gaurav},
year={2024},
url={https://github.com/themains/piedomains}
}
---
**Legacy Documentation**
========================
For legacy API documentation, see LEGACY_API.rst
Raw data
{
"_id": null,
"home_page": null,
"name": "piedomains",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.14,>=3.9",
"maintainer_email": null,
"keywords": "domain classification, website categorization, machine learning, content analysis, web scraping, computer vision",
"author": null,
"author_email": "Rajashekar Chintalapati <rajshekar.ch@gmail.com>, Gaurav Sood <gsood07@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/86/48/b19a7536eb6b0cef3e06e6d9977c0bd0e3296d69755a4838037f8694f5a7/piedomains-0.3.10.tar.gz",
"platform": null,
"description": "===========================================================================================\npiedomains: AI-powered domain content classification\n===========================================================================================\n\n.. image:: https://github.com/themains/piedomains/actions/workflows/python-publish.yml/badge.svg\n :target: https://github.com/themains/piedomains/actions/workflows/python-publish.yml\n.. image:: https://img.shields.io/pypi/v/piedomains.svg\n :target: https://pypi.python.org/pypi/piedomains\n.. image:: https://readthedocs.org/projects/piedomains/badge/?version=latest\n :target: http://piedomains.readthedocs.io/en/latest/?badge=latest\n\n**piedomains** predicts website content categories using AI analysis of domain names, text content, and homepage screenshots. Classify domains as news, shopping, adult content, education, etc. with high accuracy.\n\n\ud83d\ude80 **Quickstart**\n-------------------\n\nInstall and classify domains in 3 lines:\n\n.. code-block:: python\n\n pip install piedomains\n \n from piedomains import DomainClassifier\n classifier = DomainClassifier()\n \n # Classify current content\n result = classifier.classify([\"cnn.com\", \"amazon.com\", \"wikipedia.org\"])\n print(result[['domain', 'pred_label', 'pred_prob']])\n \n # Expected output:\n # domain pred_label pred_prob\n # 0 cnn.com news 0.876543\n # 1 amazon.com shopping 0.923456\n # 2 wikipedia.org education 0.891234\n\n\ud83d\udcca **Key Features**\n--------------------\n\n- **High Accuracy**: Combines text analysis + visual screenshots for 90%+ accuracy\n- **Historical Analysis**: Classify websites from any point in time using archive.org\n- **Fast & Scalable**: Batch processing with caching for 1000s of domains\n- **Easy Integration**: Modern Python API with pandas output\n- **41 Categories**: From news/finance to adult/gambling content\n\n\u26a1 **Usage Examples**\n---------------------\n\n**Basic Classification**\n\n.. code-block:: python\n\n from piedomains import DomainClassifier\n \n classifier = DomainClassifier()\n \n # Combined analysis (most accurate)\n result = classifier.classify([\"github.com\", \"reddit.com\"])\n \n # Text-only (faster)\n result = classifier.classify_by_text([\"news.google.com\"])\n \n # Images-only (good for visual content) \n result = classifier.classify_by_images([\"instagram.com\"])\n\n**Historical Analysis**\n\n.. code-block:: python\n\n # Analyze how Facebook looked in 2010 vs today\n old_facebook = classifier.classify([\"facebook.com\"], archive_date=\"20100101\")\n new_facebook = classifier.classify([\"facebook.com\"])\n \n print(f\"2010: {old_facebook.iloc[0]['pred_label']}\")\n print(f\"2024: {new_facebook.iloc[0]['pred_label']}\")\n\n**Batch Processing**\n\n.. code-block:: python\n\n # Process large lists efficiently\n domains = [\"site1.com\", \"site2.com\", ...] # 1000s of domains\n results = classifier.classify_batch(\n domains, \n method=\"text\", # text|images|combined\n batch_size=50, # Process 50 at a time\n show_progress=True # Progress bar\n )\n\n\ud83c\udff7\ufe0f **Supported Categories**\n------------------------------\n\nNews, Finance, Shopping, Education, Government, Adult Content, Gambling, Social Networks, Search Engines, and 32 more categories based on the Shallalist taxonomy.\n\n\ud83d\udcc8 **Performance**\n-------------------\n\n- **Speed**: ~10-50 domains/minute (depends on method and network)\n- **Accuracy**: 85-95% depending on content type and method\n- **Memory**: <500MB for batch processing\n- **Caching**: Automatic content caching for faster re-runs\n\n\ud83d\udd27 **Installation**\n--------------------\n\n**Requirements**: Python 3.9+\n\n.. code-block:: bash\n\n # Basic installation\n pip install piedomains\n \n # For development\n git clone https://github.com/themains/piedomains\n cd piedomains\n pip install -e .\n\n\ud83d\udd04 **Migration from v0.2.x**\n-----------------------------\n\n**Old API** (still supported):\n\n.. code-block:: python\n\n from piedomains import domain\n result = domain.pred_shalla_cat_with_text([\"example.com\"])\n\n**New API** (recommended):\n\n.. code-block:: python\n\n from piedomains import DomainClassifier\n classifier = DomainClassifier()\n result = classifier.classify_by_text([\"example.com\"])\n\n\ud83d\udcd6 **Documentation**\n---------------------\n\n- **API Reference**: https://piedomains.readthedocs.io\n- **Examples**: `/examples` directory\n- **Notebooks**: `/piedomains/notebooks` (training & analysis)\n\n\ud83e\udd1d **Contributing**\n--------------------\n\n.. code-block:: bash\n\n # Setup development environment\n git clone https://github.com/themains/piedomains\n cd piedomains\n pip install -e \".[dev]\"\n \n # Run tests\n pytest piedomains/tests/ -v\n \n # Run linting\n flake8 piedomains/\n\n\ud83d\udcc4 **License**\n---------------\n\nMIT License - see LICENSE file.\n\n\ud83d\udcda **Citation**\n----------------\n\nIf you use piedomains in research, please cite:\n\n.. code-block:: bibtex\n\n @software{piedomains,\n title={piedomains: AI-powered domain content classification},\n author={Chintalapati, Rajashekar and Sood, Gaurav},\n year={2024},\n url={https://github.com/themains/piedomains}\n }\n\n---\n\n**Legacy Documentation**\n========================\n\nFor legacy API documentation, see LEGACY_API.rst\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "Predict categories based on domain names and their content",
"version": "0.3.10",
"project_urls": {
"Homepage": "https://github.com/themains/piedomains"
},
"split_keywords": [
"domain classification",
" website categorization",
" machine learning",
" content analysis",
" web scraping",
" computer vision"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "fbbe07a51f891f44c430a9c2f1ff99c9ad452b3880de466d762bd6538f6fab6e",
"md5": "fb496a5c1ec4939020a5527875fb79d9",
"sha256": "510c3343bb4fc26e880b84ddbb1c3a43b1dd065dc328693dd3bc227168a015bc"
},
"downloads": -1,
"filename": "piedomains-0.3.10-py3-none-any.whl",
"has_sig": false,
"md5_digest": "fb496a5c1ec4939020a5527875fb79d9",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.14,>=3.9",
"size": 3461980,
"upload_time": "2025-09-02T15:06:31",
"upload_time_iso_8601": "2025-09-02T15:06:31.071332Z",
"url": "https://files.pythonhosted.org/packages/fb/be/07a51f891f44c430a9c2f1ff99c9ad452b3880de466d762bd6538f6fab6e/piedomains-0.3.10-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "8648b19a7536eb6b0cef3e06e6d9977c0bd0e3296d69755a4838037f8694f5a7",
"md5": "2a5ec8c38325c15c9ec354d22d741b20",
"sha256": "def6f21e77e53a6163723d2891d157dfc063a90c47e2fc0c30973062b0555360"
},
"downloads": -1,
"filename": "piedomains-0.3.10.tar.gz",
"has_sig": false,
"md5_digest": "2a5ec8c38325c15c9ec354d22d741b20",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.14,>=3.9",
"size": 3411901,
"upload_time": "2025-09-02T15:06:34",
"upload_time_iso_8601": "2025-09-02T15:06:34.277437Z",
"url": "https://files.pythonhosted.org/packages/86/48/b19a7536eb6b0cef3e06e6d9977c0bd0e3296d69755a4838037f8694f5a7/piedomains-0.3.10.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-02 15:06:34",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "themains",
"github_project": "piedomains",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "piedomains"
}