piedomains


Namepiedomains JSON
Version 0.3.10 PyPI version JSON
download
home_pageNone
SummaryPredict categories based on domain names and their content
upload_time2025-09-02 15:06:34
maintainerNone
docs_urlNone
authorNone
requires_python<3.14,>=3.9
licenseMIT License
keywords domain classification website categorization machine learning content analysis web scraping computer vision
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ===========================================================================================
piedomains: AI-powered domain content classification
===========================================================================================

.. image:: https://github.com/themains/piedomains/actions/workflows/python-publish.yml/badge.svg
    :target: https://github.com/themains/piedomains/actions/workflows/python-publish.yml
.. image:: https://img.shields.io/pypi/v/piedomains.svg
    :target: https://pypi.python.org/pypi/piedomains
.. image:: https://readthedocs.org/projects/piedomains/badge/?version=latest
    :target: http://piedomains.readthedocs.io/en/latest/?badge=latest

**piedomains** predicts website content categories using AI analysis of domain names, text content, and homepage screenshots. Classify domains as news, shopping, adult content, education, etc. with high accuracy.

🚀 **Quickstart**
-------------------

Install and classify domains in 3 lines:

.. code-block:: python

    pip install piedomains
    
    from piedomains import DomainClassifier
    classifier = DomainClassifier()
    
    # Classify current content
    result = classifier.classify(["cnn.com", "amazon.com", "wikipedia.org"])
    print(result[['domain', 'pred_label', 'pred_prob']])
    
    # Expected output:
    #        domain    pred_label  pred_prob
    # 0     cnn.com          news   0.876543
    # 1  amazon.com      shopping   0.923456
    # 2 wikipedia.org   education   0.891234

📊 **Key Features**
--------------------

- **High Accuracy**: Combines text analysis + visual screenshots for 90%+ accuracy
- **Historical Analysis**: Classify websites from any point in time using archive.org
- **Fast & Scalable**: Batch processing with caching for 1000s of domains
- **Easy Integration**: Modern Python API with pandas output
- **41 Categories**: From news/finance to adult/gambling content

⚡ **Usage Examples**
---------------------

**Basic Classification**

.. code-block:: python

    from piedomains import DomainClassifier
    
    classifier = DomainClassifier()
    
    # Combined analysis (most accurate)
    result = classifier.classify(["github.com", "reddit.com"])
    
    # Text-only (faster)
    result = classifier.classify_by_text(["news.google.com"])
    
    # Images-only (good for visual content)  
    result = classifier.classify_by_images(["instagram.com"])

**Historical Analysis**

.. code-block:: python

    # Analyze how Facebook looked in 2010 vs today
    old_facebook = classifier.classify(["facebook.com"], archive_date="20100101")
    new_facebook = classifier.classify(["facebook.com"])
    
    print(f"2010: {old_facebook.iloc[0]['pred_label']}")
    print(f"2024: {new_facebook.iloc[0]['pred_label']}")

**Batch Processing**

.. code-block:: python

    # Process large lists efficiently
    domains = ["site1.com", "site2.com", ...] # 1000s of domains
    results = classifier.classify_batch(
        domains, 
        method="text",           # text|images|combined
        batch_size=50,           # Process 50 at a time
        show_progress=True       # Progress bar
    )

🏷️ **Supported Categories**
------------------------------

News, Finance, Shopping, Education, Government, Adult Content, Gambling, Social Networks, Search Engines, and 32 more categories based on the Shallalist taxonomy.

📈 **Performance**
-------------------

- **Speed**: ~10-50 domains/minute (depends on method and network)
- **Accuracy**: 85-95% depending on content type and method
- **Memory**: <500MB for batch processing
- **Caching**: Automatic content caching for faster re-runs

🔧 **Installation**
--------------------

**Requirements**: Python 3.9+

.. code-block:: bash

    # Basic installation
    pip install piedomains
    
    # For development
    git clone https://github.com/themains/piedomains
    cd piedomains
    pip install -e .

🔄 **Migration from v0.2.x**
-----------------------------

**Old API** (still supported):

.. code-block:: python

    from piedomains import domain
    result = domain.pred_shalla_cat_with_text(["example.com"])

**New API** (recommended):

.. code-block:: python

    from piedomains import DomainClassifier
    classifier = DomainClassifier()
    result = classifier.classify_by_text(["example.com"])

📖 **Documentation**
---------------------

- **API Reference**: https://piedomains.readthedocs.io
- **Examples**: `/examples` directory
- **Notebooks**: `/piedomains/notebooks` (training & analysis)

🤝 **Contributing**
--------------------

.. code-block:: bash

    # Setup development environment
    git clone https://github.com/themains/piedomains
    cd piedomains
    pip install -e ".[dev]"
    
    # Run tests
    pytest piedomains/tests/ -v
    
    # Run linting
    flake8 piedomains/

📄 **License**
---------------

MIT License - see LICENSE file.

📚 **Citation**
----------------

If you use piedomains in research, please cite:

.. code-block:: bibtex

    @software{piedomains,
      title={piedomains: AI-powered domain content classification},
      author={Chintalapati, Rajashekar and Sood, Gaurav},
      year={2024},
      url={https://github.com/themains/piedomains}
    }

---

**Legacy Documentation**
========================

For legacy API documentation, see LEGACY_API.rst

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "piedomains",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.14,>=3.9",
    "maintainer_email": null,
    "keywords": "domain classification, website categorization, machine learning, content analysis, web scraping, computer vision",
    "author": null,
    "author_email": "Rajashekar Chintalapati <rajshekar.ch@gmail.com>, Gaurav Sood <gsood07@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/86/48/b19a7536eb6b0cef3e06e6d9977c0bd0e3296d69755a4838037f8694f5a7/piedomains-0.3.10.tar.gz",
    "platform": null,
    "description": "===========================================================================================\npiedomains: AI-powered domain content classification\n===========================================================================================\n\n.. image:: https://github.com/themains/piedomains/actions/workflows/python-publish.yml/badge.svg\n    :target: https://github.com/themains/piedomains/actions/workflows/python-publish.yml\n.. image:: https://img.shields.io/pypi/v/piedomains.svg\n    :target: https://pypi.python.org/pypi/piedomains\n.. image:: https://readthedocs.org/projects/piedomains/badge/?version=latest\n    :target: http://piedomains.readthedocs.io/en/latest/?badge=latest\n\n**piedomains** predicts website content categories using AI analysis of domain names, text content, and homepage screenshots. Classify domains as news, shopping, adult content, education, etc. with high accuracy.\n\n\ud83d\ude80 **Quickstart**\n-------------------\n\nInstall and classify domains in 3 lines:\n\n.. code-block:: python\n\n    pip install piedomains\n    \n    from piedomains import DomainClassifier\n    classifier = DomainClassifier()\n    \n    # Classify current content\n    result = classifier.classify([\"cnn.com\", \"amazon.com\", \"wikipedia.org\"])\n    print(result[['domain', 'pred_label', 'pred_prob']])\n    \n    # Expected output:\n    #        domain    pred_label  pred_prob\n    # 0     cnn.com          news   0.876543\n    # 1  amazon.com      shopping   0.923456\n    # 2 wikipedia.org   education   0.891234\n\n\ud83d\udcca **Key Features**\n--------------------\n\n- **High Accuracy**: Combines text analysis + visual screenshots for 90%+ accuracy\n- **Historical Analysis**: Classify websites from any point in time using archive.org\n- **Fast & Scalable**: Batch processing with caching for 1000s of domains\n- **Easy Integration**: Modern Python API with pandas output\n- **41 Categories**: From news/finance to adult/gambling content\n\n\u26a1 **Usage Examples**\n---------------------\n\n**Basic Classification**\n\n.. code-block:: python\n\n    from piedomains import DomainClassifier\n    \n    classifier = DomainClassifier()\n    \n    # Combined analysis (most accurate)\n    result = classifier.classify([\"github.com\", \"reddit.com\"])\n    \n    # Text-only (faster)\n    result = classifier.classify_by_text([\"news.google.com\"])\n    \n    # Images-only (good for visual content)  \n    result = classifier.classify_by_images([\"instagram.com\"])\n\n**Historical Analysis**\n\n.. code-block:: python\n\n    # Analyze how Facebook looked in 2010 vs today\n    old_facebook = classifier.classify([\"facebook.com\"], archive_date=\"20100101\")\n    new_facebook = classifier.classify([\"facebook.com\"])\n    \n    print(f\"2010: {old_facebook.iloc[0]['pred_label']}\")\n    print(f\"2024: {new_facebook.iloc[0]['pred_label']}\")\n\n**Batch Processing**\n\n.. code-block:: python\n\n    # Process large lists efficiently\n    domains = [\"site1.com\", \"site2.com\", ...] # 1000s of domains\n    results = classifier.classify_batch(\n        domains, \n        method=\"text\",           # text|images|combined\n        batch_size=50,           # Process 50 at a time\n        show_progress=True       # Progress bar\n    )\n\n\ud83c\udff7\ufe0f **Supported Categories**\n------------------------------\n\nNews, Finance, Shopping, Education, Government, Adult Content, Gambling, Social Networks, Search Engines, and 32 more categories based on the Shallalist taxonomy.\n\n\ud83d\udcc8 **Performance**\n-------------------\n\n- **Speed**: ~10-50 domains/minute (depends on method and network)\n- **Accuracy**: 85-95% depending on content type and method\n- **Memory**: <500MB for batch processing\n- **Caching**: Automatic content caching for faster re-runs\n\n\ud83d\udd27 **Installation**\n--------------------\n\n**Requirements**: Python 3.9+\n\n.. code-block:: bash\n\n    # Basic installation\n    pip install piedomains\n    \n    # For development\n    git clone https://github.com/themains/piedomains\n    cd piedomains\n    pip install -e .\n\n\ud83d\udd04 **Migration from v0.2.x**\n-----------------------------\n\n**Old API** (still supported):\n\n.. code-block:: python\n\n    from piedomains import domain\n    result = domain.pred_shalla_cat_with_text([\"example.com\"])\n\n**New API** (recommended):\n\n.. code-block:: python\n\n    from piedomains import DomainClassifier\n    classifier = DomainClassifier()\n    result = classifier.classify_by_text([\"example.com\"])\n\n\ud83d\udcd6 **Documentation**\n---------------------\n\n- **API Reference**: https://piedomains.readthedocs.io\n- **Examples**: `/examples` directory\n- **Notebooks**: `/piedomains/notebooks` (training & analysis)\n\n\ud83e\udd1d **Contributing**\n--------------------\n\n.. code-block:: bash\n\n    # Setup development environment\n    git clone https://github.com/themains/piedomains\n    cd piedomains\n    pip install -e \".[dev]\"\n    \n    # Run tests\n    pytest piedomains/tests/ -v\n    \n    # Run linting\n    flake8 piedomains/\n\n\ud83d\udcc4 **License**\n---------------\n\nMIT License - see LICENSE file.\n\n\ud83d\udcda **Citation**\n----------------\n\nIf you use piedomains in research, please cite:\n\n.. code-block:: bibtex\n\n    @software{piedomains,\n      title={piedomains: AI-powered domain content classification},\n      author={Chintalapati, Rajashekar and Sood, Gaurav},\n      year={2024},\n      url={https://github.com/themains/piedomains}\n    }\n\n---\n\n**Legacy Documentation**\n========================\n\nFor legacy API documentation, see LEGACY_API.rst\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "Predict categories based on domain names and their content",
    "version": "0.3.10",
    "project_urls": {
        "Homepage": "https://github.com/themains/piedomains"
    },
    "split_keywords": [
        "domain classification",
        " website categorization",
        " machine learning",
        " content analysis",
        " web scraping",
        " computer vision"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "fbbe07a51f891f44c430a9c2f1ff99c9ad452b3880de466d762bd6538f6fab6e",
                "md5": "fb496a5c1ec4939020a5527875fb79d9",
                "sha256": "510c3343bb4fc26e880b84ddbb1c3a43b1dd065dc328693dd3bc227168a015bc"
            },
            "downloads": -1,
            "filename": "piedomains-0.3.10-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "fb496a5c1ec4939020a5527875fb79d9",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.14,>=3.9",
            "size": 3461980,
            "upload_time": "2025-09-02T15:06:31",
            "upload_time_iso_8601": "2025-09-02T15:06:31.071332Z",
            "url": "https://files.pythonhosted.org/packages/fb/be/07a51f891f44c430a9c2f1ff99c9ad452b3880de466d762bd6538f6fab6e/piedomains-0.3.10-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8648b19a7536eb6b0cef3e06e6d9977c0bd0e3296d69755a4838037f8694f5a7",
                "md5": "2a5ec8c38325c15c9ec354d22d741b20",
                "sha256": "def6f21e77e53a6163723d2891d157dfc063a90c47e2fc0c30973062b0555360"
            },
            "downloads": -1,
            "filename": "piedomains-0.3.10.tar.gz",
            "has_sig": false,
            "md5_digest": "2a5ec8c38325c15c9ec354d22d741b20",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.14,>=3.9",
            "size": 3411901,
            "upload_time": "2025-09-02T15:06:34",
            "upload_time_iso_8601": "2025-09-02T15:06:34.277437Z",
            "url": "https://files.pythonhosted.org/packages/86/48/b19a7536eb6b0cef3e06e6d9977c0bd0e3296d69755a4838037f8694f5a7/piedomains-0.3.10.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-02 15:06:34",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "themains",
    "github_project": "piedomains",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "piedomains"
}
        
Elapsed time: 8.61119s