hibber-url-analyzer


Namehibber-url-analyzer JSON
Version 1.0.0 PyPI version JSON
download
home_pagehttps://github.com/url-analyzer/url-analyzer
SummaryA tool for analyzing, categorizing, and reporting on URLs from browsing history or other sources
upload_time2025-08-07 03:27:35
maintainerNone
docs_urlNone
authorURL Analyzer Team
requires_python>=3.8
licenseMIT
keywords url analysis web browsing history categorization reporting
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # URL Analyzer

A powerful tool for analyzing, categorizing, and reporting on URLs from browsing history or other sources.

![URL Analyzer](https://img.shields.io/badge/URL-Analyzer-blue)
![Python](https://img.shields.io/badge/Python-3.8%2B-brightgreen)
![License](https://img.shields.io/badge/License-MIT-yellow)

## Overview

URL Analyzer helps you understand web browsing patterns, identify potentially sensitive websites, and generate comprehensive reports with visualizations. It provides powerful classification, analysis, and reporting capabilities for URL data.

### Key Features

- **URL Classification**: Automatically categorize URLs using pattern matching and custom rules
- **Batch Processing**: Process multiple files with parallel execution for efficiency
- **Live Scanning**: Fetch additional information about uncategorized URLs
- **Content Summarization**: Generate AI-powered summaries of web page content
- **Interactive Reports**: Create rich HTML reports with charts, tables, and visualizations
- **Multiple Report Templates**: Choose from various templates for different use cases
- **Data Export**: Export analyzed data to CSV, JSON, or Excel formats
- **Customizable Rules**: Define custom classification rules for your specific needs
- **Memory Optimization**: Process large files efficiently with chunked processing

## Installation

### Requirements

- Python 3.8 or higher
- Required dependencies (installed automatically)

### Installation Steps

#### From PyPI (Recommended)

Install the latest stable version from PyPI:

```bash
pip install hibber-url-analyzer
```

#### Optional Dependencies

For additional features, install optional dependencies:

```bash
# For all features
pip install hibber-url-analyzer[all]

# For specific features
pip install hibber-url-analyzer[url_fetching,html_parsing,visualization]
```

Available optional dependency groups:
- `url_fetching`: For fetching URL content
- `html_parsing`: For parsing HTML content
- `domain_extraction`: For advanced domain analysis
- `visualization`: For enhanced charts and graphs
- `geolocation`: For IP geolocation features
- `pdf_export`: For PDF report generation
- `progress`: For progress bars
- `excel`: For Excel file support
- `terminal_ui`: For rich terminal interface
- `system`: For system monitoring
- `ml_analysis`: For machine learning analysis features
- `dev`: For development tools

#### From Source

For development or latest features:

```bash
git clone https://github.com/yourusername/url-analyzer.git
cd url-analyzer
pip install -e .
```

#### Environment Variables (Optional)

Set up environment variables for enhanced features:

- `GEMINI_API_KEY`: For content summarization features
- `URL_ANALYZER_CONFIG_PATH`: Custom path to configuration file
- `URL_ANALYZER_CACHE_PATH`: Custom path to cache file

## Quick Start

### Analyzing a Single File

To analyze a CSV or Excel file containing URLs:

```
python -m url_analyzer analyze --path "path/to/file.csv"
```

This will:
1. Process the file and classify URLs
2. Generate an HTML report
3. Open the report in your default web browser

### Required File Format

Your input file should be a CSV or Excel file with at least one column named `Domain_name` containing the URLs to analyze. Additional columns that enhance the analysis include:

- `Client_Name`: Name of the device or user
- `MAC_address`: MAC address of the device
- `Access_time`: Timestamp of when the URL was accessed

### Basic Command Options

The basic command can be enhanced with various options:

- `--live-scan`: Fetch additional information about uncategorized URLs
- `--summarize`: Generate AI-powered summaries of web page content (requires `--live-scan`)
- `--aggregate`: Generate a single report for multiple files
- `--output PATH`: Specify output directory for reports
- `--no-open`: Don't automatically open the report in browser

Example with options:
```
python -m url_analyzer analyze --path "path/to/file.csv" --live-scan --summarize --output "reports"
```

## Advanced Features

### Batch Processing

To process multiple files at once:

```
python -m url_analyzer analyze --path "path/to/directory" --aggregate
```

This will process all CSV and Excel files in the directory and generate a single aggregated report.

Batch processing options:
- `--job-id ID`: Assign a unique identifier to the batch job
- `--max-workers N`: Set maximum number of parallel workers (default: 20)
- `--checkpoint-interval N`: Minutes between saving checkpoints (default: 5)

### Report Templates

URL Analyzer provides several built-in report templates for different use cases:

```
python -m url_analyzer templates
```

This will list all available templates with descriptions.

To use a specific template:

```
python -m url_analyzer analyze --path "path/to/file.csv" --template "template_name"
```

#### Available Templates

- **Default**: Comprehensive template with all report sections and visualizations
- **Minimal**: Simplified template with a clean, streamlined design
- **Data Analytics**: Data-focused template with additional metrics and insights
- **Security Focus**: Security-oriented template with security metrics and recommendations
- **Executive Summary**: High-level overview for executives with key metrics and insights
- **Print Friendly**: Optimized for printing with simplified layout and print-specific styling

### Exporting Data

Export analyzed data to different formats:

```
python -m url_analyzer export --path "path/to/file.csv" --format json
```

Supported formats: `csv`, `json`, `excel`

### Configuration

Configure the application settings:

```
python -m url_analyzer configure
```

This opens an interactive configuration interface where you can:
- View current settings
- Add or modify URL classification patterns
- Configure API settings
- Set scan parameters

## Understanding the Report

The HTML report provides a comprehensive view of your URL data:

### Dashboard Section
- Doughnut chart showing URL categories
- Statistics including total URLs, sensitive URLs, etc.

### Domain & Time Insights
- Top 10 visited domains chart
- Traffic flow analysis (Sankey diagram)
- Activity heatmap by day and hour
- Daily URL access trend

### Detailed URL Data
- Interactive table with all URLs
- Filtering by client name or MAC address
- Export options (CSV, JSON)
- Dark/light theme toggle

## URL Classification

URLs are classified into several categories:

- **Sensitive**: URLs matching patterns for social media, adult content, etc.
- **User-Generated**: URLs containing user content patterns like profiles, forums, etc.
- **Advertising**: Ad-related domains and services
- **Analytics**: Tracking and analytics services
- **CDN**: Content delivery networks
- **Corporate**: Business and corporate sites
- **Uncategorized**: URLs that don't match any defined patterns

## Custom URL Classification Rules

You can define custom classification rules in the configuration file (`config.json`):

```json
{
  "custom_rules": [
    {
      "pattern": "news\\.google\\.com",
      "category": "News",
      "priority": 120,
      "is_sensitive": false,
      "description": "Google News"
    }
  ]
}
```

For more information on custom rules, see the [Custom Rules Documentation](docs/custom_rules.md).

## Troubleshooting

### Common Issues

1. **Missing dependencies**:
   - Ensure you've installed all required dependencies with `pip install -r requirements.txt`
   - For advanced features, make sure optional dependencies are installed

2. **File format issues**:
   - Verify your input file has a `Domain_name` column
   - Check for encoding issues in CSV files

3. **Performance with large files**:
   - Use the `--max-workers` option to adjust parallel processing
   - Consider splitting very large files into smaller chunks

4. **API key issues**:
   - Set the `GEMINI_API_KEY` environment variable for summarization features

### Getting Help

If you encounter issues not covered here:
1. Check the logs for detailed error messages
2. Consult the [documentation](docs/)
3. Submit an issue on the project repository

## Command Reference

### Main Commands

- `analyze`: Process files and generate reports
- `configure`: Manage application configuration
- `export`: Export data to different formats
- `report`: Generate reports from previously analyzed data
- `templates`: List available report templates

### Global Options

- `--verbose`, `-v`: Increase output verbosity
- `--quiet`, `-q`: Suppress non-error output
- `--log-file PATH`: Specify log file path

For a complete list of commands and options, run:
```
python -m url_analyzer --help
```

## Documentation

- [User Instructions](docs/user_instructions.md): Detailed user guide
- [Custom Rules](docs/custom_rules.md): Guide to creating custom classification rules
- [Batch Processing](docs/batch_processing.md): Advanced batch processing options
- [Configuration Guide](docs/configuration.md): Detailed configuration options

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Acknowledgments

- Thanks to all contributors who have helped improve this project
- Special thanks to the open-source libraries that made this project possible

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/url-analyzer/url-analyzer",
    "name": "hibber-url-analyzer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "URL Analyzer Team <contact@url-analyzer.dev>",
    "keywords": "url, analysis, web, browsing, history, categorization, reporting",
    "author": "URL Analyzer Team",
    "author_email": "URL Analyzer Team <contact@url-analyzer.dev>",
    "download_url": "https://files.pythonhosted.org/packages/78/f3/321d035421360dcb919df5022d974e3e671e326176066ee173b02f4d9306/hibber_url_analyzer-1.0.0.tar.gz",
    "platform": null,
    "description": "# URL Analyzer\r\n\r\nA powerful tool for analyzing, categorizing, and reporting on URLs from browsing history or other sources.\r\n\r\n![URL Analyzer](https://img.shields.io/badge/URL-Analyzer-blue)\r\n![Python](https://img.shields.io/badge/Python-3.8%2B-brightgreen)\r\n![License](https://img.shields.io/badge/License-MIT-yellow)\r\n\r\n## Overview\r\n\r\nURL Analyzer helps you understand web browsing patterns, identify potentially sensitive websites, and generate comprehensive reports with visualizations. It provides powerful classification, analysis, and reporting capabilities for URL data.\r\n\r\n### Key Features\r\n\r\n- **URL Classification**: Automatically categorize URLs using pattern matching and custom rules\r\n- **Batch Processing**: Process multiple files with parallel execution for efficiency\r\n- **Live Scanning**: Fetch additional information about uncategorized URLs\r\n- **Content Summarization**: Generate AI-powered summaries of web page content\r\n- **Interactive Reports**: Create rich HTML reports with charts, tables, and visualizations\r\n- **Multiple Report Templates**: Choose from various templates for different use cases\r\n- **Data Export**: Export analyzed data to CSV, JSON, or Excel formats\r\n- **Customizable Rules**: Define custom classification rules for your specific needs\r\n- **Memory Optimization**: Process large files efficiently with chunked processing\r\n\r\n## Installation\r\n\r\n### Requirements\r\n\r\n- Python 3.8 or higher\r\n- Required dependencies (installed automatically)\r\n\r\n### Installation Steps\r\n\r\n#### From PyPI (Recommended)\r\n\r\nInstall the latest stable version from PyPI:\r\n\r\n```bash\r\npip install hibber-url-analyzer\r\n```\r\n\r\n#### Optional Dependencies\r\n\r\nFor additional features, install optional dependencies:\r\n\r\n```bash\r\n# For all features\r\npip install hibber-url-analyzer[all]\r\n\r\n# For specific features\r\npip install hibber-url-analyzer[url_fetching,html_parsing,visualization]\r\n```\r\n\r\nAvailable optional dependency groups:\r\n- `url_fetching`: For fetching URL content\r\n- `html_parsing`: For parsing HTML content\r\n- `domain_extraction`: For advanced domain analysis\r\n- `visualization`: For enhanced charts and graphs\r\n- `geolocation`: For IP geolocation features\r\n- `pdf_export`: For PDF report generation\r\n- `progress`: For progress bars\r\n- `excel`: For Excel file support\r\n- `terminal_ui`: For rich terminal interface\r\n- `system`: For system monitoring\r\n- `ml_analysis`: For machine learning analysis features\r\n- `dev`: For development tools\r\n\r\n#### From Source\r\n\r\nFor development or latest features:\r\n\r\n```bash\r\ngit clone https://github.com/yourusername/url-analyzer.git\r\ncd url-analyzer\r\npip install -e .\r\n```\r\n\r\n#### Environment Variables (Optional)\r\n\r\nSet up environment variables for enhanced features:\r\n\r\n- `GEMINI_API_KEY`: For content summarization features\r\n- `URL_ANALYZER_CONFIG_PATH`: Custom path to configuration file\r\n- `URL_ANALYZER_CACHE_PATH`: Custom path to cache file\r\n\r\n## Quick Start\r\n\r\n### Analyzing a Single File\r\n\r\nTo analyze a CSV or Excel file containing URLs:\r\n\r\n```\r\npython -m url_analyzer analyze --path \"path/to/file.csv\"\r\n```\r\n\r\nThis will:\r\n1. Process the file and classify URLs\r\n2. Generate an HTML report\r\n3. Open the report in your default web browser\r\n\r\n### Required File Format\r\n\r\nYour input file should be a CSV or Excel file with at least one column named `Domain_name` containing the URLs to analyze. Additional columns that enhance the analysis include:\r\n\r\n- `Client_Name`: Name of the device or user\r\n- `MAC_address`: MAC address of the device\r\n- `Access_time`: Timestamp of when the URL was accessed\r\n\r\n### Basic Command Options\r\n\r\nThe basic command can be enhanced with various options:\r\n\r\n- `--live-scan`: Fetch additional information about uncategorized URLs\r\n- `--summarize`: Generate AI-powered summaries of web page content (requires `--live-scan`)\r\n- `--aggregate`: Generate a single report for multiple files\r\n- `--output PATH`: Specify output directory for reports\r\n- `--no-open`: Don't automatically open the report in browser\r\n\r\nExample with options:\r\n```\r\npython -m url_analyzer analyze --path \"path/to/file.csv\" --live-scan --summarize --output \"reports\"\r\n```\r\n\r\n## Advanced Features\r\n\r\n### Batch Processing\r\n\r\nTo process multiple files at once:\r\n\r\n```\r\npython -m url_analyzer analyze --path \"path/to/directory\" --aggregate\r\n```\r\n\r\nThis will process all CSV and Excel files in the directory and generate a single aggregated report.\r\n\r\nBatch processing options:\r\n- `--job-id ID`: Assign a unique identifier to the batch job\r\n- `--max-workers N`: Set maximum number of parallel workers (default: 20)\r\n- `--checkpoint-interval N`: Minutes between saving checkpoints (default: 5)\r\n\r\n### Report Templates\r\n\r\nURL Analyzer provides several built-in report templates for different use cases:\r\n\r\n```\r\npython -m url_analyzer templates\r\n```\r\n\r\nThis will list all available templates with descriptions.\r\n\r\nTo use a specific template:\r\n\r\n```\r\npython -m url_analyzer analyze --path \"path/to/file.csv\" --template \"template_name\"\r\n```\r\n\r\n#### Available Templates\r\n\r\n- **Default**: Comprehensive template with all report sections and visualizations\r\n- **Minimal**: Simplified template with a clean, streamlined design\r\n- **Data Analytics**: Data-focused template with additional metrics and insights\r\n- **Security Focus**: Security-oriented template with security metrics and recommendations\r\n- **Executive Summary**: High-level overview for executives with key metrics and insights\r\n- **Print Friendly**: Optimized for printing with simplified layout and print-specific styling\r\n\r\n### Exporting Data\r\n\r\nExport analyzed data to different formats:\r\n\r\n```\r\npython -m url_analyzer export --path \"path/to/file.csv\" --format json\r\n```\r\n\r\nSupported formats: `csv`, `json`, `excel`\r\n\r\n### Configuration\r\n\r\nConfigure the application settings:\r\n\r\n```\r\npython -m url_analyzer configure\r\n```\r\n\r\nThis opens an interactive configuration interface where you can:\r\n- View current settings\r\n- Add or modify URL classification patterns\r\n- Configure API settings\r\n- Set scan parameters\r\n\r\n## Understanding the Report\r\n\r\nThe HTML report provides a comprehensive view of your URL data:\r\n\r\n### Dashboard Section\r\n- Doughnut chart showing URL categories\r\n- Statistics including total URLs, sensitive URLs, etc.\r\n\r\n### Domain & Time Insights\r\n- Top 10 visited domains chart\r\n- Traffic flow analysis (Sankey diagram)\r\n- Activity heatmap by day and hour\r\n- Daily URL access trend\r\n\r\n### Detailed URL Data\r\n- Interactive table with all URLs\r\n- Filtering by client name or MAC address\r\n- Export options (CSV, JSON)\r\n- Dark/light theme toggle\r\n\r\n## URL Classification\r\n\r\nURLs are classified into several categories:\r\n\r\n- **Sensitive**: URLs matching patterns for social media, adult content, etc.\r\n- **User-Generated**: URLs containing user content patterns like profiles, forums, etc.\r\n- **Advertising**: Ad-related domains and services\r\n- **Analytics**: Tracking and analytics services\r\n- **CDN**: Content delivery networks\r\n- **Corporate**: Business and corporate sites\r\n- **Uncategorized**: URLs that don't match any defined patterns\r\n\r\n## Custom URL Classification Rules\r\n\r\nYou can define custom classification rules in the configuration file (`config.json`):\r\n\r\n```json\r\n{\r\n  \"custom_rules\": [\r\n    {\r\n      \"pattern\": \"news\\\\.google\\\\.com\",\r\n      \"category\": \"News\",\r\n      \"priority\": 120,\r\n      \"is_sensitive\": false,\r\n      \"description\": \"Google News\"\r\n    }\r\n  ]\r\n}\r\n```\r\n\r\nFor more information on custom rules, see the [Custom Rules Documentation](docs/custom_rules.md).\r\n\r\n## Troubleshooting\r\n\r\n### Common Issues\r\n\r\n1. **Missing dependencies**:\r\n   - Ensure you've installed all required dependencies with `pip install -r requirements.txt`\r\n   - For advanced features, make sure optional dependencies are installed\r\n\r\n2. **File format issues**:\r\n   - Verify your input file has a `Domain_name` column\r\n   - Check for encoding issues in CSV files\r\n\r\n3. **Performance with large files**:\r\n   - Use the `--max-workers` option to adjust parallel processing\r\n   - Consider splitting very large files into smaller chunks\r\n\r\n4. **API key issues**:\r\n   - Set the `GEMINI_API_KEY` environment variable for summarization features\r\n\r\n### Getting Help\r\n\r\nIf you encounter issues not covered here:\r\n1. Check the logs for detailed error messages\r\n2. Consult the [documentation](docs/)\r\n3. Submit an issue on the project repository\r\n\r\n## Command Reference\r\n\r\n### Main Commands\r\n\r\n- `analyze`: Process files and generate reports\r\n- `configure`: Manage application configuration\r\n- `export`: Export data to different formats\r\n- `report`: Generate reports from previously analyzed data\r\n- `templates`: List available report templates\r\n\r\n### Global Options\r\n\r\n- `--verbose`, `-v`: Increase output verbosity\r\n- `--quiet`, `-q`: Suppress non-error output\r\n- `--log-file PATH`: Specify log file path\r\n\r\nFor a complete list of commands and options, run:\r\n```\r\npython -m url_analyzer --help\r\n```\r\n\r\n## Documentation\r\n\r\n- [User Instructions](docs/user_instructions.md): Detailed user guide\r\n- [Custom Rules](docs/custom_rules.md): Guide to creating custom classification rules\r\n- [Batch Processing](docs/batch_processing.md): Advanced batch processing options\r\n- [Configuration Guide](docs/configuration.md): Detailed configuration options\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License - see the LICENSE file for details.\r\n\r\n## Acknowledgments\r\n\r\n- Thanks to all contributors who have helped improve this project\r\n- Special thanks to the open-source libraries that made this project possible\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A tool for analyzing, categorizing, and reporting on URLs from browsing history or other sources",
    "version": "1.0.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/Hibber/url-analyzer/issues",
        "Documentation": "https://github.com/Hibber/url-analyzer/blob/main/docs",
        "Homepage": "https://github.com/Hibber/url-analyzer",
        "Repository": "https://github.com/Hibber/url-analyzer.git",
        "Source Code": "https://github.com/Hibber/url-analyzer"
    },
    "split_keywords": [
        "url",
        " analysis",
        " web",
        " browsing",
        " history",
        " categorization",
        " reporting"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b950f69ccada3e6a30ca82b94f11ef49c19c2bf5efbc4d1e7c474c3cdf89d1eb",
                "md5": "1ed900a084a3ea797b0dd65b0c4feb5d",
                "sha256": "9aaba265cca91227174c518a9137d7fb99a7d37dbaffe24f9d014821633d0a06"
            },
            "downloads": -1,
            "filename": "hibber_url_analyzer-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1ed900a084a3ea797b0dd65b0c4feb5d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 798857,
            "upload_time": "2025-08-07T03:27:34",
            "upload_time_iso_8601": "2025-08-07T03:27:34.107121Z",
            "url": "https://files.pythonhosted.org/packages/b9/50/f69ccada3e6a30ca82b94f11ef49c19c2bf5efbc4d1e7c474c3cdf89d1eb/hibber_url_analyzer-1.0.0-py3-none-any.whl",
            "yanked": true,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "78f3321d035421360dcb919df5022d974e3e671e326176066ee173b02f4d9306",
                "md5": "ea62a8c91322ca33f7a07fda1de5c759",
                "sha256": "d54ea5b292e517c927e18154bf14b0538dd36999a03c29c7b9d3f11782b83616"
            },
            "downloads": -1,
            "filename": "hibber_url_analyzer-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "ea62a8c91322ca33f7a07fda1de5c759",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 652620,
            "upload_time": "2025-08-07T03:27:35",
            "upload_time_iso_8601": "2025-08-07T03:27:35.693620Z",
            "url": "https://files.pythonhosted.org/packages/78/f3/321d035421360dcb919df5022d974e3e671e326176066ee173b02f4d9306/hibber_url_analyzer-1.0.0.tar.gz",
            "yanked": true,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-07 03:27:35",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "url-analyzer",
    "github_project": "url-analyzer",
    "github_not_found": true,
    "lcname": "hibber-url-analyzer"
}
        
Elapsed time: 0.61657s