# URL Analyzer
A powerful tool for analyzing, categorizing, and reporting on URLs from browsing history or other sources.



## Overview
URL Analyzer helps you understand web browsing patterns, identify potentially sensitive websites, and generate comprehensive reports with visualizations. It provides powerful classification, analysis, and reporting capabilities for URL data.
### Key Features
- **URL Classification**: Automatically categorize URLs using pattern matching and custom rules
- **Batch Processing**: Process multiple files with parallel execution for efficiency
- **Live Scanning**: Fetch additional information about uncategorized URLs
- **Content Summarization**: Generate AI-powered summaries of web page content
- **Interactive Reports**: Create rich HTML reports with charts, tables, and visualizations
- **Multiple Report Templates**: Choose from various templates for different use cases
- **Data Export**: Export analyzed data to CSV, JSON, or Excel formats
- **Customizable Rules**: Define custom classification rules for your specific needs
- **Memory Optimization**: Process large files efficiently with chunked processing
## Installation
### Requirements
- Python 3.8 or higher
- Required dependencies (installed automatically)
### Installation Steps
#### From PyPI (Recommended)
Install the latest stable version from PyPI:
```bash
pip install hibber-url-analyzer
```
#### Optional Dependencies
For additional features, install optional dependencies:
```bash
# For all features
pip install hibber-url-analyzer[all]
# For specific features
pip install hibber-url-analyzer[url_fetching,html_parsing,visualization]
```
Available optional dependency groups:
- `url_fetching`: For fetching URL content
- `html_parsing`: For parsing HTML content
- `domain_extraction`: For advanced domain analysis
- `visualization`: For enhanced charts and graphs
- `geolocation`: For IP geolocation features
- `pdf_export`: For PDF report generation
- `progress`: For progress bars
- `excel`: For Excel file support
- `terminal_ui`: For rich terminal interface
- `system`: For system monitoring
- `ml_analysis`: For machine learning analysis features
- `dev`: For development tools
#### From Source
For development or latest features:
```bash
git clone https://github.com/yourusername/url-analyzer.git
cd url-analyzer
pip install -e .
```
#### Environment Variables (Optional)
Set up environment variables for enhanced features:
- `GEMINI_API_KEY`: For content summarization features
- `URL_ANALYZER_CONFIG_PATH`: Custom path to configuration file
- `URL_ANALYZER_CACHE_PATH`: Custom path to cache file
## Quick Start
### Analyzing a Single File
To analyze a CSV or Excel file containing URLs:
```
python -m url_analyzer analyze --path "path/to/file.csv"
```
This will:
1. Process the file and classify URLs
2. Generate an HTML report
3. Open the report in your default web browser
### Required File Format
Your input file should be a CSV or Excel file with at least one column named `Domain_name` containing the URLs to analyze. Additional columns that enhance the analysis include:
- `Client_Name`: Name of the device or user
- `MAC_address`: MAC address of the device
- `Access_time`: Timestamp of when the URL was accessed
### Basic Command Options
The basic command can be enhanced with various options:
- `--live-scan`: Fetch additional information about uncategorized URLs
- `--summarize`: Generate AI-powered summaries of web page content (requires `--live-scan`)
- `--aggregate`: Generate a single report for multiple files
- `--output PATH`: Specify output directory for reports
- `--no-open`: Don't automatically open the report in browser
Example with options:
```
python -m url_analyzer analyze --path "path/to/file.csv" --live-scan --summarize --output "reports"
```
## Advanced Features
### Batch Processing
To process multiple files at once:
```
python -m url_analyzer analyze --path "path/to/directory" --aggregate
```
This will process all CSV and Excel files in the directory and generate a single aggregated report.
Batch processing options:
- `--job-id ID`: Assign a unique identifier to the batch job
- `--max-workers N`: Set maximum number of parallel workers (default: 20)
- `--checkpoint-interval N`: Minutes between saving checkpoints (default: 5)
### Report Templates
URL Analyzer provides several built-in report templates for different use cases:
```
python -m url_analyzer templates
```
This will list all available templates with descriptions.
To use a specific template:
```
python -m url_analyzer analyze --path "path/to/file.csv" --template "template_name"
```
#### Available Templates
- **Default**: Comprehensive template with all report sections and visualizations
- **Minimal**: Simplified template with a clean, streamlined design
- **Data Analytics**: Data-focused template with additional metrics and insights
- **Security Focus**: Security-oriented template with security metrics and recommendations
- **Executive Summary**: High-level overview for executives with key metrics and insights
- **Print Friendly**: Optimized for printing with simplified layout and print-specific styling
### Exporting Data
Export analyzed data to different formats:
```
python -m url_analyzer export --path "path/to/file.csv" --format json
```
Supported formats: `csv`, `json`, `excel`
### Configuration
Configure the application settings:
```
python -m url_analyzer configure
```
This opens an interactive configuration interface where you can:
- View current settings
- Add or modify URL classification patterns
- Configure API settings
- Set scan parameters
## Understanding the Report
The HTML report provides a comprehensive view of your URL data:
### Dashboard Section
- Doughnut chart showing URL categories
- Statistics including total URLs, sensitive URLs, etc.
### Domain & Time Insights
- Top 10 visited domains chart
- Traffic flow analysis (Sankey diagram)
- Activity heatmap by day and hour
- Daily URL access trend
### Detailed URL Data
- Interactive table with all URLs
- Filtering by client name or MAC address
- Export options (CSV, JSON)
- Dark/light theme toggle
## URL Classification
URLs are classified into several categories:
- **Sensitive**: URLs matching patterns for social media, adult content, etc.
- **User-Generated**: URLs containing user content patterns like profiles, forums, etc.
- **Advertising**: Ad-related domains and services
- **Analytics**: Tracking and analytics services
- **CDN**: Content delivery networks
- **Corporate**: Business and corporate sites
- **Uncategorized**: URLs that don't match any defined patterns
## Custom URL Classification Rules
You can define custom classification rules in the configuration file (`config.json`):
```json
{
"custom_rules": [
{
"pattern": "news\\.google\\.com",
"category": "News",
"priority": 120,
"is_sensitive": false,
"description": "Google News"
}
]
}
```
For more information on custom rules, see the [Custom Rules Documentation](docs/custom_rules.md).
## Troubleshooting
### Common Issues
1. **Missing dependencies**:
- Ensure you've installed all required dependencies with `pip install -r requirements.txt`
- For advanced features, make sure optional dependencies are installed
2. **File format issues**:
- Verify your input file has a `Domain_name` column
- Check for encoding issues in CSV files
3. **Performance with large files**:
- Use the `--max-workers` option to adjust parallel processing
- Consider splitting very large files into smaller chunks
4. **API key issues**:
- Set the `GEMINI_API_KEY` environment variable for summarization features
### Getting Help
If you encounter issues not covered here:
1. Check the logs for detailed error messages
2. Consult the [documentation](docs/)
3. Submit an issue on the project repository
## Command Reference
### Main Commands
- `analyze`: Process files and generate reports
- `configure`: Manage application configuration
- `export`: Export data to different formats
- `report`: Generate reports from previously analyzed data
- `templates`: List available report templates
### Global Options
- `--verbose`, `-v`: Increase output verbosity
- `--quiet`, `-q`: Suppress non-error output
- `--log-file PATH`: Specify log file path
For a complete list of commands and options, run:
```
python -m url_analyzer --help
```
## Documentation
- [User Instructions](docs/user_instructions.md): Detailed user guide
- [Custom Rules](docs/custom_rules.md): Guide to creating custom classification rules
- [Batch Processing](docs/batch_processing.md): Advanced batch processing options
- [Configuration Guide](docs/configuration.md): Detailed configuration options
## License
This project is licensed under the MIT License - see the LICENSE file for details.
## Acknowledgments
- Thanks to all contributors who have helped improve this project
- Special thanks to the open-source libraries that made this project possible
Raw data
{
"_id": null,
"home_page": "https://github.com/url-analyzer/url-analyzer",
"name": "hibber-url-analyzer",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "URL Analyzer Team <contact@url-analyzer.dev>",
"keywords": "url, analysis, web, browsing, history, categorization, reporting",
"author": "URL Analyzer Team",
"author_email": "URL Analyzer Team <contact@url-analyzer.dev>",
"download_url": "https://files.pythonhosted.org/packages/78/f3/321d035421360dcb919df5022d974e3e671e326176066ee173b02f4d9306/hibber_url_analyzer-1.0.0.tar.gz",
"platform": null,
"description": "# URL Analyzer\r\n\r\nA powerful tool for analyzing, categorizing, and reporting on URLs from browsing history or other sources.\r\n\r\n\r\n\r\n\r\n\r\n## Overview\r\n\r\nURL Analyzer helps you understand web browsing patterns, identify potentially sensitive websites, and generate comprehensive reports with visualizations. It provides powerful classification, analysis, and reporting capabilities for URL data.\r\n\r\n### Key Features\r\n\r\n- **URL Classification**: Automatically categorize URLs using pattern matching and custom rules\r\n- **Batch Processing**: Process multiple files with parallel execution for efficiency\r\n- **Live Scanning**: Fetch additional information about uncategorized URLs\r\n- **Content Summarization**: Generate AI-powered summaries of web page content\r\n- **Interactive Reports**: Create rich HTML reports with charts, tables, and visualizations\r\n- **Multiple Report Templates**: Choose from various templates for different use cases\r\n- **Data Export**: Export analyzed data to CSV, JSON, or Excel formats\r\n- **Customizable Rules**: Define custom classification rules for your specific needs\r\n- **Memory Optimization**: Process large files efficiently with chunked processing\r\n\r\n## Installation\r\n\r\n### Requirements\r\n\r\n- Python 3.8 or higher\r\n- Required dependencies (installed automatically)\r\n\r\n### Installation Steps\r\n\r\n#### From PyPI (Recommended)\r\n\r\nInstall the latest stable version from PyPI:\r\n\r\n```bash\r\npip install hibber-url-analyzer\r\n```\r\n\r\n#### Optional Dependencies\r\n\r\nFor additional features, install optional dependencies:\r\n\r\n```bash\r\n# For all features\r\npip install hibber-url-analyzer[all]\r\n\r\n# For specific features\r\npip install hibber-url-analyzer[url_fetching,html_parsing,visualization]\r\n```\r\n\r\nAvailable optional dependency groups:\r\n- `url_fetching`: For fetching URL content\r\n- `html_parsing`: For parsing HTML content\r\n- `domain_extraction`: For advanced domain analysis\r\n- `visualization`: For enhanced charts and graphs\r\n- `geolocation`: For IP geolocation features\r\n- `pdf_export`: For PDF report generation\r\n- `progress`: For progress bars\r\n- `excel`: For Excel file support\r\n- `terminal_ui`: For rich terminal interface\r\n- `system`: For system monitoring\r\n- `ml_analysis`: For machine learning analysis features\r\n- `dev`: For development tools\r\n\r\n#### From Source\r\n\r\nFor development or latest features:\r\n\r\n```bash\r\ngit clone https://github.com/yourusername/url-analyzer.git\r\ncd url-analyzer\r\npip install -e .\r\n```\r\n\r\n#### Environment Variables (Optional)\r\n\r\nSet up environment variables for enhanced features:\r\n\r\n- `GEMINI_API_KEY`: For content summarization features\r\n- `URL_ANALYZER_CONFIG_PATH`: Custom path to configuration file\r\n- `URL_ANALYZER_CACHE_PATH`: Custom path to cache file\r\n\r\n## Quick Start\r\n\r\n### Analyzing a Single File\r\n\r\nTo analyze a CSV or Excel file containing URLs:\r\n\r\n```\r\npython -m url_analyzer analyze --path \"path/to/file.csv\"\r\n```\r\n\r\nThis will:\r\n1. Process the file and classify URLs\r\n2. Generate an HTML report\r\n3. Open the report in your default web browser\r\n\r\n### Required File Format\r\n\r\nYour input file should be a CSV or Excel file with at least one column named `Domain_name` containing the URLs to analyze. Additional columns that enhance the analysis include:\r\n\r\n- `Client_Name`: Name of the device or user\r\n- `MAC_address`: MAC address of the device\r\n- `Access_time`: Timestamp of when the URL was accessed\r\n\r\n### Basic Command Options\r\n\r\nThe basic command can be enhanced with various options:\r\n\r\n- `--live-scan`: Fetch additional information about uncategorized URLs\r\n- `--summarize`: Generate AI-powered summaries of web page content (requires `--live-scan`)\r\n- `--aggregate`: Generate a single report for multiple files\r\n- `--output PATH`: Specify output directory for reports\r\n- `--no-open`: Don't automatically open the report in browser\r\n\r\nExample with options:\r\n```\r\npython -m url_analyzer analyze --path \"path/to/file.csv\" --live-scan --summarize --output \"reports\"\r\n```\r\n\r\n## Advanced Features\r\n\r\n### Batch Processing\r\n\r\nTo process multiple files at once:\r\n\r\n```\r\npython -m url_analyzer analyze --path \"path/to/directory\" --aggregate\r\n```\r\n\r\nThis will process all CSV and Excel files in the directory and generate a single aggregated report.\r\n\r\nBatch processing options:\r\n- `--job-id ID`: Assign a unique identifier to the batch job\r\n- `--max-workers N`: Set maximum number of parallel workers (default: 20)\r\n- `--checkpoint-interval N`: Minutes between saving checkpoints (default: 5)\r\n\r\n### Report Templates\r\n\r\nURL Analyzer provides several built-in report templates for different use cases:\r\n\r\n```\r\npython -m url_analyzer templates\r\n```\r\n\r\nThis will list all available templates with descriptions.\r\n\r\nTo use a specific template:\r\n\r\n```\r\npython -m url_analyzer analyze --path \"path/to/file.csv\" --template \"template_name\"\r\n```\r\n\r\n#### Available Templates\r\n\r\n- **Default**: Comprehensive template with all report sections and visualizations\r\n- **Minimal**: Simplified template with a clean, streamlined design\r\n- **Data Analytics**: Data-focused template with additional metrics and insights\r\n- **Security Focus**: Security-oriented template with security metrics and recommendations\r\n- **Executive Summary**: High-level overview for executives with key metrics and insights\r\n- **Print Friendly**: Optimized for printing with simplified layout and print-specific styling\r\n\r\n### Exporting Data\r\n\r\nExport analyzed data to different formats:\r\n\r\n```\r\npython -m url_analyzer export --path \"path/to/file.csv\" --format json\r\n```\r\n\r\nSupported formats: `csv`, `json`, `excel`\r\n\r\n### Configuration\r\n\r\nConfigure the application settings:\r\n\r\n```\r\npython -m url_analyzer configure\r\n```\r\n\r\nThis opens an interactive configuration interface where you can:\r\n- View current settings\r\n- Add or modify URL classification patterns\r\n- Configure API settings\r\n- Set scan parameters\r\n\r\n## Understanding the Report\r\n\r\nThe HTML report provides a comprehensive view of your URL data:\r\n\r\n### Dashboard Section\r\n- Doughnut chart showing URL categories\r\n- Statistics including total URLs, sensitive URLs, etc.\r\n\r\n### Domain & Time Insights\r\n- Top 10 visited domains chart\r\n- Traffic flow analysis (Sankey diagram)\r\n- Activity heatmap by day and hour\r\n- Daily URL access trend\r\n\r\n### Detailed URL Data\r\n- Interactive table with all URLs\r\n- Filtering by client name or MAC address\r\n- Export options (CSV, JSON)\r\n- Dark/light theme toggle\r\n\r\n## URL Classification\r\n\r\nURLs are classified into several categories:\r\n\r\n- **Sensitive**: URLs matching patterns for social media, adult content, etc.\r\n- **User-Generated**: URLs containing user content patterns like profiles, forums, etc.\r\n- **Advertising**: Ad-related domains and services\r\n- **Analytics**: Tracking and analytics services\r\n- **CDN**: Content delivery networks\r\n- **Corporate**: Business and corporate sites\r\n- **Uncategorized**: URLs that don't match any defined patterns\r\n\r\n## Custom URL Classification Rules\r\n\r\nYou can define custom classification rules in the configuration file (`config.json`):\r\n\r\n```json\r\n{\r\n \"custom_rules\": [\r\n {\r\n \"pattern\": \"news\\\\.google\\\\.com\",\r\n \"category\": \"News\",\r\n \"priority\": 120,\r\n \"is_sensitive\": false,\r\n \"description\": \"Google News\"\r\n }\r\n ]\r\n}\r\n```\r\n\r\nFor more information on custom rules, see the [Custom Rules Documentation](docs/custom_rules.md).\r\n\r\n## Troubleshooting\r\n\r\n### Common Issues\r\n\r\n1. **Missing dependencies**:\r\n - Ensure you've installed all required dependencies with `pip install -r requirements.txt`\r\n - For advanced features, make sure optional dependencies are installed\r\n\r\n2. **File format issues**:\r\n - Verify your input file has a `Domain_name` column\r\n - Check for encoding issues in CSV files\r\n\r\n3. **Performance with large files**:\r\n - Use the `--max-workers` option to adjust parallel processing\r\n - Consider splitting very large files into smaller chunks\r\n\r\n4. **API key issues**:\r\n - Set the `GEMINI_API_KEY` environment variable for summarization features\r\n\r\n### Getting Help\r\n\r\nIf you encounter issues not covered here:\r\n1. Check the logs for detailed error messages\r\n2. Consult the [documentation](docs/)\r\n3. Submit an issue on the project repository\r\n\r\n## Command Reference\r\n\r\n### Main Commands\r\n\r\n- `analyze`: Process files and generate reports\r\n- `configure`: Manage application configuration\r\n- `export`: Export data to different formats\r\n- `report`: Generate reports from previously analyzed data\r\n- `templates`: List available report templates\r\n\r\n### Global Options\r\n\r\n- `--verbose`, `-v`: Increase output verbosity\r\n- `--quiet`, `-q`: Suppress non-error output\r\n- `--log-file PATH`: Specify log file path\r\n\r\nFor a complete list of commands and options, run:\r\n```\r\npython -m url_analyzer --help\r\n```\r\n\r\n## Documentation\r\n\r\n- [User Instructions](docs/user_instructions.md): Detailed user guide\r\n- [Custom Rules](docs/custom_rules.md): Guide to creating custom classification rules\r\n- [Batch Processing](docs/batch_processing.md): Advanced batch processing options\r\n- [Configuration Guide](docs/configuration.md): Detailed configuration options\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License - see the LICENSE file for details.\r\n\r\n## Acknowledgments\r\n\r\n- Thanks to all contributors who have helped improve this project\r\n- Special thanks to the open-source libraries that made this project possible\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A tool for analyzing, categorizing, and reporting on URLs from browsing history or other sources",
"version": "1.0.0",
"project_urls": {
"Bug Tracker": "https://github.com/Hibber/url-analyzer/issues",
"Documentation": "https://github.com/Hibber/url-analyzer/blob/main/docs",
"Homepage": "https://github.com/Hibber/url-analyzer",
"Repository": "https://github.com/Hibber/url-analyzer.git",
"Source Code": "https://github.com/Hibber/url-analyzer"
},
"split_keywords": [
"url",
" analysis",
" web",
" browsing",
" history",
" categorization",
" reporting"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "b950f69ccada3e6a30ca82b94f11ef49c19c2bf5efbc4d1e7c474c3cdf89d1eb",
"md5": "1ed900a084a3ea797b0dd65b0c4feb5d",
"sha256": "9aaba265cca91227174c518a9137d7fb99a7d37dbaffe24f9d014821633d0a06"
},
"downloads": -1,
"filename": "hibber_url_analyzer-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1ed900a084a3ea797b0dd65b0c4feb5d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 798857,
"upload_time": "2025-08-07T03:27:34",
"upload_time_iso_8601": "2025-08-07T03:27:34.107121Z",
"url": "https://files.pythonhosted.org/packages/b9/50/f69ccada3e6a30ca82b94f11ef49c19c2bf5efbc4d1e7c474c3cdf89d1eb/hibber_url_analyzer-1.0.0-py3-none-any.whl",
"yanked": true,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "78f3321d035421360dcb919df5022d974e3e671e326176066ee173b02f4d9306",
"md5": "ea62a8c91322ca33f7a07fda1de5c759",
"sha256": "d54ea5b292e517c927e18154bf14b0538dd36999a03c29c7b9d3f11782b83616"
},
"downloads": -1,
"filename": "hibber_url_analyzer-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "ea62a8c91322ca33f7a07fda1de5c759",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 652620,
"upload_time": "2025-08-07T03:27:35",
"upload_time_iso_8601": "2025-08-07T03:27:35.693620Z",
"url": "https://files.pythonhosted.org/packages/78/f3/321d035421360dcb919df5022d974e3e671e326176066ee173b02f4d9306/hibber_url_analyzer-1.0.0.tar.gz",
"yanked": true,
"yanked_reason": null
}
],
"upload_time": "2025-08-07 03:27:35",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "url-analyzer",
"github_project": "url-analyzer",
"github_not_found": true,
"lcname": "hibber-url-analyzer"
}