<p align="center"><img src=".github/images/logo.png" alt="web2vec" title="web2vec"/></p>
<h1 align="center">
⚔️ Web2Vec: Website to Vector Representation Library ⚔️
</h1>
<p align="center">
<img alt="PyPI - Python Version" src="https://img.shields.io/pypi/pyversions/web2vec.svg">
<img alt="PyPI - Downloads" src="https://img.shields.io/pypi/dm/web2vec.svg" href="https://pepy.tech/project/web2vec">
<a href="https://repology.org/project/python:web2vec/versions">
<img src="https://repology.org/badge/tiny-repos/python:web2vec.svg" alt="Packaging status">
</a>
<img alt="Downloads" src="https://pepy.tech/badge/web2vec">
<img alt="GitHub license" src="https://img.shields.io/github/license/damianfraszczak/web2vec.svg" href="https://github.com/damianfraszczak/web2vec/blob/master/LICENSE">
<img alt="Documentation Status" src="https://readthedocs.org/projects/web2vec/badge/?version=latest" href="https://web2vec.readthedocs.io/en/latest/?badge=latest">
</p>
<p align="center">
<a href="https://github.com/damianfraszczak/web2vec?tab=readme-ov-file#why-web2vec">✨ Why Web2Vec?</a>
<a href="https://github.com/damianfraszczak/web2vec?tab=readme-ov-file#features">📦 Features</a>
<a href="https://github.com/damianfraszczak/web2vec/blob/master/docs/files/QUICK_START.md">🚀 Quick Start</a>
<a href="https://github.com/damianfraszczak/web2vec?tab=readme-ov-file#integration-and-configuration">🧑💻 Installation and configuration</a>
<a href="https://web2vec.readthedocs.io/">📮 Documentation</a>
<a href="https://github.com/damianfraszczak/web2vec/blob/master/docs/files/jupyter">📓 Jupyter Notebook examples</a>
<a href="LICENSE">🔑 License</a>
</p>
Web2Vec is a comprehensive library designed to convert websites into vector parameters. It provides ready-to-use implementations of web crawlers using Scrapy, making it accessible for less experienced researchers. This tool is invaluable for website analysis tasks, including SEO, disinformation detection, and phishing identification.
Website analysis is crucial in various fields, such as SEO, where it helps improve website ranking, and in security, where it aids in identifying phishing sites. By building datasets based on known safe and malicious websites, Web2Vec facilitates the collection and analysis of their parameters, making it an ideal solution for these tasks.
The goal of Web2Vec is to offer a comprehensive repository for implementing a broad spectrum of website processing-related methods. Many available tools exist, but learning and using them can be time-consuming. Moreover, new features are continually being introduced, making it difficult to keep up with the latest techniques. Web2Vec aims to bridge this gap by providing a complete solution for website analysis. This repository facilitates the collection and analysis of extensive information about websites, supporting both academic research and industry applications.
* **Free software:** MIT license,
* **Documentation:** https://web2vec.readthedocs.io/en/latest/,
* **Python versions:** 3.9 | 3.10 | 3.11
* **Tested OS:** Windows, Ubuntu, Fedora and CentOS. **However, that does not mean it does not work on others.**
* **All-in-One Solution::** Web2Vec is an all-in-one solution that allows for the collection of a wide range of information about websites.
* **Efficiency and Expertise: :** Building a similar solution independently would be very time-consuming and require specialized knowledge. Web2Vec not only integrates with available APIs but also scrapes results from services like Google Index using Selenium.
* **Open Source Advantage: :** Publishing this tool as open source will facilitate many studies, making them simpler and allowing researchers and industry professionals to focus on more advanced tasks.
* **Continuous Improvement: :** New techniques will be added successively, ensuring continuous growth in this area.
## Features
- Crawler Implementation: Easily crawl specified websites with customizable depth and allowed domains.
- Network Analysis: Build and analyze networks of connected websites.
- Parameter Extraction: Extract a wide range of features for detailed analysis, each providerer returns Python dataclass for maintainability and easier process of adding new parameters, including:
- HTML Content
- DNS
- HTTP Response
- SSL Certificate
- URL related geographical location
- URL Lexical Analysis
- WHOIS Integration
- Google Index
- Open Page Rank
- Open Phish
- Phish Tank
- Similar Web
- URL House
By using this library, you can easily collect and analyze almost 200 parameters to describe a website comprehensively.
### Html Content parameters
```python
@dataclass
class HtmlBodyFeatures:
contains_forms: bool
contains_obfuscated_scripts: bool
contains_suspicious_keywords: bool
body_length: int
num_titles: int
num_images: int
num_links: int
script_length: int
special_characters: int
script_to_special_chars_ratio: float
script_to_body_ratio: float
body_to_special_char_ratio: float
iframe_redirection: int
mouse_over_effect: int
right_click_disabled: int
num_scripts_http: int
num_styles_http: int
num_iframes_http: int
num_external_scripts: int
num_external_styles: int
num_external_iframes: int
num_meta_tags: int
num_forms: int
num_forms_post: int
num_forms_get: int
num_forms_external_action: int
num_hidden_elements: int
num_safe_anchors: int
num_media_http: int
num_media_external: int
num_email_forms: int
num_internal_links: int
favicon_url: Optional[str]
logo_url: Optional[str]
found_forms: List[Dict[str, Any]] = field(default_factory=list)
found_images: List[Dict[str, Any]] = field(default_factory=list)
found_anchors: List[Dict[str, Any]] = field(default_factory=list)
found_media: List[Dict[str, Any]] = field(default_factory=list)
copyright: Optional[str] = None
```
### DNS parameters
```python
@dataclass
class DNSRecordFeatures:
record_type: str
ttl: int
values: List[str]
```
### HTTP Response parameters
```python
@dataclass
class HttpResponseFeatures:
redirects: bool
redirect_count: int
contains_forms: bool
contains_obfuscated_scripts: bool
contains_suspicious_keywords: bool
uses_https: bool
missing_x_frame_options: bool
missing_x_xss_protection: bool
missing_content_security_policy: bool
missing_strict_transport_security: bool
missing_x_content_type_options: bool
is_live: bool
server_version: Optional[str] = None
body_length: int = 0
num_titles: int = 0
num_images: int = 0
num_links: int = 0
script_length: int = 0
special_characters: int = 0
script_to_special_chars_ratio: float = 0.0
script_to_body_ratio: float = 0.0
body_to_special_char_ratio: float = 0.0
```
### SSLCertificate
```python
@dataclass
class CertificateFeatures:
subject: Dict[str, Any]
issuer: Dict[str, Any]
not_before: datetime
not_after: datetime
is_valid: bool
validity_message: str
is_trusted: bool
trust_message: str
```
### URL related geographical location
```python
@dataclass
class URLGeoFeatures:
url: str
country_code: str
asn: int
```
### URL Lexical Analysis
```python
@dataclass
class URLLexicalFeatures:
count_dot_url: int
count_dash_url: int
count_underscore_url: int
count_slash_url: int
count_question_url: int
count_equals_url: int
count_at_url: int
count_ampersand_url: int
count_exclamation_url: int
count_space_url: int
count_tilde_url: int
count_comma_url: int
count_plus_url: int
count_asterisk_url: int
count_hash_url: int
count_dollar_url: int
count_percent_url: int
url_length: int
tld_amount_url: int
count_dot_domain: int
count_dash_domain: int
count_underscore_domain: int
count_slash_domain: int
count_question_domain: int
count_equals_domain: int
count_at_domain: int
count_ampersand_domain: int
count_exclamation_domain: int
count_space_domain: int
count_tilde_domain: int
count_comma_domain: int
count_plus_domain: int
count_asterisk_domain: int
count_hash_domain: int
count_dollar_domain: int
count_percent_domain: int
domain_length: int
vowel_count_domain: int
domain_in_ip_format: bool
domain_contains_keywords: bool
count_dot_directory: int
count_dash_directory: int
count_underscore_directory: int
count_slash_directory: int
count_question_directory: int
count_equals_directory: int
count_at_directory: int
count_ampersand_directory: int
count_exclamation_directory: int
count_space_directory: int
count_tilde_directory: int
count_comma_directory: int
count_plus_directory: int
count_asterisk_directory: int
count_hash_directory: int
count_dollar_directory: int
count_percent_directory: int
directory_length: int
count_dot_parameters: int
count_dash_parameters: int
count_underscore_parameters: int
count_slash_parameters: int
count_question_parameters: int
count_equals_parameters: int
count_at_parameters: int
count_ampersand_parameters: int
count_exclamation_parameters: int
count_space_parameters: int
count_tilde_parameters: int
count_comma_parameters: int
count_plus_parameters: int
count_asterisk_parameters: int
count_hash_parameters: int
count_dollar_parameters: int
count_percent_parameters: int
parameters_length: int
tld_presence_in_arguments: int
number_of_parameters: int
email_present_in_url: bool
domain_entropy: float
url_depth: int
uses_shortening_service: Optional[str]
is_ip: bool = False
```
### WHOIS Integration
```python
@dataclass
class WhoisFeatures:
domain_name: List[str]
registrar: Optional[str]
whois_server: Optional[str]
referral_url: Optional[str]
updated_date: Optional[datetime]
creation_date: Optional[datetime]
expiration_date: Optional[datetime]
name_servers: List[str]
status: List[str]
emails: List[str]
dnssec: Optional[str]
name: Optional[str]
org: Optional[str]
address: Optional[str]
city: Optional[str]
state: Optional[str]
zipcode: Optional[str]
country: Optional[str]
raw: Dict = field(default_factory=dict)
```
### Google Index
```python
@dataclass
class GoogleIndexFeatures:
is_indexed: Optional[bool]
position: Optional[int] = None
```
### Open Page Rank
```python
@dataclass
class OpenPageRankFeatures:
domain: str
page_rank_decimal: Optional[float]
updated_date: Optional[str]
```
### Open Phish
```python
@dataclass
class OpenPhishFeatures:
is_phishing: bool
```
### Phish Tank
```python
@dataclass
class PhishTankFeatures:
phish_id: str
url: str
phish_detail_url: str
submission_time: str
verified: str
verification_time: str
online: str
target: str
```
### Similar Web
```python
@dataclass
class SimilarWebFeatures:
Version: int
SiteName: str
Description: str
TopCountryShares: List[TopCountryShare]
Title: str
Engagements: Engagements
EstimatedMonthlyVisits: List[EstimatedMonthlyVisit]
GlobalRank: int
CountryRank: int
CountryCode: str
CategoryRank: str
Category: str
LargeScreenshot: str
TrafficSources: TrafficSource
TopKeywords: List[TopKeyword]
RawData: dict = field(default_factory=dict)
```
### URL Haus
```python
@dataclass
class URLHausFeatures:
id: str
date_added: str
url: str
url_status: str
last_online: str
threat: str
tags: str
urlhaus_link: str
reporter: str
```
## Why Web2Vec?
While many scripts and solutions exist that perform some of the tasks offered by Web2Vec, none provide a complete all-in-one package. Web2Vec not only offers comprehensive functionality but also ensures maintainability and ease of use.
## Integration and Configuration
Web2Vec focuses on integration with free services, leveraging their APIs or scraping their responses. Configuration is handled via Python settings, making it easily configurable through traditional methods (environment variables, configuration files, etc.). Its integration with dedicated phishing detection services makes it a robust tool for building phishing detection datasets.
## How to use
Library can be installed using pip:
```bash
pip install web2vec
```
## Code usage
### Configuration
Configure the library using environment variables or configuration files.
```shell
export WEB2VEC_CRAWLER_SPIDER_DEPTH_LIMIT=2
export WEB2VEC_DEFAULT_OUTPUT_PATH=/home/admin/crawler/output
export WEB2VEC_OPEN_PAGE_RANK_API_KEY=XXXXX
```
### Crawling websites and extract parameters
```python
import os
from scrapy.crawler import CrawlerProcess
import web2vec as w2v
process = CrawlerProcess(
settings={
"FEEDS": {
os.path.join(w2v.config.crawler_output_path, "output.json"): {
"format": "json",
"encoding": "utf8",
}
},
"DEPTH_LIMIT": 1,
"LOG_LEVEL": "INFO",
}
)
process.crawl(
w2v.Web2VecSpider,
start_urls=["http://quotes.toscrape.com/"], # pages to process
allowed_domains=["quotes.toscrape.com"], # domains to process for links
extractors=w2v.ALL_EXTRACTORS, # extractors to use
)
process.start()
```
and as a results you will get each processed page stored in a separate file as json with the following keys:
- url - processed url
- title - website title extracted from HTML
- html - HTTP response text attribute
- response_headers - HTTP response headers
- status_code - HTTP response status code
- extractors - dictionary with extractors results
sample content
```json
{
"url": "http://quotes.toscrape.com/",
"title": "Quotes to Scrape",
"html": "HTML body, removed too big to show",
"response_headers": {
"b'Content-Length'": "[b'11054']",
"b'Date'": "[b'Tue, 23 Jul 2024 06:05:10 GMT']",
"b'Content-Type'": "[b'text/html; charset=utf-8']"
},
"status_code": 200,
"extractors": [
{
"name": "DNSFeatures",
"result": {
"domain": "quotes.toscrape.com",
"records": [
{
"record_type": "A",
"ttl": 225,
"values": [
"35.211.122.109"
]
},
{
"record_type": "CNAME",
"ttl": 225,
"values": [
"ingress.prod-01.gcp.infra.zyte.group."
]
}
]
}
}
]
}
```
### Website analysis
Websites can be analysed without scrapping process, by using extractors directly. For example to get data from SimilarWeb for given domain you have just to call appropriate method:
```python
from src.web2vec.extractors.external_api.similar_web_features import \
get_similar_web_features
domain_to_check = "down.pcclear.com"
entry = get_similar_web_features(domain_to_check)
print(entry)
```
All modules are exported into main package, so you can use import module and invoke them directly.
```python
import web2vec as w2v
domain_to_check = "down.pcclear.com"
entry = w2v.get_similar_web_features(domain_to_check)
print(entry)
```
## Contributing
For contributing, refer to its [CONTRIBUTING.md](.github/CONTRIBUTING.md) file.
We are a welcoming community... just follow the [Code of Conduct](.github/CODE_OF_CONDUCT.md).
## Maintainers
Project maintainers are:
- Damian Frąszczak
- Edyta Frąszczak
Raw data
{
"_id": null,
"home_page": "https://github.com/damianfraszczak/web2vec",
"name": "web2vec",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "website_processing crawling scraping web2vec",
"author": "Damian Fr\u0105szczak, Edyta Fr\u0105szczak",
"author_email": "damian.fraszczak@wat.edu.pl",
"download_url": "https://files.pythonhosted.org/packages/3a/12/4e2e37192d5fbfce81d6c1ab41130d1dac88776c9448bd4722d299cf5fac/web2vec-0.1.3.tar.gz",
"platform": null,
"description": "<p align=\"center\"><img src=\".github/images/logo.png\" alt=\"web2vec\" title=\"web2vec\"/></p>\n\n<h1 align=\"center\">\n \u2694\ufe0f Web2Vec: Website to Vector Representation Library \u2694\ufe0f\n</h1>\n\n<p align=\"center\">\n <img alt=\"PyPI - Python Version\" src=\"https://img.shields.io/pypi/pyversions/web2vec.svg\">\n <img alt=\"PyPI - Downloads\" src=\"https://img.shields.io/pypi/dm/web2vec.svg\" href=\"https://pepy.tech/project/web2vec\">\n <a href=\"https://repology.org/project/python:web2vec/versions\">\n <img src=\"https://repology.org/badge/tiny-repos/python:web2vec.svg\" alt=\"Packaging status\">\n </a>\n <img alt=\"Downloads\" src=\"https://pepy.tech/badge/web2vec\">\n <img alt=\"GitHub license\" src=\"https://img.shields.io/github/license/damianfraszczak/web2vec.svg\" href=\"https://github.com/damianfraszczak/web2vec/blob/master/LICENSE\">\n <img alt=\"Documentation Status\" src=\"https://readthedocs.org/projects/web2vec/badge/?version=latest\" href=\"https://web2vec.readthedocs.io/en/latest/?badge=latest\">\n</p>\n\n<p align=\"center\">\n <a href=\"https://github.com/damianfraszczak/web2vec?tab=readme-ov-file#why-web2vec\">\u2728 Why Web2Vec?</a>\n <a href=\"https://github.com/damianfraszczak/web2vec?tab=readme-ov-file#features\">\ud83d\udce6 Features</a>\n<a href=\"https://github.com/damianfraszczak/web2vec/blob/master/docs/files/QUICK_START.md\">\ud83d\ude80 Quick Start</a>\n <a href=\"https://github.com/damianfraszczak/web2vec?tab=readme-ov-file#integration-and-configuration\">\ud83e\uddd1\u200d\ud83d\udcbb Installation and configuration</a>\n <a href=\"https://web2vec.readthedocs.io/\">\ud83d\udcee Documentation</a>\n <a href=\"https://github.com/damianfraszczak/web2vec/blob/master/docs/files/jupyter\">\ud83d\udcd3 Jupyter Notebook examples</a>\n <a href=\"LICENSE\">\ud83d\udd11 License</a>\n</p>\n\nWeb2Vec is a comprehensive library designed to convert websites into vector parameters. It provides ready-to-use implementations of web crawlers using Scrapy, making it accessible for less experienced researchers. This tool is invaluable for website analysis tasks, including SEO, disinformation detection, and phishing identification.\n\nWebsite analysis is crucial in various fields, such as SEO, where it helps improve website ranking, and in security, where it aids in identifying phishing sites. By building datasets based on known safe and malicious websites, Web2Vec facilitates the collection and analysis of their parameters, making it an ideal solution for these tasks.\n\nThe goal of Web2Vec is to offer a comprehensive repository for implementing a broad spectrum of website processing-related methods. Many available tools exist, but learning and using them can be time-consuming. Moreover, new features are continually being introduced, making it difficult to keep up with the latest techniques. Web2Vec aims to bridge this gap by providing a complete solution for website analysis. This repository facilitates the collection and analysis of extensive information about websites, supporting both academic research and industry applications.\n\n* **Free software:** MIT license,\n* **Documentation:** https://web2vec.readthedocs.io/en/latest/,\n* **Python versions:** 3.9 | 3.10 | 3.11\n* **Tested OS:** Windows, Ubuntu, Fedora and CentOS. **However, that does not mean it does not work on others.**\n* **All-in-One Solution::** Web2Vec is an all-in-one solution that allows for the collection of a wide range of information about websites.\n* **Efficiency and Expertise: :** Building a similar solution independently would be very time-consuming and require specialized knowledge. Web2Vec not only integrates with available APIs but also scrapes results from services like Google Index using Selenium.\n* **Open Source Advantage: :** Publishing this tool as open source will facilitate many studies, making them simpler and allowing researchers and industry professionals to focus on more advanced tasks.\n* **Continuous Improvement: :** New techniques will be added successively, ensuring continuous growth in this area.\n\n## Features\n- Crawler Implementation: Easily crawl specified websites with customizable depth and allowed domains.\n- Network Analysis: Build and analyze networks of connected websites.\n- Parameter Extraction: Extract a wide range of features for detailed analysis, each providerer returns Python dataclass for maintainability and easier process of adding new parameters, including:\n - HTML Content\n - DNS\n - HTTP Response\n - SSL Certificate\n - URL related geographical location\n - URL Lexical Analysis\n - WHOIS Integration\n - Google Index\n - Open Page Rank\n - Open Phish\n - Phish Tank\n - Similar Web\n - URL House\n\nBy using this library, you can easily collect and analyze almost 200 parameters to describe a website comprehensively.\n\n### Html Content parameters\n```python\n@dataclass\nclass HtmlBodyFeatures:\n contains_forms: bool\n contains_obfuscated_scripts: bool\n contains_suspicious_keywords: bool\n body_length: int\n num_titles: int\n num_images: int\n num_links: int\n script_length: int\n special_characters: int\n script_to_special_chars_ratio: float\n script_to_body_ratio: float\n body_to_special_char_ratio: float\n iframe_redirection: int\n mouse_over_effect: int\n right_click_disabled: int\n num_scripts_http: int\n num_styles_http: int\n num_iframes_http: int\n num_external_scripts: int\n num_external_styles: int\n num_external_iframes: int\n num_meta_tags: int\n num_forms: int\n num_forms_post: int\n num_forms_get: int\n num_forms_external_action: int\n num_hidden_elements: int\n num_safe_anchors: int\n num_media_http: int\n num_media_external: int\n num_email_forms: int\n num_internal_links: int\n favicon_url: Optional[str]\n logo_url: Optional[str]\n found_forms: List[Dict[str, Any]] = field(default_factory=list)\n found_images: List[Dict[str, Any]] = field(default_factory=list)\n found_anchors: List[Dict[str, Any]] = field(default_factory=list)\n found_media: List[Dict[str, Any]] = field(default_factory=list)\n copyright: Optional[str] = None\n```\n### DNS parameters\n```python\n@dataclass\nclass DNSRecordFeatures:\n record_type: str\n ttl: int\n values: List[str]\n\n```\n### HTTP Response parameters\n```python\n@dataclass\nclass HttpResponseFeatures:\n redirects: bool\n redirect_count: int\n contains_forms: bool\n contains_obfuscated_scripts: bool\n contains_suspicious_keywords: bool\n uses_https: bool\n missing_x_frame_options: bool\n missing_x_xss_protection: bool\n missing_content_security_policy: bool\n missing_strict_transport_security: bool\n missing_x_content_type_options: bool\n is_live: bool\n server_version: Optional[str] = None\n body_length: int = 0\n num_titles: int = 0\n num_images: int = 0\n num_links: int = 0\n script_length: int = 0\n special_characters: int = 0\n script_to_special_chars_ratio: float = 0.0\n script_to_body_ratio: float = 0.0\n body_to_special_char_ratio: float = 0.0\n```\n### SSLCertificate\n```python\n@dataclass\nclass CertificateFeatures:\n subject: Dict[str, Any]\n issuer: Dict[str, Any]\n not_before: datetime\n not_after: datetime\n is_valid: bool\n validity_message: str\n is_trusted: bool\n trust_message: str\n\n```\n### URL related geographical location\n```python\n@dataclass\nclass URLGeoFeatures:\n url: str\n country_code: str\n asn: int\n```\n### URL Lexical Analysis\n```python\n\n@dataclass\nclass URLLexicalFeatures:\n count_dot_url: int\n count_dash_url: int\n count_underscore_url: int\n count_slash_url: int\n count_question_url: int\n count_equals_url: int\n count_at_url: int\n count_ampersand_url: int\n count_exclamation_url: int\n count_space_url: int\n count_tilde_url: int\n count_comma_url: int\n count_plus_url: int\n count_asterisk_url: int\n count_hash_url: int\n count_dollar_url: int\n count_percent_url: int\n url_length: int\n tld_amount_url: int\n count_dot_domain: int\n count_dash_domain: int\n count_underscore_domain: int\n count_slash_domain: int\n count_question_domain: int\n count_equals_domain: int\n count_at_domain: int\n count_ampersand_domain: int\n count_exclamation_domain: int\n count_space_domain: int\n count_tilde_domain: int\n count_comma_domain: int\n count_plus_domain: int\n count_asterisk_domain: int\n count_hash_domain: int\n count_dollar_domain: int\n count_percent_domain: int\n domain_length: int\n vowel_count_domain: int\n domain_in_ip_format: bool\n domain_contains_keywords: bool\n count_dot_directory: int\n count_dash_directory: int\n count_underscore_directory: int\n count_slash_directory: int\n count_question_directory: int\n count_equals_directory: int\n count_at_directory: int\n count_ampersand_directory: int\n count_exclamation_directory: int\n count_space_directory: int\n count_tilde_directory: int\n count_comma_directory: int\n count_plus_directory: int\n count_asterisk_directory: int\n count_hash_directory: int\n count_dollar_directory: int\n count_percent_directory: int\n directory_length: int\n count_dot_parameters: int\n count_dash_parameters: int\n count_underscore_parameters: int\n count_slash_parameters: int\n count_question_parameters: int\n count_equals_parameters: int\n count_at_parameters: int\n count_ampersand_parameters: int\n count_exclamation_parameters: int\n count_space_parameters: int\n count_tilde_parameters: int\n count_comma_parameters: int\n count_plus_parameters: int\n count_asterisk_parameters: int\n count_hash_parameters: int\n count_dollar_parameters: int\n count_percent_parameters: int\n parameters_length: int\n tld_presence_in_arguments: int\n number_of_parameters: int\n email_present_in_url: bool\n domain_entropy: float\n url_depth: int\n uses_shortening_service: Optional[str]\n is_ip: bool = False\n```\n### WHOIS Integration\n```python\n@dataclass\nclass WhoisFeatures:\n domain_name: List[str]\n registrar: Optional[str]\n whois_server: Optional[str]\n referral_url: Optional[str]\n updated_date: Optional[datetime]\n creation_date: Optional[datetime]\n expiration_date: Optional[datetime]\n name_servers: List[str]\n status: List[str]\n emails: List[str]\n dnssec: Optional[str]\n name: Optional[str]\n org: Optional[str]\n address: Optional[str]\n city: Optional[str]\n state: Optional[str]\n zipcode: Optional[str]\n country: Optional[str]\n raw: Dict = field(default_factory=dict)\n```\n### Google Index\n```python\n@dataclass\nclass GoogleIndexFeatures:\n is_indexed: Optional[bool]\n position: Optional[int] = None\n```\n### Open Page Rank\n```python\n@dataclass\nclass OpenPageRankFeatures:\n domain: str\n page_rank_decimal: Optional[float]\n updated_date: Optional[str]\n```\n### Open Phish\n```python\n@dataclass\nclass OpenPhishFeatures:\n is_phishing: bool\n```\n### Phish Tank\n```python\n@dataclass\nclass PhishTankFeatures:\n phish_id: str\n url: str\n phish_detail_url: str\n submission_time: str\n verified: str\n verification_time: str\n online: str\n target: str\n```\n### Similar Web\n```python\n@dataclass\nclass SimilarWebFeatures:\n Version: int\n SiteName: str\n Description: str\n TopCountryShares: List[TopCountryShare]\n Title: str\n Engagements: Engagements\n EstimatedMonthlyVisits: List[EstimatedMonthlyVisit]\n GlobalRank: int\n CountryRank: int\n CountryCode: str\n CategoryRank: str\n Category: str\n LargeScreenshot: str\n TrafficSources: TrafficSource\n TopKeywords: List[TopKeyword]\n RawData: dict = field(default_factory=dict)\n```\n### URL Haus\n```python\n@dataclass\nclass URLHausFeatures:\n id: str\n date_added: str\n url: str\n url_status: str\n last_online: str\n threat: str\n tags: str\n urlhaus_link: str\n reporter: str\n```\n## Why Web2Vec?\nWhile many scripts and solutions exist that perform some of the tasks offered by Web2Vec, none provide a complete all-in-one package. Web2Vec not only offers comprehensive functionality but also ensures maintainability and ease of use.\n\n## Integration and Configuration\nWeb2Vec focuses on integration with free services, leveraging their APIs or scraping their responses. Configuration is handled via Python settings, making it easily configurable through traditional methods (environment variables, configuration files, etc.). Its integration with dedicated phishing detection services makes it a robust tool for building phishing detection datasets.\n\n\n## How to use\nLibrary can be installed using pip:\n\n```bash\npip install web2vec\n```\n\n## Code usage\n### Configuration\nConfigure the library using environment variables or configuration files.\n```shell\nexport WEB2VEC_CRAWLER_SPIDER_DEPTH_LIMIT=2\nexport WEB2VEC_DEFAULT_OUTPUT_PATH=/home/admin/crawler/output\nexport WEB2VEC_OPEN_PAGE_RANK_API_KEY=XXXXX\n```\n### Crawling websites and extract parameters\n\n```python\nimport os\n\nfrom scrapy.crawler import CrawlerProcess\n\nimport web2vec as w2v\n\nprocess = CrawlerProcess(\n settings={\n \"FEEDS\": {\n os.path.join(w2v.config.crawler_output_path, \"output.json\"): {\n \"format\": \"json\",\n \"encoding\": \"utf8\",\n }\n },\n \"DEPTH_LIMIT\": 1,\n \"LOG_LEVEL\": \"INFO\",\n }\n)\n\nprocess.crawl(\n w2v.Web2VecSpider,\n start_urls=[\"http://quotes.toscrape.com/\"], # pages to process\n allowed_domains=[\"quotes.toscrape.com\"], # domains to process for links\n extractors=w2v.ALL_EXTRACTORS, # extractors to use\n)\nprocess.start()\n```\nand as a results you will get each processed page stored in a separate file as json with the following keys:\n- url - processed url\n- title - website title extracted from HTML\n- html - HTTP response text attribute\n- response_headers - HTTP response headers\n- status_code - HTTP response status code\n- extractors - dictionary with extractors results\n\nsample content\n```json\n{\n \"url\": \"http://quotes.toscrape.com/\",\n \"title\": \"Quotes to Scrape\",\n \"html\": \"HTML body, removed too big to show\",\n \"response_headers\": {\n \"b'Content-Length'\": \"[b'11054']\",\n \"b'Date'\": \"[b'Tue, 23 Jul 2024 06:05:10 GMT']\",\n \"b'Content-Type'\": \"[b'text/html; charset=utf-8']\"\n },\n \"status_code\": 200,\n \"extractors\": [\n {\n \"name\": \"DNSFeatures\",\n \"result\": {\n \"domain\": \"quotes.toscrape.com\",\n \"records\": [\n {\n \"record_type\": \"A\",\n \"ttl\": 225,\n \"values\": [\n \"35.211.122.109\"\n ]\n },\n {\n \"record_type\": \"CNAME\",\n \"ttl\": 225,\n \"values\": [\n \"ingress.prod-01.gcp.infra.zyte.group.\"\n ]\n }\n ]\n }\n }\n ]\n}\n```\n### Website analysis\nWebsites can be analysed without scrapping process, by using extractors directly. For example to get data from SimilarWeb for given domain you have just to call appropriate method:\n\n```python\nfrom src.web2vec.extractors.external_api.similar_web_features import \\\n get_similar_web_features\n\ndomain_to_check = \"down.pcclear.com\"\nentry = get_similar_web_features(domain_to_check)\nprint(entry)\n```\n\nAll modules are exported into main package, so you can use import module and invoke them directly.\n```python\nimport web2vec as w2v\n\ndomain_to_check = \"down.pcclear.com\"\nentry = w2v.get_similar_web_features(domain_to_check)\nprint(entry)\n```\n\n\n## Contributing\n\nFor contributing, refer to its [CONTRIBUTING.md](.github/CONTRIBUTING.md) file.\nWe are a welcoming community... just follow the [Code of Conduct](.github/CODE_OF_CONDUCT.md).\n\n## Maintainers\n\nProject maintainers are:\n\n- Damian Fr\u0105szczak\n- Edyta Fr\u0105szczak\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Website to vector representation library",
"version": "0.1.3",
"project_urls": {
"Homepage": "https://github.com/damianfraszczak/web2vec"
},
"split_keywords": [
"website_processing",
"crawling",
"scraping",
"web2vec"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "81bb7ca1aa8a28340f2dd85c5619ce0eb34b20d652a86560281dac511bdd0267",
"md5": "6f4f421318bf3e5bd18fa523ae9cf551",
"sha256": "93c69488911d790e319b3c9560158b8a0ebdae1ce64e08162628ded25bb0d0e0"
},
"downloads": -1,
"filename": "web2vec-0.1.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "6f4f421318bf3e5bd18fa523ae9cf551",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 34063,
"upload_time": "2024-10-22T05:48:01",
"upload_time_iso_8601": "2024-10-22T05:48:01.291853Z",
"url": "https://files.pythonhosted.org/packages/81/bb/7ca1aa8a28340f2dd85c5619ce0eb34b20d652a86560281dac511bdd0267/web2vec-0.1.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "3a124e2e37192d5fbfce81d6c1ab41130d1dac88776c9448bd4722d299cf5fac",
"md5": "0abfa930e8fb2f25388120cffcf89352",
"sha256": "4f91ac51d4fc72e7de0356fb8f5ed99ae3f5ab19f253a012e6c7cf69af05d6d0"
},
"downloads": -1,
"filename": "web2vec-0.1.3.tar.gz",
"has_sig": false,
"md5_digest": "0abfa930e8fb2f25388120cffcf89352",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 31196,
"upload_time": "2024-10-22T05:48:03",
"upload_time_iso_8601": "2024-10-22T05:48:03.016504Z",
"url": "https://files.pythonhosted.org/packages/3a/12/4e2e37192d5fbfce81d6c1ab41130d1dac88776c9448bd4722d299cf5fac/web2vec-0.1.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-22 05:48:03",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "damianfraszczak",
"github_project": "web2vec",
"travis_ci": false,
"coveralls": true,
"github_actions": true,
"requirements": [
{
"name": "scrapy",
"specs": []
},
{
"name": "networkx",
"specs": []
},
{
"name": "beautifulsoup4",
"specs": []
},
{
"name": "matplotlib",
"specs": []
},
{
"name": "scipy",
"specs": []
},
{
"name": "python-whois",
"specs": []
},
{
"name": "dnspython",
"specs": []
},
{
"name": "geoip2",
"specs": []
},
{
"name": "tldextract",
"specs": []
},
{
"name": "selenium",
"specs": []
},
{
"name": "webdriver-manager",
"specs": []
},
{
"name": "dnspython",
"specs": []
},
{
"name": "pydantic",
"specs": []
},
{
"name": "pydantic_settings",
"specs": []
}
],
"lcname": "web2vec"
}