# Scrappify
Scrappify is a powerful yet simple website scraping and downloading tool. It allows you to easily **scrape links, download files, filter by file types, extract patterns (like emails or phone numbers), and perform deep crawling** — all from Python or the command line.
---
## Features
* Download entire websites
* Extract links, emails, phone numbers, or custom regex patterns
* Filter downloads by file type (images, documents, scripts, etc.)
* Fast downloads with configurable workers
* Cross-domain crawling support
* Command-line interface (CLI) and Python API
---
## Installation
```bash
pip install scrappify
```
---
## Python Usage
### Basic Usage
```python
from scrappify import url, scrap, download
# Download entire website
url_download = url("https://example.com")
downloaded_files = download(url_download, output_dir="my_site")
print(f"Downloaded {len(downloaded_files)} files")
# Get all links from a page
links = scrap(url_download)
print(f"Found {len(links)} links")
```
### File Type Filtering
```python
from scrappify import url, download
from scrappify.patterns import file_type
# Download only JavaScript files
js_files = download("https://example.com", file_type="js", output_dir="js_files")
# Download images using category
images = download("https://example.com", file_type=file_type['image'], output_dir="images")
# Download multiple specific file types
docs_and_images = download("https://example.com", file_type=["pdf", "jpg", "png"])
```
### Pattern Searching
```python
from scrappify import url, download
from scrappify.patterns import pattern
# Find emails in all downloaded files
email_results = download("https://example.com", pattern=pattern['email'])
# Find phone numbers in HTML files only
phone_results = download("https://example.com", file_type="html", pattern=pattern['phone'])
# Custom regex pattern
custom_pattern = r'\b\d{3}-\d{2}-\d{4}\b' # SSN pattern
ssn_results = download("https://example.com", pattern=custom_pattern)
# Combine file type and pattern
results = download("https://example.com", file_type="js", pattern=pattern['url'])
```
### Advanced Scraping
```python
from scrappify import url, scrap, download
# Deep crawling (multiple levels)
deep_links = scrap("https://example.com", depth=3)
print(f"Found {len(deep_links)} links across 3 levels")
# Download with increased workers
fast_download = download("https://example.com", max_workers=20, output_dir="fast_download")
# Cross-domain downloading (disable same-domain restriction)
all_links = scrap("https://example.com", same_domain_only=False)
```
### Programmatic Pattern Extraction
```python
from scrappify.core.utils import search_pattern_in_file
# Search pattern in specific file
results = search_pattern_in_file("downloaded_file.html", r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
for result in results:
print(f"Email found: {result['match']} at line {result['line']}")
```
---
## Command Line Usage
```bash
# Download entire website
scrappify https://example.com -o my_site
# Download only PDF files
scrappify https://example.com -t pdf -o documents
# Download images and search for emails
scrappify https://example.com -t image -p email -o images_with_emails
# Deep crawl (3 levels) and download everything
scrappify https://example.com -d 3 -o deep_site
# Use custom regex pattern
scrappify https://example.com -p '\b\d{3}-\d{2}-\d{4}\b' -o ssn_search
# List available patterns
scrappify --list-patterns
# List available file types
scrappify --list-types
# High-performance download with 20 workers
scrappify https://example.com -w 20 -o fast_download
```
### Complex Examples
```bash
# Download all JavaScript and CSS files, search for URLs
scrappify https://example.com -t javascript -t css -p url -o assets_with_urls
# Download documents and images, search for prices
scrappify https://example.com -t document -t image -p price -o priced_content
# Deep crawl with custom pattern
scrappify https://example.com -d 2 -p '#[a-zA-Z0-9_]+' -o hashtags
```
---
## Available Options
### File Types
* `image` → png, jpg, gif, svg, etc.
* `document` → pdf, docx, txt, etc.
* `javascript`, `css`, `html`
* Custom extensions supported (e.g., `zip`, `mp4`)
### Patterns
* `email` → find emails
* `phone` → detect phone numbers
* `url` → extract URLs
* `price` → detect price patterns
* Custom regex patterns supported
---
## License
MIT License © 2025 \[MrFidal]
Raw data
{
"_id": null,
"home_page": "https://github.com/ByteBreach/scrappify",
"name": "scrappify",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": null,
"keywords": "scraping, web scraping, website downloader, crawler, web crawler, data extraction, regex, hackinglab, mrfidal, email extractor, file downloader, python scraping, automation",
"author": "hackinglab",
"author_email": "mrfidal@proton.me",
"download_url": "https://files.pythonhosted.org/packages/a4/26/307582480604e891dd22ed1bcdb9c28497e3605d686245ce866d2f4064bc/scrappify-0.0.1.tar.gz",
"platform": null,
"description": "# Scrappify\r\n\r\nScrappify is a powerful yet simple website scraping and downloading tool. It allows you to easily **scrape links, download files, filter by file types, extract patterns (like emails or phone numbers), and perform deep crawling** \u2014 all from Python or the command line.\r\n\r\n---\r\n\r\n## Features\r\n\r\n* Download entire websites\r\n* Extract links, emails, phone numbers, or custom regex patterns\r\n* Filter downloads by file type (images, documents, scripts, etc.)\r\n* Fast downloads with configurable workers\r\n* Cross-domain crawling support\r\n* Command-line interface (CLI) and Python API\r\n\r\n---\r\n\r\n## Installation\r\n\r\n```bash\r\npip install scrappify\r\n```\r\n\r\n---\r\n\r\n## Python Usage\r\n\r\n### Basic Usage\r\n\r\n```python\r\nfrom scrappify import url, scrap, download\r\n\r\n# Download entire website\r\nurl_download = url(\"https://example.com\")\r\ndownloaded_files = download(url_download, output_dir=\"my_site\")\r\nprint(f\"Downloaded {len(downloaded_files)} files\")\r\n\r\n# Get all links from a page\r\nlinks = scrap(url_download)\r\nprint(f\"Found {len(links)} links\")\r\n```\r\n\r\n### File Type Filtering\r\n\r\n```python\r\nfrom scrappify import url, download\r\nfrom scrappify.patterns import file_type\r\n\r\n# Download only JavaScript files\r\njs_files = download(\"https://example.com\", file_type=\"js\", output_dir=\"js_files\")\r\n\r\n# Download images using category\r\nimages = download(\"https://example.com\", file_type=file_type['image'], output_dir=\"images\")\r\n\r\n# Download multiple specific file types\r\ndocs_and_images = download(\"https://example.com\", file_type=[\"pdf\", \"jpg\", \"png\"])\r\n```\r\n\r\n### Pattern Searching\r\n\r\n```python\r\nfrom scrappify import url, download\r\nfrom scrappify.patterns import pattern\r\n\r\n# Find emails in all downloaded files\r\nemail_results = download(\"https://example.com\", pattern=pattern['email'])\r\n\r\n# Find phone numbers in HTML files only\r\nphone_results = download(\"https://example.com\", file_type=\"html\", pattern=pattern['phone'])\r\n\r\n# Custom regex pattern\r\ncustom_pattern = r'\\b\\d{3}-\\d{2}-\\d{4}\\b' # SSN pattern\r\nssn_results = download(\"https://example.com\", pattern=custom_pattern)\r\n\r\n# Combine file type and pattern\r\nresults = download(\"https://example.com\", file_type=\"js\", pattern=pattern['url'])\r\n```\r\n\r\n### Advanced Scraping\r\n\r\n```python\r\nfrom scrappify import url, scrap, download\r\n\r\n# Deep crawling (multiple levels)\r\ndeep_links = scrap(\"https://example.com\", depth=3)\r\nprint(f\"Found {len(deep_links)} links across 3 levels\")\r\n\r\n# Download with increased workers\r\nfast_download = download(\"https://example.com\", max_workers=20, output_dir=\"fast_download\")\r\n\r\n# Cross-domain downloading (disable same-domain restriction)\r\nall_links = scrap(\"https://example.com\", same_domain_only=False)\r\n```\r\n\r\n### Programmatic Pattern Extraction\r\n\r\n```python\r\nfrom scrappify.core.utils import search_pattern_in_file\r\n\r\n# Search pattern in specific file\r\nresults = search_pattern_in_file(\"downloaded_file.html\", r'\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b')\r\nfor result in results:\r\n print(f\"Email found: {result['match']} at line {result['line']}\")\r\n```\r\n\r\n---\r\n\r\n## Command Line Usage\r\n\r\n```bash\r\n# Download entire website\r\nscrappify https://example.com -o my_site\r\n\r\n# Download only PDF files\r\nscrappify https://example.com -t pdf -o documents\r\n\r\n# Download images and search for emails\r\nscrappify https://example.com -t image -p email -o images_with_emails\r\n\r\n# Deep crawl (3 levels) and download everything\r\nscrappify https://example.com -d 3 -o deep_site\r\n\r\n# Use custom regex pattern\r\nscrappify https://example.com -p '\\b\\d{3}-\\d{2}-\\d{4}\\b' -o ssn_search\r\n\r\n# List available patterns\r\nscrappify --list-patterns\r\n\r\n# List available file types\r\nscrappify --list-types\r\n\r\n# High-performance download with 20 workers\r\nscrappify https://example.com -w 20 -o fast_download\r\n```\r\n\r\n### Complex Examples\r\n\r\n```bash\r\n# Download all JavaScript and CSS files, search for URLs\r\nscrappify https://example.com -t javascript -t css -p url -o assets_with_urls\r\n\r\n# Download documents and images, search for prices\r\nscrappify https://example.com -t document -t image -p price -o priced_content\r\n\r\n# Deep crawl with custom pattern\r\nscrappify https://example.com -d 2 -p '#[a-zA-Z0-9_]+' -o hashtags\r\n```\r\n\r\n---\r\n\r\n## Available Options\r\n\r\n### File Types\r\n\r\n* `image` \u2192 png, jpg, gif, svg, etc.\r\n* `document` \u2192 pdf, docx, txt, etc.\r\n* `javascript`, `css`, `html`\r\n* Custom extensions supported (e.g., `zip`, `mp4`)\r\n\r\n### Patterns\r\n\r\n* `email` \u2192 find emails\r\n* `phone` \u2192 detect phone numbers\r\n* `url` \u2192 extract URLs\r\n* `price` \u2192 detect price patterns\r\n* Custom regex patterns supported\r\n\r\n---\r\n\r\n\r\n## License\r\n\r\nMIT License \u00a9 2025 \\[MrFidal]\r\n",
"bugtrack_url": null,
"license": null,
"summary": "A powerful web scraping and downloading utility",
"version": "0.0.1",
"project_urls": {
"Bug Tracker": "https://github.com/ByteBreach/scrappify/issues",
"Documentation": "https://github.com/ByteBreach/scrappify#readme",
"Homepage": "https://github.com/ByteBreach/scrappify",
"Source Code": "https://github.com/ByteBreach/scrappify"
},
"split_keywords": [
"scraping",
" web scraping",
" website downloader",
" crawler",
" web crawler",
" data extraction",
" regex",
" hackinglab",
" mrfidal",
" email extractor",
" file downloader",
" python scraping",
" automation"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "6304b9e36203d24758e08a628322077df06a27157be647a5956297ec85bf980f",
"md5": "e7c8c51b4dfffe3ae4941a9aff4c0b2d",
"sha256": "44aea120bd9c95d0b31cd2f82d5666fa497b0c627cf82890b8c843b55676a735"
},
"downloads": -1,
"filename": "scrappify-0.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e7c8c51b4dfffe3ae4941a9aff4c0b2d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 8646,
"upload_time": "2025-09-13T18:00:10",
"upload_time_iso_8601": "2025-09-13T18:00:10.609325Z",
"url": "https://files.pythonhosted.org/packages/63/04/b9e36203d24758e08a628322077df06a27157be647a5956297ec85bf980f/scrappify-0.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "a426307582480604e891dd22ed1bcdb9c28497e3605d686245ce866d2f4064bc",
"md5": "8c5ded4056078f9ce9cf58f7fe443977",
"sha256": "8ebf524e78cede0598ceedf6213504d4830fa416b025069d0e190c90ba0d7fbe"
},
"downloads": -1,
"filename": "scrappify-0.0.1.tar.gz",
"has_sig": false,
"md5_digest": "8c5ded4056078f9ce9cf58f7fe443977",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 8802,
"upload_time": "2025-09-13T18:00:12",
"upload_time_iso_8601": "2025-09-13T18:00:12.145314Z",
"url": "https://files.pythonhosted.org/packages/a4/26/307582480604e891dd22ed1bcdb9c28497e3605d686245ce866d2f4064bc/scrappify-0.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-13 18:00:12",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ByteBreach",
"github_project": "scrappify",
"github_not_found": true,
"lcname": "scrappify"
}