scrappify


Namescrappify JSON
Version 0.0.1 PyPI version JSON
download
home_pagehttps://github.com/ByteBreach/scrappify
SummaryA powerful web scraping and downloading utility
upload_time2025-09-13 18:00:12
maintainerNone
docs_urlNone
authorhackinglab
requires_python>=3.6
licenseNone
keywords scraping web scraping website downloader crawler web crawler data extraction regex hackinglab mrfidal email extractor file downloader python scraping automation
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Scrappify

Scrappify is a powerful yet simple website scraping and downloading tool. It allows you to easily **scrape links, download files, filter by file types, extract patterns (like emails or phone numbers), and perform deep crawling** — all from Python or the command line.

---

## Features

* Download entire websites
* Extract links, emails, phone numbers, or custom regex patterns
* Filter downloads by file type (images, documents, scripts, etc.)
* Fast downloads with configurable workers
* Cross-domain crawling support
* Command-line interface (CLI) and Python API

---

## Installation

```bash
pip install scrappify
```

---

## Python Usage

### Basic Usage

```python
from scrappify import url, scrap, download

# Download entire website
url_download = url("https://example.com")
downloaded_files = download(url_download, output_dir="my_site")
print(f"Downloaded {len(downloaded_files)} files")

# Get all links from a page
links = scrap(url_download)
print(f"Found {len(links)} links")
```

### File Type Filtering

```python
from scrappify import url, download
from scrappify.patterns import file_type

# Download only JavaScript files
js_files = download("https://example.com", file_type="js", output_dir="js_files")

# Download images using category
images = download("https://example.com", file_type=file_type['image'], output_dir="images")

# Download multiple specific file types
docs_and_images = download("https://example.com", file_type=["pdf", "jpg", "png"])
```

### Pattern Searching

```python
from scrappify import url, download
from scrappify.patterns import pattern

# Find emails in all downloaded files
email_results = download("https://example.com", pattern=pattern['email'])

# Find phone numbers in HTML files only
phone_results = download("https://example.com", file_type="html", pattern=pattern['phone'])

# Custom regex pattern
custom_pattern = r'\b\d{3}-\d{2}-\d{4}\b'  # SSN pattern
ssn_results = download("https://example.com", pattern=custom_pattern)

# Combine file type and pattern
results = download("https://example.com", file_type="js", pattern=pattern['url'])
```

### Advanced Scraping

```python
from scrappify import url, scrap, download

# Deep crawling (multiple levels)
deep_links = scrap("https://example.com", depth=3)
print(f"Found {len(deep_links)} links across 3 levels")

# Download with increased workers
fast_download = download("https://example.com", max_workers=20, output_dir="fast_download")

# Cross-domain downloading (disable same-domain restriction)
all_links = scrap("https://example.com", same_domain_only=False)
```

### Programmatic Pattern Extraction

```python
from scrappify.core.utils import search_pattern_in_file

# Search pattern in specific file
results = search_pattern_in_file("downloaded_file.html", r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
for result in results:
    print(f"Email found: {result['match']} at line {result['line']}")
```

---

## Command Line Usage

```bash
# Download entire website
scrappify https://example.com -o my_site

# Download only PDF files
scrappify https://example.com -t pdf -o documents

# Download images and search for emails
scrappify https://example.com -t image -p email -o images_with_emails

# Deep crawl (3 levels) and download everything
scrappify https://example.com -d 3 -o deep_site

# Use custom regex pattern
scrappify https://example.com -p '\b\d{3}-\d{2}-\d{4}\b' -o ssn_search

# List available patterns
scrappify --list-patterns

# List available file types
scrappify --list-types

# High-performance download with 20 workers
scrappify https://example.com -w 20 -o fast_download
```

### Complex Examples

```bash
# Download all JavaScript and CSS files, search for URLs
scrappify https://example.com -t javascript -t css -p url -o assets_with_urls

# Download documents and images, search for prices
scrappify https://example.com -t document -t image -p price -o priced_content

# Deep crawl with custom pattern
scrappify https://example.com -d 2 -p '#[a-zA-Z0-9_]+' -o hashtags
```

---

## Available Options

### File Types

* `image` → png, jpg, gif, svg, etc.
* `document` → pdf, docx, txt, etc.
* `javascript`, `css`, `html`
* Custom extensions supported (e.g., `zip`, `mp4`)

### Patterns

* `email` → find emails
* `phone` → detect phone numbers
* `url` → extract URLs
* `price` → detect price patterns
* Custom regex patterns supported

---


## License

MIT License © 2025 \[MrFidal]

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ByteBreach/scrappify",
    "name": "scrappify",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": "scraping, web scraping, website downloader, crawler, web crawler, data extraction, regex, hackinglab, mrfidal, email extractor, file downloader, python scraping, automation",
    "author": "hackinglab",
    "author_email": "mrfidal@proton.me",
    "download_url": "https://files.pythonhosted.org/packages/a4/26/307582480604e891dd22ed1bcdb9c28497e3605d686245ce866d2f4064bc/scrappify-0.0.1.tar.gz",
    "platform": null,
    "description": "# Scrappify\r\n\r\nScrappify is a powerful yet simple website scraping and downloading tool. It allows you to easily **scrape links, download files, filter by file types, extract patterns (like emails or phone numbers), and perform deep crawling** \u2014 all from Python or the command line.\r\n\r\n---\r\n\r\n## Features\r\n\r\n* Download entire websites\r\n* Extract links, emails, phone numbers, or custom regex patterns\r\n* Filter downloads by file type (images, documents, scripts, etc.)\r\n* Fast downloads with configurable workers\r\n* Cross-domain crawling support\r\n* Command-line interface (CLI) and Python API\r\n\r\n---\r\n\r\n## Installation\r\n\r\n```bash\r\npip install scrappify\r\n```\r\n\r\n---\r\n\r\n## Python Usage\r\n\r\n### Basic Usage\r\n\r\n```python\r\nfrom scrappify import url, scrap, download\r\n\r\n# Download entire website\r\nurl_download = url(\"https://example.com\")\r\ndownloaded_files = download(url_download, output_dir=\"my_site\")\r\nprint(f\"Downloaded {len(downloaded_files)} files\")\r\n\r\n# Get all links from a page\r\nlinks = scrap(url_download)\r\nprint(f\"Found {len(links)} links\")\r\n```\r\n\r\n### File Type Filtering\r\n\r\n```python\r\nfrom scrappify import url, download\r\nfrom scrappify.patterns import file_type\r\n\r\n# Download only JavaScript files\r\njs_files = download(\"https://example.com\", file_type=\"js\", output_dir=\"js_files\")\r\n\r\n# Download images using category\r\nimages = download(\"https://example.com\", file_type=file_type['image'], output_dir=\"images\")\r\n\r\n# Download multiple specific file types\r\ndocs_and_images = download(\"https://example.com\", file_type=[\"pdf\", \"jpg\", \"png\"])\r\n```\r\n\r\n### Pattern Searching\r\n\r\n```python\r\nfrom scrappify import url, download\r\nfrom scrappify.patterns import pattern\r\n\r\n# Find emails in all downloaded files\r\nemail_results = download(\"https://example.com\", pattern=pattern['email'])\r\n\r\n# Find phone numbers in HTML files only\r\nphone_results = download(\"https://example.com\", file_type=\"html\", pattern=pattern['phone'])\r\n\r\n# Custom regex pattern\r\ncustom_pattern = r'\\b\\d{3}-\\d{2}-\\d{4}\\b'  # SSN pattern\r\nssn_results = download(\"https://example.com\", pattern=custom_pattern)\r\n\r\n# Combine file type and pattern\r\nresults = download(\"https://example.com\", file_type=\"js\", pattern=pattern['url'])\r\n```\r\n\r\n### Advanced Scraping\r\n\r\n```python\r\nfrom scrappify import url, scrap, download\r\n\r\n# Deep crawling (multiple levels)\r\ndeep_links = scrap(\"https://example.com\", depth=3)\r\nprint(f\"Found {len(deep_links)} links across 3 levels\")\r\n\r\n# Download with increased workers\r\nfast_download = download(\"https://example.com\", max_workers=20, output_dir=\"fast_download\")\r\n\r\n# Cross-domain downloading (disable same-domain restriction)\r\nall_links = scrap(\"https://example.com\", same_domain_only=False)\r\n```\r\n\r\n### Programmatic Pattern Extraction\r\n\r\n```python\r\nfrom scrappify.core.utils import search_pattern_in_file\r\n\r\n# Search pattern in specific file\r\nresults = search_pattern_in_file(\"downloaded_file.html\", r'\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b')\r\nfor result in results:\r\n    print(f\"Email found: {result['match']} at line {result['line']}\")\r\n```\r\n\r\n---\r\n\r\n## Command Line Usage\r\n\r\n```bash\r\n# Download entire website\r\nscrappify https://example.com -o my_site\r\n\r\n# Download only PDF files\r\nscrappify https://example.com -t pdf -o documents\r\n\r\n# Download images and search for emails\r\nscrappify https://example.com -t image -p email -o images_with_emails\r\n\r\n# Deep crawl (3 levels) and download everything\r\nscrappify https://example.com -d 3 -o deep_site\r\n\r\n# Use custom regex pattern\r\nscrappify https://example.com -p '\\b\\d{3}-\\d{2}-\\d{4}\\b' -o ssn_search\r\n\r\n# List available patterns\r\nscrappify --list-patterns\r\n\r\n# List available file types\r\nscrappify --list-types\r\n\r\n# High-performance download with 20 workers\r\nscrappify https://example.com -w 20 -o fast_download\r\n```\r\n\r\n### Complex Examples\r\n\r\n```bash\r\n# Download all JavaScript and CSS files, search for URLs\r\nscrappify https://example.com -t javascript -t css -p url -o assets_with_urls\r\n\r\n# Download documents and images, search for prices\r\nscrappify https://example.com -t document -t image -p price -o priced_content\r\n\r\n# Deep crawl with custom pattern\r\nscrappify https://example.com -d 2 -p '#[a-zA-Z0-9_]+' -o hashtags\r\n```\r\n\r\n---\r\n\r\n## Available Options\r\n\r\n### File Types\r\n\r\n* `image` \u2192 png, jpg, gif, svg, etc.\r\n* `document` \u2192 pdf, docx, txt, etc.\r\n* `javascript`, `css`, `html`\r\n* Custom extensions supported (e.g., `zip`, `mp4`)\r\n\r\n### Patterns\r\n\r\n* `email` \u2192 find emails\r\n* `phone` \u2192 detect phone numbers\r\n* `url` \u2192 extract URLs\r\n* `price` \u2192 detect price patterns\r\n* Custom regex patterns supported\r\n\r\n---\r\n\r\n\r\n## License\r\n\r\nMIT License \u00a9 2025 \\[MrFidal]\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A powerful web scraping and downloading utility",
    "version": "0.0.1",
    "project_urls": {
        "Bug Tracker": "https://github.com/ByteBreach/scrappify/issues",
        "Documentation": "https://github.com/ByteBreach/scrappify#readme",
        "Homepage": "https://github.com/ByteBreach/scrappify",
        "Source Code": "https://github.com/ByteBreach/scrappify"
    },
    "split_keywords": [
        "scraping",
        " web scraping",
        " website downloader",
        " crawler",
        " web crawler",
        " data extraction",
        " regex",
        " hackinglab",
        " mrfidal",
        " email extractor",
        " file downloader",
        " python scraping",
        " automation"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6304b9e36203d24758e08a628322077df06a27157be647a5956297ec85bf980f",
                "md5": "e7c8c51b4dfffe3ae4941a9aff4c0b2d",
                "sha256": "44aea120bd9c95d0b31cd2f82d5666fa497b0c627cf82890b8c843b55676a735"
            },
            "downloads": -1,
            "filename": "scrappify-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e7c8c51b4dfffe3ae4941a9aff4c0b2d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 8646,
            "upload_time": "2025-09-13T18:00:10",
            "upload_time_iso_8601": "2025-09-13T18:00:10.609325Z",
            "url": "https://files.pythonhosted.org/packages/63/04/b9e36203d24758e08a628322077df06a27157be647a5956297ec85bf980f/scrappify-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a426307582480604e891dd22ed1bcdb9c28497e3605d686245ce866d2f4064bc",
                "md5": "8c5ded4056078f9ce9cf58f7fe443977",
                "sha256": "8ebf524e78cede0598ceedf6213504d4830fa416b025069d0e190c90ba0d7fbe"
            },
            "downloads": -1,
            "filename": "scrappify-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "8c5ded4056078f9ce9cf58f7fe443977",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 8802,
            "upload_time": "2025-09-13T18:00:12",
            "upload_time_iso_8601": "2025-09-13T18:00:12.145314Z",
            "url": "https://files.pythonhosted.org/packages/a4/26/307582480604e891dd22ed1bcdb9c28497e3605d686245ce866d2f4064bc/scrappify-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-13 18:00:12",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ByteBreach",
    "github_project": "scrappify",
    "github_not_found": true,
    "lcname": "scrappify"
}
        
Elapsed time: 2.45684s