TezzCrawler


NameTezzCrawler JSON
Version 0.3.1 PyPI version JSON
download
home_pagehttps://github.com/TezzLabs/TezzCrawler
SummaryA web crawler that converts web pages to markdown and prepares them for LLM consumption
upload_time2024-12-15 12:43:32
maintainerNone
docs_urlNone
authorJapkeerat Singh
requires_python>=3.10
licenseNone
keywords
VCS
bugtrack_url
requirements requests typer python-dotenv beautifulsoup4 markdownify lxml
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # TezzCrawler

A powerful web crawler that converts web pages to markdown format, making them ready for LLM consumption.

## Features

- Single page scraping with markdown conversion
- Full website crawling using sitemap.xml
- Proxy support for web scraping
- Simple CLI interface
- Easy to use as a Python package

## Installation

```bash
pip install TezzCrawler
```

## Usage

### Command Line Interface

1. Scrape a single page:
```bash
tezzcrawler scrape-page https://example.com --output ./output
```

2. Crawl from sitemap:
```bash
tezzcrawler crawl-from-sitemap https://example.com/sitemap.xml --output ./output
```

3. Using with proxy:
```bash
tezzcrawler scrape-page https://example.com \
    --proxy-url proxy.example.com \
    --proxy-port 8080 \
    --proxy-username user \
    --proxy-password pass \
    --output ./output
```

### Python Package

```python
from tezzcrawler import Scraper, Crawler
from pathlib import Path

# Scrape a single page
scraper = Scraper()
scraper.scrape_page("https://example.com", Path("./output"))

# Crawl from sitemap
crawler = Crawler()
crawler.crawl_sitemap("https://example.com/sitemap.xml", Path("./output"))

# With proxy configuration
scraper = Scraper(
    proxy_url="proxy.example.com",
    proxy_port=8080,
    proxy_username="user",
    proxy_password="pass"
)
```

## Development

1. Clone the repository:
```bash
git clone https://github.com/TezzLabs/TezzCrawler.git
cd TezzCrawler
```

2. Install development dependencies:
```bash
pip install -e ".[dev]"
```

## License

MIT License - see LICENSE file for details.


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/TezzLabs/TezzCrawler",
    "name": "TezzCrawler",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": "Japkeerat Singh",
    "author_email": "japkeerat21@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/59/0b/c8ab9351e23e6e937f0baf6478d6caef025fbf48f329d3588b216dd70e1d/tezzcrawler-0.3.1.tar.gz",
    "platform": null,
    "description": "# TezzCrawler\n\nA powerful web crawler that converts web pages to markdown format, making them ready for LLM consumption.\n\n## Features\n\n- Single page scraping with markdown conversion\n- Full website crawling using sitemap.xml\n- Proxy support for web scraping\n- Simple CLI interface\n- Easy to use as a Python package\n\n## Installation\n\n```bash\npip install TezzCrawler\n```\n\n## Usage\n\n### Command Line Interface\n\n1. Scrape a single page:\n```bash\ntezzcrawler scrape-page https://example.com --output ./output\n```\n\n2. Crawl from sitemap:\n```bash\ntezzcrawler crawl-from-sitemap https://example.com/sitemap.xml --output ./output\n```\n\n3. Using with proxy:\n```bash\ntezzcrawler scrape-page https://example.com \\\n    --proxy-url proxy.example.com \\\n    --proxy-port 8080 \\\n    --proxy-username user \\\n    --proxy-password pass \\\n    --output ./output\n```\n\n### Python Package\n\n```python\nfrom tezzcrawler import Scraper, Crawler\nfrom pathlib import Path\n\n# Scrape a single page\nscraper = Scraper()\nscraper.scrape_page(\"https://example.com\", Path(\"./output\"))\n\n# Crawl from sitemap\ncrawler = Crawler()\ncrawler.crawl_sitemap(\"https://example.com/sitemap.xml\", Path(\"./output\"))\n\n# With proxy configuration\nscraper = Scraper(\n    proxy_url=\"proxy.example.com\",\n    proxy_port=8080,\n    proxy_username=\"user\",\n    proxy_password=\"pass\"\n)\n```\n\n## Development\n\n1. Clone the repository:\n```bash\ngit clone https://github.com/TezzLabs/TezzCrawler.git\ncd TezzCrawler\n```\n\n2. Install development dependencies:\n```bash\npip install -e \".[dev]\"\n```\n\n## License\n\nMIT License - see LICENSE file for details.\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A web crawler that converts web pages to markdown and prepares them for LLM consumption",
    "version": "0.3.1",
    "project_urls": {
        "Homepage": "https://github.com/TezzLabs/TezzCrawler"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3b9775f47d63aaab2909ce71d82aa729bdba98c2f1761ceba0cb841f4eef4200",
                "md5": "5e279a6d2beb9311d910cdf8c0afb2e3",
                "sha256": "1cce7b954a6e2cea64ef001bb890ec27ef9ec482bc0a1e23b561b66ede45aaf3"
            },
            "downloads": -1,
            "filename": "TezzCrawler-0.3.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5e279a6d2beb9311d910cdf8c0afb2e3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 8059,
            "upload_time": "2024-12-15T12:43:30",
            "upload_time_iso_8601": "2024-12-15T12:43:30.080697Z",
            "url": "https://files.pythonhosted.org/packages/3b/97/75f47d63aaab2909ce71d82aa729bdba98c2f1761ceba0cb841f4eef4200/TezzCrawler-0.3.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "590bc8ab9351e23e6e937f0baf6478d6caef025fbf48f329d3588b216dd70e1d",
                "md5": "c8db369c4410ce95bcaad1cd979b9522",
                "sha256": "52ae2fb799947aabdb0d39ebc86ea2ccdbbd7f288dc835a44f648fbfb03a7d5c"
            },
            "downloads": -1,
            "filename": "tezzcrawler-0.3.1.tar.gz",
            "has_sig": false,
            "md5_digest": "c8db369c4410ce95bcaad1cd979b9522",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 5817,
            "upload_time": "2024-12-15T12:43:32",
            "upload_time_iso_8601": "2024-12-15T12:43:32.208579Z",
            "url": "https://files.pythonhosted.org/packages/59/0b/c8ab9351e23e6e937f0baf6478d6caef025fbf48f329d3588b216dd70e1d/tezzcrawler-0.3.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-15 12:43:32",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "TezzLabs",
    "github_project": "TezzCrawler",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "requests",
            "specs": [
                [
                    "==",
                    "2.32.3"
                ]
            ]
        },
        {
            "name": "typer",
            "specs": [
                [
                    "==",
                    "0.13.0"
                ]
            ]
        },
        {
            "name": "python-dotenv",
            "specs": [
                [
                    "==",
                    "1.0.1"
                ]
            ]
        },
        {
            "name": "beautifulsoup4",
            "specs": [
                [
                    "==",
                    "4.12.3"
                ]
            ]
        },
        {
            "name": "markdownify",
            "specs": [
                [
                    "==",
                    "0.13.1"
                ]
            ]
        },
        {
            "name": "lxml",
            "specs": [
                [
                    "==",
                    "5.3.0"
                ]
            ]
        }
    ],
    "lcname": "tezzcrawler"
}
        
Elapsed time: 0.45005s