# TezzCrawler
A powerful web crawler that converts web pages to markdown format, making them ready for LLM consumption.
## Features
- Single page scraping with markdown conversion
- Full website crawling using sitemap.xml
- Proxy support for web scraping
- Simple CLI interface
- Easy to use as a Python package
## Installation
```bash
pip install TezzCrawler
```
## Usage
### Command Line Interface
1. Scrape a single page:
```bash
tezzcrawler scrape-page https://example.com --output ./output
```
2. Crawl from sitemap:
```bash
tezzcrawler crawl-from-sitemap https://example.com/sitemap.xml --output ./output
```
3. Using with proxy:
```bash
tezzcrawler scrape-page https://example.com \
--proxy-url proxy.example.com \
--proxy-port 8080 \
--proxy-username user \
--proxy-password pass \
--output ./output
```
### Python Package
```python
from tezzcrawler import Scraper, Crawler
from pathlib import Path
# Scrape a single page
scraper = Scraper()
scraper.scrape_page("https://example.com", Path("./output"))
# Crawl from sitemap
crawler = Crawler()
crawler.crawl_sitemap("https://example.com/sitemap.xml", Path("./output"))
# With proxy configuration
scraper = Scraper(
proxy_url="proxy.example.com",
proxy_port=8080,
proxy_username="user",
proxy_password="pass"
)
```
## Development
1. Clone the repository:
```bash
git clone https://github.com/TezzLabs/TezzCrawler.git
cd TezzCrawler
```
2. Install development dependencies:
```bash
pip install -e ".[dev]"
```
## License
MIT License - see LICENSE file for details.
Raw data
{
"_id": null,
"home_page": "https://github.com/TezzLabs/TezzCrawler",
"name": "TezzCrawler",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": null,
"author": "Japkeerat Singh",
"author_email": "japkeerat21@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/59/0b/c8ab9351e23e6e937f0baf6478d6caef025fbf48f329d3588b216dd70e1d/tezzcrawler-0.3.1.tar.gz",
"platform": null,
"description": "# TezzCrawler\n\nA powerful web crawler that converts web pages to markdown format, making them ready for LLM consumption.\n\n## Features\n\n- Single page scraping with markdown conversion\n- Full website crawling using sitemap.xml\n- Proxy support for web scraping\n- Simple CLI interface\n- Easy to use as a Python package\n\n## Installation\n\n```bash\npip install TezzCrawler\n```\n\n## Usage\n\n### Command Line Interface\n\n1. Scrape a single page:\n```bash\ntezzcrawler scrape-page https://example.com --output ./output\n```\n\n2. Crawl from sitemap:\n```bash\ntezzcrawler crawl-from-sitemap https://example.com/sitemap.xml --output ./output\n```\n\n3. Using with proxy:\n```bash\ntezzcrawler scrape-page https://example.com \\\n --proxy-url proxy.example.com \\\n --proxy-port 8080 \\\n --proxy-username user \\\n --proxy-password pass \\\n --output ./output\n```\n\n### Python Package\n\n```python\nfrom tezzcrawler import Scraper, Crawler\nfrom pathlib import Path\n\n# Scrape a single page\nscraper = Scraper()\nscraper.scrape_page(\"https://example.com\", Path(\"./output\"))\n\n# Crawl from sitemap\ncrawler = Crawler()\ncrawler.crawl_sitemap(\"https://example.com/sitemap.xml\", Path(\"./output\"))\n\n# With proxy configuration\nscraper = Scraper(\n proxy_url=\"proxy.example.com\",\n proxy_port=8080,\n proxy_username=\"user\",\n proxy_password=\"pass\"\n)\n```\n\n## Development\n\n1. Clone the repository:\n```bash\ngit clone https://github.com/TezzLabs/TezzCrawler.git\ncd TezzCrawler\n```\n\n2. Install development dependencies:\n```bash\npip install -e \".[dev]\"\n```\n\n## License\n\nMIT License - see LICENSE file for details.\n\n",
"bugtrack_url": null,
"license": null,
"summary": "A web crawler that converts web pages to markdown and prepares them for LLM consumption",
"version": "0.3.1",
"project_urls": {
"Homepage": "https://github.com/TezzLabs/TezzCrawler"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "3b9775f47d63aaab2909ce71d82aa729bdba98c2f1761ceba0cb841f4eef4200",
"md5": "5e279a6d2beb9311d910cdf8c0afb2e3",
"sha256": "1cce7b954a6e2cea64ef001bb890ec27ef9ec482bc0a1e23b561b66ede45aaf3"
},
"downloads": -1,
"filename": "TezzCrawler-0.3.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5e279a6d2beb9311d910cdf8c0afb2e3",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 8059,
"upload_time": "2024-12-15T12:43:30",
"upload_time_iso_8601": "2024-12-15T12:43:30.080697Z",
"url": "https://files.pythonhosted.org/packages/3b/97/75f47d63aaab2909ce71d82aa729bdba98c2f1761ceba0cb841f4eef4200/TezzCrawler-0.3.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "590bc8ab9351e23e6e937f0baf6478d6caef025fbf48f329d3588b216dd70e1d",
"md5": "c8db369c4410ce95bcaad1cd979b9522",
"sha256": "52ae2fb799947aabdb0d39ebc86ea2ccdbbd7f288dc835a44f648fbfb03a7d5c"
},
"downloads": -1,
"filename": "tezzcrawler-0.3.1.tar.gz",
"has_sig": false,
"md5_digest": "c8db369c4410ce95bcaad1cd979b9522",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 5817,
"upload_time": "2024-12-15T12:43:32",
"upload_time_iso_8601": "2024-12-15T12:43:32.208579Z",
"url": "https://files.pythonhosted.org/packages/59/0b/c8ab9351e23e6e937f0baf6478d6caef025fbf48f329d3588b216dd70e1d/tezzcrawler-0.3.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-15 12:43:32",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "TezzLabs",
"github_project": "TezzCrawler",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "requests",
"specs": [
[
"==",
"2.32.3"
]
]
},
{
"name": "typer",
"specs": [
[
"==",
"0.13.0"
]
]
},
{
"name": "python-dotenv",
"specs": [
[
"==",
"1.0.1"
]
]
},
{
"name": "beautifulsoup4",
"specs": [
[
"==",
"4.12.3"
]
]
},
{
"name": "markdownify",
"specs": [
[
"==",
"0.13.1"
]
]
},
{
"name": "lxml",
"specs": [
[
"==",
"5.3.0"
]
]
}
],
"lcname": "tezzcrawler"
}