| Name | AsyncURLCrawler JSON |
| Version |
0.0.4
JSON |
| download |
| home_page | None |
| Summary | AsyncURLCrawler navigates through web pages concurrently by following hyperlinks to collect URLs. |
| upload_time | 2024-08-31 21:01:25 |
| maintainer | None |
| docs_url | None |
| author | None |
| requires_python | >=3.8 |
| license | None |
| keywords |
|
| VCS |
 |
| bugtrack_url |
|
| requirements |
No requirements were recorded.
|
| Travis-CI |
No Travis.
|
| coveralls test coverage |
No coveralls.
|
# AsyncURLCrawler
`AsyncURLCrawler` navigates through web pages concurrently by following hyperlinks to collect URLs.
`AsyncURLCrawler` uses `BFS algorithm`. To make use of it check `robots.txt` of the domains first.
**👉 For complete documentation read [here](https://asyncurlcrawlerdocs.pages.dev/)**
**👉 Source code on Github [here](https://github.com/PouyaEsmaeili/AsyncURLCrawler)**
---
### Install Pacakge
```commandline
pip install AsyncURLCrawler
```
```commandline
pip install AsyncURLCrawler==<version>
```
👉 The official [page](https://pypi.org/project/AsyncURLCrawler) of the project in PyPi.
---
### Usage Example in Code
Here is a simple python script to show how to use the package:
```python
import asyncio
import os
from AsyncURLCrawler.parser import Parser
from AsyncURLCrawler.crawler import Crawler
import yaml
async def main():
parser = Parser(
delay_start=0.1,
max_retries=5,
request_timeout=1,
user_agent="Mozilla",
)
crawler = Crawler(
seed_urls=["https://pouyae.ir"],
parser=parser,
exact=True,
deep=False,
delay=0.1,
)
result = await crawler.crawl()
with open(
os.path.join(output_path, 'result.yaml'), 'w') as file:
for key in result:
result[key] = list(result[key])
yaml.dump(result, file)
if __name__ == "__main__":
asyncio.run(main())
```
This is the output for the above code:
```yaml
https://pouyae.ir:
- https://github.com/PouyaEsmaeili/AsyncURLCrawler
- https://pouyae.ir/images/pouya3.jpg
- https://github.com/PouyaEsmaeili/CryptographicClientSideUserState
- https://github.com/PouyaEsmaeili/RateLimiter
- https://pouyae.ir/
- https://github.com/luizdepra/hugo-coder/
- https://duman.pouyae.ir/
- https://pouyae.ir/projects/
- https://pouyae.ir/images/pouya4.jpg
- https://pouyae.ir/images/pouya5.jpg
- https://pouyae.ir/gallery/
- https://github.com/PouyaEsmaeili
- https://pouyae.ir/blog/
- https://www.linkedin.com/in/pouya-esmaeili-9124b839/
- https://pouyae.ir/about/
- https://stackoverflow.com/users/13118327/pouya-esmaeili?tab=profile
- https://pouyae.ir/contact-me/
- https://github.com/PouyaEsmaeili/SnowflakeID
- https://pouyae.ir/images/pouya2.jpg
- https://github.com/PouyaEsmaeili/gFuzz
- https://linktr.ee/pouyae
- https://gohugo.io/
- https://pouyae.ir/images/pouya1.jpg
```
👉 There is also a blog post about using `AsyncURLCrawler` to find malicious URLs in a web page. [Read here](https://towardsdev.com/viruscan-a-website-for-malicious-url-with-asyncurlcrawler-and-virus-total-2adaef0201c3?source=friends_link&sk=b537f4ab5387b8172d70b73c933412d1).
---
### Commandline Tool
The script can be customized using the `src/cmd/cmd.py` file, which accepts various arguments to configure the crawler's behavior:
| argument | description |
|-----------|---------------------|
| `--url` | Specifies a list of URLs to crawl. At least one URL must be provided. |
| `--exact` | Optional flag; if set, the crawler will restrict crawling to the specified subdomain/domain only. Default is False. |
| `--deep` | Optional flag; if enabled, the crawler will explore all visited URLs. Not recommended due to potential resource intensity. If --deep is True, the --exact flag is ignored. |
| `--delay` | Sets the delay between consecutive HTTP requests, in seconds. |
| `--output`| Specifies the path for the output file, which will be saved in YAML format. |
---
### Run Commandline Tool in Docker Container 🐳
There is a Dockerfile in `src/cmd` to run the above-mentioned cmd tool in a docker container.
```commandline
docker build -t crawler .
```
```commandline
docker run -v my_dir:/src/output --name crawler crawler
```
After execution of the container,
the resulting output file will be accessible in the directory named `my_dir` as defined in the above.
To configure the tool based on your needs check the `CMD` in `Dockerfile`.
---
### Build and Publish to Python Package Index(PyPi)
Requirements:
```commandline
python3 -m pip install --upgrade build
```
```commandline
python3 -m pip install --upgrade twine
```
👉 For more details check [Packaging Python Projects](https://packaging.python.org/en/latest/tutorials/packaging-projects/).
Build and upload:
```commandline
python3 -m build
```
```commandline
python3 -m twine upload --repository pypi dist/*
```
---
### Build Documentation with Sphinx
Install packages listed in `docs/doc-requirements.txt`.
```commandline
cd docs
```
```commandline
pip install -r doc-requirements.txt
```
```commandline
make clean
```
```commandline
make html
```
HTML files will be generated in `docs/build`. Push them the repository and deploy on _pages.dev_.
---
### Workflow
- Branch off, implement features and merge them to `main`. Remove feature branches.
- Update version in `pyproject.toml` and push to `main`.
- Add release tag in [Github](https://github.com/PouyaEsmaeili/AsyncURLCrawler/releases).
- Build and push the package to [PyPi](https://pypi.org/project/AsyncURLCrawler/).
- Build documentation and push HTML files to [AsyncURLCrawlerDocs repo](https://github.com/PouyaEsmaeili/AsyncURLCrawlerDocs)
- Documentation will be deployed on [pages.dev](https://asyncurlcrawlerdocs.pages.dev/) automatically.
---
### Contact
**[Find me @ My Homepage](https://pouyae.ir)**
---
### Disclaimer
**⚠️ Use at your own risk. The author and contributors are not responsible for any misuse or consequences resulting from the use of this project.**
---
Raw data
{
"_id": null,
"home_page": null,
"name": "AsyncURLCrawler",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": null,
"author": null,
"author_email": "Pouya Esameili <pouya.esmaeili.g@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/57/e4/9ebee32ad01a3528fcba6df68d6295b704cdf54777c3945f83d8d0a2d7ed/asyncurlcrawler-0.0.4.tar.gz",
"platform": null,
"description": "# AsyncURLCrawler\n`AsyncURLCrawler` navigates through web pages concurrently by following hyperlinks to collect URLs.\n`AsyncURLCrawler` uses `BFS algorithm`. To make use of it check `robots.txt` of the domains first.\n\n**\ud83d\udc49 For complete documentation read [here](https://asyncurlcrawlerdocs.pages.dev/)**\n\n**\ud83d\udc49 Source code on Github [here](https://github.com/PouyaEsmaeili/AsyncURLCrawler)**\n\n---\n\n### Install Pacakge\n\n```commandline \npip install AsyncURLCrawler\n```\n\n```commandline \npip install AsyncURLCrawler==<version>\n```\n\n\ud83d\udc49 The official [page](https://pypi.org/project/AsyncURLCrawler) of the project in PyPi.\n\n---\n\n### Usage Example in Code\n\nHere is a simple python script to show how to use the package:\n\n```python\nimport asyncio\nimport os\nfrom AsyncURLCrawler.parser import Parser\nfrom AsyncURLCrawler.crawler import Crawler\nimport yaml\n\n\nasync def main():\n parser = Parser(\n delay_start=0.1, \n max_retries=5, \n request_timeout=1,\n user_agent=\"Mozilla\",\n )\n crawler = Crawler( \n seed_urls=[\"https://pouyae.ir\"],\n parser=parser,\n exact=True,\n deep=False,\n delay=0.1,\n )\n result = await crawler.crawl()\n with open(\n os.path.join(output_path, 'result.yaml'), 'w') as file:\n for key in result:\n result[key] = list(result[key])\n yaml.dump(result, file)\n\n\nif __name__ == \"__main__\":\n asyncio.run(main())\n```\n\nThis is the output for the above code:\n```yaml\nhttps://pouyae.ir:\n- https://github.com/PouyaEsmaeili/AsyncURLCrawler\n- https://pouyae.ir/images/pouya3.jpg\n- https://github.com/PouyaEsmaeili/CryptographicClientSideUserState\n- https://github.com/PouyaEsmaeili/RateLimiter\n- https://pouyae.ir/\n- https://github.com/luizdepra/hugo-coder/\n- https://duman.pouyae.ir/\n- https://pouyae.ir/projects/\n- https://pouyae.ir/images/pouya4.jpg\n- https://pouyae.ir/images/pouya5.jpg\n- https://pouyae.ir/gallery/\n- https://github.com/PouyaEsmaeili\n- https://pouyae.ir/blog/\n- https://www.linkedin.com/in/pouya-esmaeili-9124b839/\n- https://pouyae.ir/about/\n- https://stackoverflow.com/users/13118327/pouya-esmaeili?tab=profile\n- https://pouyae.ir/contact-me/\n- https://github.com/PouyaEsmaeili/SnowflakeID\n- https://pouyae.ir/images/pouya2.jpg\n- https://github.com/PouyaEsmaeili/gFuzz\n- https://linktr.ee/pouyae\n- https://gohugo.io/\n- https://pouyae.ir/images/pouya1.jpg\n```\n\n\ud83d\udc49 There is also a blog post about using `AsyncURLCrawler` to find malicious URLs in a web page. [Read here](https://towardsdev.com/viruscan-a-website-for-malicious-url-with-asyncurlcrawler-and-virus-total-2adaef0201c3?source=friends_link&sk=b537f4ab5387b8172d70b73c933412d1).\n\n---\n### Commandline Tool\n\nThe script can be customized using the `src/cmd/cmd.py` file, which accepts various arguments to configure the crawler's behavior:\n\n| argument | description | \n|-----------|---------------------| \n| `--url` | Specifies a list of URLs to crawl. At least one URL must be provided. | \n| `--exact` | Optional flag; if set, the crawler will restrict crawling to the specified subdomain/domain only. Default is False. | \n| `--deep` | Optional flag; if enabled, the crawler will explore all visited URLs. Not recommended due to potential resource intensity. If --deep is True, the --exact flag is ignored. | \n| `--delay` | Sets the delay between consecutive HTTP requests, in seconds. |\n| `--output`| Specifies the path for the output file, which will be saved in YAML format. |\n\n---\n\n### Run Commandline Tool in Docker Container \ud83d\udc33\n\nThere is a Dockerfile in `src/cmd` to run the above-mentioned cmd tool in a docker container.\n\n```commandline \ndocker build -t crawler .\n```\n\n```commandline\ndocker run -v my_dir:/src/output --name crawler crawler\n```\n\nAfter execution of the container, \nthe resulting output file will be accessible in the directory named `my_dir` as defined in the above.\nTo configure the tool based on your needs check the `CMD` in `Dockerfile`.\n\n---\n\n### Build and Publish to Python Package Index(PyPi)\n\nRequirements:\n\n```commandline\npython3 -m pip install --upgrade build\n```\n\n```commandline\npython3 -m pip install --upgrade twine\n```\n\ud83d\udc49 For more details check [Packaging Python Projects](https://packaging.python.org/en/latest/tutorials/packaging-projects/).\n\nBuild and upload:\n\n```commandline \npython3 -m build\n```\n\n```commandline\npython3 -m twine upload --repository pypi dist/*\n```\n\n---\n\n### Build Documentation with Sphinx\n\nInstall packages listed in `docs/doc-requirements.txt`.\n\n```commandline\ncd docs\n```\n\n```commandline\npip install -r doc-requirements.txt\n```\n\n```commandline\nmake clean\n```\n\n```commandline\nmake html\n```\n\nHTML files will be generated in `docs/build`. Push them the repository and deploy on _pages.dev_.\n\n---\n\n### Workflow\n\n- Branch off, implement features and merge them to `main`. Remove feature branches.\n- Update version in `pyproject.toml` and push to `main`.\n- Add release tag in [Github](https://github.com/PouyaEsmaeili/AsyncURLCrawler/releases).\n- Build and push the package to [PyPi](https://pypi.org/project/AsyncURLCrawler/). \n- Build documentation and push HTML files to [AsyncURLCrawlerDocs repo](https://github.com/PouyaEsmaeili/AsyncURLCrawlerDocs)\n- Documentation will be deployed on [pages.dev](https://asyncurlcrawlerdocs.pages.dev/) automatically.\n\n---\n\n### Contact\n\n**[Find me @ My Homepage](https://pouyae.ir)**\n\n---\n\n### Disclaimer\n\n**\u26a0\ufe0f Use at your own risk. The author and contributors are not responsible for any misuse or consequences resulting from the use of this project.**\n\n--- \n",
"bugtrack_url": null,
"license": null,
"summary": "AsyncURLCrawler navigates through web pages concurrently by following hyperlinks to collect URLs.",
"version": "0.0.4",
"project_urls": {
"Homepage": "https://github.com/PouyaEsmaeili/AsyncURLCrawler",
"Issues": "https://github.com/PouyaEsmaeili/AsyncURLCrawler/issues"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "3fe822dde117b25185efe42b97335394b5912eee36d00009f15e7defc1b9e898",
"md5": "bec4d42e0a8a6a2127b2ce0188bb4d0c",
"sha256": "97b991c5672f014cf83adb65fd4ac49032f1fc023a1789dfe15c8eda3d64f08e"
},
"downloads": -1,
"filename": "asyncurlcrawler-0.0.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "bec4d42e0a8a6a2127b2ce0188bb4d0c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 8382,
"upload_time": "2024-08-31T21:01:22",
"upload_time_iso_8601": "2024-08-31T21:01:22.939598Z",
"url": "https://files.pythonhosted.org/packages/3f/e8/22dde117b25185efe42b97335394b5912eee36d00009f15e7defc1b9e898/asyncurlcrawler-0.0.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "57e49ebee32ad01a3528fcba6df68d6295b704cdf54777c3945f83d8d0a2d7ed",
"md5": "2bf44985be32896bfc9ebe3cd2eebfdf",
"sha256": "01090164ae4596799af7a96902a1e30e319788f410efa42a6c0607f781fb8399"
},
"downloads": -1,
"filename": "asyncurlcrawler-0.0.4.tar.gz",
"has_sig": false,
"md5_digest": "2bf44985be32896bfc9ebe3cd2eebfdf",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 11743,
"upload_time": "2024-08-31T21:01:25",
"upload_time_iso_8601": "2024-08-31T21:01:25.519128Z",
"url": "https://files.pythonhosted.org/packages/57/e4/9ebee32ad01a3528fcba6df68d6295b704cdf54777c3945f83d8d0a2d7ed/asyncurlcrawler-0.0.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-31 21:01:25",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "PouyaEsmaeili",
"github_project": "AsyncURLCrawler",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "asyncurlcrawler"
}