AsyncURLCrawler


NameAsyncURLCrawler JSON
Version 0.0.4 PyPI version JSON
download
home_pageNone
SummaryAsyncURLCrawler navigates through web pages concurrently by following hyperlinks to collect URLs.
upload_time2024-08-31 21:01:25
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # AsyncURLCrawler
`AsyncURLCrawler` navigates through web pages concurrently by following hyperlinks to collect URLs.
`AsyncURLCrawler` uses `BFS algorithm`. To make use of it check `robots.txt` of the domains first.

**👉 For complete documentation read [here](https://asyncurlcrawlerdocs.pages.dev/)**

**👉 Source code on Github [here](https://github.com/PouyaEsmaeili/AsyncURLCrawler)**

---

### Install Pacakge

```commandline 
pip install AsyncURLCrawler
```

```commandline 
pip install AsyncURLCrawler==<version>
```

👉 The official [page](https://pypi.org/project/AsyncURLCrawler) of the project in PyPi.

---

### Usage Example in Code

Here is a simple python script to show how to use the package:

```python
import asyncio
import os
from AsyncURLCrawler.parser import Parser
from AsyncURLCrawler.crawler import Crawler
import yaml


async def main():
    parser = Parser(
        delay_start=0.1, 
        max_retries=5, 
        request_timeout=1,
        user_agent="Mozilla",
    )
    crawler = Crawler( 
        seed_urls=["https://pouyae.ir"],
        parser=parser,
        exact=True,
        deep=False,
        delay=0.1,
    )
    result = await crawler.crawl()
    with open(
            os.path.join(output_path, 'result.yaml'), 'w') as file:
        for key in result:
            result[key] = list(result[key])
        yaml.dump(result, file)


if __name__ == "__main__":
    asyncio.run(main())
```

This is the output for the above code:
```yaml
https://pouyae.ir:
- https://github.com/PouyaEsmaeili/AsyncURLCrawler
- https://pouyae.ir/images/pouya3.jpg
- https://github.com/PouyaEsmaeili/CryptographicClientSideUserState
- https://github.com/PouyaEsmaeili/RateLimiter
- https://pouyae.ir/
- https://github.com/luizdepra/hugo-coder/
- https://duman.pouyae.ir/
- https://pouyae.ir/projects/
- https://pouyae.ir/images/pouya4.jpg
- https://pouyae.ir/images/pouya5.jpg
- https://pouyae.ir/gallery/
- https://github.com/PouyaEsmaeili
- https://pouyae.ir/blog/
- https://www.linkedin.com/in/pouya-esmaeili-9124b839/
- https://pouyae.ir/about/
- https://stackoverflow.com/users/13118327/pouya-esmaeili?tab=profile
- https://pouyae.ir/contact-me/
- https://github.com/PouyaEsmaeili/SnowflakeID
- https://pouyae.ir/images/pouya2.jpg
- https://github.com/PouyaEsmaeili/gFuzz
- https://linktr.ee/pouyae
- https://gohugo.io/
- https://pouyae.ir/images/pouya1.jpg
```

👉 There is also a blog post about using `AsyncURLCrawler` to find malicious URLs in a web page. [Read here](https://towardsdev.com/viruscan-a-website-for-malicious-url-with-asyncurlcrawler-and-virus-total-2adaef0201c3?source=friends_link&sk=b537f4ab5387b8172d70b73c933412d1).

---
### Commandline Tool

The script can be customized using the `src/cmd/cmd.py` file, which accepts various arguments to configure the crawler's behavior:

| argument  | description         | 
|-----------|---------------------| 
| `--url`   | Specifies a list of URLs to crawl. At least one URL must be provided. | 
| `--exact` | Optional flag; if set, the crawler will restrict crawling to the specified subdomain/domain only. Default is False.                    | 
| `--deep`  | Optional flag; if enabled, the crawler will explore all visited URLs. Not recommended due to potential resource intensity. If --deep is True, the --exact flag is ignored. | 
| `--delay` | Sets the delay between consecutive HTTP requests, in seconds. |
| `--output`| Specifies the path for the output file, which will be saved in YAML format. |

---

### Run Commandline Tool in Docker Container 🐳

There is a Dockerfile in `src/cmd` to run the above-mentioned cmd tool in a docker container.

```commandline 
docker build -t crawler .
```

```commandline
docker run -v my_dir:/src/output --name crawler crawler
```

After execution of the container, 
the resulting output file will be accessible in the directory named `my_dir` as defined in the above.
To configure the tool based on your needs check the `CMD` in `Dockerfile`.

---

### Build and Publish to Python Package Index(PyPi)

Requirements:

```commandline
python3 -m pip install --upgrade build
```

```commandline
python3 -m pip install --upgrade twine
```
👉 For more details check [Packaging Python Projects](https://packaging.python.org/en/latest/tutorials/packaging-projects/).

Build and upload:

```commandline 
python3 -m build
```

```commandline
python3 -m twine upload --repository pypi dist/*
```

---

### Build Documentation with Sphinx

Install packages listed in `docs/doc-requirements.txt`.

```commandline
cd docs
```

```commandline
pip install -r doc-requirements.txt
```

```commandline
make clean
```

```commandline
make html
```

HTML files will be generated in `docs/build`. Push them the repository and deploy on _pages.dev_.

---

### Workflow

- Branch off, implement features and merge them to `main`. Remove feature branches.
- Update version in `pyproject.toml` and push to `main`.
- Add release tag in [Github](https://github.com/PouyaEsmaeili/AsyncURLCrawler/releases).
- Build and push the package to [PyPi](https://pypi.org/project/AsyncURLCrawler/). 
- Build documentation and push HTML files to [AsyncURLCrawlerDocs repo](https://github.com/PouyaEsmaeili/AsyncURLCrawlerDocs)
- Documentation will be deployed on [pages.dev](https://asyncurlcrawlerdocs.pages.dev/) automatically.

---

### Contact

**[Find me @ My Homepage](https://pouyae.ir)**

---

### Disclaimer

**⚠️ Use at your own risk. The author and contributors are not responsible for any misuse or consequences resulting from the use of this project.**

--- 

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "AsyncURLCrawler",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": null,
    "author_email": "Pouya Esameili <pouya.esmaeili.g@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/57/e4/9ebee32ad01a3528fcba6df68d6295b704cdf54777c3945f83d8d0a2d7ed/asyncurlcrawler-0.0.4.tar.gz",
    "platform": null,
    "description": "# AsyncURLCrawler\n`AsyncURLCrawler` navigates through web pages concurrently by following hyperlinks to collect URLs.\n`AsyncURLCrawler` uses `BFS algorithm`. To make use of it check `robots.txt` of the domains first.\n\n**\ud83d\udc49 For complete documentation read [here](https://asyncurlcrawlerdocs.pages.dev/)**\n\n**\ud83d\udc49 Source code on Github [here](https://github.com/PouyaEsmaeili/AsyncURLCrawler)**\n\n---\n\n### Install Pacakge\n\n```commandline \npip install AsyncURLCrawler\n```\n\n```commandline \npip install AsyncURLCrawler==<version>\n```\n\n\ud83d\udc49 The official [page](https://pypi.org/project/AsyncURLCrawler) of the project in PyPi.\n\n---\n\n### Usage Example in Code\n\nHere is a simple python script to show how to use the package:\n\n```python\nimport asyncio\nimport os\nfrom AsyncURLCrawler.parser import Parser\nfrom AsyncURLCrawler.crawler import Crawler\nimport yaml\n\n\nasync def main():\n    parser = Parser(\n        delay_start=0.1, \n        max_retries=5, \n        request_timeout=1,\n        user_agent=\"Mozilla\",\n    )\n    crawler = Crawler( \n        seed_urls=[\"https://pouyae.ir\"],\n        parser=parser,\n        exact=True,\n        deep=False,\n        delay=0.1,\n    )\n    result = await crawler.crawl()\n    with open(\n            os.path.join(output_path, 'result.yaml'), 'w') as file:\n        for key in result:\n            result[key] = list(result[key])\n        yaml.dump(result, file)\n\n\nif __name__ == \"__main__\":\n    asyncio.run(main())\n```\n\nThis is the output for the above code:\n```yaml\nhttps://pouyae.ir:\n- https://github.com/PouyaEsmaeili/AsyncURLCrawler\n- https://pouyae.ir/images/pouya3.jpg\n- https://github.com/PouyaEsmaeili/CryptographicClientSideUserState\n- https://github.com/PouyaEsmaeili/RateLimiter\n- https://pouyae.ir/\n- https://github.com/luizdepra/hugo-coder/\n- https://duman.pouyae.ir/\n- https://pouyae.ir/projects/\n- https://pouyae.ir/images/pouya4.jpg\n- https://pouyae.ir/images/pouya5.jpg\n- https://pouyae.ir/gallery/\n- https://github.com/PouyaEsmaeili\n- https://pouyae.ir/blog/\n- https://www.linkedin.com/in/pouya-esmaeili-9124b839/\n- https://pouyae.ir/about/\n- https://stackoverflow.com/users/13118327/pouya-esmaeili?tab=profile\n- https://pouyae.ir/contact-me/\n- https://github.com/PouyaEsmaeili/SnowflakeID\n- https://pouyae.ir/images/pouya2.jpg\n- https://github.com/PouyaEsmaeili/gFuzz\n- https://linktr.ee/pouyae\n- https://gohugo.io/\n- https://pouyae.ir/images/pouya1.jpg\n```\n\n\ud83d\udc49 There is also a blog post about using `AsyncURLCrawler` to find malicious URLs in a web page. [Read here](https://towardsdev.com/viruscan-a-website-for-malicious-url-with-asyncurlcrawler-and-virus-total-2adaef0201c3?source=friends_link&sk=b537f4ab5387b8172d70b73c933412d1).\n\n---\n### Commandline Tool\n\nThe script can be customized using the `src/cmd/cmd.py` file, which accepts various arguments to configure the crawler's behavior:\n\n| argument  | description         | \n|-----------|---------------------| \n| `--url`   | Specifies a list of URLs to crawl. At least one URL must be provided. | \n| `--exact` | Optional flag; if set, the crawler will restrict crawling to the specified subdomain/domain only. Default is False.                    | \n| `--deep`  | Optional flag; if enabled, the crawler will explore all visited URLs. Not recommended due to potential resource intensity. If --deep is True, the --exact flag is ignored. | \n| `--delay` | Sets the delay between consecutive HTTP requests, in seconds. |\n| `--output`| Specifies the path for the output file, which will be saved in YAML format. |\n\n---\n\n### Run Commandline Tool in Docker Container \ud83d\udc33\n\nThere is a Dockerfile in `src/cmd` to run the above-mentioned cmd tool in a docker container.\n\n```commandline \ndocker build -t crawler .\n```\n\n```commandline\ndocker run -v my_dir:/src/output --name crawler crawler\n```\n\nAfter execution of the container, \nthe resulting output file will be accessible in the directory named `my_dir` as defined in the above.\nTo configure the tool based on your needs check the `CMD` in `Dockerfile`.\n\n---\n\n### Build and Publish to Python Package Index(PyPi)\n\nRequirements:\n\n```commandline\npython3 -m pip install --upgrade build\n```\n\n```commandline\npython3 -m pip install --upgrade twine\n```\n\ud83d\udc49 For more details check [Packaging Python Projects](https://packaging.python.org/en/latest/tutorials/packaging-projects/).\n\nBuild and upload:\n\n```commandline \npython3 -m build\n```\n\n```commandline\npython3 -m twine upload --repository pypi dist/*\n```\n\n---\n\n### Build Documentation with Sphinx\n\nInstall packages listed in `docs/doc-requirements.txt`.\n\n```commandline\ncd docs\n```\n\n```commandline\npip install -r doc-requirements.txt\n```\n\n```commandline\nmake clean\n```\n\n```commandline\nmake html\n```\n\nHTML files will be generated in `docs/build`. Push them the repository and deploy on _pages.dev_.\n\n---\n\n### Workflow\n\n- Branch off, implement features and merge them to `main`. Remove feature branches.\n- Update version in `pyproject.toml` and push to `main`.\n- Add release tag in [Github](https://github.com/PouyaEsmaeili/AsyncURLCrawler/releases).\n- Build and push the package to [PyPi](https://pypi.org/project/AsyncURLCrawler/). \n- Build documentation and push HTML files to [AsyncURLCrawlerDocs repo](https://github.com/PouyaEsmaeili/AsyncURLCrawlerDocs)\n- Documentation will be deployed on [pages.dev](https://asyncurlcrawlerdocs.pages.dev/) automatically.\n\n---\n\n### Contact\n\n**[Find me @ My Homepage](https://pouyae.ir)**\n\n---\n\n### Disclaimer\n\n**\u26a0\ufe0f Use at your own risk. The author and contributors are not responsible for any misuse or consequences resulting from the use of this project.**\n\n--- \n",
    "bugtrack_url": null,
    "license": null,
    "summary": "AsyncURLCrawler navigates through web pages concurrently by following hyperlinks to collect URLs.",
    "version": "0.0.4",
    "project_urls": {
        "Homepage": "https://github.com/PouyaEsmaeili/AsyncURLCrawler",
        "Issues": "https://github.com/PouyaEsmaeili/AsyncURLCrawler/issues"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3fe822dde117b25185efe42b97335394b5912eee36d00009f15e7defc1b9e898",
                "md5": "bec4d42e0a8a6a2127b2ce0188bb4d0c",
                "sha256": "97b991c5672f014cf83adb65fd4ac49032f1fc023a1789dfe15c8eda3d64f08e"
            },
            "downloads": -1,
            "filename": "asyncurlcrawler-0.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "bec4d42e0a8a6a2127b2ce0188bb4d0c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 8382,
            "upload_time": "2024-08-31T21:01:22",
            "upload_time_iso_8601": "2024-08-31T21:01:22.939598Z",
            "url": "https://files.pythonhosted.org/packages/3f/e8/22dde117b25185efe42b97335394b5912eee36d00009f15e7defc1b9e898/asyncurlcrawler-0.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "57e49ebee32ad01a3528fcba6df68d6295b704cdf54777c3945f83d8d0a2d7ed",
                "md5": "2bf44985be32896bfc9ebe3cd2eebfdf",
                "sha256": "01090164ae4596799af7a96902a1e30e319788f410efa42a6c0607f781fb8399"
            },
            "downloads": -1,
            "filename": "asyncurlcrawler-0.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "2bf44985be32896bfc9ebe3cd2eebfdf",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 11743,
            "upload_time": "2024-08-31T21:01:25",
            "upload_time_iso_8601": "2024-08-31T21:01:25.519128Z",
            "url": "https://files.pythonhosted.org/packages/57/e4/9ebee32ad01a3528fcba6df68d6295b704cdf54777c3945f83d8d0a2d7ed/asyncurlcrawler-0.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-31 21:01:25",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "PouyaEsmaeili",
    "github_project": "AsyncURLCrawler",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "asyncurlcrawler"
}
        
Elapsed time: 0.59173s