just-another-imgscrapper

Name	just-another-imgscrapper JSON
Version	0.1.1 JSON
	download
home_page	https://github.com/deshrit/just-another-imgscrapper
Summary	A utility for scrapping images from a HTML doc from a URL.
upload_time	2023-06-06 19:53:18
maintainer
docs_url	None
author	Deshrit Baral
requires_python	>=3.7
license	MIT
keywords	image scrapper asyncio httpx beautifulsoup4 lxml
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # just-another-imgscrapper
![](https://github.com/deshrit/just-another-imgscrapper/actions/workflows/tests.yml/badge.svg)

A utility for scrapping images from a HTML doc.

Uses `asyncio` for fast concurrent download.

## Installation
```bash
$ pip install just-another-imgscrapper
```
## Usage
### 1. From cli
```bash
$ imgscrapper -h
```
To get HTML doc, extract image links from `src` attribute of `<img>` tags and download.
```
$ imgscrapper "http://foo.com/bar"
[2023-06-06 23:22:56] imgscrapper.utils:INFO: ### Initializing Scrapping ###
[2023-06-06 23:23:01] imgscrapper.utils:INFO: ### Downloaded 41 images out of extracted 41 links ###
```
Downloads to `imgs/` dir in working dir. If dir does not exists, creates.

### 2. From module
```python
>>> from imgscrapper import ImgScrapper
>>> d = ImgScrapper()
>>> d.download("http://foo.com/bar") 
>>> 3
```
Specify path to store downloaded images.
```python
>>> d = ImgScrapper()
>>> d.url = "http://foo.com/bar"
>>> d.path = "/path/download"
>>> d.download() # returns no. of successful downloads
>>> 3
```
Some servers will block the scrapping, respect robots.txt and only used in allowed hosts.

You can add request headers.
```python
>>> ...
>>> d.request_header = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0',
    'DNT': '1',
    }
>>> ...
```
You can specifically select specific type of `img` tags by specfying attribute of HTML element.
```html
<!-- >http://helloworld.com<-->
<html>
    <body>
        <img src="https://foo.com/bar.png" class="apple ball">
        <img src="/foo.jpg" class="cat bar">
    </body>
<html>
```
To select only images with `class: cat`
```python
>>> d = ImgScrapper()
>>> d.url = "http://helloworld.com"
>>> d.attrs = {
    'class': 'cat',
    }
>>> d.download()
>>> 1 # http://helloworld.com/foo.jpg
```
The downloader gives unique `uuid` to downloaded images preserving the image extension.
```python
>>> d = ImgScrapper(
    url = "http://helloworld.com",
    attrs = {'class': 'cat'},
    max = 5,
    path = "/home/images"
)
>>> d.download()
>>> 5
```
You can limit no. of image downloads by `max` value.

## Liscense
`just-another-imgscrapper` is released under the MIT liscense. See LISCENSE for details.

## Contact
Follow me on twitter ![@deshritbaral](https://twitter.com/deshritbaral)

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/deshrit/just-another-imgscrapper",
    "name": "just-another-imgscrapper",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "image,scrapper,asyncio,httpx,beautifulsoup4,lxml",
    "author": "Deshrit Baral",
    "author_email": "deshritbaral@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/6c/58/6b99b7e0f7cdb616dcbcf185759c04158a82a1f64389519fc876ab064836/just-another-imgscrapper-0.1.1.tar.gz",
    "platform": null,
    "description": "# just-another-imgscrapper\n![](https://github.com/deshrit/just-another-imgscrapper/actions/workflows/tests.yml/badge.svg)\n\nA utility for scrapping images from a HTML doc.\n\nUses `asyncio` for fast concurrent download.\n\n## Installation\n```bash\n$ pip install just-another-imgscrapper\n```\n## Usage\n### 1. From cli\n```bash\n$ imgscrapper -h\n```\nTo get HTML doc, extract image links from `src` attribute of `<img>` tags and download.\n```\n$ imgscrapper \"http://foo.com/bar\"\n[2023-06-06 23:22:56] imgscrapper.utils:INFO: ### Initializing Scrapping ###\n[2023-06-06 23:23:01] imgscrapper.utils:INFO: ### Downloaded 41 images out of extracted 41 links ###\n```\nDownloads to `imgs/` dir in working dir. If dir does not exists, creates.\n\n### 2. From module\n```python\n>>> from imgscrapper import ImgScrapper\n>>> d = ImgScrapper()\n>>> d.download(\"http://foo.com/bar\") \n>>> 3\n```\nSpecify path to store downloaded images.\n```python\n>>> d = ImgScrapper()\n>>> d.url = \"http://foo.com/bar\"\n>>> d.path = \"/path/download\"\n>>> d.download() # returns no. of successful downloads\n>>> 3\n```\nSome servers will block the scrapping, respect robots.txt and only used in allowed hosts.\n\nYou can add request headers.\n```python\n>>> ...\n>>> d.request_header = {\n    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0',\n    'DNT': '1',\n    }\n>>> ...\n```\nYou can specifically select specific type of `img` tags by specfying attribute of HTML element.\n```html\n<!-- >http://helloworld.com<-->\n<html>\n    <body>\n        <img src=\"https://foo.com/bar.png\" class=\"apple ball\">\n        <img src=\"/foo.jpg\" class=\"cat bar\">\n    </body>\n<html>\n```\nTo select only images with `class: cat`\n```python\n>>> d = ImgScrapper()\n>>> d.url = \"http://helloworld.com\"\n>>> d.attrs = {\n    'class': 'cat',\n    }\n>>> d.download()\n>>> 1 # http://helloworld.com/foo.jpg\n```\nThe downloader gives unique `uuid` to downloaded images preserving the image extension.\n```python\n>>> d = ImgScrapper(\n    url = \"http://helloworld.com\",\n    attrs = {'class': 'cat'},\n    max = 5,\n    path = \"/home/images\"\n)\n>>> d.download()\n>>> 5\n```\nYou can limit no. of image downloads by `max` value.\n\n## Liscense\n`just-another-imgscrapper` is released under the MIT liscense. See LISCENSE for details.\n\n## Contact\nFollow me on twitter ![@deshritbaral](https://twitter.com/deshritbaral)\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A utility for scrapping images from a HTML doc from  a URL.",
    "version": "0.1.1",
    "project_urls": {
        "Homepage": "https://github.com/deshrit/just-another-imgscrapper"
    },
    "split_keywords": [
        "image",
        "scrapper",
        "asyncio",
        "httpx",
        "beautifulsoup4",
        "lxml"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "157792f9695ebf14597ed8927c2962e46058dbe97135f4be95f48098f98c8407",
                "md5": "08d58104e2ea1e6629b2474442d9b596",
                "sha256": "bad00654ab2cf6e8f1e9af5d4ff3ef6452e1eda4129ae1e87796e091a3dc0b1c"
            },
            "downloads": -1,
            "filename": "just_another_imgscrapper-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "08d58104e2ea1e6629b2474442d9b596",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 7759,
            "upload_time": "2023-06-06T19:53:16",
            "upload_time_iso_8601": "2023-06-06T19:53:16.949177Z",
            "url": "https://files.pythonhosted.org/packages/15/77/92f9695ebf14597ed8927c2962e46058dbe97135f4be95f48098f98c8407/just_another_imgscrapper-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6c586b99b7e0f7cdb616dcbcf185759c04158a82a1f64389519fc876ab064836",
                "md5": "45f3bd7038a67d6212ade9e0f07da669",
                "sha256": "86c3c05341848af68a1d32713d493938174787caf14f180f48b1153ef234b9a6"
            },
            "downloads": -1,
            "filename": "just-another-imgscrapper-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "45f3bd7038a67d6212ade9e0f07da669",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 8325,
            "upload_time": "2023-06-06T19:53:18",
            "upload_time_iso_8601": "2023-06-06T19:53:18.790388Z",
            "url": "https://files.pythonhosted.org/packages/6c/58/6b99b7e0f7cdb616dcbcf185759c04158a82a1f64389519fc876ab064836/just-another-imgscrapper-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-06 19:53:18",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "deshrit",
    "github_project": "just-another-imgscrapper",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "tox": true,
    "lcname": "just-another-imgscrapper"
}

Deshrit Baral