img-lurker

Name	img-lurker JSON
Version	1.0.3 JSON
	download
home_page
Summary	Web gallery downloader
upload_time	2024-01-22 21:25:22
maintainer
docs_url	None
author
requires_python	>=3.6
license
keywords	crawling download gallery image photo spider
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

# img-lurker

img-lurker is a gallery downloader.

img-lurker takes a URL of a (HTML) web page and downloads linked images on it.
If the page contains only thumbnails, linking to a the full size version of
the image, img-lurker will rather take the bigger one.
If there are links to other HTML pages (themselves containing a the full size
image), img-lurker will follow those links to find the bigger size.

img-lurker has a "minimum image size" for considering an image is worthy of being downloaded and
isn't UI stuff like buttons/separators. img-lurker will not follow links if the link tag doesn't
contain an image tag (assumed to be the thumbnail).

## Example

Consider a site with following HTML:

img-lurker would download `fullimage1.jpg` and `fullimage2.jpg`.
If instead the links point to other HTML pages containing the full size version
of the images (for example `fullimage1.html` containing `fullimage1.jpg`),
img-lurker would still find `fullimage1.jpg` by following the page links.

## Options

### Cookies

--cookie KEY=VALUE

Inject a specific cookie, which might be required to visit some restricted
access pages. For example, some subreddits require you to pass the cookie "over18=1".

The option can be passed several times to inject multiple cookies.

### Pagination

--next-page-xpath HTML_XPATH

img-lurker can handle pagination for sites where a gallery contains so many
images that the site is split in numbered pages.
`HTML_XPATH` should be an XPath expression locating the HTML link to the "next
page".
If this argument is given, after downloading all images of a "page", img-lurker
will follow the link pointed to by `HTML_XPATH` and repeat on the next page.

Warning: this can issue a lot of traffic for huge galleries. Be cautious or you
might get blocked by the website.

### Stop/resume

--history-file FILE

Mark all downloaded images URLs in this file and avoid redownloading URLs
present in this file.
Useful when running img-lurker multiple times on the same gallery, typically if
the gallery has received fresh images. Also useful if you use
`--next-page-xpath` option and kill img-lurker to avoid flooding the site, make
a pause (minutes? hours? days?) then restart img-lurker: the history file will
help it resume where it was interrupted.

This makes the assumption that:

- each image will always have the same URL, e.g. no varying tokens/timestamps in the URL, etc.
- conversely, an URL will always point to the same image, it will not point to another image at some point, e.g. the
images are NOT numbered in ascending order (else `1.jpg` would point to
different images over time).

### Tell apart thumbnails from "big images" to download only the latter

--min-thumb-size WIDTHxHEIGHT
--min-image-size WIDTHxHEIGHT

Minimum size for an image to be considered a thumbnail worth following or an
image worth downloading. Useful not to download navigation buttons, logos, etc.
Default values are `--min-thumb-size=128x128` and `--min-image-size=400x400`.

--max-aspect-ratio WIDTH:HEIGHT

Maximum ratio between WIDTH and HEIGHT (or HEIGHT on WIDTH, img-lurker is smart
enough to figure out) to consider an image is worth downloading.

For example, pass "16:9" and img-lurker will accept images with dimensions
1920x1080 or 1080x1920 as they are respectively 16:9 and 9:16 but also 1600x1200
or 1200x1600 because they are 4:3 (and 3:4) which is lower (more looking like
a square) than the max "16:9". Ratios of portrait and landscape are considered
equivalent.
However, passing "16:9" would discard a banner with dimensions 1200x300 because
its ratio is 4:1 which is way more distorted (very thin rectangle) than 16:9.
It would also reject a banner with dimensions 300x1200 because it is 1:4,
equivalent to 4:1.

A photo is rarely square but is almost never thin like 4:1, except panoramas, so
configure this option if you intend to download panoramas for example.
The default value is `--max-aspect-ratio=4:1`.

### Debug

--debug

Debug log.

## Limitations

- img-lurker will not interpret javascript, though it has specific hints to detect
lazy-loaded images, so it might not work on sites like instagram.
- img-lurker will not open iframes, so it will fail to download a few images from
reddit.
- img-lurker does not crawl a site and does not support nested galleries, it only
takes one gallery and expects it to contain the images desired.

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "img-lurker",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "crawling,download,gallery,image,photo,spider",
    "author": "",
    "author_email": "Hg <dev@indigo.re>",
    "download_url": "https://files.pythonhosted.org/packages/91/35/8fb096453fbedc340f50c7d9b84b228b5729e81246851c27f372280b8ff4/img_lurker-1.0.3.tar.gz",
    "platform": null,
    "description": "# img-lurker\n\nimg-lurker is a gallery downloader.\n\nimg-lurker takes a URL of a (HTML) web page and downloads linked images on it.\nIf the page contains only thumbnails, linking to a the full size version of\nthe image, img-lurker will rather take the bigger one.\nIf there are links to other HTML pages (themselves containing a the full size\nimage), img-lurker will follow those links to find the bigger size.\n\nimg-lurker has a \"minimum image size\" for considering an image is worthy of being downloaded and\nisn't UI stuff like buttons/separators. img-lurker will not follow links if the link tag doesn't\ncontain an image tag (assumed to be the thumbnail).\n\n## Example\n\nConsider a site with following HTML:\n\n    <a href=\"fullimage1.jpg\">\n        <img src=\"thumbnail1.jpg\" />\n    </a>\n    <a href=\"fullimage2.jpg\">\n        <img src=\"thumbnail2.jpg\" />\n    </a>\n\nimg-lurker would download `fullimage1.jpg` and `fullimage2.jpg`.\nIf instead the links point to other HTML pages containing the full size version\nof the images (for example `fullimage1.html` containing `fullimage1.jpg`),\nimg-lurker would still find `fullimage1.jpg` by following the page links.\n\n## Options\n\n### Cookies\n\n    --cookie KEY=VALUE\n\nInject a specific cookie, which might be required to visit some restricted\naccess pages. For example, some subreddits require you to pass the cookie \"over18=1\".\n\nThe option can be passed several times to inject multiple cookies.\n\n### Pagination\n\n    --next-page-xpath HTML_XPATH\n\nimg-lurker can handle pagination for sites where a gallery contains so many\nimages that the site is split in numbered pages.\n`HTML_XPATH` should be an XPath expression locating the HTML link to the \"next\npage\".\nIf this argument is given, after downloading all images of a \"page\", img-lurker\nwill follow the link pointed to by `HTML_XPATH` and repeat on the next page.\n\nWarning: this can issue a lot of traffic for huge galleries. Be cautious or you\nmight get blocked by the website.\n\n### Stop/resume\n\n    --history-file FILE\n\nMark all downloaded images URLs in this file and avoid redownloading URLs\npresent in this file.\nUseful when running img-lurker multiple times on the same gallery, typically if\nthe gallery has received fresh images. Also useful if you use\n`--next-page-xpath` option and kill img-lurker to avoid flooding the site, make\na pause (minutes? hours? days?) then restart img-lurker: the history file will\nhelp it resume where it was interrupted.\n\nThis makes the assumption that:\n\n- each image will always have the same URL, e.g. no varying tokens/timestamps in the URL, etc.\n- conversely, an URL will always point to the same image, it will not point to another image at some point, e.g. the\nimages are NOT numbered in ascending order (else `1.jpg` would point to\ndifferent images over time).\n\n### Tell apart thumbnails from \"big images\" to download only the latter\n\n    --min-thumb-size WIDTHxHEIGHT\n    --min-image-size WIDTHxHEIGHT\n\nMinimum size for an image to be considered a thumbnail worth following or an\nimage worth downloading. Useful not to download navigation buttons, logos, etc.\nDefault values are `--min-thumb-size=128x128` and `--min-image-size=400x400`.\n\n    --max-aspect-ratio WIDTH:HEIGHT\n\nMaximum ratio between WIDTH and HEIGHT (or HEIGHT on WIDTH, img-lurker is smart\nenough to figure out) to consider an image is worth downloading.\n\nFor example, pass \"16:9\" and img-lurker will accept images with dimensions\n1920x1080 or 1080x1920 as they are respectively 16:9 and 9:16 but also 1600x1200\nor 1200x1600 because they are 4:3 (and 3:4) which is lower (more looking like\na square) than the max \"16:9\". Ratios of portrait and landscape are considered\nequivalent.\nHowever, passing \"16:9\" would discard a banner with dimensions 1200x300 because\nits ratio is 4:1 which is way more distorted (very thin rectangle) than 16:9.\nIt would also reject a banner with dimensions 300x1200 because it is 1:4,\nequivalent to 4:1.\n\nA photo is rarely square but is almost never thin like 4:1, except panoramas, so\nconfigure this option if you intend to download panoramas for example.\nThe default value is `--max-aspect-ratio=4:1`.\n\n### Debug\n\n    --debug\n\nDebug log.\n\n## Limitations\n\n- img-lurker will not interpret javascript, though it has specific hints to detect\nlazy-loaded images, so it might not work on sites like instagram.\n- img-lurker will not open iframes, so it will fail to download a few images from\nreddit.\n- img-lurker does not crawl a site and does not support nested galleries, it only\ntakes one gallery and expects it to contain the images desired.\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Web gallery downloader",
    "version": "1.0.3",
    "project_urls": {
        "Homepage": "https://gitlab.com/hydrargyrum/img-lurker/"
    },
    "split_keywords": [
        "crawling",
        "download",
        "gallery",
        "image",
        "photo",
        "spider"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "18fc98d51a90e277e5d87816b4bdf6cc8f4d2d1bd897e34a9884dbdf2b9a50b6",
                "md5": "aff559a841ec568246741511cfcfe9d1",
                "sha256": "313b0d918cd17145a0053eedb82f4477322dfda44d1a7ca96624bc677dc732ae"
            },
            "downloads": -1,
            "filename": "img_lurker-1.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "aff559a841ec568246741511cfcfe9d1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 7611,
            "upload_time": "2024-01-22T21:25:21",
            "upload_time_iso_8601": "2024-01-22T21:25:21.207485Z",
            "url": "https://files.pythonhosted.org/packages/18/fc/98d51a90e277e5d87816b4bdf6cc8f4d2d1bd897e34a9884dbdf2b9a50b6/img_lurker-1.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "91358fb096453fbedc340f50c7d9b84b228b5729e81246851c27f372280b8ff4",
                "md5": "01283d22e5eb66ff04f54b6a48b9a809",
                "sha256": "44a6a27e2653c452401cb9b2e1713eea8f8c6e69fc864a921c1e3271111a1005"
            },
            "downloads": -1,
            "filename": "img_lurker-1.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "01283d22e5eb66ff04f54b6a48b9a809",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 6793,
            "upload_time": "2024-01-22T21:25:22",
            "upload_time_iso_8601": "2024-01-22T21:25:22.954843Z",
            "url": "https://files.pythonhosted.org/packages/91/35/8fb096453fbedc340f50c7d9b84b228b5729e81246851c27f372280b8ff4/img_lurker-1.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-22 21:25:22",
    "github": false,
    "gitlab": true,
    "bitbucket": false,
    "codeberg": false,
    "gitlab_user": "hydrargyrum",
    "gitlab_project": "img-lurker",
    "lcname": "img-lurker"
}