Name | img-lurker JSON |
Version |
1.0.3
JSON |
| download |
home_page | |
Summary | Web gallery downloader |
upload_time | 2024-01-22 21:25:22 |
maintainer | |
docs_url | None |
author | |
requires_python | >=3.6 |
license | |
keywords |
crawling
download
gallery
image
photo
spider
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# img-lurker
img-lurker is a gallery downloader.
img-lurker takes a URL of a (HTML) web page and downloads linked images on it.
If the page contains only thumbnails, linking to a the full size version of
the image, img-lurker will rather take the bigger one.
If there are links to other HTML pages (themselves containing a the full size
image), img-lurker will follow those links to find the bigger size.
img-lurker has a "minimum image size" for considering an image is worthy of being downloaded and
isn't UI stuff like buttons/separators. img-lurker will not follow links if the link tag doesn't
contain an image tag (assumed to be the thumbnail).
## Example
Consider a site with following HTML:
<a href="fullimage1.jpg">
<img src="thumbnail1.jpg" />
</a>
<a href="fullimage2.jpg">
<img src="thumbnail2.jpg" />
</a>
img-lurker would download `fullimage1.jpg` and `fullimage2.jpg`.
If instead the links point to other HTML pages containing the full size version
of the images (for example `fullimage1.html` containing `fullimage1.jpg`),
img-lurker would still find `fullimage1.jpg` by following the page links.
## Options
### Cookies
--cookie KEY=VALUE
Inject a specific cookie, which might be required to visit some restricted
access pages. For example, some subreddits require you to pass the cookie "over18=1".
The option can be passed several times to inject multiple cookies.
### Pagination
--next-page-xpath HTML_XPATH
img-lurker can handle pagination for sites where a gallery contains so many
images that the site is split in numbered pages.
`HTML_XPATH` should be an XPath expression locating the HTML link to the "next
page".
If this argument is given, after downloading all images of a "page", img-lurker
will follow the link pointed to by `HTML_XPATH` and repeat on the next page.
Warning: this can issue a lot of traffic for huge galleries. Be cautious or you
might get blocked by the website.
### Stop/resume
--history-file FILE
Mark all downloaded images URLs in this file and avoid redownloading URLs
present in this file.
Useful when running img-lurker multiple times on the same gallery, typically if
the gallery has received fresh images. Also useful if you use
`--next-page-xpath` option and kill img-lurker to avoid flooding the site, make
a pause (minutes? hours? days?) then restart img-lurker: the history file will
help it resume where it was interrupted.
This makes the assumption that:
- each image will always have the same URL, e.g. no varying tokens/timestamps in the URL, etc.
- conversely, an URL will always point to the same image, it will not point to another image at some point, e.g. the
images are NOT numbered in ascending order (else `1.jpg` would point to
different images over time).
### Tell apart thumbnails from "big images" to download only the latter
--min-thumb-size WIDTHxHEIGHT
--min-image-size WIDTHxHEIGHT
Minimum size for an image to be considered a thumbnail worth following or an
image worth downloading. Useful not to download navigation buttons, logos, etc.
Default values are `--min-thumb-size=128x128` and `--min-image-size=400x400`.
--max-aspect-ratio WIDTH:HEIGHT
Maximum ratio between WIDTH and HEIGHT (or HEIGHT on WIDTH, img-lurker is smart
enough to figure out) to consider an image is worth downloading.
For example, pass "16:9" and img-lurker will accept images with dimensions
1920x1080 or 1080x1920 as they are respectively 16:9 and 9:16 but also 1600x1200
or 1200x1600 because they are 4:3 (and 3:4) which is lower (more looking like
a square) than the max "16:9". Ratios of portrait and landscape are considered
equivalent.
However, passing "16:9" would discard a banner with dimensions 1200x300 because
its ratio is 4:1 which is way more distorted (very thin rectangle) than 16:9.
It would also reject a banner with dimensions 300x1200 because it is 1:4,
equivalent to 4:1.
A photo is rarely square but is almost never thin like 4:1, except panoramas, so
configure this option if you intend to download panoramas for example.
The default value is `--max-aspect-ratio=4:1`.
### Debug
--debug
Debug log.
## Limitations
- img-lurker will not interpret javascript, though it has specific hints to detect
lazy-loaded images, so it might not work on sites like instagram.
- img-lurker will not open iframes, so it will fail to download a few images from
reddit.
- img-lurker does not crawl a site and does not support nested galleries, it only
takes one gallery and expects it to contain the images desired.
Raw data
{
"_id": null,
"home_page": "",
"name": "img-lurker",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "crawling,download,gallery,image,photo,spider",
"author": "",
"author_email": "Hg <dev@indigo.re>",
"download_url": "https://files.pythonhosted.org/packages/91/35/8fb096453fbedc340f50c7d9b84b228b5729e81246851c27f372280b8ff4/img_lurker-1.0.3.tar.gz",
"platform": null,
"description": "# img-lurker\n\nimg-lurker is a gallery downloader.\n\nimg-lurker takes a URL of a (HTML) web page and downloads linked images on it.\nIf the page contains only thumbnails, linking to a the full size version of\nthe image, img-lurker will rather take the bigger one.\nIf there are links to other HTML pages (themselves containing a the full size\nimage), img-lurker will follow those links to find the bigger size.\n\nimg-lurker has a \"minimum image size\" for considering an image is worthy of being downloaded and\nisn't UI stuff like buttons/separators. img-lurker will not follow links if the link tag doesn't\ncontain an image tag (assumed to be the thumbnail).\n\n## Example\n\nConsider a site with following HTML:\n\n <a href=\"fullimage1.jpg\">\n <img src=\"thumbnail1.jpg\" />\n </a>\n <a href=\"fullimage2.jpg\">\n <img src=\"thumbnail2.jpg\" />\n </a>\n\nimg-lurker would download `fullimage1.jpg` and `fullimage2.jpg`.\nIf instead the links point to other HTML pages containing the full size version\nof the images (for example `fullimage1.html` containing `fullimage1.jpg`),\nimg-lurker would still find `fullimage1.jpg` by following the page links.\n\n## Options\n\n### Cookies\n\n --cookie KEY=VALUE\n\nInject a specific cookie, which might be required to visit some restricted\naccess pages. For example, some subreddits require you to pass the cookie \"over18=1\".\n\nThe option can be passed several times to inject multiple cookies.\n\n### Pagination\n\n --next-page-xpath HTML_XPATH\n\nimg-lurker can handle pagination for sites where a gallery contains so many\nimages that the site is split in numbered pages.\n`HTML_XPATH` should be an XPath expression locating the HTML link to the \"next\npage\".\nIf this argument is given, after downloading all images of a \"page\", img-lurker\nwill follow the link pointed to by `HTML_XPATH` and repeat on the next page.\n\nWarning: this can issue a lot of traffic for huge galleries. Be cautious or you\nmight get blocked by the website.\n\n### Stop/resume\n\n --history-file FILE\n\nMark all downloaded images URLs in this file and avoid redownloading URLs\npresent in this file.\nUseful when running img-lurker multiple times on the same gallery, typically if\nthe gallery has received fresh images. Also useful if you use\n`--next-page-xpath` option and kill img-lurker to avoid flooding the site, make\na pause (minutes? hours? days?) then restart img-lurker: the history file will\nhelp it resume where it was interrupted.\n\nThis makes the assumption that:\n\n- each image will always have the same URL, e.g. no varying tokens/timestamps in the URL, etc.\n- conversely, an URL will always point to the same image, it will not point to another image at some point, e.g. the\nimages are NOT numbered in ascending order (else `1.jpg` would point to\ndifferent images over time).\n\n### Tell apart thumbnails from \"big images\" to download only the latter\n\n --min-thumb-size WIDTHxHEIGHT\n --min-image-size WIDTHxHEIGHT\n\nMinimum size for an image to be considered a thumbnail worth following or an\nimage worth downloading. Useful not to download navigation buttons, logos, etc.\nDefault values are `--min-thumb-size=128x128` and `--min-image-size=400x400`.\n\n --max-aspect-ratio WIDTH:HEIGHT\n\nMaximum ratio between WIDTH and HEIGHT (or HEIGHT on WIDTH, img-lurker is smart\nenough to figure out) to consider an image is worth downloading.\n\nFor example, pass \"16:9\" and img-lurker will accept images with dimensions\n1920x1080 or 1080x1920 as they are respectively 16:9 and 9:16 but also 1600x1200\nor 1200x1600 because they are 4:3 (and 3:4) which is lower (more looking like\na square) than the max \"16:9\". Ratios of portrait and landscape are considered\nequivalent.\nHowever, passing \"16:9\" would discard a banner with dimensions 1200x300 because\nits ratio is 4:1 which is way more distorted (very thin rectangle) than 16:9.\nIt would also reject a banner with dimensions 300x1200 because it is 1:4,\nequivalent to 4:1.\n\nA photo is rarely square but is almost never thin like 4:1, except panoramas, so\nconfigure this option if you intend to download panoramas for example.\nThe default value is `--max-aspect-ratio=4:1`.\n\n### Debug\n\n --debug\n\nDebug log.\n\n## Limitations\n\n- img-lurker will not interpret javascript, though it has specific hints to detect\nlazy-loaded images, so it might not work on sites like instagram.\n- img-lurker will not open iframes, so it will fail to download a few images from\nreddit.\n- img-lurker does not crawl a site and does not support nested galleries, it only\ntakes one gallery and expects it to contain the images desired.\n",
"bugtrack_url": null,
"license": "",
"summary": "Web gallery downloader",
"version": "1.0.3",
"project_urls": {
"Homepage": "https://gitlab.com/hydrargyrum/img-lurker/"
},
"split_keywords": [
"crawling",
"download",
"gallery",
"image",
"photo",
"spider"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "18fc98d51a90e277e5d87816b4bdf6cc8f4d2d1bd897e34a9884dbdf2b9a50b6",
"md5": "aff559a841ec568246741511cfcfe9d1",
"sha256": "313b0d918cd17145a0053eedb82f4477322dfda44d1a7ca96624bc677dc732ae"
},
"downloads": -1,
"filename": "img_lurker-1.0.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "aff559a841ec568246741511cfcfe9d1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 7611,
"upload_time": "2024-01-22T21:25:21",
"upload_time_iso_8601": "2024-01-22T21:25:21.207485Z",
"url": "https://files.pythonhosted.org/packages/18/fc/98d51a90e277e5d87816b4bdf6cc8f4d2d1bd897e34a9884dbdf2b9a50b6/img_lurker-1.0.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "91358fb096453fbedc340f50c7d9b84b228b5729e81246851c27f372280b8ff4",
"md5": "01283d22e5eb66ff04f54b6a48b9a809",
"sha256": "44a6a27e2653c452401cb9b2e1713eea8f8c6e69fc864a921c1e3271111a1005"
},
"downloads": -1,
"filename": "img_lurker-1.0.3.tar.gz",
"has_sig": false,
"md5_digest": "01283d22e5eb66ff04f54b6a48b9a809",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 6793,
"upload_time": "2024-01-22T21:25:22",
"upload_time_iso_8601": "2024-01-22T21:25:22.954843Z",
"url": "https://files.pythonhosted.org/packages/91/35/8fb096453fbedc340f50c7d9b84b228b5729e81246851c27f372280b8ff4/img_lurker-1.0.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-01-22 21:25:22",
"github": false,
"gitlab": true,
"bitbucket": false,
"codeberg": false,
"gitlab_user": "hydrargyrum",
"gitlab_project": "img-lurker",
"lcname": "img-lurker"
}