=============
ImageResolver
=============
A python clone of ImageResolver for finding significant images in HTML content
See the excellent JS version at: https://github.com/mauricesvay/ImageResolver
USAGE
-----
::
import imageresolver
import sys
try:
i = imageresolver.ImageResolver()
i.register(imageresolver.FileExtensionResolver())
i.register(imageresolver.WebpageResolver(load_images=True, parser='lxml',blacklist='easylist.txt'))
url = sys.argv[1]
print(i.resolve(url))
except ImageResolverError:
print("An error occured")
Differences From the Javascript Version
---------------------------------------
* methods return instead of calling callbacks
* WebpageResolver has lots of new options (see below)
* Added some debugging features
* Exceptions are raised rather than callback to an error function
WebpageResolver Additions
-------------------------
* rules syntax is now based on AdBlockPlus filters (https://adblockplus.org/en/filters)
* New rules can be added without writing a resolver
* blacklist image sources and whitelist
* Loads as little of the image as possible when fetching for image info. Stops downloading if diminsions are found or a setable limit is reached.
* The original rules from the JS version are still implemented. (see options)
ImageResolver() METHODS
-----------------------
**__init__** *(\*\*kwargs)*
Keyword options
* *max_read_size* - set to the maximum amount of bytes to read to find the width and height of an image. Default `10240`
* *chunk_size* - set to the chunk size to read Default `1024`
* *read_all* - set to read the entire image and then detect its info. Option will override max_read_size. Default `False`
* *debug* - set to enable debugging output (logger="ImageResolver"). Default `False`
**fetch** *(string url)*
Fetches a URL and returns the response data.
**fetch_image_info** *(string url)*
Fetches an image url and examines the resulting image. Returns a tuple consisting of the detected file extension, the width and the height of the image.
**register** *(instance filter)*
Register a filter to examine an image with. The filter argument must be an instance of a class that has a `resolve()` method. `resolve()` must accept a string URL and must return a url or `None`
**resolve** *(string url)*
Loop through each registered filter until a url is resolved by one of them. If no url is found, returns `None`
FileExtensionResolver() METHODS
-------------------------------
**resolve** *(string url)*
Returns the url if the extention matches a possible image
WebpageResolver() METHODS
-------------------------
The work-horse of this module. Our uses revolve mostly around this filter and thus it is the
most feature complete and tested.
**__init__** *(\*\*kwargs)*
Initialize the class with options:
* *load_image* - set to true to load the first 1k of images whose size is not set in HTML. Default `False`
* *use_js_ruleset* - set to true to use the original rules from the Javascript version. Default `False`
* *use_adblock_filters* - set to false to disable adblock filters. Default `True`
* *parser* - set to a BeautifulSoup compatable parser (lxml is recommended). Default `html.parser`
* *blacklist* - set to a file containing AdBlockPlus style filters used to lower an image's score. Default `blacklist.txt`
* *whiltelist* - set to a file containing AdBlockPlus style filters used to raise an image's score. Default `whitelist.txt`
* *significant_surface* - Amount of surface (width x height) of the image required to add additional scoring
* *boost_jpeg* - add (int) boost score to JPEG files. Default `1`
* *boost_gif* - add (int) boost score to GIF files. Default `0`
* *boost_png* - add (int) boost score to PNG files. Default `0`
* *skip_fetch_errors* - Skip exceptions raised by fetch_image_info(). Exceptions are logged and the image will be skipped. Default `True`
The default parser for BeautifulSoup is html.parser which is built-in to python. We *highly* recommend you install lxml and pass parser="lxml"
to WebpageResolver(). In our testing we found that it was much faster and more accurate.
LOGGING
-------
Use the name "ImageResolver" to configure a logger. Skipped exceptions will be logged to this logger's error output and when enabled, debugging output as well.
EXCEPTIONS
----------
**ImageResolverError**
Base exception for other exceptions below.
**ImageInfoException**
Raised if the image could not be read or type, width or height properties return undefined.
By default this exception is skipped and logged but can be enabled with "skip_fetch_errors=False" option in WebpageResolver
**HTTPException**
Raised if the image could not be loaded from the URL.
By default this exception is skipped and logged but can be enabled with "skip_fetch_errors=False" option in WebpageResolver
TODO
-----------------
Need to implement better caching. Future plan is to include a configurable cache method so images seen across sessions can be cached for better performance
AUTHOR
------
Chris Brown
BUGS
----
Probably. Send us an email or a patch if you find one
COPYRIGHT / ACKNOWLEDGEMENTS
----------------------------
Copyright (c) 2023 Constituent Voice, LLC.
Original idea and basic setup came from Maurice Svay https://github.com/mauricesvay/ImageResolver
Image detection came from the bfg-pages project https://code.google.com/p/bfg-pages/
Reading AdBlock Plus filters forked from https://github.com/wildgarden/abpy
LICENSE
-------
Some of the source libraries are licensed with the BSD license. To avoid license messiness we've chosen to release this software as BSD as well.
The easylist.txt provided by AdBlockPlus is licensed as GPL and it should be updated regularly anyway. For these reasons we have chosen not to
include the file in the package. You can pass it as the "blacklist" or "whitelist" parameter to the Webpageresolver
Raw data
{
"_id": null,
"home_page": "https://github.com/constituentvoice/ImageResolverPython",
"name": "ImageResolver",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "",
"author": "Chris Brown",
"author_email": "chris.brown@nwyc.com",
"download_url": "https://files.pythonhosted.org/packages/88/6d/2e748b0b9d1d578b1f02264b5e01366f115d86eebbe9dda139f78104d7c4/ImageResolver-0.4.2.tar.gz",
"platform": null,
"description": "=============\nImageResolver\n=============\n\nA python clone of ImageResolver for finding significant images in HTML content\nSee the excellent JS version at: https://github.com/mauricesvay/ImageResolver\n\nUSAGE\n-----\n\n::\n\n\timport imageresolver\n\timport sys\n\n\ttry:\n\t\ti = imageresolver.ImageResolver()\n\t\ti.register(imageresolver.FileExtensionResolver())\n\t\ti.register(imageresolver.WebpageResolver(load_images=True, parser='lxml',blacklist='easylist.txt'))\n\t\turl = sys.argv[1]\n\n\t\tprint(i.resolve(url))\n\texcept ImageResolverError:\n\t\tprint(\"An error occured\")\n\nDifferences From the Javascript Version\n---------------------------------------\n\n* methods return instead of calling callbacks\n\n* WebpageResolver has lots of new options (see below)\n\n* Added some debugging features\n\n* Exceptions are raised rather than callback to an error function\n\nWebpageResolver Additions\n-------------------------\n\n* rules syntax is now based on AdBlockPlus filters (https://adblockplus.org/en/filters)\n\n* New rules can be added without writing a resolver\n\n* blacklist image sources and whitelist\n\n* Loads as little of the image as possible when fetching for image info. Stops downloading if diminsions are found or a setable limit is reached.\n\n* The original rules from the JS version are still implemented. (see options)\n\nImageResolver() METHODS\n-----------------------\n\n**__init__** *(\\*\\*kwargs)*\n\nKeyword options\n\n\t* *max_read_size* - set to the maximum amount of bytes to read to find the width and height of an image. Default `10240`\n\t* *chunk_size* - set to the chunk size to read Default `1024`\n\t* *read_all* - set to read the entire image and then detect its info. Option will override max_read_size. Default `False`\n\t* *debug* - set to enable debugging output (logger=\"ImageResolver\"). Default `False`\n\n**fetch** *(string url)*\n\nFetches a URL and returns the response data.\n\n**fetch_image_info** *(string url)*\n\nFetches an image url and examines the resulting image. Returns a tuple consisting of the detected file extension, the width and the height of the image.\n\n**register** *(instance filter)*\n\nRegister a filter to examine an image with. The filter argument must be an instance of a class that has a `resolve()` method. `resolve()` must accept a string URL and must return a url or `None`\n\n**resolve** *(string url)*\n\nLoop through each registered filter until a url is resolved by one of them. If no url is found, returns `None`\n\n\nFileExtensionResolver() METHODS\n-------------------------------\n\n**resolve** *(string url)*\n\nReturns the url if the extention matches a possible image\n\nWebpageResolver() METHODS\n-------------------------\n\nThe work-horse of this module. Our uses revolve mostly around this filter and thus it is the\nmost feature complete and tested.\n\n**__init__** *(\\*\\*kwargs)*\n\nInitialize the class with options:\n\n\t* *load_image* - set to true to load the first 1k of images whose size is not set in HTML. Default `False`\n\t* *use_js_ruleset* - set to true to use the original rules from the Javascript version. Default `False`\n\t* *use_adblock_filters* - set to false to disable adblock filters. Default `True`\n\t* *parser* - set to a BeautifulSoup compatable parser (lxml is recommended). Default `html.parser`\n\t* *blacklist* - set to a file containing AdBlockPlus style filters used to lower an image's score. Default `blacklist.txt`\n\t* *whiltelist* - set to a file containing AdBlockPlus style filters used to raise an image's score. Default `whitelist.txt`\n\t* *significant_surface* - Amount of surface (width x height) of the image required to add additional scoring\n\t* *boost_jpeg* - add (int) boost score to JPEG files. Default `1`\n\t* *boost_gif* - add (int) boost score to GIF files. Default `0`\n\t* *boost_png* - add (int) boost score to PNG files. Default `0`\n\t* *skip_fetch_errors* - Skip exceptions raised by fetch_image_info(). Exceptions are logged and the image will be skipped. Default `True`\n\nThe default parser for BeautifulSoup is html.parser which is built-in to python. We *highly* recommend you install lxml and pass parser=\"lxml\"\nto WebpageResolver(). In our testing we found that it was much faster and more accurate. \n\nLOGGING\n-------\n\nUse the name \"ImageResolver\" to configure a logger. Skipped exceptions will be logged to this logger's error output and when enabled, debugging output as well.\n\nEXCEPTIONS\n----------\n\n**ImageResolverError**\n\nBase exception for other exceptions below.\n\n**ImageInfoException**\n\nRaised if the image could not be read or type, width or height properties return undefined. \nBy default this exception is skipped and logged but can be enabled with \"skip_fetch_errors=False\" option in WebpageResolver\n\n**HTTPException**\n\nRaised if the image could not be loaded from the URL. \nBy default this exception is skipped and logged but can be enabled with \"skip_fetch_errors=False\" option in WebpageResolver\n\nTODO\n-----------------\n\nNeed to implement better caching. Future plan is to include a configurable cache method so images seen across sessions can be cached for better performance\n\n\nAUTHOR\n------\n\nChris Brown\n\nBUGS\n----\n\nProbably. Send us an email or a patch if you find one\n\nCOPYRIGHT / ACKNOWLEDGEMENTS\n----------------------------\n\nCopyright (c) 2023 Constituent Voice, LLC.\n\nOriginal idea and basic setup came from Maurice Svay https://github.com/mauricesvay/ImageResolver\n\nImage detection came from the bfg-pages project https://code.google.com/p/bfg-pages/\n\nReading AdBlock Plus filters forked from https://github.com/wildgarden/abpy\n\nLICENSE\n-------\n\nSome of the source libraries are licensed with the BSD license. To avoid license messiness we've chosen to release this software as BSD as well.\nThe easylist.txt provided by AdBlockPlus is licensed as GPL and it should be updated regularly anyway. For these reasons we have chosen not to\ninclude the file in the package. You can pass it as the \"blacklist\" or \"whitelist\" parameter to the Webpageresolver",
"bugtrack_url": null,
"license": "BSD",
"summary": "Find the most significant image in an article.",
"version": "0.4.2",
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "886d2e748b0b9d1d578b1f02264b5e01366f115d86eebbe9dda139f78104d7c4",
"md5": "e556072eaf1963b67b51505057ded0f2",
"sha256": "9ee40a3b6056f055e2f49ebe5c9ce5e3d209a4319d614f1cdc3651a788230d93"
},
"downloads": -1,
"filename": "ImageResolver-0.4.2.tar.gz",
"has_sig": false,
"md5_digest": "e556072eaf1963b67b51505057ded0f2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 15846,
"upload_time": "2023-01-18T18:46:20",
"upload_time_iso_8601": "2023-01-18T18:46:20.758527Z",
"url": "https://files.pythonhosted.org/packages/88/6d/2e748b0b9d1d578b1f02264b5e01366f115d86eebbe9dda139f78104d7c4/ImageResolver-0.4.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-01-18 18:46:20",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "constituentvoice",
"github_project": "ImageResolverPython",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "imageresolver"
}