url-cache

Name	url-cache JSON
Version	0.0.9 JSON
	download
home_page	https://github.com/seanbreckenridge/url_cache
Summary	A file system cache which saves URL metadata and summarizes content
upload_time	2023-09-09 23:23:37
maintainer
docs_url	None
author	Sean Breckenridge
requires_python
license	http://www.apache.org/licenses/LICENSE-2.0
keywords	url cache metadata youtube subtitles
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            [![PyPi version](https://img.shields.io/pypi/v/url_cache.svg)](https://pypi.python.org/pypi/url_cache) [![Python3.7|3.8|3.9|3.10|3.11](https://img.shields.io/pypi/pyversions/url_cache.svg)](https://pypi.python.org/pypi/url_cache) [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square)](http://makeapullrequest.com)

This is currently very alpha and in development, so expect changes to the API/interface. It aims to walk the line between extracting enough text/data for it to be useful, but no so much that it takes enormous amounts of space.

As it stands I'm sort of pessimistic this would ever be a silver bullet, getting useful info out of arbitrary HTML is hard, so you're sort of stuck writing parsers for each website you're interested in. However, I still use this frequently, especially as a cache for API information like described [below](#api-cache-examples)

Current TODOs:

- [ ] Add more sites using the [abstract interface](https://github.com/seanbreckenridge/url_cache/blob/master/src/url_cache/sites/abstract.py), to get more info from sites I use commonly. Ideally, should be able to re-use common scraper/parsers/API interface libraries in python, instead of recreating everything from scratch
- [ ] Create a (separate repo/project) daemon which handles configuring this and slowly requests things in the background as they become available through given sources; allow user to provide generators/inputs define include/exclude lists/regexes. Probably just integrate with [promnesia](https://github.com/karlicoss/promnesia) so avoid duplicating the work of searching for URLs on disk

## Installation

Requires `python3.7+`

To install with pip, run:

    python3 -m pip install url_cache

As this is still in development, for the latest changes install from git: `python3 -m pip install git+https://github.com/seanbreckenridge/url_cache`

## Rationale

A file system cache which saves URL metadata and summarizes content

This is meant to provide more context to any of my tools which use URLs. If I [watched some youtube video](https://github.com/seanbreckenridge/mpv-history-daemon) and I have a URL, I'd like to have the subtitles for it, so I can do a text-search over all the videos I've watched. If I [read an article](https://github.com/seanbreckenridge/browserexport), I want the article text! This requests, parses and abstracts away that data for me locally, so I can do something like:

```python
>>> from url_cache.core import URLCache
>>> u = URLCache()
>>> data = u.get("https://sean.fish/")
>>> data.metadata["images"][-1]
{'src': 'https://raw.githubusercontent.com/seanbreckenridge/glue/master/assets/screenshot.png', 'alt': 'screenshot', 'type': 'body_image', 'width': 600}
>>> data.metadata["description"]
"sean.fish; Sean Breckenridge's Home Page"
```

If I ever request that URL again, the information is grabbed from a local cache instead.

Generally, this uses:

- [`lassie`](https://github.com/michaelhelmick/lassie) to get generic metadata; the title, description, opengraph information, links to images/videos on the page.
- [`readability`](https://github.com/buriy/python-readability) to convert/compress HTML to a summary of the HTML content.

Site-Specific Extractors:

- [Youtube](./docs/url_cache/sites/youtube/subtitles_downloader.md): to get manual/auto-generated captions (converted to a `.srt` file) from Youtube URLs
- Stackoverflow (Just a basic URL preprocessor to reduce the possibility of conflicts/duplicate data)
- MyAnimeList (using [Jikan v4](https://docs.api.jikan.moe/))

This is meant to be extendible -- so its possible for you to write your own extractors/file loaders/dumpers (for new formats (e.g. `srt`)) for sites you use commonly and pass those to `url_cache.core.URLCache` to extract richer data for those sites. Otherwise, it saves the information from `lassie` and the summarized HTML using `readability` for each URL.

To avoid scope creep, this probably won't support:

- Converting the HTML summary to text (use something like the `lynx` command below)
- Minimizing HTML - run something like `find ~/.local/share/url_cache/ -name '*.html' -exec <some tool/script that minimizes in place> \;` instead -- the data is just stored in individual files in the data directory

### Usage:

In Python, this can be configured by using the `url_cache.core.URLCache` class: For example:

```python
import logging
from url_cache.core import URLCache

# make requests every 2 seconds
# debug logs
# save to a folder in my home directory
cache = URLCache(loglevel=logging.DEBUG, sleep_time=2, cache_dir="~/Documents/urldata")
c = cache.get("https://github.com/seanbreckenridge")
# just request information, don't read/save to cache
data = cache.request_data("https://www.wikipedia.org/")
```

For more information, see [the docs](./docs/url_cache/core.md)

The CLI interface provides some utility commands to get/list information from the cache.

```
Usage: url_cache [OPTIONS] COMMAND [ARGS]...

Options:
  --cache-dir PATH                Override default cache directory location
  --debug / --no-debug            Increase log verbosity
  --sleep-time INTEGER            How long to sleep between requests
  --summarize-html / --no-summarize-html
                                  Use readability to summarize html. Otherwise
                                  saves the entire HTML document

  --skip-subtitles / --no-skip-subtitles
                                  Skip downloading Youtube Subtitles
  --subtitle-language TEXT        Subtitle language for Youtube Subtitles
  --help                          Show this message and exit.

Commands:
  cachedir  Prints the location of the local cache directory
  export    Print all cached information as JSON
  get       Get information for one or more URLs Prints results as JSON
  in-cache  Prints if a URL is already cached
  list      List all cached URLs
```

An environment variable `URL_CACHE_DIR` can be set, which changes the default cache directory.

### API Cache Examples

I've also successfully used this to cache responses from API results in some of my projects, by subclassing and overriding the `request_data` function. I just make a request and return a summary, and it transparently caches the rest. See:

- [`albums/discogs_cache`](https://github.com/seanbreckenridge/albums/blob/9d296c4abb8e9e16c8dd410aeae8e5bb760008de/nextalbums/discogs_cache.py)
- [`my_feed/tmdb`](https://github.com/seanbreckenridge/my_feed/blob/master/src/my_feed/sources/trakt/tmdb.py)
- [`dbsentinel/metadata`](https://github.com/seanbreckenridge/dbsentinel/blob/accfc70485644d8966a582204c6c47839d2d874e/mal_id/metadata_cache.py)

### CLI Examples

The `get` command emits `JSON`, so it could with other tools (e.g. [`jq`](https://stedolan.github.io/jq/)) used like:

```shell
$ url_cache get "https://click.palletsprojects.com/en/7.x/arguments/" | \
  jq -r '.[] | .html_summary' | lynx -stdin -dump | head -n 5
Arguments

   Arguments work similarly to [1]options but are positional. They also
   only support a subset of the features of options due to their
   syntactical nature. Click will also not attempt to document arguments
```

```shell
$ url_cache export | jq -r '.[] | .metadata | .title'
seanbreckenridge - Overview
Arguments — Click Documentation (7.x)
```

```shell
url_cache list --location
/home/sean/.local/share/url_cache/data/2/c/7/6284b2f664f381372fab3276449b2/000
/home/sean/.local/share/url_cache/data/7/5/1/70fc230cd88f32e475ff4087f81d9/000
```

```shell
# to make a backup of the cache directory
$ tar -cvzf url_cache.tar.gz "$(url_cache cachedir)"
```

Accessible through the `url_cache` script and `python3 -m url_cache`

### Implementation Notes

This stores all of this information as individual files in a cache directory. In particular, it `MD5` hashes the URL and stores information like:

```
.
└── a
    └── a
        └── e
            └── cf0118bb22340e18fff20f2db8abd
                └── 000
                    ├── data
                    │   └── subtitles.srt
                    ├── key
                    ├── metadata.json
                    └── timestamp.datetime.txt
```

In other words, this is a file system hash table which implements separate chaining.

You're free to delete any of the directories in the cache if you want, this doesn't maintain a strict index, it uses a hash of the URL and then searches for a matching `key` file.

By default this waits 5 seconds between requests. Since all the info is cached, I use this by requesting all the info from one data source (e.g. my bookmarks, or videos I've watched recently) in a loop in the background, which saves all the information to my computer. The next time I do that same loop, it doesn't have to make any requests and it just grabs all the info from local cache.

Originally created for [`HPI`](https://github.com/seanbreckenridge/HPI).

---

### Testing

```
git clone 'https://github.com/seanbreckenridge/url_cache'
cd ./url_cache
pip install '.[testing]'
mypy ./src/url_cache
flake8 ./src/url_cache
pytest
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/seanbreckenridge/url_cache",
    "name": "url-cache",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "url cache metadata youtube subtitles",
    "author": "Sean Breckenridge",
    "author_email": "seanbrecke@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/13/7c/d3c840867817319cab7eb602302f4004a4d447a403724e2ef09ca65cd6f8/url_cache-0.0.9.tar.gz",
    "platform": null,
    "description": "[![PyPi version](https://img.shields.io/pypi/v/url_cache.svg)](https://pypi.python.org/pypi/url_cache) [![Python3.7|3.8|3.9|3.10|3.11](https://img.shields.io/pypi/pyversions/url_cache.svg)](https://pypi.python.org/pypi/url_cache) [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square)](http://makeapullrequest.com)\n\nThis is currently very alpha and in development, so expect changes to the API/interface. It aims to walk the line between extracting enough text/data for it to be useful, but no so much that it takes enormous amounts of space.\n\nAs it stands I'm sort of pessimistic this would ever be a silver bullet, getting useful info out of arbitrary HTML is hard, so you're sort of stuck writing parsers for each website you're interested in. However, I still use this frequently, especially as a cache for API information like described [below](#api-cache-examples)\n\nCurrent TODOs:\n\n- [ ] Add more sites using the [abstract interface](https://github.com/seanbreckenridge/url_cache/blob/master/src/url_cache/sites/abstract.py), to get more info from sites I use commonly. Ideally, should be able to re-use common scraper/parsers/API interface libraries in python, instead of recreating everything from scratch\n- [ ] Create a (separate repo/project) daemon which handles configuring this and slowly requests things in the background as they become available through given sources; allow user to provide generators/inputs define include/exclude lists/regexes. Probably just integrate with [promnesia](https://github.com/karlicoss/promnesia) so avoid duplicating the work of searching for URLs on disk\n\n## Installation\n\nRequires `python3.7+`\n\nTo install with pip, run:\n\n    python3 -m pip install url_cache\n\nAs this is still in development, for the latest changes install from git: `python3 -m pip install git+https://github.com/seanbreckenridge/url_cache`\n\n## Rationale\n\nA file system cache which saves URL metadata and summarizes content\n\nThis is meant to provide more context to any of my tools which use URLs. If I [watched some youtube video](https://github.com/seanbreckenridge/mpv-history-daemon) and I have a URL, I'd like to have the subtitles for it, so I can do a text-search over all the videos I've watched. If I [read an article](https://github.com/seanbreckenridge/browserexport), I want the article text! This requests, parses and abstracts away that data for me locally, so I can do something like:\n\n```python\n>>> from url_cache.core import URLCache\n>>> u = URLCache()\n>>> data = u.get(\"https://sean.fish/\")\n>>> data.metadata[\"images\"][-1]\n{'src': 'https://raw.githubusercontent.com/seanbreckenridge/glue/master/assets/screenshot.png', 'alt': 'screenshot', 'type': 'body_image', 'width': 600}\n>>> data.metadata[\"description\"]\n\"sean.fish; Sean Breckenridge's Home Page\"\n```\n\nIf I ever request that URL again, the information is grabbed from a local cache instead.\n\nGenerally, this uses:\n\n- [`lassie`](https://github.com/michaelhelmick/lassie) to get generic metadata; the title, description, opengraph information, links to images/videos on the page.\n- [`readability`](https://github.com/buriy/python-readability) to convert/compress HTML to a summary of the HTML content.\n\nSite-Specific Extractors:\n\n- [Youtube](./docs/url_cache/sites/youtube/subtitles_downloader.md): to get manual/auto-generated captions (converted to a `.srt` file) from Youtube URLs\n- Stackoverflow (Just a basic URL preprocessor to reduce the possibility of conflicts/duplicate data)\n- MyAnimeList (using [Jikan v4](https://docs.api.jikan.moe/))\n\nThis is meant to be extendible -- so its possible for you to write your own extractors/file loaders/dumpers (for new formats (e.g. `srt`)) for sites you use commonly and pass those to `url_cache.core.URLCache` to extract richer data for those sites. Otherwise, it saves the information from `lassie` and the summarized HTML using `readability` for each URL.\n\nTo avoid scope creep, this probably won't support:\n\n- Converting the HTML summary to text (use something like the `lynx` command below)\n- Minimizing HTML - run something like `find ~/.local/share/url_cache/ -name '*.html' -exec <some tool/script that minimizes in place> \\;` instead -- the data is just stored in individual files in the data directory\n\n### Usage:\n\nIn Python, this can be configured by using the `url_cache.core.URLCache` class: For example:\n\n```python\nimport logging\nfrom url_cache.core import URLCache\n\n# make requests every 2 seconds\n# debug logs\n# save to a folder in my home directory\ncache = URLCache(loglevel=logging.DEBUG, sleep_time=2, cache_dir=\"~/Documents/urldata\")\nc = cache.get(\"https://github.com/seanbreckenridge\")\n# just request information, don't read/save to cache\ndata = cache.request_data(\"https://www.wikipedia.org/\")\n```\n\nFor more information, see [the docs](./docs/url_cache/core.md)\n\nThe CLI interface provides some utility commands to get/list information from the cache.\n\n```\nUsage: url_cache [OPTIONS] COMMAND [ARGS]...\n\nOptions:\n  --cache-dir PATH                Override default cache directory location\n  --debug / --no-debug            Increase log verbosity\n  --sleep-time INTEGER            How long to sleep between requests\n  --summarize-html / --no-summarize-html\n                                  Use readability to summarize html. Otherwise\n                                  saves the entire HTML document\n\n  --skip-subtitles / --no-skip-subtitles\n                                  Skip downloading Youtube Subtitles\n  --subtitle-language TEXT        Subtitle language for Youtube Subtitles\n  --help                          Show this message and exit.\n\nCommands:\n  cachedir  Prints the location of the local cache directory\n  export    Print all cached information as JSON\n  get       Get information for one or more URLs Prints results as JSON\n  in-cache  Prints if a URL is already cached\n  list      List all cached URLs\n```\n\nAn environment variable `URL_CACHE_DIR` can be set, which changes the default cache directory.\n\n### API Cache Examples\n\nI've also successfully used this to cache responses from API results in some of my projects, by subclassing and overriding the `request_data` function. I just make a request and return a summary, and it transparently caches the rest. See:\n\n- [`albums/discogs_cache`](https://github.com/seanbreckenridge/albums/blob/9d296c4abb8e9e16c8dd410aeae8e5bb760008de/nextalbums/discogs_cache.py)\n- [`my_feed/tmdb`](https://github.com/seanbreckenridge/my_feed/blob/master/src/my_feed/sources/trakt/tmdb.py)\n- [`dbsentinel/metadata`](https://github.com/seanbreckenridge/dbsentinel/blob/accfc70485644d8966a582204c6c47839d2d874e/mal_id/metadata_cache.py)\n\n### CLI Examples\n\nThe `get` command emits `JSON`, so it could with other tools (e.g. [`jq`](https://stedolan.github.io/jq/)) used like:\n\n```shell\n$ url_cache get \"https://click.palletsprojects.com/en/7.x/arguments/\" | \\\n  jq -r '.[] | .html_summary' | lynx -stdin -dump | head -n 5\nArguments\n\n   Arguments work similarly to [1]options but are positional. They also\n   only support a subset of the features of options due to their\n   syntactical nature. Click will also not attempt to document arguments\n```\n\n```shell\n$ url_cache export | jq -r '.[] | .metadata | .title'\nseanbreckenridge - Overview\nArguments \u2014 Click Documentation (7.x)\n```\n\n```shell\nurl_cache list --location\n/home/sean/.local/share/url_cache/data/2/c/7/6284b2f664f381372fab3276449b2/000\n/home/sean/.local/share/url_cache/data/7/5/1/70fc230cd88f32e475ff4087f81d9/000\n```\n\n```shell\n# to make a backup of the cache directory\n$ tar -cvzf url_cache.tar.gz \"$(url_cache cachedir)\"\n```\n\nAccessible through the `url_cache` script and `python3 -m url_cache`\n\n### Implementation Notes\n\nThis stores all of this information as individual files in a cache directory. In particular, it `MD5` hashes the URL and stores information like:\n\n```\n.\n\u2514\u2500\u2500 a\n    \u2514\u2500\u2500 a\n        \u2514\u2500\u2500 e\n            \u2514\u2500\u2500 cf0118bb22340e18fff20f2db8abd\n                \u2514\u2500\u2500 000\n                    \u251c\u2500\u2500 data\n                    \u2502\u00a0\u00a0 \u2514\u2500\u2500 subtitles.srt\n                    \u251c\u2500\u2500 key\n                    \u251c\u2500\u2500 metadata.json\n                    \u2514\u2500\u2500 timestamp.datetime.txt\n```\n\nIn other words, this is a file system hash table which implements separate chaining.\n\nYou're free to delete any of the directories in the cache if you want, this doesn't maintain a strict index, it uses a hash of the URL and then searches for a matching `key` file.\n\nBy default this waits 5 seconds between requests. Since all the info is cached, I use this by requesting all the info from one data source (e.g. my bookmarks, or videos I've watched recently) in a loop in the background, which saves all the information to my computer. The next time I do that same loop, it doesn't have to make any requests and it just grabs all the info from local cache.\n\nOriginally created for [`HPI`](https://github.com/seanbreckenridge/HPI).\n\n---\n\n### Testing\n\n```\ngit clone 'https://github.com/seanbreckenridge/url_cache'\ncd ./url_cache\npip install '.[testing]'\nmypy ./src/url_cache\nflake8 ./src/url_cache\npytest\n```\n",
    "bugtrack_url": null,
    "license": "http://www.apache.org/licenses/LICENSE-2.0",
    "summary": "A file system cache which saves URL metadata and summarizes content",
    "version": "0.0.9",
    "project_urls": {
        "Homepage": "https://github.com/seanbreckenridge/url_cache"
    },
    "split_keywords": [
        "url",
        "cache",
        "metadata",
        "youtube",
        "subtitles"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b79a2daf8920d68705b4ecc8991b1018e30da384ddef5b7e906b65ae8510252c",
                "md5": "ae7ef6a8d4434cd5280030ac5022ae4f",
                "sha256": "3fdb945004d9498bca239cda9e9dd5525de662c16343cdc6000e5b4651dd480d"
            },
            "downloads": -1,
            "filename": "url_cache-0.0.9-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ae7ef6a8d4434cd5280030ac5022ae4f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 32680,
            "upload_time": "2023-09-09T23:23:35",
            "upload_time_iso_8601": "2023-09-09T23:23:35.265277Z",
            "url": "https://files.pythonhosted.org/packages/b7/9a/2daf8920d68705b4ecc8991b1018e30da384ddef5b7e906b65ae8510252c/url_cache-0.0.9-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "137cd3c840867817319cab7eb602302f4004a4d447a403724e2ef09ca65cd6f8",
                "md5": "2fcad9998c2728cafbb75f4d6bca7036",
                "sha256": "262f16291cd871dc3ab1e69e73fdd4118f49d19922ec1c5b2b02840abfd8ca88"
            },
            "downloads": -1,
            "filename": "url_cache-0.0.9.tar.gz",
            "has_sig": false,
            "md5_digest": "2fcad9998c2728cafbb75f4d6bca7036",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 34191,
            "upload_time": "2023-09-09T23:23:37",
            "upload_time_iso_8601": "2023-09-09T23:23:37.176905Z",
            "url": "https://files.pythonhosted.org/packages/13/7c/d3c840867817319cab7eb602302f4004a4d447a403724e2ef09ca65cd6f8/url_cache-0.0.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-09 23:23:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "seanbreckenridge",
    "github_project": "url_cache",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "url-cache"
}

Sean Breckenridge