# graze
Cache (a tiny part of) the internet.
(For the technically inclined, graze is meant to enable the separation of the concerns
of getting and caching data from the internet.)
## install
```pip install graze```
# Example
```python
from graze import Graze
import os
rootdir = os.path.expanduser('~/graze')
g = Graze(rootdir)
list(g)
```
If this is your first time, you got nothing:
```
[]
```
So get something. For no particular reason let's be self-referential and get myself:
```python
url = 'https://raw.githubusercontent.com/thorwhalen/graze/master/README.md'
content = g[url]
type(content), len(content)
```
Before I grew up, I had only 46 petty bytes:
```
(bytes, 46)
```
These were:
```python
print(content.decode())
```
```
# graze
Cache (a tiny part of) the internet
```
But now, here's the deal. List your ``g`` keys now. Go ahead, don't be shy!
```python
list(g)
```
```
['https://raw.githubusercontent.com/thorwhalen/graze/master/README.md']
```
What does that mean?
I means you have a local copy of these contents.
The file path isn't really ``https://...``, it's `rootdir/https/...`, but you
only have to care about that if you actually have to go get the file with
something else than graze. Because graze will give it to you.
How? Same way you got it in the first place:
```
content_2 = g[url]
assert content_2 == content
```
But this time, it didn't ask the internet. It just got it's local copy.
And if you want a fresh copy?
No problem, just delete your local one. You guessed!
The same way you would delete a key from a dict:
```python
del g[url]
```
# Q&A
## The pages I need to slurp need to be rendered, can I use selenium of other such engines?
Sure!
We understand that sometimes you might have special slurping needs -- such
as needing to let the JS render the page fully, and/or extract something
specific, in a specific way, from the page.
Selenium is a popular choice for these needs.
`graze` doesn't install selenium for you, but if you've done that, you just
need to specify a different `Internet` object for `Graze` to source from,
and to make an internet object, you just need to specify what a
`url_to_contents` function that does exactly what it says.
Note that the contents need to be returned in bytes for `Graze` to work.
If you want to use some of the default `selenium` `url_to_contents` functions
to make an `Internet` (we got Chrome, Firefox, Safari, and Opera),
you go ahead! here's an example using the default Chrome driver
(again, you need to have the driver installed already for this to work;
see https://selenium-python.readthedocs.io/):
```python
from graze import Graze, url_to_contents, Internet
g = Graze(source=Internet(url_to_contents=url_to_contents.selenium_chrome))
```
And if you'll be using it often, just do:
```python
from graze import Graze, url_to_contents, Internet
from functools import partial
my_graze = partial(
Graze,
rootdir='a_specific_root_dir_for_your_project',
source=Internet(url_to_contents=url_to_contents.selenium_chrome)
)
# and then you can just do
g = my_graze()
# and get on with the fun...
```
## What if I want a fresh copy of the data?
Classic caching problem.
You like the convenience of having a local copy, but then how do you keep in sync with the data source if it changes?
If you KNOW the source data changed and want to sync, it's easy. You delete the local copy
(like deleting a key from a dict: `del Graze()[url]`)
and you try to access it again.
Since you don't have a local copy, it will get one from the `url` source.
What if you want this to happen automatically?
Well, there's several ways to do that.
If you have a way to know if the source and local are different (through modified dates, or hashes, etc.),
then you can write a little function to keep things in sync.
But that's context dependent; `graze` doesn't offer you any default way to do it.
Another way to do this is sometimes known as a `TTL Cache` (time-to-live cache).
You get such functionality with the `graze.GrazeWithDataRefresh` store, or for most cases,
simply getting your data through the `graze` function
specifying a `max_age` value (in seconds):
```
from graze import graze
content_bytes = graze(url, max_age=in_seconds)
```
## Can I make graze notify me when it gets a new copy of the data?
Sure! Just specify a `key_ingress` function when you make your `Graze` object, or
call `graze`. This function will be called on the key (the url) just before contents
are being downloaded from the internet. The typical function would be:
```python
key_ingress = lambda key: print(f"Getting {key} from the internet")
```
## Does graze work for dropbox links?
Yes it does, but you need to be aware that dropbox systematically send the data as a zip, **even if there's only one file in it**.
Here's some code that can help.
```python
def zip_store_of_gropbox_url(dropbox_url: str):
"""Get a key-value perspective of the (folder) contents
of the zip a dropbox url gets you"""
from graze import graze
from py2store import FilesOfZip
return FilesOfZip(graze(dropbox_url))
def filebytes_of_dropbox_url(dropbox_url: str, assert_only_one_file=True):
"""Get the bytes of the first file in a zip that a dropbox url gives you"""
zip_store = zip_store_of_gropbox_url(dropbox_url)
zip_filepaths = iter(zip_store)
first_filepath = next(zip_filepaths)
if assert_only_one_file:
assert next(zip_filepaths, None) is None, f"More than one file in {dropbox_url}"
return zip_store[first_filepath]
```
# Notes
## New url-to-path mapping
`graze` used to have a more straightforward url-to-local_filepath mapping,
but it ended up being problematic: In a nutshell,
if you slurp `abc.com` and it goes to a file of that name,
where is `abc.com/data.zip` supposed to go (`abc.com` needs to be a folder
in that case).
See [issue](https://github.com/thorwhalen/graze/issues/1).
It's with a heavy heart that I changed the mapping to one that was still
straightforward, but has the disadvantage of mapping all files to the
same file name, without extension.
Hopefully a better solution will show up soon.
If you already have graze files from the old way, you can
use the `change_files_to_new_url_to_filepath_format` function to change these
to the new format.
Raw data
{
"_id": null,
"home_page": "https://github.com/thorwhalen/graze",
"name": "graze",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": "Thor Whalen",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/01/69/e28a4c6371f77452adb340a2ccd7b5c5c312e227516cf9d297a2c4cc0433/graze-0.1.27.tar.gz",
"platform": "any",
"description": "# graze\n\nCache (a tiny part of) the internet.\n\n(For the technically inclined, graze is meant to enable the separation of the concerns \nof getting and caching data from the internet.)\n\n## install\n\n```pip install graze```\n\n# Example\n\n```python\nfrom graze import Graze\nimport os\nrootdir = os.path.expanduser('~/graze')\ng = Graze(rootdir)\nlist(g)\n```\n\nIf this is your first time, you got nothing:\n\n```\n[]\n```\n\nSo get something. For no particular reason let's be self-referential and get myself:\n\n```python\nurl = 'https://raw.githubusercontent.com/thorwhalen/graze/master/README.md'\ncontent = g[url]\ntype(content), len(content)\n```\n\nBefore I grew up, I had only 46 petty bytes:\n```\n(bytes, 46)\n```\n\nThese were:\n\n```python\nprint(content.decode())\n```\n\n```\n\n# graze\n\nCache (a tiny part of) the internet\n```\n\nBut now, here's the deal. List your ``g`` keys now. Go ahead, don't be shy!\n\n```python\nlist(g)\n```\n```\n['https://raw.githubusercontent.com/thorwhalen/graze/master/README.md']\n```\n\nWhat does that mean? \n\nI means you have a local copy of these contents. \n\nThe file path isn't really ``https://...``, it's `rootdir/https/...`, but you \nonly have to care about that if you actually have to go get the file with\nsomething else than graze. Because graze will give it to you.\n\nHow? Same way you got it in the first place:\n\n```\ncontent_2 = g[url]\nassert content_2 == content\n```\n\nBut this time, it didn't ask the internet. It just got it's local copy.\n\nAnd if you want a fresh copy? \n\nNo problem, just delete your local one. You guessed! \nThe same way you would delete a key from a dict:\n\n```python\ndel g[url]\n```\n\n\n# Q&A\n\n\n## The pages I need to slurp need to be rendered, can I use selenium of other such engines?\n\nSure!\n\nWe understand that sometimes you might have special slurping needs -- such \nas needing to let the JS render the page fully, and/or extract something \nspecific, in a specific way, from the page.\n\nSelenium is a popular choice for these needs.\n\n`graze` doesn't install selenium for you, but if you've done that, you just \nneed to specify a different `Internet` object for `Graze` to source from, \nand to make an internet object, you just need to specify what a \n`url_to_contents` function that does exactly what it says. \n\nNote that the contents need to be returned in bytes for `Graze` to work.\n\nIf you want to use some of the default `selenium` `url_to_contents` functions \nto make an `Internet` (we got Chrome, Firefox, Safari, and Opera), \nyou go ahead! here's an example using the default Chrome driver\n(again, you need to have the driver installed already for this to work; \nsee https://selenium-python.readthedocs.io/):\n\n```python\nfrom graze import Graze, url_to_contents, Internet\n\ng = Graze(source=Internet(url_to_contents=url_to_contents.selenium_chrome))\n```\n\nAnd if you'll be using it often, just do:\n\n```python\nfrom graze import Graze, url_to_contents, Internet\nfrom functools import partial\nmy_graze = partial(\n Graze,\n rootdir='a_specific_root_dir_for_your_project',\n source=Internet(url_to_contents=url_to_contents.selenium_chrome)\n)\n\n# and then you can just do\ng = my_graze()\n# and get on with the fun...\n```\n\n\n## What if I want a fresh copy of the data?\n\nClassic caching problem. \nYou like the convenience of having a local copy, but then how do you keep in sync with the data source if it changes?\n\nIf you KNOW the source data changed and want to sync, it's easy. You delete the local copy \n(like deleting a key from a dict: `del Graze()[url]`)\nand you try to access it again. \nSince you don't have a local copy, it will get one from the `url` source. \n\nWhat if you want this to happen automatically? \n\nWell, there's several ways to do that. \n\nIf you have a way to know if the source and local are different (through modified dates, or hashes, etc.), \nthen you can write a little function to keep things in sync. \nBut that's context dependent; `graze` doesn't offer you any default way to do it. \n\nAnother way to do this is sometimes known as a `TTL Cache` (time-to-live cache). \nYou get such functionality with the `graze.GrazeWithDataRefresh` store, or for most cases, \nsimply getting your data through the `graze` function\nspecifying a `max_age` value (in seconds):\n\n```\nfrom graze import graze\n\ncontent_bytes = graze(url, max_age=in_seconds)\n```\n\n## Can I make graze notify me when it gets a new copy of the data?\n\nSure! Just specify a `key_ingress` function when you make your `Graze` object, or \ncall `graze`. This function will be called on the key (the url) just before contents \nare being downloaded from the internet. The typical function would be:\n\n```python\nkey_ingress = lambda key: print(f\"Getting {key} from the internet\")\n```\n\n## Does graze work for dropbox links?\n\nYes it does, but you need to be aware that dropbox systematically send the data as a zip, **even if there's only one file in it**.\n\nHere's some code that can help.\n\n```python\ndef zip_store_of_gropbox_url(dropbox_url: str):\n \"\"\"Get a key-value perspective of the (folder) contents \n of the zip a dropbox url gets you\"\"\"\n from graze import graze\n from py2store import FilesOfZip\n return FilesOfZip(graze(dropbox_url))\n \ndef filebytes_of_dropbox_url(dropbox_url: str, assert_only_one_file=True):\n \"\"\"Get the bytes of the first file in a zip that a dropbox url gives you\"\"\"\n zip_store = zip_store_of_gropbox_url(dropbox_url)\n zip_filepaths = iter(zip_store)\n first_filepath = next(zip_filepaths)\n if assert_only_one_file:\n assert next(zip_filepaths, None) is None, f\"More than one file in {dropbox_url}\"\n return zip_store[first_filepath]\n```\n\n\n# Notes\n\n## New url-to-path mapping \n\n`graze` used to have a more straightforward url-to-local_filepath mapping, \nbut it ended up being problematic: In a nutshell, \nif you slurp `abc.com` and it goes to a file of that name, \nwhere is `abc.com/data.zip` supposed to go (`abc.com` needs to be a folder \nin that case). \nSee [issue](https://github.com/thorwhalen/graze/issues/1).\n\nIt's with a heavy heart that I changed the mapping to one that was still \nstraightforward, but has the disadvantage of mapping all files to the \nsame file name, without extension. \n\nHopefully a better solution will show up soon.\n\nIf you already have graze files from the old way, you can \nuse the `change_files_to_new_url_to_filepath_format` function to change these \nto the new format. \n\n\n\n",
"bugtrack_url": null,
"license": "mit",
"summary": "Cache (a tiny part of) the internet",
"version": "0.1.27",
"project_urls": {
"Homepage": "https://github.com/thorwhalen/graze"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9303c630b6ca891b0b7acfcc6a2c59c6544084c37e9f6f0b64638daeeacc8267",
"md5": "b485da012efa2be92e780772d52566da",
"sha256": "4255a7669875b640f634aa514a455763b53ae5d701a5e50d4fdc69bef0e29a90"
},
"downloads": -1,
"filename": "graze-0.1.27-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b485da012efa2be92e780772d52566da",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 19018,
"upload_time": "2024-11-20T10:54:52",
"upload_time_iso_8601": "2024-11-20T10:54:52.610207Z",
"url": "https://files.pythonhosted.org/packages/93/03/c630b6ca891b0b7acfcc6a2c59c6544084c37e9f6f0b64638daeeacc8267/graze-0.1.27-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "0169e28a4c6371f77452adb340a2ccd7b5c5c312e227516cf9d297a2c4cc0433",
"md5": "8b0927fd889ed9b1d9d2232b27587e32",
"sha256": "365fd55f65584b09efe36480f3e676a7902808ab321596775652b2258117f773"
},
"downloads": -1,
"filename": "graze-0.1.27.tar.gz",
"has_sig": false,
"md5_digest": "8b0927fd889ed9b1d9d2232b27587e32",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 19713,
"upload_time": "2024-11-20T10:54:53",
"upload_time_iso_8601": "2024-11-20T10:54:53.481120Z",
"url": "https://files.pythonhosted.org/packages/01/69/e28a4c6371f77452adb340a2ccd7b5c5c312e227516cf9d297a2c4cc0433/graze-0.1.27.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-20 10:54:53",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "thorwhalen",
"github_project": "graze",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "graze"
}