# scraped
Tools for scraping.
To install: ```pip install scraped```
# Showcase of main functionalities
Note that when pip installed, `scraped` comes with a command line tool of that name.
Run this in your terminal:
```bash
scraped -h
```
Output:
```
usage: tools.py [-h] {markdown-of-site,download-site,scrape-multiple-sites} ...
...
```
These tools are written in python, so you can use them by importing
```python
from scraped import markdown_of_site, download_site, scrape_multiple_sites
```
`download_site` downloads one (by default, `depth=1`) or several (if you specify
a larger `depth`) pages of a target url, saving them in files of a folder of
your (optional) choice.
`scrape_multiple_sites` can be used to download several sites.
`markdown_of_site` uses `download_site` (by default, saving to a temporary folder),
then aggregates all the pages into a single markdown string, which it can save for
you if you ask for it (by specifying a `save_filepath`)
Below you'll find more details on these functionalities.
You'll find more useful functions in the code, but the three I mention here are
the "top" ones I use most often.
## markdown_of_site
Download a site and convert it to markdown.
This can be quite useful when you want to perform some NLP analysis on a site,
feed some information to an AI model, or simply want to read the site offline.
Markdown offers a happy medium between readability and simplicity, and is
supported by many tools and platforms.
Args:
- url: The URL of the site to download.
- depth: The maximum depth to follow links.
- filter_urls: A function to filter URLs to download.
- save_filepath: The file path where the combined Markdown will be saved.
- verbosity: The verbosity level.
- dir_to_save_page_slurps: The directory to save the downloaded pages.
- extra_kwargs: Extra keyword arguments to pass to the Scrapy spider.
Returns:
- The Markdown string of the site (if save_filepath is None), otherwise the save_filepath.
```python
>>> markdown_of_site(
... "https://i2mint.github.io/dol/",
... depth=2,
... save_filepath='~/dol_documentation.md'
... ) # doctest: +SKIP
'~/dol_documentation.md'
```
If you don't specify a `save_filepath`, the function will return the Markdown
string, which you can then analyze directly, and/or store as you wish.
```python
>>> markdown_string = markdown_of_site("https://i2mint.github.io/dol/") # doctest: +SKIP
>>> print(f"{type(markdown_string).__name__} of length {len(markdown_string)}") # doctest: +SKIP
str of length 626439
```
## download_site
```python
download_site('http://www.example.com')
```
will just download the page the url points to, storing it in the default rootdir,
which, for example, on unix/mac, is `~/.config/scraped/data`, but can be configured
through a `SCRAPED_DFLT_ROOTDIR` environment variable.
The `depth` argument will enable you to download more content starting from the url:
```python
download_site('http://www.example.com', depth=3)
```
And there's more arguments:
* `start_url`: The URL to start downloading from.
* `url_to_filepath`: The function to convert URLs to local filepaths.
* `depth`: The maximum depth to follow links.
* `filter_urls`: A function to filter URLs to download.
* `mk_missing_dirs`: Whether to create missing directories.
* `verbosity`: The verbosity level.
* `rootdir`: The root directory to save the downloaded files.
* `extra_kwargs`: Extra keyword arguments to pass to the Scrapy spider.
Raw data
{
"_id": null,
"home_page": "https://github.com/thorwhalen/scraped",
"name": "scraped",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": "Thor Whalen",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/2e/c3/0c74bce8e787be37cd01f11509af2ad65a03519bdac33673f01a6454ce67/scraped-0.0.9.tar.gz",
"platform": "any",
"description": "# scraped\n\nTools for scraping.\n\nTo install:\t```pip install scraped```\n\n\n# Showcase of main functionalities\n\nNote that when pip installed, `scraped` comes with a command line tool of that name. \nRun this in your terminal:\n\n```bash\nscraped -h\n```\n\nOutput:\n\n```\nusage: tools.py [-h] {markdown-of-site,download-site,scrape-multiple-sites} ...\n\n...\n```\n\nThese tools are written in python, so you can use them by importing\n\n```python\nfrom scraped import markdown_of_site, download_site, scrape_multiple_sites\n```\n\n`download_site` downloads one (by default, `depth=1`) or several (if you specify\na larger `depth`) pages of a target url, saving them in files of a folder of \nyour (optional) choice. \n\n`scrape_multiple_sites` can be used to download several sites.\n\n`markdown_of_site` uses `download_site` (by default, saving to a temporary folder), \nthen aggregates all the pages into a single markdown string, which it can save for \nyou if you ask for it (by specifying a `save_filepath`)\n\nBelow you'll find more details on these functionalities. \n\nYou'll find more useful functions in the code, but the three I mention here are \nthe \"top\" ones I use most often.\n\n## markdown_of_site\n\nDownload a site and convert it to markdown.\n\nThis can be quite useful when you want to perform some NLP analysis on a site, \nfeed some information to an AI model, or simply want to read the site offline.\nMarkdown offers a happy medium between readability and simplicity, and is\nsupported by many tools and platforms.\n\nArgs:\n- url: The URL of the site to download.\n- depth: The maximum depth to follow links.\n- filter_urls: A function to filter URLs to download.\n- save_filepath: The file path where the combined Markdown will be saved.\n- verbosity: The verbosity level.\n- dir_to_save_page_slurps: The directory to save the downloaded pages.\n- extra_kwargs: Extra keyword arguments to pass to the Scrapy spider.\n\nReturns:\n- The Markdown string of the site (if save_filepath is None), otherwise the save_filepath.\n\n```python\n>>> markdown_of_site(\n... \"https://i2mint.github.io/dol/\",\n... depth=2,\n... save_filepath='~/dol_documentation.md'\n... ) # doctest: +SKIP\n'~/dol_documentation.md'\n```\n\nIf you don't specify a `save_filepath`, the function will return the Markdown \nstring, which you can then analyze directly, and/or store as you wish.\n\n```python\n>>> markdown_string = markdown_of_site(\"https://i2mint.github.io/dol/\") # doctest: +SKIP\n>>> print(f\"{type(markdown_string).__name__} of length {len(markdown_string)}\") # doctest: +SKIP\nstr of length 626439\n```\n\n## download_site\n\n```python\ndownload_site('http://www.example.com')\n```\n\nwill just download the page the url points to, storing it in the default rootdir, \nwhich, for example, on unix/mac, is `~/.config/scraped/data`, but can be configured \nthrough a `SCRAPED_DFLT_ROOTDIR` environment variable.\n\nThe `depth` argument will enable you to download more content starting from the url:\n\n\n```python\ndownload_site('http://www.example.com', depth=3)\n```\n\nAnd there's more arguments:\n* `start_url`: The URL to start downloading from.\n* `url_to_filepath`: The function to convert URLs to local filepaths.\n* `depth`: The maximum depth to follow links.\n* `filter_urls`: A function to filter URLs to download.\n* `mk_missing_dirs`: Whether to create missing directories.\n* `verbosity`: The verbosity level.\n* `rootdir`: The root directory to save the downloaded files.\n* `extra_kwargs`: Extra keyword arguments to pass to the Scrapy spider.\n\n",
"bugtrack_url": null,
"license": "mit",
"summary": "Tools for scraping",
"version": "0.0.9",
"project_urls": {
"Homepage": "https://github.com/thorwhalen/scraped"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "edc9b52c1f63437bcb27cf1bc511d72d7a3b905a5034b8c3545d66d515adb88b",
"md5": "1d17304026dd7dd8779d41e855fee37c",
"sha256": "f6c0e399acabca63c855783763bade9c0d52bd5c6c5dc26a4c05e88031a016c5"
},
"downloads": -1,
"filename": "scraped-0.0.9-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1d17304026dd7dd8779d41e855fee37c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 13131,
"upload_time": "2024-11-08T12:55:17",
"upload_time_iso_8601": "2024-11-08T12:55:17.137887Z",
"url": "https://files.pythonhosted.org/packages/ed/c9/b52c1f63437bcb27cf1bc511d72d7a3b905a5034b8c3545d66d515adb88b/scraped-0.0.9-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "2ec30c74bce8e787be37cd01f11509af2ad65a03519bdac33673f01a6454ce67",
"md5": "b294a1ceae2ecaf45bfb00885b9b9b68",
"sha256": "9ffd724a345cb1a4c1d54e8f568d7c68930e4662fe083dbec97cb7f71715563a"
},
"downloads": -1,
"filename": "scraped-0.0.9.tar.gz",
"has_sig": false,
"md5_digest": "b294a1ceae2ecaf45bfb00885b9b9b68",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 12640,
"upload_time": "2024-11-08T12:55:18",
"upload_time_iso_8601": "2024-11-08T12:55:18.666056Z",
"url": "https://files.pythonhosted.org/packages/2e/c3/0c74bce8e787be37cd01f11509af2ad65a03519bdac33673f01a6454ce67/scraped-0.0.9.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-08 12:55:18",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "thorwhalen",
"github_project": "scraped",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "scraped"
}