# scraped
Tools for scraping.
To install: ```pip install scraped```
# Showcase of main functionalities
Note that when pip installed, `scraped` comes with a command line tool of that name.
Run this in your terminal:
```bash
scraped -h
```
Output:
```
usage: tools.py [-h] {markdown-of-site,download-site,scrape-multiple-sites} ...
...
```
These tools are written in python, so you can use them by importing
```python
from scraped import markdown_of_site, download_site, scrape_multiple_sites
```
`download_site` downloads one (by default, `depth=1`) or several (if you specify
a larger `depth`) pages of a target url, saving them in files of a folder of
your (optional) choice.
`scrape_multiple_sites` can be used to download several sites.
`markdown_of_site` uses `download_site` (by default, saving to a temporary folder),
then aggregates all the pages into a single markdown string, which it can save for
you if you ask for it (by specifying a `save_filepath`)
Below you'll find more details on these functionalities.
You'll find more useful functions in the code, but the three I mention here are
the "top" ones I use most often.
## markdown_of_site
Download a site and convert it to markdown.
This can be quite useful when you want to perform some NLP analysis on a site,
feed some information to an AI model, or simply want to read the site offline.
Markdown offers a happy medium between readability and simplicity, and is
supported by many tools and platforms.
Args:
- url: The URL of the site to download.
- depth: The maximum depth to follow links.
- filter_urls: A function to filter URLs to download.
- save_filepath: The file path where the combined Markdown will be saved.
- verbosity: The verbosity level.
- dir_to_save_page_slurps: The directory to save the downloaded pages.
- extra_kwargs: Extra keyword arguments to pass to the Scrapy spider.
Returns:
- The Markdown string of the site (if save_filepath is None), otherwise the save_filepath.
```python
>>> markdown_of_site(
... "https://i2mint.github.io/dol/",
... depth=2,
... save_filepath='~/dol_documentation.md'
... ) # doctest: +SKIP
'~/dol_documentation.md'
```
If you don't specify a `save_filepath`, the function will return the Markdown
string, which you can then analyze directly, and/or store as you wish.
```python
>>> markdown_string = markdown_of_site("https://i2mint.github.io/dol/") # doctest: +SKIP
>>> print(f"{type(markdown_string).__name__} of length {len(markdown_string)}") # doctest: +SKIP
str of length 626439
```
## download_site
```python
download_site('http://www.example.com')
```
will just download the page the url points to, storing it in the default rootdir,
which, for example, on unix/mac, is `~/.config/scraped/data`, but can be configured
through a `SCRAPED_DFLT_ROOTDIR` environment variable.
The `depth` argument will enable you to download more content starting from the url:
```python
download_site('http://www.example.com', depth=3)
```
And there's more arguments:
* `start_url`: The URL to start downloading from.
* `url_to_filepath`: The function to convert URLs to local filepaths.
* `depth`: The maximum depth to follow links.
* `filter_urls`: A function to filter URLs to download.
* `mk_missing_dirs`: Whether to create missing directories.
* `verbosity`: The verbosity level.
* `rootdir`: The root directory to save the downloaded files.
* `extra_kwargs`: Extra keyword arguments to pass to the Scrapy spider.
Raw data
{
"_id": null,
"home_page": "https://github.com/thorwhalen/scraped",
"name": "scraped",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": "Thor Whalen",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/99/a5/2786b347dbea79f6a86efa29ea3cf281e78b6b3d0230bb107804f7eb5e4f/scraped-0.0.11.tar.gz",
"platform": "any",
"description": "# scraped\n\nTools for scraping.\n\nTo install:\t```pip install scraped```\n\n\n# Showcase of main functionalities\n\nNote that when pip installed, `scraped` comes with a command line tool of that name. \nRun this in your terminal:\n\n```bash\nscraped -h\n```\n\nOutput:\n\n```\nusage: tools.py [-h] {markdown-of-site,download-site,scrape-multiple-sites} ...\n\n...\n```\n\nThese tools are written in python, so you can use them by importing\n\n```python\nfrom scraped import markdown_of_site, download_site, scrape_multiple_sites\n```\n\n`download_site` downloads one (by default, `depth=1`) or several (if you specify\na larger `depth`) pages of a target url, saving them in files of a folder of \nyour (optional) choice. \n\n`scrape_multiple_sites` can be used to download several sites.\n\n`markdown_of_site` uses `download_site` (by default, saving to a temporary folder), \nthen aggregates all the pages into a single markdown string, which it can save for \nyou if you ask for it (by specifying a `save_filepath`)\n\nBelow you'll find more details on these functionalities. \n\nYou'll find more useful functions in the code, but the three I mention here are \nthe \"top\" ones I use most often.\n\n## markdown_of_site\n\nDownload a site and convert it to markdown.\n\nThis can be quite useful when you want to perform some NLP analysis on a site, \nfeed some information to an AI model, or simply want to read the site offline.\nMarkdown offers a happy medium between readability and simplicity, and is\nsupported by many tools and platforms.\n\nArgs:\n- url: The URL of the site to download.\n- depth: The maximum depth to follow links.\n- filter_urls: A function to filter URLs to download.\n- save_filepath: The file path where the combined Markdown will be saved.\n- verbosity: The verbosity level.\n- dir_to_save_page_slurps: The directory to save the downloaded pages.\n- extra_kwargs: Extra keyword arguments to pass to the Scrapy spider.\n\nReturns:\n- The Markdown string of the site (if save_filepath is None), otherwise the save_filepath.\n\n```python\n>>> markdown_of_site(\n... \"https://i2mint.github.io/dol/\",\n... depth=2,\n... save_filepath='~/dol_documentation.md'\n... ) # doctest: +SKIP\n'~/dol_documentation.md'\n```\n\nIf you don't specify a `save_filepath`, the function will return the Markdown \nstring, which you can then analyze directly, and/or store as you wish.\n\n```python\n>>> markdown_string = markdown_of_site(\"https://i2mint.github.io/dol/\") # doctest: +SKIP\n>>> print(f\"{type(markdown_string).__name__} of length {len(markdown_string)}\") # doctest: +SKIP\nstr of length 626439\n```\n\n## download_site\n\n```python\ndownload_site('http://www.example.com')\n```\n\nwill just download the page the url points to, storing it in the default rootdir, \nwhich, for example, on unix/mac, is `~/.config/scraped/data`, but can be configured \nthrough a `SCRAPED_DFLT_ROOTDIR` environment variable.\n\nThe `depth` argument will enable you to download more content starting from the url:\n\n\n```python\ndownload_site('http://www.example.com', depth=3)\n```\n\nAnd there's more arguments:\n* `start_url`: The URL to start downloading from.\n* `url_to_filepath`: The function to convert URLs to local filepaths.\n* `depth`: The maximum depth to follow links.\n* `filter_urls`: A function to filter URLs to download.\n* `mk_missing_dirs`: Whether to create missing directories.\n* `verbosity`: The verbosity level.\n* `rootdir`: The root directory to save the downloaded files.\n* `extra_kwargs`: Extra keyword arguments to pass to the Scrapy spider.\n\n",
"bugtrack_url": null,
"license": "mit",
"summary": "Tools for scraping",
"version": "0.0.11",
"project_urls": {
"Homepage": "https://github.com/thorwhalen/scraped"
},
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "ec3d546bb9a5364abda061eea6ccdb1d28acbff5d847beac982ea33b63f87119",
"md5": "7140525b9a31eec12a0058f2ae233ede",
"sha256": "d543df962ae05a2597265d15fc792b3a85bf96aee4edd5fbb4563dac51584a35"
},
"downloads": -1,
"filename": "scraped-0.0.11-py3-none-any.whl",
"has_sig": false,
"md5_digest": "7140525b9a31eec12a0058f2ae233ede",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 13770,
"upload_time": "2025-08-20T15:46:04",
"upload_time_iso_8601": "2025-08-20T15:46:04.839456Z",
"url": "https://files.pythonhosted.org/packages/ec/3d/546bb9a5364abda061eea6ccdb1d28acbff5d847beac982ea33b63f87119/scraped-0.0.11-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "99a52786b347dbea79f6a86efa29ea3cf281e78b6b3d0230bb107804f7eb5e4f",
"md5": "f8f8e2bdb66712d86a6e3c8b6724d8be",
"sha256": "571b081fc391ca5d07c552121def3c1a27de112364a704fa92368a914845008b"
},
"downloads": -1,
"filename": "scraped-0.0.11.tar.gz",
"has_sig": false,
"md5_digest": "f8f8e2bdb66712d86a6e3c8b6724d8be",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 13083,
"upload_time": "2025-08-20T15:46:05",
"upload_time_iso_8601": "2025-08-20T15:46:05.975145Z",
"url": "https://files.pythonhosted.org/packages/99/a5/2786b347dbea79f6a86efa29ea3cf281e78b6b3d0230bb107804f7eb5e4f/scraped-0.0.11.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-20 15:46:05",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "thorwhalen",
"github_project": "scraped",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "scraped"
}