scraped


Namescraped JSON
Version 0.0.11 PyPI version JSON
download
home_pagehttps://github.com/thorwhalen/scraped
SummaryTools for scraping
upload_time2025-08-20 15:46:05
maintainerNone
docs_urlNone
authorThor Whalen
requires_pythonNone
licensemit
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # scraped

Tools for scraping.

To install:	```pip install scraped```


# Showcase of main functionalities

Note that when pip installed, `scraped` comes with a command line tool of that name. 
Run this in your terminal:

```bash
scraped -h
```

Output:

```
usage: tools.py [-h] {markdown-of-site,download-site,scrape-multiple-sites} ...

...
```

These tools are written in python, so you can use them by importing

```python
from scraped import markdown_of_site, download_site, scrape_multiple_sites
```

`download_site` downloads one (by default, `depth=1`) or several (if you specify
a larger `depth`) pages of a target url, saving them in files of a folder of 
your (optional) choice. 

`scrape_multiple_sites` can be used to download several sites.

`markdown_of_site` uses `download_site` (by default, saving to a temporary folder), 
then aggregates all the pages into a single markdown string, which it can save for 
you if you ask for it (by specifying a `save_filepath`)

Below you'll find more details on these functionalities. 

You'll find more useful functions in the code, but the three I mention here are 
the "top" ones I use most often.

## markdown_of_site

Download a site and convert it to markdown.

This can be quite useful when you want to perform some NLP analysis on a site, 
feed some information to an AI model, or simply want to read the site offline.
Markdown offers a happy medium between readability and simplicity, and is
supported by many tools and platforms.

Args:
- url: The URL of the site to download.
- depth: The maximum depth to follow links.
- filter_urls: A function to filter URLs to download.
- save_filepath: The file path where the combined Markdown will be saved.
- verbosity: The verbosity level.
- dir_to_save_page_slurps: The directory to save the downloaded pages.
- extra_kwargs: Extra keyword arguments to pass to the Scrapy spider.

Returns:
- The Markdown string of the site (if save_filepath is None), otherwise the save_filepath.

```python
>>> markdown_of_site(
...     "https://i2mint.github.io/dol/",
...     depth=2,
...     save_filepath='~/dol_documentation.md'
... )  # doctest: +SKIP
'~/dol_documentation.md'
```

If you don't specify a `save_filepath`, the function will return the Markdown 
string, which you can then analyze directly, and/or store as you wish.

```python
>>> markdown_string = markdown_of_site("https://i2mint.github.io/dol/")  # doctest: +SKIP
>>> print(f"{type(markdown_string).__name__} of length {len(markdown_string)}")  # doctest: +SKIP
str of length 626439
```

## download_site

```python
download_site('http://www.example.com')
```

will just download the page the url points to, storing it in the default rootdir, 
which, for example, on unix/mac, is `~/.config/scraped/data`, but can be configured 
through a `SCRAPED_DFLT_ROOTDIR` environment variable.

The `depth` argument will enable you to download more content starting from the url:


```python
download_site('http://www.example.com', depth=3)
```

And there's more arguments:
* `start_url`: The URL to start downloading from.
* `url_to_filepath`: The function to convert URLs to local filepaths.
* `depth`: The maximum depth to follow links.
* `filter_urls`: A function to filter URLs to download.
* `mk_missing_dirs`: Whether to create missing directories.
* `verbosity`: The verbosity level.
* `rootdir`: The root directory to save the downloaded files.
* `extra_kwargs`: Extra keyword arguments to pass to the Scrapy spider.


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/thorwhalen/scraped",
    "name": "scraped",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "Thor Whalen",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/99/a5/2786b347dbea79f6a86efa29ea3cf281e78b6b3d0230bb107804f7eb5e4f/scraped-0.0.11.tar.gz",
    "platform": "any",
    "description": "# scraped\n\nTools for scraping.\n\nTo install:\t```pip install scraped```\n\n\n# Showcase of main functionalities\n\nNote that when pip installed, `scraped` comes with a command line tool of that name. \nRun this in your terminal:\n\n```bash\nscraped -h\n```\n\nOutput:\n\n```\nusage: tools.py [-h] {markdown-of-site,download-site,scrape-multiple-sites} ...\n\n...\n```\n\nThese tools are written in python, so you can use them by importing\n\n```python\nfrom scraped import markdown_of_site, download_site, scrape_multiple_sites\n```\n\n`download_site` downloads one (by default, `depth=1`) or several (if you specify\na larger `depth`) pages of a target url, saving them in files of a folder of \nyour (optional) choice. \n\n`scrape_multiple_sites` can be used to download several sites.\n\n`markdown_of_site` uses `download_site` (by default, saving to a temporary folder), \nthen aggregates all the pages into a single markdown string, which it can save for \nyou if you ask for it (by specifying a `save_filepath`)\n\nBelow you'll find more details on these functionalities. \n\nYou'll find more useful functions in the code, but the three I mention here are \nthe \"top\" ones I use most often.\n\n## markdown_of_site\n\nDownload a site and convert it to markdown.\n\nThis can be quite useful when you want to perform some NLP analysis on a site, \nfeed some information to an AI model, or simply want to read the site offline.\nMarkdown offers a happy medium between readability and simplicity, and is\nsupported by many tools and platforms.\n\nArgs:\n- url: The URL of the site to download.\n- depth: The maximum depth to follow links.\n- filter_urls: A function to filter URLs to download.\n- save_filepath: The file path where the combined Markdown will be saved.\n- verbosity: The verbosity level.\n- dir_to_save_page_slurps: The directory to save the downloaded pages.\n- extra_kwargs: Extra keyword arguments to pass to the Scrapy spider.\n\nReturns:\n- The Markdown string of the site (if save_filepath is None), otherwise the save_filepath.\n\n```python\n>>> markdown_of_site(\n...     \"https://i2mint.github.io/dol/\",\n...     depth=2,\n...     save_filepath='~/dol_documentation.md'\n... )  # doctest: +SKIP\n'~/dol_documentation.md'\n```\n\nIf you don't specify a `save_filepath`, the function will return the Markdown \nstring, which you can then analyze directly, and/or store as you wish.\n\n```python\n>>> markdown_string = markdown_of_site(\"https://i2mint.github.io/dol/\")  # doctest: +SKIP\n>>> print(f\"{type(markdown_string).__name__} of length {len(markdown_string)}\")  # doctest: +SKIP\nstr of length 626439\n```\n\n## download_site\n\n```python\ndownload_site('http://www.example.com')\n```\n\nwill just download the page the url points to, storing it in the default rootdir, \nwhich, for example, on unix/mac, is `~/.config/scraped/data`, but can be configured \nthrough a `SCRAPED_DFLT_ROOTDIR` environment variable.\n\nThe `depth` argument will enable you to download more content starting from the url:\n\n\n```python\ndownload_site('http://www.example.com', depth=3)\n```\n\nAnd there's more arguments:\n* `start_url`: The URL to start downloading from.\n* `url_to_filepath`: The function to convert URLs to local filepaths.\n* `depth`: The maximum depth to follow links.\n* `filter_urls`: A function to filter URLs to download.\n* `mk_missing_dirs`: Whether to create missing directories.\n* `verbosity`: The verbosity level.\n* `rootdir`: The root directory to save the downloaded files.\n* `extra_kwargs`: Extra keyword arguments to pass to the Scrapy spider.\n\n",
    "bugtrack_url": null,
    "license": "mit",
    "summary": "Tools for scraping",
    "version": "0.0.11",
    "project_urls": {
        "Homepage": "https://github.com/thorwhalen/scraped"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ec3d546bb9a5364abda061eea6ccdb1d28acbff5d847beac982ea33b63f87119",
                "md5": "7140525b9a31eec12a0058f2ae233ede",
                "sha256": "d543df962ae05a2597265d15fc792b3a85bf96aee4edd5fbb4563dac51584a35"
            },
            "downloads": -1,
            "filename": "scraped-0.0.11-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7140525b9a31eec12a0058f2ae233ede",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 13770,
            "upload_time": "2025-08-20T15:46:04",
            "upload_time_iso_8601": "2025-08-20T15:46:04.839456Z",
            "url": "https://files.pythonhosted.org/packages/ec/3d/546bb9a5364abda061eea6ccdb1d28acbff5d847beac982ea33b63f87119/scraped-0.0.11-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "99a52786b347dbea79f6a86efa29ea3cf281e78b6b3d0230bb107804f7eb5e4f",
                "md5": "f8f8e2bdb66712d86a6e3c8b6724d8be",
                "sha256": "571b081fc391ca5d07c552121def3c1a27de112364a704fa92368a914845008b"
            },
            "downloads": -1,
            "filename": "scraped-0.0.11.tar.gz",
            "has_sig": false,
            "md5_digest": "f8f8e2bdb66712d86a6e3c8b6724d8be",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 13083,
            "upload_time": "2025-08-20T15:46:05",
            "upload_time_iso_8601": "2025-08-20T15:46:05.975145Z",
            "url": "https://files.pythonhosted.org/packages/99/a5/2786b347dbea79f6a86efa29ea3cf281e78b6b3d0230bb107804f7eb5e4f/scraped-0.0.11.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-20 15:46:05",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "thorwhalen",
    "github_project": "scraped",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "scraped"
}
        
Elapsed time: 3.97511s