scraped


Namescraped JSON
Version 0.0.9 PyPI version JSON
download
home_pagehttps://github.com/thorwhalen/scraped
SummaryTools for scraping
upload_time2024-11-08 12:55:18
maintainerNone
docs_urlNone
authorThor Whalen
requires_pythonNone
licensemit
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # scraped

Tools for scraping.

To install:	```pip install scraped```


# Showcase of main functionalities

Note that when pip installed, `scraped` comes with a command line tool of that name. 
Run this in your terminal:

```bash
scraped -h
```

Output:

```
usage: tools.py [-h] {markdown-of-site,download-site,scrape-multiple-sites} ...

...
```

These tools are written in python, so you can use them by importing

```python
from scraped import markdown_of_site, download_site, scrape_multiple_sites
```

`download_site` downloads one (by default, `depth=1`) or several (if you specify
a larger `depth`) pages of a target url, saving them in files of a folder of 
your (optional) choice. 

`scrape_multiple_sites` can be used to download several sites.

`markdown_of_site` uses `download_site` (by default, saving to a temporary folder), 
then aggregates all the pages into a single markdown string, which it can save for 
you if you ask for it (by specifying a `save_filepath`)

Below you'll find more details on these functionalities. 

You'll find more useful functions in the code, but the three I mention here are 
the "top" ones I use most often.

## markdown_of_site

Download a site and convert it to markdown.

This can be quite useful when you want to perform some NLP analysis on a site, 
feed some information to an AI model, or simply want to read the site offline.
Markdown offers a happy medium between readability and simplicity, and is
supported by many tools and platforms.

Args:
- url: The URL of the site to download.
- depth: The maximum depth to follow links.
- filter_urls: A function to filter URLs to download.
- save_filepath: The file path where the combined Markdown will be saved.
- verbosity: The verbosity level.
- dir_to_save_page_slurps: The directory to save the downloaded pages.
- extra_kwargs: Extra keyword arguments to pass to the Scrapy spider.

Returns:
- The Markdown string of the site (if save_filepath is None), otherwise the save_filepath.

```python
>>> markdown_of_site(
...     "https://i2mint.github.io/dol/",
...     depth=2,
...     save_filepath='~/dol_documentation.md'
... )  # doctest: +SKIP
'~/dol_documentation.md'
```

If you don't specify a `save_filepath`, the function will return the Markdown 
string, which you can then analyze directly, and/or store as you wish.

```python
>>> markdown_string = markdown_of_site("https://i2mint.github.io/dol/")  # doctest: +SKIP
>>> print(f"{type(markdown_string).__name__} of length {len(markdown_string)}")  # doctest: +SKIP
str of length 626439
```

## download_site

```python
download_site('http://www.example.com')
```

will just download the page the url points to, storing it in the default rootdir, 
which, for example, on unix/mac, is `~/.config/scraped/data`, but can be configured 
through a `SCRAPED_DFLT_ROOTDIR` environment variable.

The `depth` argument will enable you to download more content starting from the url:


```python
download_site('http://www.example.com', depth=3)
```

And there's more arguments:
* `start_url`: The URL to start downloading from.
* `url_to_filepath`: The function to convert URLs to local filepaths.
* `depth`: The maximum depth to follow links.
* `filter_urls`: A function to filter URLs to download.
* `mk_missing_dirs`: Whether to create missing directories.
* `verbosity`: The verbosity level.
* `rootdir`: The root directory to save the downloaded files.
* `extra_kwargs`: Extra keyword arguments to pass to the Scrapy spider.


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/thorwhalen/scraped",
    "name": "scraped",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "Thor Whalen",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/2e/c3/0c74bce8e787be37cd01f11509af2ad65a03519bdac33673f01a6454ce67/scraped-0.0.9.tar.gz",
    "platform": "any",
    "description": "# scraped\n\nTools for scraping.\n\nTo install:\t```pip install scraped```\n\n\n# Showcase of main functionalities\n\nNote that when pip installed, `scraped` comes with a command line tool of that name. \nRun this in your terminal:\n\n```bash\nscraped -h\n```\n\nOutput:\n\n```\nusage: tools.py [-h] {markdown-of-site,download-site,scrape-multiple-sites} ...\n\n...\n```\n\nThese tools are written in python, so you can use them by importing\n\n```python\nfrom scraped import markdown_of_site, download_site, scrape_multiple_sites\n```\n\n`download_site` downloads one (by default, `depth=1`) or several (if you specify\na larger `depth`) pages of a target url, saving them in files of a folder of \nyour (optional) choice. \n\n`scrape_multiple_sites` can be used to download several sites.\n\n`markdown_of_site` uses `download_site` (by default, saving to a temporary folder), \nthen aggregates all the pages into a single markdown string, which it can save for \nyou if you ask for it (by specifying a `save_filepath`)\n\nBelow you'll find more details on these functionalities. \n\nYou'll find more useful functions in the code, but the three I mention here are \nthe \"top\" ones I use most often.\n\n## markdown_of_site\n\nDownload a site and convert it to markdown.\n\nThis can be quite useful when you want to perform some NLP analysis on a site, \nfeed some information to an AI model, or simply want to read the site offline.\nMarkdown offers a happy medium between readability and simplicity, and is\nsupported by many tools and platforms.\n\nArgs:\n- url: The URL of the site to download.\n- depth: The maximum depth to follow links.\n- filter_urls: A function to filter URLs to download.\n- save_filepath: The file path where the combined Markdown will be saved.\n- verbosity: The verbosity level.\n- dir_to_save_page_slurps: The directory to save the downloaded pages.\n- extra_kwargs: Extra keyword arguments to pass to the Scrapy spider.\n\nReturns:\n- The Markdown string of the site (if save_filepath is None), otherwise the save_filepath.\n\n```python\n>>> markdown_of_site(\n...     \"https://i2mint.github.io/dol/\",\n...     depth=2,\n...     save_filepath='~/dol_documentation.md'\n... )  # doctest: +SKIP\n'~/dol_documentation.md'\n```\n\nIf you don't specify a `save_filepath`, the function will return the Markdown \nstring, which you can then analyze directly, and/or store as you wish.\n\n```python\n>>> markdown_string = markdown_of_site(\"https://i2mint.github.io/dol/\")  # doctest: +SKIP\n>>> print(f\"{type(markdown_string).__name__} of length {len(markdown_string)}\")  # doctest: +SKIP\nstr of length 626439\n```\n\n## download_site\n\n```python\ndownload_site('http://www.example.com')\n```\n\nwill just download the page the url points to, storing it in the default rootdir, \nwhich, for example, on unix/mac, is `~/.config/scraped/data`, but can be configured \nthrough a `SCRAPED_DFLT_ROOTDIR` environment variable.\n\nThe `depth` argument will enable you to download more content starting from the url:\n\n\n```python\ndownload_site('http://www.example.com', depth=3)\n```\n\nAnd there's more arguments:\n* `start_url`: The URL to start downloading from.\n* `url_to_filepath`: The function to convert URLs to local filepaths.\n* `depth`: The maximum depth to follow links.\n* `filter_urls`: A function to filter URLs to download.\n* `mk_missing_dirs`: Whether to create missing directories.\n* `verbosity`: The verbosity level.\n* `rootdir`: The root directory to save the downloaded files.\n* `extra_kwargs`: Extra keyword arguments to pass to the Scrapy spider.\n\n",
    "bugtrack_url": null,
    "license": "mit",
    "summary": "Tools for scraping",
    "version": "0.0.9",
    "project_urls": {
        "Homepage": "https://github.com/thorwhalen/scraped"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "edc9b52c1f63437bcb27cf1bc511d72d7a3b905a5034b8c3545d66d515adb88b",
                "md5": "1d17304026dd7dd8779d41e855fee37c",
                "sha256": "f6c0e399acabca63c855783763bade9c0d52bd5c6c5dc26a4c05e88031a016c5"
            },
            "downloads": -1,
            "filename": "scraped-0.0.9-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1d17304026dd7dd8779d41e855fee37c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 13131,
            "upload_time": "2024-11-08T12:55:17",
            "upload_time_iso_8601": "2024-11-08T12:55:17.137887Z",
            "url": "https://files.pythonhosted.org/packages/ed/c9/b52c1f63437bcb27cf1bc511d72d7a3b905a5034b8c3545d66d515adb88b/scraped-0.0.9-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2ec30c74bce8e787be37cd01f11509af2ad65a03519bdac33673f01a6454ce67",
                "md5": "b294a1ceae2ecaf45bfb00885b9b9b68",
                "sha256": "9ffd724a345cb1a4c1d54e8f568d7c68930e4662fe083dbec97cb7f71715563a"
            },
            "downloads": -1,
            "filename": "scraped-0.0.9.tar.gz",
            "has_sig": false,
            "md5_digest": "b294a1ceae2ecaf45bfb00885b9b9b68",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 12640,
            "upload_time": "2024-11-08T12:55:18",
            "upload_time_iso_8601": "2024-11-08T12:55:18.666056Z",
            "url": "https://files.pythonhosted.org/packages/2e/c3/0c74bce8e787be37cd01f11509af2ad65a03519bdac33673f01a6454ce67/scraped-0.0.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-08 12:55:18",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "thorwhalen",
    "github_project": "scraped",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "scraped"
}
        
Elapsed time: 1.28042s