scrapy-ipfs-filecoin


Namescrapy-ipfs-filecoin JSON
Version 0.0.3 PyPI version JSON
download
home_pagehttps://github.com/pawanpaudel93/scrapy-ipfs-filecoin
SummaryScrapy is a popular open-source and collaborative python framework for extracting the data you need from websites. scrapy-ipfs-filecoin provides scrapy pipelines and feed exports to store items into IPFS and Filecoin using services like Web3.Storage, LightHouse.Storage, Estuary, Pinata, Moralis, Filebase or any s3 compatible services.
upload_time2022-12-14 18:54:54
maintainer
docs_urlNone
authorPawan Paudel
requires_python>=3.0
licenseISC
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
<p align="center"><img src="https://raw.githubusercontent.com/pawanpaudel93/scrapy-ipfs-filecoin/main/logo.png" alt="original" width="400" height="300"></p>

<h1 align="center">Welcome to Scrapy-IPFS-Filecoin</h1>
<p>
  <img alt="Version" src="https://img.shields.io/badge/version-0.0.3-blue.svg?cacheSeconds=2592000" />
</p>

Scrapy is a popular open-source and collaborative python framework for extracting the data you need from websites. scrapy-ipfs-filecoin provides scrapy pipelines and feed exports to store items into IPFS and Filecoin using services like [Web3.Storage](https://web3.storage/), [LightHouse.Storage](https://lighthouse.storage/), [Estuary](https://estuary.tech/), [Pinata](https://www.pinata.cloud/), [Moralis](https://moralis.io/), [Filebase](https://filebase.com/) or any S3 compatible services.

### 🏠 [Homepage](https://github.com/pawanpaudel93/scrapy-ipfs-filecoin)

## Install

```shell
npm install -g https://github.com/pawanpaudel93/ipfs-only-hash.git
```

```shell
pip install scrapy-ipfs-filecoin
```

## Example

[scrapy-ipfs-filecoin-example](https://github.com/pawanpaudel93/scrapy-ipfs-filecoin-example)

## Usage

1. Install ipfs-only-hash and scrapy-ipfs-filecoin.

 ```shell
 npm install -g https://github.com/pawanpaudel93/ipfs-only-hash.git
 ```

 ```shell
 pip install scrapy-ipfs-filecoin

 ```

2. Add 'scrapy-ipfs-filecoin.pipelines.ImagesPipeline' and/or 'scrapy-ipfs-filecoin.pipelines.FilesPipeline' to ITEM_PIPELINES setting in your Scrapy project if you need to store images or other files to IPFS and Filecoin.
 For Images Pipeline, use:

 ```shell
 ITEM_PIPELINES = {'scrapy_ipfs_filecoin.pipelines.ImagesPipeline': 1}
 ```

 For Files Pipeline, use:

 ```shell
 ITEM_PIPELINES = {'scrapy_ipfs_filecoin.pipelines.FilesPipeline': 1}
 ```

 The advantage of using the ImagesPipeline for image files is that you can configure some extra functions like generating thumbnails and filtering the images based on their size.

 Or You can also use both the Files and Images Pipeline at the same time.

 ```python
 ITEM_PIPELINES = {
  'scrapy_ipfs_filecoin.pipelines.ImagesPipeline': 1,
  'scrapy-ipfs-filecoin.pipelines.FilesPipeline': 1
 }
 ```

 If you are using the ImagesPipeline make sure to install the pillow package. The Images Pipeline requires Pillow 7.1.0 or greater. It is used for thumbnailing and normalizing images to JPEG/RGB format.

 ```shell
 pip install pillow
 ```

 Then, configure the target storage setting to a valid value that will be used for storing the downloaded images. Otherwise the pipeline will remain disabled, even if you include it in the ITEM_PIPELINES setting.

 Add store path of files or images for Web3Storage, LightHouse, Moralis, Pinata or Estuary as required.

 ```python
 # for ImagesPipeline
 IMAGES_STORE = 'w3s://images' # For Web3Storage
 IMAGES_STORE = 'es://images' # For Estuary
 IMAGES_STORE = 'lh://images' # For LightHouse
 IMAGES_STORE = 'pn://images' # For Pinata
 IMAGES_STORE = 'ms://images' # For Moralis
 IMAGES_STORE = "s3://bucket-name/images/"  # For Filebase or other s3 compatible services
 
 # For FilesPipeline
 FILES_STORE = 'w3s://files' # For Web3Storage
 FILES_STORE = 'es://files' # For Estuary
 FILES_STORE = 'lh://files' # For LightHouse
 FILES_STORE = 'es://files' # For Pinata
 FILES_STORE = 'pn://files' # For Moralis
 FILES_STORE = "s3://bucket-name/files/"  # For Filebase or other s3 compatible services
 ```

 For more info regarding ImagesPipeline and FilesPipline. [See here](https://docs.scrapy.org/en/latest/topics/media-pipeline.html)

3. For Feed storage to store the output of scraping as json, csv, json, jsonlines, jsonl, jl, csv, xml, marshal, pickle etc set FEED_STORAGES as following for the desired output format:

 ```python
 from scrapy_ipfs_filecoin.feedexport import get_feed_storages
 FEED_STORAGES = get_feed_storages()
 ```

 Then set API Key for one of the storage i.e Web3Storage, LightHouse, Moralis, Pinata or Estuary. And, set FEEDS as following to finally store the scraped data.

 For Web3Storage:

 ```python
 W3S_API_KEY = "<W3S_API_KEY>"

 FEEDS = {
  'w3s://house.json': {
   "format": "json"
  },
 }
 ```

 For LightHouse:

 ```python
 LH_API_KEY = "<LH_API_KEY>"

 FEEDS = {
  'lh://house.json': {
   "format": "json"
  },
 }
 ```

 For Estuary:

 ```python
 ES_API_KEY = "<ES_API_KEY>"

 FEEDS = {
  'es://house.json': {
   "format": "json"
  },
 }
 ```

 For Pinata:

 ```python
 PN_JWT_TOKEN = "<PN_JWT_TOKEN>"

 FEEDS = {
  'pn://house.json': {
   "format": "json"
  },
 }
 ```

 For Moralis:

 ```python
 MS_API_KEY = "<MS_API_KEY>"

 FEEDS = {
  'ms://house.json': {
   "format": "json"
  },
 }
 ```

 For Filebase or other s3 compatible services

 The S3 pipeline requires botocore so install it.

  ```shell
 pip install botocore
 ```

```python
 S3_ACCESS_KEY_ID = "<S3_ACCESS_KEY_ID>"
 S3_SECRET_ACCESS_KEY = "<S3_SECRET_ACCESS_KEY>"
 S3_ENDPOINT_URL = "https://s3.filebase.com"
 S3_IPFS_URL_FORMAT = "https://ipfs.filebase.io/ipfs/{cid}"

 FEEDS = {
  "s3://bucket-name/foldername/%(name)s_%(time)s.json": {"format": "json"},
  "s3://bucket-name/foldername/%(name)s_%(time)s.csv": {"format": "csv"},
 }
 ```

 See more on FEEDS [here](https://docs.scrapy.org/en/latest/topics/feed-exports.html#feeds)

4. Now perform the scrapping as you would normally.

## Author

👤 **Pawan Paudel**

- Github: [@pawanpaudel93](https://github.com/pawanpaudel93)

## 🤝 Contributing

Contributions, issues and feature requests are welcome!<br />Feel free to check [issues page](https://github.com/pawanpaudel93/scrapy-ipfs-filecoin/issues).

## Show your support

Give a ⭐️ if this project helped you!

Copyright © 2022 [Pawan Paudel](https://github.com/pawanpaudel93).<br />

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/pawanpaudel93/scrapy-ipfs-filecoin",
    "name": "scrapy-ipfs-filecoin",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.0",
    "maintainer_email": "",
    "keywords": "",
    "author": "Pawan Paudel",
    "author_email": "pawanpaudel93@gmail.com",
    "download_url": "",
    "platform": null,
    "description": "\n<p align=\"center\"><img src=\"https://raw.githubusercontent.com/pawanpaudel93/scrapy-ipfs-filecoin/main/logo.png\" alt=\"original\" width=\"400\" height=\"300\"></p>\n\n<h1 align=\"center\">Welcome to Scrapy-IPFS-Filecoin</h1>\n<p>\n  <img alt=\"Version\" src=\"https://img.shields.io/badge/version-0.0.3-blue.svg?cacheSeconds=2592000\" />\n</p>\n\nScrapy is a popular open-source and collaborative python framework for extracting the data you need from websites. scrapy-ipfs-filecoin provides scrapy pipelines and feed exports to store items into IPFS and Filecoin using services like [Web3.Storage](https://web3.storage/), [LightHouse.Storage](https://lighthouse.storage/), [Estuary](https://estuary.tech/), [Pinata](https://www.pinata.cloud/), [Moralis](https://moralis.io/), [Filebase](https://filebase.com/) or any S3 compatible services.\n\n### \ud83c\udfe0 [Homepage](https://github.com/pawanpaudel93/scrapy-ipfs-filecoin)\n\n## Install\n\n```shell\nnpm install -g https://github.com/pawanpaudel93/ipfs-only-hash.git\n```\n\n```shell\npip install scrapy-ipfs-filecoin\n```\n\n## Example\n\n[scrapy-ipfs-filecoin-example](https://github.com/pawanpaudel93/scrapy-ipfs-filecoin-example)\n\n## Usage\n\n1. Install ipfs-only-hash and scrapy-ipfs-filecoin.\n\n ```shell\n npm install -g https://github.com/pawanpaudel93/ipfs-only-hash.git\n ```\n\n ```shell\n pip install scrapy-ipfs-filecoin\n\n ```\n\n2. Add 'scrapy-ipfs-filecoin.pipelines.ImagesPipeline' and/or 'scrapy-ipfs-filecoin.pipelines.FilesPipeline' to ITEM_PIPELINES setting in your Scrapy project if you need to store images or other files to IPFS and Filecoin.\n For Images Pipeline, use:\n\n ```shell\n ITEM_PIPELINES = {'scrapy_ipfs_filecoin.pipelines.ImagesPipeline': 1}\n ```\n\n For Files Pipeline, use:\n\n ```shell\n ITEM_PIPELINES = {'scrapy_ipfs_filecoin.pipelines.FilesPipeline': 1}\n ```\n\n The advantage of using the ImagesPipeline for image files is that you can configure some extra functions like generating thumbnails and filtering the images based on their size.\n\n Or You can also use both the Files and Images Pipeline at the same time.\n\n ```python\n ITEM_PIPELINES = {\n  'scrapy_ipfs_filecoin.pipelines.ImagesPipeline': 1,\n  'scrapy-ipfs-filecoin.pipelines.FilesPipeline': 1\n }\n ```\n\n If you are using the ImagesPipeline make sure to install the pillow package. The Images Pipeline requires Pillow 7.1.0 or greater. It is used for thumbnailing and normalizing images to JPEG/RGB format.\n\n ```shell\n pip install pillow\n ```\n\n Then, configure the target storage setting to a valid value that will be used for storing the downloaded images. Otherwise the pipeline will remain disabled, even if you include it in the ITEM_PIPELINES setting.\n\n Add store path of files or images for Web3Storage, LightHouse, Moralis, Pinata or Estuary as required.\n\n ```python\n # for ImagesPipeline\n IMAGES_STORE = 'w3s://images' # For Web3Storage\n IMAGES_STORE = 'es://images' # For Estuary\n IMAGES_STORE = 'lh://images' # For LightHouse\n IMAGES_STORE = 'pn://images' # For Pinata\n IMAGES_STORE = 'ms://images' # For Moralis\n IMAGES_STORE = \"s3://bucket-name/images/\"  # For Filebase or other s3 compatible services\n \n # For FilesPipeline\n FILES_STORE = 'w3s://files' # For Web3Storage\n FILES_STORE = 'es://files' # For Estuary\n FILES_STORE = 'lh://files' # For LightHouse\n FILES_STORE = 'es://files' # For Pinata\n FILES_STORE = 'pn://files' # For Moralis\n FILES_STORE = \"s3://bucket-name/files/\"  # For Filebase or other s3 compatible services\n ```\n\n For more info regarding ImagesPipeline and FilesPipline. [See here](https://docs.scrapy.org/en/latest/topics/media-pipeline.html)\n\n3. For Feed storage to store the output of scraping as json, csv, json, jsonlines, jsonl, jl, csv, xml, marshal, pickle etc set FEED_STORAGES as following for the desired output format:\n\n ```python\n from scrapy_ipfs_filecoin.feedexport import get_feed_storages\n FEED_STORAGES = get_feed_storages()\n ```\n\n Then set API Key for one of the storage i.e Web3Storage, LightHouse, Moralis, Pinata or Estuary. And, set FEEDS as following to finally store the scraped data.\n\n For Web3Storage:\n\n ```python\n W3S_API_KEY = \"<W3S_API_KEY>\"\n\n FEEDS = {\n  'w3s://house.json': {\n   \"format\": \"json\"\n  },\n }\n ```\n\n For LightHouse:\n\n ```python\n LH_API_KEY = \"<LH_API_KEY>\"\n\n FEEDS = {\n  'lh://house.json': {\n   \"format\": \"json\"\n  },\n }\n ```\n\n For Estuary:\n\n ```python\n ES_API_KEY = \"<ES_API_KEY>\"\n\n FEEDS = {\n  'es://house.json': {\n   \"format\": \"json\"\n  },\n }\n ```\n\n For Pinata:\n\n ```python\n PN_JWT_TOKEN = \"<PN_JWT_TOKEN>\"\n\n FEEDS = {\n  'pn://house.json': {\n   \"format\": \"json\"\n  },\n }\n ```\n\n For Moralis:\n\n ```python\n MS_API_KEY = \"<MS_API_KEY>\"\n\n FEEDS = {\n  'ms://house.json': {\n   \"format\": \"json\"\n  },\n }\n ```\n\n For Filebase or other s3 compatible services\n\n The S3 pipeline requires botocore so install it.\n\n  ```shell\n pip install botocore\n ```\n\n```python\n S3_ACCESS_KEY_ID = \"<S3_ACCESS_KEY_ID>\"\n S3_SECRET_ACCESS_KEY = \"<S3_SECRET_ACCESS_KEY>\"\n S3_ENDPOINT_URL = \"https://s3.filebase.com\"\n S3_IPFS_URL_FORMAT = \"https://ipfs.filebase.io/ipfs/{cid}\"\n\n FEEDS = {\n  \"s3://bucket-name/foldername/%(name)s_%(time)s.json\": {\"format\": \"json\"},\n  \"s3://bucket-name/foldername/%(name)s_%(time)s.csv\": {\"format\": \"csv\"},\n }\n ```\n\n See more on FEEDS [here](https://docs.scrapy.org/en/latest/topics/feed-exports.html#feeds)\n\n4. Now perform the scrapping as you would normally.\n\n## Author\n\n\ud83d\udc64 **Pawan Paudel**\n\n- Github: [@pawanpaudel93](https://github.com/pawanpaudel93)\n\n## \ud83e\udd1d Contributing\n\nContributions, issues and feature requests are welcome!<br />Feel free to check [issues page](https://github.com/pawanpaudel93/scrapy-ipfs-filecoin/issues).\n\n## Show your support\n\nGive a \u2b50\ufe0f if this project helped you!\n\nCopyright \u00a9 2022 [Pawan Paudel](https://github.com/pawanpaudel93).<br />\n",
    "bugtrack_url": null,
    "license": "ISC",
    "summary": "Scrapy is a popular open-source and collaborative python framework for extracting the data you need from websites. scrapy-ipfs-filecoin provides scrapy pipelines and feed exports to store items into IPFS and Filecoin using services like Web3.Storage, LightHouse.Storage, Estuary, Pinata, Moralis, Filebase or any s3 compatible services.",
    "version": "0.0.3",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "680cfc0e0a5d2adb258641e46cffa7e4",
                "sha256": "c2f20c1e8f12497b05e8104a66037e125325c586887d99053368493460be8a88"
            },
            "downloads": -1,
            "filename": "scrapy_ipfs_filecoin-0.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "680cfc0e0a5d2adb258641e46cffa7e4",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.0",
            "size": 11549,
            "upload_time": "2022-12-14T18:54:54",
            "upload_time_iso_8601": "2022-12-14T18:54:54.251023Z",
            "url": "https://files.pythonhosted.org/packages/7b/03/63801b840cf32316f813221b3e677a38b37b9e88d433f4295c12abc68ef9/scrapy_ipfs_filecoin-0.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-14 18:54:54",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "pawanpaudel93",
    "github_project": "scrapy-ipfs-filecoin",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "scrapy-ipfs-filecoin"
}
        
Elapsed time: 0.04663s