scrapyrunner


Namescrapyrunner JSON
Version 0.0.8 PyPI version JSON
download
home_pagehttps://github.com/hupe1980/scrapyrunner
SummaryA Python library to run Scrapy spiders directly from your code.
upload_time2025-01-13 21:08:55
maintainerNone
docs_urlNone
authorhupe1980
requires_python>=3.10
licenseMIT
keywords scrapy scraping web scraping scrapy wrapper
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# ScrapyRunner

A Python library to run Scrapy spiders directly from your code.

## Overview

ScrapyRunner is a lightweight library that enables you to run Scrapy spiders in your Python code, process scraped items using custom processors, and manage Scrapy signals seamlessly. It simplifies the process of starting and managing Scrapy spiders and integrates well with your existing Python workflows.

## Features

- Run Scrapy spiders directly from Python code.
- Process scraped items in batches with a custom processor.
- Manage Scrapy signals (e.g., on item scraped, on engine stopped).
- Easy integration with the Scrapy framework.
- Asynchronous processing of items using Twisted.

## Installation

To install ScrapyRunner, you can use `pip`:

```bash
pip install scrapyrunner
```

## Usage

### Example

```python
# Importing necessary libraries
from dataclasses import dataclass  # Used to create data classes for the processor
from time import sleep  # Used to simulate a delay during item processing

import scrapy  # Scrapy library for creating spiders

from scrapyrunner import ItemProcessor, ScrapyRunner  # Importing the custom Scrapy runner and processor classes


# Define the spider to crawl a webpage and extract data
class MySpider(scrapy.Spider):
    name = 'example'  # Name of the spider, used to identify it when running

    def parse(self, response):
        # This method is called to parse the response from the URL.
        # We extract the title of the page using XPath and return it as a dictionary.
        data = response.xpath("//title/text()").extract_first()
        return {"title": data}  # Returning the extracted title in a dictionary format

# Define the item processor to process the items after they are scraped
@dataclass(kw_only=True)
class MyProcessor(ItemProcessor):
    prefix: str
    suffix: str

    def process_item(self, item: scrapy.Item) -> None:
        # A simulated delay is added here to mimic real processing time.
        # In a real-world scenario, this could be a time-consuming task like data validation or saving to a database.
        print(self.prefix, item, self.suffix)  # Print the processed item to the console
        sleep(2)  # Sleep for 2 seconds to simulate processing time

# Main block to execute the spider and processor
if __name__ == '__main__':
    # Create an instance of ScrapyRunner with the specified spider and processor.
    # ScrapyRunner will handle crawling and managing the queue for items.
    scrapy_runner = ScrapyRunner(spider=MySpider, processor=MyProcessor, processor_kwargs={"prefix": ">>>", "suffix": "<<<"})

    # Run the Scrapy crawler, passing the starting URL to the spider
    # The spider will start scraping the provided URL and the processor will handle the items.
    scrapy_runner.run(start_urls=["https://example.org", "https://scrapy.org"])  # Run the spider with the start URL
```

### How it works:

1. **Define a Spider**: In this example, `MySpider` extracts the title of a webpage.
2. **Define a Processor**: `MyProcessor` processes scraped items (here it simply sleeps for 2 seconds to simulate real processing).
3. **Run the ScrapyRunner**: The `ScrapyRunner` class is used to run the spider and process the items. The `run()` method triggers the scraping, and each item scraped is passed to the custom processor.

## Customization

### Custom Processor

To create your own custom processor:

1. Subclass `ItemProcessor`.
2. Override the `process_item()` method to handle scraped items.
3. Process each item as needed (e.g., save to a database, perform additional transformations, etc.).

```python
class MyCustomProcessor(ItemProcessor):
    def process_item(self, item: scrapy.Item) -> None:
        # Custom processing logic goes here
        print("Processing item:", item)
```

### Custom Settings

You can pass custom Scrapy settings to `ScrapyRunner`:

```python
scrapy_settings = {
    "LOG_LEVEL": "DEBUG",
    "USER_AGENT": "MyCustomAgent",
    # Add more custom settings as needed
}

runner = ScrapyRunner(spider=MySpider, processor=MyProcessor, scrapy_settings=scrapy_settings)
runner.run(start_urls=["https://example.org", "https://scrapy.org"])
```

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/hupe1980/scrapyrunner",
    "name": "scrapyrunner",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "scrapy, scraping, web scraping, scrapy wrapper",
    "author": "hupe1980",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/4e/70/ccafce8bfc11478a2d8d47ea5197bdcec21546f1fd9095e70c697780819a/scrapyrunner-0.0.8.tar.gz",
    "platform": null,
    "description": "\n# ScrapyRunner\n\nA Python library to run Scrapy spiders directly from your code.\n\n## Overview\n\nScrapyRunner is a lightweight library that enables you to run Scrapy spiders in your Python code, process scraped items using custom processors, and manage Scrapy signals seamlessly. It simplifies the process of starting and managing Scrapy spiders and integrates well with your existing Python workflows.\n\n## Features\n\n- Run Scrapy spiders directly from Python code.\n- Process scraped items in batches with a custom processor.\n- Manage Scrapy signals (e.g., on item scraped, on engine stopped).\n- Easy integration with the Scrapy framework.\n- Asynchronous processing of items using Twisted.\n\n## Installation\n\nTo install ScrapyRunner, you can use `pip`:\n\n```bash\npip install scrapyrunner\n```\n\n## Usage\n\n### Example\n\n```python\n# Importing necessary libraries\nfrom dataclasses import dataclass  # Used to create data classes for the processor\nfrom time import sleep  # Used to simulate a delay during item processing\n\nimport scrapy  # Scrapy library for creating spiders\n\nfrom scrapyrunner import ItemProcessor, ScrapyRunner  # Importing the custom Scrapy runner and processor classes\n\n\n# Define the spider to crawl a webpage and extract data\nclass MySpider(scrapy.Spider):\n    name = 'example'  # Name of the spider, used to identify it when running\n\n    def parse(self, response):\n        # This method is called to parse the response from the URL.\n        # We extract the title of the page using XPath and return it as a dictionary.\n        data = response.xpath(\"//title/text()\").extract_first()\n        return {\"title\": data}  # Returning the extracted title in a dictionary format\n\n# Define the item processor to process the items after they are scraped\n@dataclass(kw_only=True)\nclass MyProcessor(ItemProcessor):\n    prefix: str\n    suffix: str\n\n    def process_item(self, item: scrapy.Item) -> None:\n        # A simulated delay is added here to mimic real processing time.\n        # In a real-world scenario, this could be a time-consuming task like data validation or saving to a database.\n        print(self.prefix, item, self.suffix)  # Print the processed item to the console\n        sleep(2)  # Sleep for 2 seconds to simulate processing time\n\n# Main block to execute the spider and processor\nif __name__ == '__main__':\n    # Create an instance of ScrapyRunner with the specified spider and processor.\n    # ScrapyRunner will handle crawling and managing the queue for items.\n    scrapy_runner = ScrapyRunner(spider=MySpider, processor=MyProcessor, processor_kwargs={\"prefix\": \">>>\", \"suffix\": \"<<<\"})\n\n    # Run the Scrapy crawler, passing the starting URL to the spider\n    # The spider will start scraping the provided URL and the processor will handle the items.\n    scrapy_runner.run(start_urls=[\"https://example.org\", \"https://scrapy.org\"])  # Run the spider with the start URL\n```\n\n### How it works:\n\n1. **Define a Spider**: In this example, `MySpider` extracts the title of a webpage.\n2. **Define a Processor**: `MyProcessor` processes scraped items (here it simply sleeps for 2 seconds to simulate real processing).\n3. **Run the ScrapyRunner**: The `ScrapyRunner` class is used to run the spider and process the items. The `run()` method triggers the scraping, and each item scraped is passed to the custom processor.\n\n## Customization\n\n### Custom Processor\n\nTo create your own custom processor:\n\n1. Subclass `ItemProcessor`.\n2. Override the `process_item()` method to handle scraped items.\n3. Process each item as needed (e.g., save to a database, perform additional transformations, etc.).\n\n```python\nclass MyCustomProcessor(ItemProcessor):\n    def process_item(self, item: scrapy.Item) -> None:\n        # Custom processing logic goes here\n        print(\"Processing item:\", item)\n```\n\n### Custom Settings\n\nYou can pass custom Scrapy settings to `ScrapyRunner`:\n\n```python\nscrapy_settings = {\n    \"LOG_LEVEL\": \"DEBUG\",\n    \"USER_AGENT\": \"MyCustomAgent\",\n    # Add more custom settings as needed\n}\n\nrunner = ScrapyRunner(spider=MySpider, processor=MyProcessor, scrapy_settings=scrapy_settings)\nrunner.run(start_urls=[\"https://example.org\", \"https://scrapy.org\"])\n```\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A Python library to run Scrapy spiders directly from your code.",
    "version": "0.0.8",
    "project_urls": {
        "Homepage": "https://github.com/hupe1980/scrapyrunner",
        "Repository": "https://github.com/hupe1980/scrapyrunner"
    },
    "split_keywords": [
        "scrapy",
        " scraping",
        " web scraping",
        " scrapy wrapper"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2a68476a08e1b6403fee74177024121ca15aab2199c91cef8d3c5a3043d0b045",
                "md5": "7bffb1f11a85723f90e105cf30a8e27c",
                "sha256": "b9cf0577cc1c6eeb937e324ceb0e1c3e57d2bca2616e1b182b24a5d2beba9fbf"
            },
            "downloads": -1,
            "filename": "scrapyrunner-0.0.8-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7bffb1f11a85723f90e105cf30a8e27c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 8432,
            "upload_time": "2025-01-13T21:08:53",
            "upload_time_iso_8601": "2025-01-13T21:08:53.111235Z",
            "url": "https://files.pythonhosted.org/packages/2a/68/476a08e1b6403fee74177024121ca15aab2199c91cef8d3c5a3043d0b045/scrapyrunner-0.0.8-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4e70ccafce8bfc11478a2d8d47ea5197bdcec21546f1fd9095e70c697780819a",
                "md5": "04c395fe02123f797617e6481e808ea8",
                "sha256": "e97668c3236bd5becd16306db530539c5c6011c71278b72892e16d4ab78419c4"
            },
            "downloads": -1,
            "filename": "scrapyrunner-0.0.8.tar.gz",
            "has_sig": false,
            "md5_digest": "04c395fe02123f797617e6481e808ea8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 7026,
            "upload_time": "2025-01-13T21:08:55",
            "upload_time_iso_8601": "2025-01-13T21:08:55.308628Z",
            "url": "https://files.pythonhosted.org/packages/4e/70/ccafce8bfc11478a2d8d47ea5197bdcec21546f1fd9095e70c697780819a/scrapyrunner-0.0.8.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-13 21:08:55",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hupe1980",
    "github_project": "scrapyrunner",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "scrapyrunner"
}
        
Elapsed time: 0.75355s