scrapy-vectors


Namescrapy-vectors JSON
Version 0.2.0 PyPI version JSON
download
home_pageNone
SummaryVector embeddings generation and storage for Scrapy
upload_time2025-09-04 00:19:43
maintainerNone
docs_urlNone
authorKyle Kai Hang Tan
requires_python>=3.9
licenseNone
keywords scrapy vectors embeddings s3 llm
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # scrapy-vectors

[![Tests](https://github.com/kyleissuper/scrapy-vectors/actions/workflows/tests.yml/badge.svg)](https://github.com/kyleissuper/scrapy-vectors/actions/workflows/tests.yml)
[![Code Quality](https://github.com/kyleissuper/scrapy-vectors/actions/workflows/checks.yml/badge.svg)](https://github.com/kyleissuper/scrapy-vectors/actions/workflows/checks.yml)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![License: ISC](https://img.shields.io/badge/License-ISC-blue.svg)](https://opensource.org/licenses/ISC)
[![PyPI version](https://badge.fury.io/py/scrapy-vectors.svg)](https://badge.fury.io/py/scrapy-vectors)

Vector embeddings generation and storage for Scrapy spiders.

## Features

- **Embeddings Pipeline**: Generate vector embeddings using LiteLLM (supports OpenAI, Cohere, and other providers)
- **S3 Vectors Storage**: Store embeddings in AWS S3 Vectors service

## Installation

```bash
pip install scrapy-vectors
```

## Quick Start

In your `scrapy_settings.py`:
```python
ITEM_PIPELINES = {
    # Outputs as jsonlines in Pinecone format, which s3-vectors can use
    "scrapy_vectors.EmbeddingsLiteLLMPipeline": 300,
}
EXTENSIONS = {
    "scrapy.extensions.feedexport.FeedExporter": None,  # Disable standard
    "scrapy_vectors.S3VectorsFeedExporter": 300,        # Use custom
}
FEED_STORAGES = {
    "s3-vectors": "scrapy_vectors.S3VectorsFeedStorage",
}
FEEDS = {
    "s3-vectors://vectors-bucket/vectors-index": {
        "format": "jsonlines",
        "batch_item_count": 100,
    }
}

# LiteLLM will route for you
LITELLM_API_KEY = "your_provider_api_key"          # (e.g. OpenAI API Key)
LITELLM_EMBEDDING_MODEL = "text-embedding-3-small" # This is default when unspecified

AWS_REGION_NAME = "us-east-1"
AWS_ACCESS_KEY_ID = "access_key_id"
AWS_SECRET_ACCESS_KEY = "access_key"
```

In your scraper:
```python
import scrapy


class MySpider(scrapy.Spider):
    name = "example"
    start_urls = ["https://example.com"]

    # Must yield with: id, page_content, and metadata
    def parse(self, response):
        yield {
            "id": response.url,
            "page_content": response.css("article::text").get(),
            "metadata": {
                "title": response.css("h1::text").get(),
                "url": response.url,
            }
        }
```

## Configuration

### Embeddings Pipeline Settings

- `LITELLM_API_KEY`: API key for your embedding provider (required)
- `LITELLM_EMBEDDING_MODEL`: Model to use (default: OpenAI's `text-embedding-3-small`)

### S3 Vectors Storage Settings

- `AWS_REGION_NAME`: AWS region (required)
- `AWS_ACCESS_KEY_ID`: AWS access key
- `AWS_SECRET_ACCESS_KEY`: AWS secret key

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "scrapy-vectors",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "scrapy, vectors, embeddings, s3, llm",
    "author": "Kyle Kai Hang Tan",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/65/a1/e270fd501da1f5c43f86db0004aa6871b8afbeefee897fdfd5d6ff9089df/scrapy_vectors-0.2.0.tar.gz",
    "platform": null,
    "description": "# scrapy-vectors\n\n[![Tests](https://github.com/kyleissuper/scrapy-vectors/actions/workflows/tests.yml/badge.svg)](https://github.com/kyleissuper/scrapy-vectors/actions/workflows/tests.yml)\n[![Code Quality](https://github.com/kyleissuper/scrapy-vectors/actions/workflows/checks.yml/badge.svg)](https://github.com/kyleissuper/scrapy-vectors/actions/workflows/checks.yml)\n[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)\n[![License: ISC](https://img.shields.io/badge/License-ISC-blue.svg)](https://opensource.org/licenses/ISC)\n[![PyPI version](https://badge.fury.io/py/scrapy-vectors.svg)](https://badge.fury.io/py/scrapy-vectors)\n\nVector embeddings generation and storage for Scrapy spiders.\n\n## Features\n\n- **Embeddings Pipeline**: Generate vector embeddings using LiteLLM (supports OpenAI, Cohere, and other providers)\n- **S3 Vectors Storage**: Store embeddings in AWS S3 Vectors service\n\n## Installation\n\n```bash\npip install scrapy-vectors\n```\n\n## Quick Start\n\nIn your `scrapy_settings.py`:\n```python\nITEM_PIPELINES = {\n    # Outputs as jsonlines in Pinecone format, which s3-vectors can use\n    \"scrapy_vectors.EmbeddingsLiteLLMPipeline\": 300,\n}\nEXTENSIONS = {\n    \"scrapy.extensions.feedexport.FeedExporter\": None,  # Disable standard\n    \"scrapy_vectors.S3VectorsFeedExporter\": 300,        # Use custom\n}\nFEED_STORAGES = {\n    \"s3-vectors\": \"scrapy_vectors.S3VectorsFeedStorage\",\n}\nFEEDS = {\n    \"s3-vectors://vectors-bucket/vectors-index\": {\n        \"format\": \"jsonlines\",\n        \"batch_item_count\": 100,\n    }\n}\n\n# LiteLLM will route for you\nLITELLM_API_KEY = \"your_provider_api_key\"          # (e.g. OpenAI API Key)\nLITELLM_EMBEDDING_MODEL = \"text-embedding-3-small\" # This is default when unspecified\n\nAWS_REGION_NAME = \"us-east-1\"\nAWS_ACCESS_KEY_ID = \"access_key_id\"\nAWS_SECRET_ACCESS_KEY = \"access_key\"\n```\n\nIn your scraper:\n```python\nimport scrapy\n\n\nclass MySpider(scrapy.Spider):\n    name = \"example\"\n    start_urls = [\"https://example.com\"]\n\n    # Must yield with: id, page_content, and metadata\n    def parse(self, response):\n        yield {\n            \"id\": response.url,\n            \"page_content\": response.css(\"article::text\").get(),\n            \"metadata\": {\n                \"title\": response.css(\"h1::text\").get(),\n                \"url\": response.url,\n            }\n        }\n```\n\n## Configuration\n\n### Embeddings Pipeline Settings\n\n- `LITELLM_API_KEY`: API key for your embedding provider (required)\n- `LITELLM_EMBEDDING_MODEL`: Model to use (default: OpenAI's `text-embedding-3-small`)\n\n### S3 Vectors Storage Settings\n\n- `AWS_REGION_NAME`: AWS region (required)\n- `AWS_ACCESS_KEY_ID`: AWS access key\n- `AWS_SECRET_ACCESS_KEY`: AWS secret key\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Vector embeddings generation and storage for Scrapy",
    "version": "0.2.0",
    "project_urls": {
        "Homepage": "https://github.com/kyleissuper/scrapy-vectors",
        "Issues": "https://github.com/kyleissuper/scrapy-vectors/issues",
        "Repository": "https://github.com/kyleissuper/scrapy-vectors"
    },
    "split_keywords": [
        "scrapy",
        " vectors",
        " embeddings",
        " s3",
        " llm"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "83cd4555f7b9eb63232de64eab5f3e28060a40e88a310b71d2ce85d138774d1c",
                "md5": "79f0cc7dad16876e804734b4f729c7a3",
                "sha256": "d5776175813ee240d59cf1c8cb21839f87700488b48dd92bff475af10f69b4c0"
            },
            "downloads": -1,
            "filename": "scrapy_vectors-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "79f0cc7dad16876e804734b4f729c7a3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 6328,
            "upload_time": "2025-09-04T00:19:42",
            "upload_time_iso_8601": "2025-09-04T00:19:42.646633Z",
            "url": "https://files.pythonhosted.org/packages/83/cd/4555f7b9eb63232de64eab5f3e28060a40e88a310b71d2ce85d138774d1c/scrapy_vectors-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "65a1e270fd501da1f5c43f86db0004aa6871b8afbeefee897fdfd5d6ff9089df",
                "md5": "7ce1398b0f5dcbcdbfb4659664b1f166",
                "sha256": "9011660fcbc7a0ea58ae95705a696eda5934037ffbdf453383bdeb5f2257b915"
            },
            "downloads": -1,
            "filename": "scrapy_vectors-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "7ce1398b0f5dcbcdbfb4659664b1f166",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 6851,
            "upload_time": "2025-09-04T00:19:43",
            "upload_time_iso_8601": "2025-09-04T00:19:43.699057Z",
            "url": "https://files.pythonhosted.org/packages/65/a1/e270fd501da1f5c43f86db0004aa6871b8afbeefee897fdfd5d6ff9089df/scrapy_vectors-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-04 00:19:43",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "kyleissuper",
    "github_project": "scrapy-vectors",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "tox": true,
    "lcname": "scrapy-vectors"
}
        
Elapsed time: 3.70088s