scrapy-vectors


Namescrapy-vectors JSON
Version 0.1.0 PyPI version JSON
download
home_pageNone
SummaryVector embeddings generation and storage for Scrapy
upload_time2025-08-23 00:16:26
maintainerNone
docs_urlNone
authorKyle Kai Hang Tan
requires_python>=3.9
licenseNone
keywords scrapy vectors embeddings s3 llm
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # scrapy-vectors

[![Tests](https://github.com/kyleissuper/scrapy-vectors/actions/workflows/tests.yml/badge.svg)](https://github.com/kyleissuper/scrapy-vectors/actions/workflows/tests.yml)
[![Code Quality](https://github.com/kyleissuper/scrapy-vectors/actions/workflows/checks.yml/badge.svg)](https://github.com/kyleissuper/scrapy-vectors/actions/workflows/checks.yml)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![License: ISC](https://img.shields.io/badge/License-ISC-blue.svg)](https://opensource.org/licenses/ISC)
[![PyPI version](https://badge.fury.io/py/scrapy-vectors.svg)](https://badge.fury.io/py/scrapy-vectors)

Vector embeddings generation and storage for Scrapy spiders.

## Features

- **Embeddings Pipeline**: Generate vector embeddings using LiteLLM (supports OpenAI, Cohere, and other providers)
- **S3 Vectors Storage**: Store embeddings in AWS S3 Vectors service

## Installation

```bash
pip install scrapy-vectors
```

## Quick Start

In your `scrapy_settings.py`:
```python
ITEM_PIPELINES = {
    # Outputs as jsonlines in Pinecone format, which s3-vectors can use
    "scrapy_vectors.EmbeddingsLiteLLMPipeline": 300,
}
FEED_STORAGES = {
    "s3-vectors": "scrapy_vectors.S3VectorsFeedStorage",
}
FEEDS = {
    "s3-vectors://vectors-bucket/vectors-index": {
        "format": "jsonlines",
    }
}

# LiteLLM will route for you
LITELLM_API_KEY = "your_provider_api_key"          # (e.g. OpenAI API Key)
LITELLM_EMBEDDING_MODEL = "text-embedding-3-small" # This is default when unspecified

AWS_REGION_NAME = "us-east-1"
AWS_ACCESS_KEY_ID = "access_key_id"
AWS_SECRET_ACCESS_KEY = "access_key"
```

In your scraper:
```python
import scrapy


class MySpider(scrapy.Spider):
    name = "example"
    start_urls = ["https://example.com"]
    
    # Must yield with: id, page_content, and metadata
    def parse(self, response):
        yield {
            "id": response.url,
            "page_content": response.css("article::text").get(),
            "metadata": {
                "title": response.css("h1::text").get(),
                "url": response.url,
            }
        }
```

## Configuration

### Embeddings Pipeline Settings

- `LITELLM_API_KEY`: API key for your embedding provider (required)
- `LITELLM_EMBEDDING_MODEL`: Model to use (default: OpenAI's `text-embedding-3-small`)

### S3 Vectors Storage Settings

- `AWS_REGION_NAME`: AWS region (required)
- `AWS_ACCESS_KEY_ID`: AWS access key
- `AWS_SECRET_ACCESS_KEY`: AWS secret key

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "scrapy-vectors",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "scrapy, vectors, embeddings, s3, llm",
    "author": "Kyle Kai Hang Tan",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/ab/4e/803d8c4c44038688b365a6682b4b038b96ac47c2448514de095940399c53/scrapy_vectors-0.1.0.tar.gz",
    "platform": null,
    "description": "# scrapy-vectors\n\n[![Tests](https://github.com/kyleissuper/scrapy-vectors/actions/workflows/tests.yml/badge.svg)](https://github.com/kyleissuper/scrapy-vectors/actions/workflows/tests.yml)\n[![Code Quality](https://github.com/kyleissuper/scrapy-vectors/actions/workflows/checks.yml/badge.svg)](https://github.com/kyleissuper/scrapy-vectors/actions/workflows/checks.yml)\n[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)\n[![License: ISC](https://img.shields.io/badge/License-ISC-blue.svg)](https://opensource.org/licenses/ISC)\n[![PyPI version](https://badge.fury.io/py/scrapy-vectors.svg)](https://badge.fury.io/py/scrapy-vectors)\n\nVector embeddings generation and storage for Scrapy spiders.\n\n## Features\n\n- **Embeddings Pipeline**: Generate vector embeddings using LiteLLM (supports OpenAI, Cohere, and other providers)\n- **S3 Vectors Storage**: Store embeddings in AWS S3 Vectors service\n\n## Installation\n\n```bash\npip install scrapy-vectors\n```\n\n## Quick Start\n\nIn your `scrapy_settings.py`:\n```python\nITEM_PIPELINES = {\n    # Outputs as jsonlines in Pinecone format, which s3-vectors can use\n    \"scrapy_vectors.EmbeddingsLiteLLMPipeline\": 300,\n}\nFEED_STORAGES = {\n    \"s3-vectors\": \"scrapy_vectors.S3VectorsFeedStorage\",\n}\nFEEDS = {\n    \"s3-vectors://vectors-bucket/vectors-index\": {\n        \"format\": \"jsonlines\",\n    }\n}\n\n# LiteLLM will route for you\nLITELLM_API_KEY = \"your_provider_api_key\"          # (e.g. OpenAI API Key)\nLITELLM_EMBEDDING_MODEL = \"text-embedding-3-small\" # This is default when unspecified\n\nAWS_REGION_NAME = \"us-east-1\"\nAWS_ACCESS_KEY_ID = \"access_key_id\"\nAWS_SECRET_ACCESS_KEY = \"access_key\"\n```\n\nIn your scraper:\n```python\nimport scrapy\n\n\nclass MySpider(scrapy.Spider):\n    name = \"example\"\n    start_urls = [\"https://example.com\"]\n    \n    # Must yield with: id, page_content, and metadata\n    def parse(self, response):\n        yield {\n            \"id\": response.url,\n            \"page_content\": response.css(\"article::text\").get(),\n            \"metadata\": {\n                \"title\": response.css(\"h1::text\").get(),\n                \"url\": response.url,\n            }\n        }\n```\n\n## Configuration\n\n### Embeddings Pipeline Settings\n\n- `LITELLM_API_KEY`: API key for your embedding provider (required)\n- `LITELLM_EMBEDDING_MODEL`: Model to use (default: OpenAI's `text-embedding-3-small`)\n\n### S3 Vectors Storage Settings\n\n- `AWS_REGION_NAME`: AWS region (required)\n- `AWS_ACCESS_KEY_ID`: AWS access key\n- `AWS_SECRET_ACCESS_KEY`: AWS secret key\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Vector embeddings generation and storage for Scrapy",
    "version": "0.1.0",
    "project_urls": {
        "Homepage": "https://github.com/kyleissuper/scrapy-vectors",
        "Issues": "https://github.com/kyleissuper/scrapy-vectors/issues",
        "Repository": "https://github.com/kyleissuper/scrapy-vectors"
    },
    "split_keywords": [
        "scrapy",
        " vectors",
        " embeddings",
        " s3",
        " llm"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "09861b46ae1facbedcba84ac85292b3e045f86d57b05dadd685a0873becd74ad",
                "md5": "c8d3c2515b8ef97e4aa6ca89f0938355",
                "sha256": "5c09d5ccc19a59e1df717c5c0ee231bf066f3085c6f22c77e2c05b12e592f18c"
            },
            "downloads": -1,
            "filename": "scrapy_vectors-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c8d3c2515b8ef97e4aa6ca89f0938355",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 6114,
            "upload_time": "2025-08-23T00:16:25",
            "upload_time_iso_8601": "2025-08-23T00:16:25.293525Z",
            "url": "https://files.pythonhosted.org/packages/09/86/1b46ae1facbedcba84ac85292b3e045f86d57b05dadd685a0873becd74ad/scrapy_vectors-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ab4e803d8c4c44038688b365a6682b4b038b96ac47c2448514de095940399c53",
                "md5": "dfd5d48a25ba6aae8e3903b421df5745",
                "sha256": "35ad718a2f2a84bf01ed31085d08dabef3a417d3d1143bbc04e298f869b031ed"
            },
            "downloads": -1,
            "filename": "scrapy_vectors-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "dfd5d48a25ba6aae8e3903b421df5745",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 6641,
            "upload_time": "2025-08-23T00:16:26",
            "upload_time_iso_8601": "2025-08-23T00:16:26.604045Z",
            "url": "https://files.pythonhosted.org/packages/ab/4e/803d8c4c44038688b365a6682b4b038b96ac47c2448514de095940399c53/scrapy_vectors-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-23 00:16:26",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "kyleissuper",
    "github_project": "scrapy-vectors",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "tox": true,
    "lcname": "scrapy-vectors"
}
        
Elapsed time: 1.43960s