# scrapy-vectors
[](https://github.com/kyleissuper/scrapy-vectors/actions/workflows/tests.yml)
[](https://github.com/kyleissuper/scrapy-vectors/actions/workflows/checks.yml)
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/ISC)
[](https://badge.fury.io/py/scrapy-vectors)
Vector embeddings generation and storage for Scrapy spiders.
## Features
- **Embeddings Pipeline**: Generate vector embeddings using LiteLLM (supports OpenAI, Cohere, and other providers)
- **S3 Vectors Storage**: Store embeddings in AWS S3 Vectors service
## Installation
```bash
pip install scrapy-vectors
```
## Quick Start
In your `scrapy_settings.py`:
```python
ITEM_PIPELINES = {
# Outputs as jsonlines in Pinecone format, which s3-vectors can use
"scrapy_vectors.EmbeddingsLiteLLMPipeline": 300,
}
FEED_STORAGES = {
"s3-vectors": "scrapy_vectors.S3VectorsFeedStorage",
}
FEEDS = {
"s3-vectors://vectors-bucket/vectors-index": {
"format": "jsonlines",
}
}
# LiteLLM will route for you
LITELLM_API_KEY = "your_provider_api_key" # (e.g. OpenAI API Key)
LITELLM_EMBEDDING_MODEL = "text-embedding-3-small" # This is default when unspecified
AWS_REGION_NAME = "us-east-1"
AWS_ACCESS_KEY_ID = "access_key_id"
AWS_SECRET_ACCESS_KEY = "access_key"
```
In your scraper:
```python
import scrapy
class MySpider(scrapy.Spider):
name = "example"
start_urls = ["https://example.com"]
# Must yield with: id, page_content, and metadata
def parse(self, response):
yield {
"id": response.url,
"page_content": response.css("article::text").get(),
"metadata": {
"title": response.css("h1::text").get(),
"url": response.url,
}
}
```
## Configuration
### Embeddings Pipeline Settings
- `LITELLM_API_KEY`: API key for your embedding provider (required)
- `LITELLM_EMBEDDING_MODEL`: Model to use (default: OpenAI's `text-embedding-3-small`)
### S3 Vectors Storage Settings
- `AWS_REGION_NAME`: AWS region (required)
- `AWS_ACCESS_KEY_ID`: AWS access key
- `AWS_SECRET_ACCESS_KEY`: AWS secret key
Raw data
{
"_id": null,
"home_page": null,
"name": "scrapy-vectors",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "scrapy, vectors, embeddings, s3, llm",
"author": "Kyle Kai Hang Tan",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/ab/4e/803d8c4c44038688b365a6682b4b038b96ac47c2448514de095940399c53/scrapy_vectors-0.1.0.tar.gz",
"platform": null,
"description": "# scrapy-vectors\n\n[](https://github.com/kyleissuper/scrapy-vectors/actions/workflows/tests.yml)\n[](https://github.com/kyleissuper/scrapy-vectors/actions/workflows/checks.yml)\n[](https://www.python.org/downloads/)\n[](https://opensource.org/licenses/ISC)\n[](https://badge.fury.io/py/scrapy-vectors)\n\nVector embeddings generation and storage for Scrapy spiders.\n\n## Features\n\n- **Embeddings Pipeline**: Generate vector embeddings using LiteLLM (supports OpenAI, Cohere, and other providers)\n- **S3 Vectors Storage**: Store embeddings in AWS S3 Vectors service\n\n## Installation\n\n```bash\npip install scrapy-vectors\n```\n\n## Quick Start\n\nIn your `scrapy_settings.py`:\n```python\nITEM_PIPELINES = {\n # Outputs as jsonlines in Pinecone format, which s3-vectors can use\n \"scrapy_vectors.EmbeddingsLiteLLMPipeline\": 300,\n}\nFEED_STORAGES = {\n \"s3-vectors\": \"scrapy_vectors.S3VectorsFeedStorage\",\n}\nFEEDS = {\n \"s3-vectors://vectors-bucket/vectors-index\": {\n \"format\": \"jsonlines\",\n }\n}\n\n# LiteLLM will route for you\nLITELLM_API_KEY = \"your_provider_api_key\" # (e.g. OpenAI API Key)\nLITELLM_EMBEDDING_MODEL = \"text-embedding-3-small\" # This is default when unspecified\n\nAWS_REGION_NAME = \"us-east-1\"\nAWS_ACCESS_KEY_ID = \"access_key_id\"\nAWS_SECRET_ACCESS_KEY = \"access_key\"\n```\n\nIn your scraper:\n```python\nimport scrapy\n\n\nclass MySpider(scrapy.Spider):\n name = \"example\"\n start_urls = [\"https://example.com\"]\n \n # Must yield with: id, page_content, and metadata\n def parse(self, response):\n yield {\n \"id\": response.url,\n \"page_content\": response.css(\"article::text\").get(),\n \"metadata\": {\n \"title\": response.css(\"h1::text\").get(),\n \"url\": response.url,\n }\n }\n```\n\n## Configuration\n\n### Embeddings Pipeline Settings\n\n- `LITELLM_API_KEY`: API key for your embedding provider (required)\n- `LITELLM_EMBEDDING_MODEL`: Model to use (default: OpenAI's `text-embedding-3-small`)\n\n### S3 Vectors Storage Settings\n\n- `AWS_REGION_NAME`: AWS region (required)\n- `AWS_ACCESS_KEY_ID`: AWS access key\n- `AWS_SECRET_ACCESS_KEY`: AWS secret key\n",
"bugtrack_url": null,
"license": null,
"summary": "Vector embeddings generation and storage for Scrapy",
"version": "0.1.0",
"project_urls": {
"Homepage": "https://github.com/kyleissuper/scrapy-vectors",
"Issues": "https://github.com/kyleissuper/scrapy-vectors/issues",
"Repository": "https://github.com/kyleissuper/scrapy-vectors"
},
"split_keywords": [
"scrapy",
" vectors",
" embeddings",
" s3",
" llm"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "09861b46ae1facbedcba84ac85292b3e045f86d57b05dadd685a0873becd74ad",
"md5": "c8d3c2515b8ef97e4aa6ca89f0938355",
"sha256": "5c09d5ccc19a59e1df717c5c0ee231bf066f3085c6f22c77e2c05b12e592f18c"
},
"downloads": -1,
"filename": "scrapy_vectors-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "c8d3c2515b8ef97e4aa6ca89f0938355",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 6114,
"upload_time": "2025-08-23T00:16:25",
"upload_time_iso_8601": "2025-08-23T00:16:25.293525Z",
"url": "https://files.pythonhosted.org/packages/09/86/1b46ae1facbedcba84ac85292b3e045f86d57b05dadd685a0873becd74ad/scrapy_vectors-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "ab4e803d8c4c44038688b365a6682b4b038b96ac47c2448514de095940399c53",
"md5": "dfd5d48a25ba6aae8e3903b421df5745",
"sha256": "35ad718a2f2a84bf01ed31085d08dabef3a417d3d1143bbc04e298f869b031ed"
},
"downloads": -1,
"filename": "scrapy_vectors-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "dfd5d48a25ba6aae8e3903b421df5745",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 6641,
"upload_time": "2025-08-23T00:16:26",
"upload_time_iso_8601": "2025-08-23T00:16:26.604045Z",
"url": "https://files.pythonhosted.org/packages/ab/4e/803d8c4c44038688b365a6682b4b038b96ac47c2448514de095940399c53/scrapy_vectors-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-23 00:16:26",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "kyleissuper",
"github_project": "scrapy-vectors",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"tox": true,
"lcname": "scrapy-vectors"
}