# scrapy-vectors
[](https://github.com/kyleissuper/scrapy-vectors/actions/workflows/tests.yml)
[](https://github.com/kyleissuper/scrapy-vectors/actions/workflows/checks.yml)
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/ISC)
[](https://badge.fury.io/py/scrapy-vectors)
Vector embeddings generation and storage for Scrapy spiders.
## Features
- **Embeddings Pipeline**: Generate vector embeddings using LiteLLM (supports OpenAI, Cohere, and other providers)
- **S3 Vectors Storage**: Store embeddings in AWS S3 Vectors service
## Installation
```bash
pip install scrapy-vectors
```
## Quick Start
In your `scrapy_settings.py`:
```python
ITEM_PIPELINES = {
# Outputs as jsonlines in Pinecone format, which s3-vectors can use
"scrapy_vectors.EmbeddingsLiteLLMPipeline": 300,
}
EXTENSIONS = {
"scrapy.extensions.feedexport.FeedExporter": None, # Disable standard
"scrapy_vectors.S3VectorsFeedExporter": 300, # Use custom
}
FEED_STORAGES = {
"s3-vectors": "scrapy_vectors.S3VectorsFeedStorage",
}
FEEDS = {
"s3-vectors://vectors-bucket/vectors-index": {
"format": "jsonlines",
"batch_item_count": 100,
}
}
# LiteLLM will route for you
LITELLM_API_KEY = "your_provider_api_key" # (e.g. OpenAI API Key)
LITELLM_EMBEDDING_MODEL = "text-embedding-3-small" # This is default when unspecified
AWS_REGION_NAME = "us-east-1"
AWS_ACCESS_KEY_ID = "access_key_id"
AWS_SECRET_ACCESS_KEY = "access_key"
```
In your scraper:
```python
import scrapy
class MySpider(scrapy.Spider):
name = "example"
start_urls = ["https://example.com"]
# Must yield with: id, page_content, and metadata
def parse(self, response):
yield {
"id": response.url,
"page_content": response.css("article::text").get(),
"metadata": {
"title": response.css("h1::text").get(),
"url": response.url,
}
}
```
## Configuration
### Embeddings Pipeline Settings
- `LITELLM_API_KEY`: API key for your embedding provider (required)
- `LITELLM_EMBEDDING_MODEL`: Model to use (default: OpenAI's `text-embedding-3-small`)
### S3 Vectors Storage Settings
- `AWS_REGION_NAME`: AWS region (required)
- `AWS_ACCESS_KEY_ID`: AWS access key
- `AWS_SECRET_ACCESS_KEY`: AWS secret key
Raw data
{
"_id": null,
"home_page": null,
"name": "scrapy-vectors",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "scrapy, vectors, embeddings, s3, llm",
"author": "Kyle Kai Hang Tan",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/65/a1/e270fd501da1f5c43f86db0004aa6871b8afbeefee897fdfd5d6ff9089df/scrapy_vectors-0.2.0.tar.gz",
"platform": null,
"description": "# scrapy-vectors\n\n[](https://github.com/kyleissuper/scrapy-vectors/actions/workflows/tests.yml)\n[](https://github.com/kyleissuper/scrapy-vectors/actions/workflows/checks.yml)\n[](https://www.python.org/downloads/)\n[](https://opensource.org/licenses/ISC)\n[](https://badge.fury.io/py/scrapy-vectors)\n\nVector embeddings generation and storage for Scrapy spiders.\n\n## Features\n\n- **Embeddings Pipeline**: Generate vector embeddings using LiteLLM (supports OpenAI, Cohere, and other providers)\n- **S3 Vectors Storage**: Store embeddings in AWS S3 Vectors service\n\n## Installation\n\n```bash\npip install scrapy-vectors\n```\n\n## Quick Start\n\nIn your `scrapy_settings.py`:\n```python\nITEM_PIPELINES = {\n # Outputs as jsonlines in Pinecone format, which s3-vectors can use\n \"scrapy_vectors.EmbeddingsLiteLLMPipeline\": 300,\n}\nEXTENSIONS = {\n \"scrapy.extensions.feedexport.FeedExporter\": None, # Disable standard\n \"scrapy_vectors.S3VectorsFeedExporter\": 300, # Use custom\n}\nFEED_STORAGES = {\n \"s3-vectors\": \"scrapy_vectors.S3VectorsFeedStorage\",\n}\nFEEDS = {\n \"s3-vectors://vectors-bucket/vectors-index\": {\n \"format\": \"jsonlines\",\n \"batch_item_count\": 100,\n }\n}\n\n# LiteLLM will route for you\nLITELLM_API_KEY = \"your_provider_api_key\" # (e.g. OpenAI API Key)\nLITELLM_EMBEDDING_MODEL = \"text-embedding-3-small\" # This is default when unspecified\n\nAWS_REGION_NAME = \"us-east-1\"\nAWS_ACCESS_KEY_ID = \"access_key_id\"\nAWS_SECRET_ACCESS_KEY = \"access_key\"\n```\n\nIn your scraper:\n```python\nimport scrapy\n\n\nclass MySpider(scrapy.Spider):\n name = \"example\"\n start_urls = [\"https://example.com\"]\n\n # Must yield with: id, page_content, and metadata\n def parse(self, response):\n yield {\n \"id\": response.url,\n \"page_content\": response.css(\"article::text\").get(),\n \"metadata\": {\n \"title\": response.css(\"h1::text\").get(),\n \"url\": response.url,\n }\n }\n```\n\n## Configuration\n\n### Embeddings Pipeline Settings\n\n- `LITELLM_API_KEY`: API key for your embedding provider (required)\n- `LITELLM_EMBEDDING_MODEL`: Model to use (default: OpenAI's `text-embedding-3-small`)\n\n### S3 Vectors Storage Settings\n\n- `AWS_REGION_NAME`: AWS region (required)\n- `AWS_ACCESS_KEY_ID`: AWS access key\n- `AWS_SECRET_ACCESS_KEY`: AWS secret key\n",
"bugtrack_url": null,
"license": null,
"summary": "Vector embeddings generation and storage for Scrapy",
"version": "0.2.0",
"project_urls": {
"Homepage": "https://github.com/kyleissuper/scrapy-vectors",
"Issues": "https://github.com/kyleissuper/scrapy-vectors/issues",
"Repository": "https://github.com/kyleissuper/scrapy-vectors"
},
"split_keywords": [
"scrapy",
" vectors",
" embeddings",
" s3",
" llm"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "83cd4555f7b9eb63232de64eab5f3e28060a40e88a310b71d2ce85d138774d1c",
"md5": "79f0cc7dad16876e804734b4f729c7a3",
"sha256": "d5776175813ee240d59cf1c8cb21839f87700488b48dd92bff475af10f69b4c0"
},
"downloads": -1,
"filename": "scrapy_vectors-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "79f0cc7dad16876e804734b4f729c7a3",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 6328,
"upload_time": "2025-09-04T00:19:42",
"upload_time_iso_8601": "2025-09-04T00:19:42.646633Z",
"url": "https://files.pythonhosted.org/packages/83/cd/4555f7b9eb63232de64eab5f3e28060a40e88a310b71d2ce85d138774d1c/scrapy_vectors-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "65a1e270fd501da1f5c43f86db0004aa6871b8afbeefee897fdfd5d6ff9089df",
"md5": "7ce1398b0f5dcbcdbfb4659664b1f166",
"sha256": "9011660fcbc7a0ea58ae95705a696eda5934037ffbdf453383bdeb5f2257b915"
},
"downloads": -1,
"filename": "scrapy_vectors-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "7ce1398b0f5dcbcdbfb4659664b1f166",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 6851,
"upload_time": "2025-09-04T00:19:43",
"upload_time_iso_8601": "2025-09-04T00:19:43.699057Z",
"url": "https://files.pythonhosted.org/packages/65/a1/e270fd501da1f5c43f86db0004aa6871b8afbeefee897fdfd5d6ff9089df/scrapy_vectors-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-04 00:19:43",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "kyleissuper",
"github_project": "scrapy-vectors",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"tox": true,
"lcname": "scrapy-vectors"
}