# Apify-Haystack integration
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://github.com/apify/apify-haystack/blob/main/LICENSE)
[![PyPi Package](https://badge.fury.io/py/apify-haystack.svg)](https://badge.fury.io/py/apify-haystack)
[![Python](https://img.shields.io/pypi/pyversions/apify-haystack)](https://pypi.org/project/apify-haystack)
The Apify-Haystack integration allows easy interaction between the [Apify](https://apify.com/) platform and [Haystack](https://haystack.deepset.ai/).
Apify is a platform for web scraping, data extraction, and web automation tasks.
It provides serverless applications called Actors for different tasks, like crawling websites, and scraping Facebook, Instagram, and Google results, etc.
Haystack offers an ecosystem of tools for building, managing, and deploying search engines and LLM applications.
## Installation
Apify-haystack is available at the [`apify-haystack`](https://pypi.org/project/apify-haystack/) PyPI package.
```sh
pip install apify-haystack
```
## Examples
### Crawl a website using Apify's Website Content Crawler and convert it to Haystack Documents
You need to have an Apify account and API token to run this example.
You can start with a free account at [Apify](https://apify.com/) and get your [API token](https://docs.apify.com/platform/integrations/api).
In the example below, specify `apify_api_token` and run the script:
```python
from dotenv import load_dotenv
from haystack import Document
from apify_haystack import ApifyDatasetFromActorCall
# Set APIFY_API_TOKEN here or load it from .env file
apify_api_token = "" or load_dotenv()
actor_id = "apify/website-content-crawler"
run_input = {
"maxCrawlPages": 3, # limit the number of pages to crawl
"startUrls": [{"url": "https://haystack.deepset.ai/"}],
}
def dataset_mapping_function(dataset_item: dict) -> Document:
return Document(content=dataset_item.get("text"), meta={"url": dataset_item.get("url")})
actor = ApifyDatasetFromActorCall(
actor_id=actor_id, run_input=run_input, dataset_mapping_function=dataset_mapping_function
)
print(f"Calling the Apify actor {actor_id} ... crawling will take some time ...")
print("You can monitor the progress at: https://console.apify.com/actors/runs")
dataset = actor.run().get("documents")
print(f"Loaded {len(dataset)} documents from the Apify Actor {actor_id}:")
for d in dataset:
print(d)
```
### More examples
See other examples in the [examples directory](https://github.com/apify/apify-haystack/blob/master/src/apify_haystack/examples) for more examples, here is a list of few of them
- Load a dataset from Apify and convert it to a Haystack Document
- Call [Website Content Crawler](https://apify.com/apify/website-content-crawler) and convert the data into the Haystack Documents
- Crawl websites, retrieve text content, and store it in the `InMemoryDocumentStore`
- Retrieval-Augmented Generation (RAG): Extracting text from a website & question answering <a href="https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/apify_haystack_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
- Analyze Your Instagram Comments’ Vibe with Apify and Haystack <a href="https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/apify_haystack_instagram_comments_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
## Support
If you find any bug or issue, please [submit an issue on GitHub](https://github.com/apify/apify-haystack/issues).
For questions, you can ask on [Stack Overflow](https://stackoverflow.com/questions/tagged/apify), in GitHub Discussions or you can join our [Discord server](https://discord.com/invite/jyEM2PRvMU).
## Contributing
Your code contributions are welcome.
If you have any ideas for improvements, either submit an issue or create a pull request.
For contribution guidelines and the code of conduct, see [CONTRIBUTING.md](https://github.com/apify/apify-haystack/blob/master/CONTRIBUTING.md).
## License
This project is licensed under the Apache License 2.0 - see the [LICENSE](https://github.com/apify/apify-haystack/blob/master/LICENSE) file for details.
Raw data
{
"_id": null,
"home_page": "https://apify.com",
"name": "apify-haystack",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.9",
"maintainer_email": null,
"keywords": "apify, crawler, haystack, rag, scraper, scraping",
"author": "Apify Technologies s.r.o.",
"author_email": "support@apify.com",
"download_url": "https://files.pythonhosted.org/packages/e6/97/3cab2187fd1ff819028bd2ca2a92581b8f0e8ebfba597cc1a66515e48db9/apify_haystack-0.1.5.tar.gz",
"platform": null,
"description": "# Apify-Haystack integration\n\n[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://github.com/apify/apify-haystack/blob/main/LICENSE)\n[![PyPi Package](https://badge.fury.io/py/apify-haystack.svg)](https://badge.fury.io/py/apify-haystack)\n[![Python](https://img.shields.io/pypi/pyversions/apify-haystack)](https://pypi.org/project/apify-haystack)\n\nThe Apify-Haystack integration allows easy interaction between the [Apify](https://apify.com/) platform and [Haystack](https://haystack.deepset.ai/).\n\nApify is a platform for web scraping, data extraction, and web automation tasks.\nIt provides serverless applications called Actors for different tasks, like crawling websites, and scraping Facebook, Instagram, and Google results, etc.\n\nHaystack offers an ecosystem of tools for building, managing, and deploying search engines and LLM applications.\n\n## Installation\n\nApify-haystack is available at the [`apify-haystack`](https://pypi.org/project/apify-haystack/) PyPI package.\n\n```sh\npip install apify-haystack\n```\n\n## Examples\n\n### Crawl a website using Apify's Website Content Crawler and convert it to Haystack Documents\n\nYou need to have an Apify account and API token to run this example.\nYou can start with a free account at [Apify](https://apify.com/) and get your [API token](https://docs.apify.com/platform/integrations/api).\n\nIn the example below, specify `apify_api_token` and run the script:\n\n```python\nfrom dotenv import load_dotenv\nfrom haystack import Document\n\nfrom apify_haystack import ApifyDatasetFromActorCall\n\n# Set APIFY_API_TOKEN here or load it from .env file\napify_api_token = \"\" or load_dotenv()\n\nactor_id = \"apify/website-content-crawler\"\nrun_input = {\n \"maxCrawlPages\": 3, # limit the number of pages to crawl\n \"startUrls\": [{\"url\": \"https://haystack.deepset.ai/\"}],\n}\n\n\ndef dataset_mapping_function(dataset_item: dict) -> Document:\n return Document(content=dataset_item.get(\"text\"), meta={\"url\": dataset_item.get(\"url\")})\n\n\nactor = ApifyDatasetFromActorCall(\n actor_id=actor_id, run_input=run_input, dataset_mapping_function=dataset_mapping_function\n)\nprint(f\"Calling the Apify actor {actor_id} ... crawling will take some time ...\")\nprint(\"You can monitor the progress at: https://console.apify.com/actors/runs\")\n\ndataset = actor.run().get(\"documents\")\n\nprint(f\"Loaded {len(dataset)} documents from the Apify Actor {actor_id}:\")\nfor d in dataset:\n print(d)\n```\n\n### More examples\n\nSee other examples in the [examples directory](https://github.com/apify/apify-haystack/blob/master/src/apify_haystack/examples) for more examples, here is a list of few of them\n\n- Load a dataset from Apify and convert it to a Haystack Document\n- Call [Website Content Crawler](https://apify.com/apify/website-content-crawler) and convert the data into the Haystack Documents\n- Crawl websites, retrieve text content, and store it in the `InMemoryDocumentStore`\n- Retrieval-Augmented Generation (RAG): Extracting text from a website & question answering <a href=\"https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/apify_haystack_rag.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n- Analyze Your Instagram Comments\u2019 Vibe with Apify and Haystack <a href=\"https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/apify_haystack_instagram_comments_analysis.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n\n## Support\n\nIf you find any bug or issue, please [submit an issue on GitHub](https://github.com/apify/apify-haystack/issues).\nFor questions, you can ask on [Stack Overflow](https://stackoverflow.com/questions/tagged/apify), in GitHub Discussions or you can join our [Discord server](https://discord.com/invite/jyEM2PRvMU).\n\n## Contributing\n\nYour code contributions are welcome.\nIf you have any ideas for improvements, either submit an issue or create a pull request.\nFor contribution guidelines and the code of conduct, see [CONTRIBUTING.md](https://github.com/apify/apify-haystack/blob/master/CONTRIBUTING.md).\n\n## License\n\nThis project is licensed under the Apache License 2.0 - see the [LICENSE](https://github.com/apify/apify-haystack/blob/master/LICENSE) file for details.\n\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Apify-haystack integration",
"version": "0.1.5",
"project_urls": {
"Changelog": "https://github.com/apify/apify-haystack/blob/master/CHANGELOG.md",
"Documentation": "https://github.com/apify/apify-haystack",
"Homepage": "https://apify.com",
"Issue Tracker": "https://github.com/apify/apify-haystack/issues",
"Repository": "https://github.com/apify/apify-haystack"
},
"split_keywords": [
"apify",
" crawler",
" haystack",
" rag",
" scraper",
" scraping"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "337655b31a9ef689aabf5ef8a8acb91d9ff944e67a547ca72238400b7574f296",
"md5": "babb4712fb31eaf8a3820de2d2e09527",
"sha256": "1a961315b83251763829bc2fd5d9be7488b0b05742d77624b6174fee8bd9be05"
},
"downloads": -1,
"filename": "apify_haystack-0.1.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "babb4712fb31eaf8a3820de2d2e09527",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.9",
"size": 16729,
"upload_time": "2024-09-23T13:36:54",
"upload_time_iso_8601": "2024-09-23T13:36:54.829812Z",
"url": "https://files.pythonhosted.org/packages/33/76/55b31a9ef689aabf5ef8a8acb91d9ff944e67a547ca72238400b7574f296/apify_haystack-0.1.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "e6973cab2187fd1ff819028bd2ca2a92581b8f0e8ebfba597cc1a66515e48db9",
"md5": "57beffbcde930ddc2f34544feb444046",
"sha256": "52dc45a7fa11a8e90146a224f4ff4fe3ae3bdd05051a254c6b87c5992c4228d2"
},
"downloads": -1,
"filename": "apify_haystack-0.1.5.tar.gz",
"has_sig": false,
"md5_digest": "57beffbcde930ddc2f34544feb444046",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.9",
"size": 15039,
"upload_time": "2024-09-23T13:36:56",
"upload_time_iso_8601": "2024-09-23T13:36:56.142897Z",
"url": "https://files.pythonhosted.org/packages/e6/97/3cab2187fd1ff819028bd2ca2a92581b8f0e8ebfba597cc1a66515e48db9/apify_haystack-0.1.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-23 13:36:56",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "apify",
"github_project": "apify-haystack",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "apify-haystack"
}