# Apify Loaders
```bash
pip install llama-index-readers-apify
```
## Apify Actor Loader
[Apify](https://apify.com/) is a cloud platform for web scraping and data extraction,
which provides an [ecosystem](https://apify.com/store) of more than a thousand
ready-made apps called _Actors_ for various scraping, crawling, and extraction use cases.
This loader runs a specific Actor and loads its results.
## Usage
In this example, we’ll use the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor,
which can deeply crawl websites such as documentation, knowledge bases, help centers,
or blogs, and extract text content from the web pages.
The extracted text then can be fed to a vector index or language model like GPT
in order to answer questions from it.
To use this loader, you need to have a (free) Apify account
and set your [Apify API token](https://console.apify.com/account/integrations) in the code.
```python
from llama_index.core import Document
from llama_index.readers.apify import ApifyActor
reader = ApifyActor("<My Apify API token>")
documents = reader.load_data(
actor_id="apify/website-content-crawler",
run_input={
"startUrls": [{"url": "https://docs.llamaindex.ai/en/latest/"}]
},
dataset_mapping_function=lambda item: Document(
text=item.get("text"),
metadata={
"url": item.get("url"),
},
),
)
```
This loader is designed to be used as a way to load data into
[LlamaIndex](https://github.com/run-llama/llama_index/tree/main/llama_index) and/or subsequently
used as a Tool in a [LangChain](https://github.com/hwchase17/langchain) Agent.
## Apify Dataset Loader
[Apify](https://apify.com/) is a cloud platform for web scraping and data extraction,
which provides an [ecosystem](https://apify.com/store) of more than a thousand
ready-made apps called _Actors_ for various scraping, crawling, and extraction use cases.
This loader loads documents from an existing [Apify dataset](https://docs.apify.com/platform/storage/dataset).
## Usage
In this example, we’ll load a dataset generated by
the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor,
which can deeply crawl websites such as documentation, knowledge bases, help centers,
or blogs, and extract text content from the web pages.
The extracted text then can be fed to a vector index or language model like GPT
in order to answer questions from it.
To use this loader, you need to have a (free) Apify account
and set your [Apify API token](https://console.apify.com/account/integrations) in the code.
```python
from llama_index.core import Document
from llama_index.readers.apify import ApifyDataset
reader = ApifyDataset("<Your Apify API token>")
documents = reader.load_data(
dataset_id="<Apify Dataset ID>",
dataset_mapping_function=lambda item: Document(
text=item.get("text"),
metadata={
"url": item.get("url"),
},
),
)
```
Raw data
{
"_id": null,
"home_page": null,
"name": "llama-index-readers-apify",
"maintainer": "drobnikj",
"docs_url": null,
"requires_python": "<4.0,>=3.8.1",
"maintainer_email": null,
"keywords": "apify, crawler, scraper, scraping",
"author": "Your Name",
"author_email": "you@example.com",
"download_url": "https://files.pythonhosted.org/packages/0d/dd/d3267ce374892a304086b99ddae4a8df342baf286b9c05ea9f87cd2eb373/llama_index_readers_apify-0.2.0.tar.gz",
"platform": null,
"description": "# Apify Loaders\n\n```bash\npip install llama-index-readers-apify\n```\n\n## Apify Actor Loader\n\n[Apify](https://apify.com/) is a cloud platform for web scraping and data extraction,\nwhich provides an [ecosystem](https://apify.com/store) of more than a thousand\nready-made apps called _Actors_ for various scraping, crawling, and extraction use cases.\n\nThis loader runs a specific Actor and loads its results.\n\n## Usage\n\nIn this example, we\u2019ll use the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor,\nwhich can deeply crawl websites such as documentation, knowledge bases, help centers,\nor blogs, and extract text content from the web pages.\nThe extracted text then can be fed to a vector index or language model like GPT\nin order to answer questions from it.\n\nTo use this loader, you need to have a (free) Apify account\nand set your [Apify API token](https://console.apify.com/account/integrations) in the code.\n\n```python\nfrom llama_index.core import Document\nfrom llama_index.readers.apify import ApifyActor\n\nreader = ApifyActor(\"<My Apify API token>\")\n\ndocuments = reader.load_data(\n actor_id=\"apify/website-content-crawler\",\n run_input={\n \"startUrls\": [{\"url\": \"https://docs.llamaindex.ai/en/latest/\"}]\n },\n dataset_mapping_function=lambda item: Document(\n text=item.get(\"text\"),\n metadata={\n \"url\": item.get(\"url\"),\n },\n ),\n)\n```\n\nThis loader is designed to be used as a way to load data into\n[LlamaIndex](https://github.com/run-llama/llama_index/tree/main/llama_index) and/or subsequently\nused as a Tool in a [LangChain](https://github.com/hwchase17/langchain) Agent.\n\n## Apify Dataset Loader\n\n[Apify](https://apify.com/) is a cloud platform for web scraping and data extraction,\nwhich provides an [ecosystem](https://apify.com/store) of more than a thousand\nready-made apps called _Actors_ for various scraping, crawling, and extraction use cases.\n\nThis loader loads documents from an existing [Apify dataset](https://docs.apify.com/platform/storage/dataset).\n\n## Usage\n\nIn this example, we\u2019ll load a dataset generated by\nthe [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor,\nwhich can deeply crawl websites such as documentation, knowledge bases, help centers,\nor blogs, and extract text content from the web pages.\nThe extracted text then can be fed to a vector index or language model like GPT\nin order to answer questions from it.\n\nTo use this loader, you need to have a (free) Apify account\nand set your [Apify API token](https://console.apify.com/account/integrations) in the code.\n\n```python\nfrom llama_index.core import Document\nfrom llama_index.readers.apify import ApifyDataset\n\nreader = ApifyDataset(\"<Your Apify API token>\")\ndocuments = reader.load_data(\n dataset_id=\"<Apify Dataset ID>\",\n dataset_mapping_function=lambda item: Document(\n text=item.get(\"text\"),\n metadata={\n \"url\": item.get(\"url\"),\n },\n ),\n)\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "llama-index readers apify integration",
"version": "0.2.0",
"project_urls": null,
"split_keywords": [
"apify",
" crawler",
" scraper",
" scraping"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "438f3a0c3edcd1e50818c63cbecdd2b3fe9c73b7bdfafa6f577f1950653af4d8",
"md5": "f2527b2b3772ca75e659656e96218d00",
"sha256": "df54112c941a51baf4737e3e13da8c80c492fb26582bde3bbff60a8fe8917330"
},
"downloads": -1,
"filename": "llama_index_readers_apify-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f2527b2b3772ca75e659656e96218d00",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.8.1",
"size": 4790,
"upload_time": "2024-08-22T05:45:08",
"upload_time_iso_8601": "2024-08-22T05:45:08.699853Z",
"url": "https://files.pythonhosted.org/packages/43/8f/3a0c3edcd1e50818c63cbecdd2b3fe9c73b7bdfafa6f577f1950653af4d8/llama_index_readers_apify-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "0dddd3267ce374892a304086b99ddae4a8df342baf286b9c05ea9f87cd2eb373",
"md5": "d8340acf226393ccaaf782f71e778355",
"sha256": "c4b0eb2c7043b0ff21225c41447aeefc274b27fcf4bc39dec61db1cd1b19522d"
},
"downloads": -1,
"filename": "llama_index_readers_apify-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "d8340acf226393ccaaf782f71e778355",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.8.1",
"size": 3584,
"upload_time": "2024-08-22T05:45:09",
"upload_time_iso_8601": "2024-08-22T05:45:09.712585Z",
"url": "https://files.pythonhosted.org/packages/0d/dd/d3267ce374892a304086b99ddae4a8df342baf286b9c05ea9f87cd2eb373/llama_index_readers_apify-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-22 05:45:09",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "llama-index-readers-apify"
}