# Apify Loaders
## Apify Actor Loader
[Apify](https://apify.com/) is a cloud platform for web scraping and data extraction,
which provides an [ecosystem](https://apify.com/store) of more than a thousand
ready-made apps called _Actors_ for various scraping, crawling, and extraction use cases.
This loader runs a specific Actor and loads its results.
## Usage
In this example, we’ll use the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor,
which can deeply crawl websites such as documentation, knowledge bases, help centers,
or blogs, and extract text content from the web pages.
The extracted text then can be fed to a vector index or language model like GPT
in order to answer questions from it.
To use this loader, you need to have a (free) Apify account
and set your [Apify API token](https://console.apify.com/account/integrations) in the code.
```python
from llama_index import download_loader
from llama_index.readers.schema import Document
# Converts a single record from the Actor's resulting dataset to the LlamaIndex format
def tranform_dataset_item(item):
return Document(
text=item.get("text"),
extra_info={
"url": item.get("url"),
},
)
ApifyActor = download_loader("ApifyActor")
reader = ApifyActor("<My Apify API token>")
documents = reader.load_data(
actor_id="apify/website-content-crawler",
run_input={
"startUrls": [{"url": "https://gpt-index.readthedocs.io/en/latest"}]
},
dataset_mapping_function=tranform_dataset_item,
)
```
This loader is designed to be used as a way to load data into
[LlamaIndex](https://github.com/run-llama/llama_index/tree/main/llama_index) and/or subsequently
used as a Tool in a [LangChain](https://github.com/hwchase17/langchain) Agent.
See [here](https://github.com/emptycrown/llama-hub/tree/main) for examples.
## Apify Dataset Loader
[Apify](https://apify.com/) is a cloud platform for web scraping and data extraction,
which provides an [ecosystem](https://apify.com/store) of more than a thousand
ready-made apps called _Actors_ for various scraping, crawling, and extraction use cases.
This loader loads documents from an existing [Apify dataset](https://docs.apify.com/platform/storage/dataset).
## Usage
In this example, we’ll load a dataset generated by
the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor,
which can deeply crawl websites such as documentation, knowledge bases, help centers,
or blogs, and extract text content from the web pages.
The extracted text then can be fed to a vector index or language model like GPT
in order to answer questions from it.
To use this loader, you need to have a (free) Apify account
and set your [Apify API token](https://console.apify.com/account/integrations) in the code.
```python
from llama_index import download_loader
from llama_index.readers.schema import Document
# Converts a single record from the Apify dataset to the LlamaIndex format
def tranform_dataset_item(item):
return Document(
text=item.get("text"),
extra_info={
"url": item.get("url"),
},
)
ApifyDataset = download_loader("ApifyDataset")
reader = ApifyDataset("<Your Apify API token>")
documents = reader.load_data(
dataset_id="<Apify Dataset ID>",
dataset_mapping_function=tranform_dataset_item,
)
```
Raw data
{
"_id": null,
"home_page": "",
"name": "llama-index-readers-apify",
"maintainer": "drobnikj",
"docs_url": null,
"requires_python": ">=3.8.1,<4.0",
"maintainer_email": "",
"keywords": "apify,crawler,scraper,scraping",
"author": "Your Name",
"author_email": "you@example.com",
"download_url": "https://files.pythonhosted.org/packages/be/31/7642742b80b4e57c33bd19cdb15253cb091d1e2fc076e7a69c029699f07d/llama_index_readers_apify-0.1.3.tar.gz",
"platform": null,
"description": "# Apify Loaders\n\n## Apify Actor Loader\n\n[Apify](https://apify.com/) is a cloud platform for web scraping and data extraction,\nwhich provides an [ecosystem](https://apify.com/store) of more than a thousand\nready-made apps called _Actors_ for various scraping, crawling, and extraction use cases.\n\nThis loader runs a specific Actor and loads its results.\n\n## Usage\n\nIn this example, we\u2019ll use the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor,\nwhich can deeply crawl websites such as documentation, knowledge bases, help centers,\nor blogs, and extract text content from the web pages.\nThe extracted text then can be fed to a vector index or language model like GPT\nin order to answer questions from it.\n\nTo use this loader, you need to have a (free) Apify account\nand set your [Apify API token](https://console.apify.com/account/integrations) in the code.\n\n```python\nfrom llama_index import download_loader\nfrom llama_index.readers.schema import Document\n\n\n# Converts a single record from the Actor's resulting dataset to the LlamaIndex format\ndef tranform_dataset_item(item):\n return Document(\n text=item.get(\"text\"),\n extra_info={\n \"url\": item.get(\"url\"),\n },\n )\n\n\nApifyActor = download_loader(\"ApifyActor\")\n\nreader = ApifyActor(\"<My Apify API token>\")\ndocuments = reader.load_data(\n actor_id=\"apify/website-content-crawler\",\n run_input={\n \"startUrls\": [{\"url\": \"https://gpt-index.readthedocs.io/en/latest\"}]\n },\n dataset_mapping_function=tranform_dataset_item,\n)\n```\n\nThis loader is designed to be used as a way to load data into\n[LlamaIndex](https://github.com/run-llama/llama_index/tree/main/llama_index) and/or subsequently\nused as a Tool in a [LangChain](https://github.com/hwchase17/langchain) Agent.\nSee [here](https://github.com/emptycrown/llama-hub/tree/main) for examples.\n\n## Apify Dataset Loader\n\n[Apify](https://apify.com/) is a cloud platform for web scraping and data extraction,\nwhich provides an [ecosystem](https://apify.com/store) of more than a thousand\nready-made apps called _Actors_ for various scraping, crawling, and extraction use cases.\n\nThis loader loads documents from an existing [Apify dataset](https://docs.apify.com/platform/storage/dataset).\n\n## Usage\n\nIn this example, we\u2019ll load a dataset generated by\nthe [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor,\nwhich can deeply crawl websites such as documentation, knowledge bases, help centers,\nor blogs, and extract text content from the web pages.\nThe extracted text then can be fed to a vector index or language model like GPT\nin order to answer questions from it.\n\nTo use this loader, you need to have a (free) Apify account\nand set your [Apify API token](https://console.apify.com/account/integrations) in the code.\n\n```python\nfrom llama_index import download_loader\nfrom llama_index.readers.schema import Document\n\n\n# Converts a single record from the Apify dataset to the LlamaIndex format\ndef tranform_dataset_item(item):\n return Document(\n text=item.get(\"text\"),\n extra_info={\n \"url\": item.get(\"url\"),\n },\n )\n\n\nApifyDataset = download_loader(\"ApifyDataset\")\n\nreader = ApifyDataset(\"<Your Apify API token>\")\ndocuments = reader.load_data(\n dataset_id=\"<Apify Dataset ID>\",\n dataset_mapping_function=tranform_dataset_item,\n)\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "llama-index readers apify integration",
"version": "0.1.3",
"project_urls": null,
"split_keywords": [
"apify",
"crawler",
"scraper",
"scraping"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9682ef90d442a95c3521caff767139ff297511dcb543b9264ea255034a21e037",
"md5": "15c3bae35db0a469c15c87f95785b94e",
"sha256": "c677349a9c97b0e5661efa4434351bc6bb9ccdc7f0de33e3e28e17c77b442ec1"
},
"downloads": -1,
"filename": "llama_index_readers_apify-0.1.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "15c3bae35db0a469c15c87f95785b94e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8.1,<4.0",
"size": 4734,
"upload_time": "2024-02-21T19:23:05",
"upload_time_iso_8601": "2024-02-21T19:23:05.656562Z",
"url": "https://files.pythonhosted.org/packages/96/82/ef90d442a95c3521caff767139ff297511dcb543b9264ea255034a21e037/llama_index_readers_apify-0.1.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "be317642742b80b4e57c33bd19cdb15253cb091d1e2fc076e7a69c029699f07d",
"md5": "592502780cc64a9e31860cb397e019df",
"sha256": "c7f400917346783c7f9ec6b27eebd30fa154a51c8f1f0040a5b52d977efd466b"
},
"downloads": -1,
"filename": "llama_index_readers_apify-0.1.3.tar.gz",
"has_sig": false,
"md5_digest": "592502780cc64a9e31860cb397e019df",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8.1,<4.0",
"size": 3563,
"upload_time": "2024-02-21T19:23:06",
"upload_time_iso_8601": "2024-02-21T19:23:06.780280Z",
"url": "https://files.pythonhosted.org/packages/be/31/7642742b80b4e57c33bd19cdb15253cb091d1e2fc076e7a69c029699f07d/llama_index_readers_apify-0.1.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-02-21 19:23:06",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "llama-index-readers-apify"
}