llama-index-readers-apify


Namellama-index-readers-apify JSON
Version 0.1.3 PyPI version JSON
download
home_page
Summaryllama-index readers apify integration
upload_time2024-02-21 19:23:06
maintainerdrobnikj
docs_urlNone
authorYour Name
requires_python>=3.8.1,<4.0
licenseMIT
keywords apify crawler scraper scraping
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Apify Loaders

## Apify Actor Loader

[Apify](https://apify.com/) is a cloud platform for web scraping and data extraction,
which provides an [ecosystem](https://apify.com/store) of more than a thousand
ready-made apps called _Actors_ for various scraping, crawling, and extraction use cases.

This loader runs a specific Actor and loads its results.

## Usage

In this example, we’ll use the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor,
which can deeply crawl websites such as documentation, knowledge bases, help centers,
or blogs, and extract text content from the web pages.
The extracted text then can be fed to a vector index or language model like GPT
in order to answer questions from it.

To use this loader, you need to have a (free) Apify account
and set your [Apify API token](https://console.apify.com/account/integrations) in the code.

```python
from llama_index import download_loader
from llama_index.readers.schema import Document


# Converts a single record from the Actor's resulting dataset to the LlamaIndex format
def tranform_dataset_item(item):
    return Document(
        text=item.get("text"),
        extra_info={
            "url": item.get("url"),
        },
    )


ApifyActor = download_loader("ApifyActor")

reader = ApifyActor("<My Apify API token>")
documents = reader.load_data(
    actor_id="apify/website-content-crawler",
    run_input={
        "startUrls": [{"url": "https://gpt-index.readthedocs.io/en/latest"}]
    },
    dataset_mapping_function=tranform_dataset_item,
)
```

This loader is designed to be used as a way to load data into
[LlamaIndex](https://github.com/run-llama/llama_index/tree/main/llama_index) and/or subsequently
used as a Tool in a [LangChain](https://github.com/hwchase17/langchain) Agent.
See [here](https://github.com/emptycrown/llama-hub/tree/main) for examples.

## Apify Dataset Loader

[Apify](https://apify.com/) is a cloud platform for web scraping and data extraction,
which provides an [ecosystem](https://apify.com/store) of more than a thousand
ready-made apps called _Actors_ for various scraping, crawling, and extraction use cases.

This loader loads documents from an existing [Apify dataset](https://docs.apify.com/platform/storage/dataset).

## Usage

In this example, we’ll load a dataset generated by
the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor,
which can deeply crawl websites such as documentation, knowledge bases, help centers,
or blogs, and extract text content from the web pages.
The extracted text then can be fed to a vector index or language model like GPT
in order to answer questions from it.

To use this loader, you need to have a (free) Apify account
and set your [Apify API token](https://console.apify.com/account/integrations) in the code.

```python
from llama_index import download_loader
from llama_index.readers.schema import Document


# Converts a single record from the Apify dataset to the LlamaIndex format
def tranform_dataset_item(item):
    return Document(
        text=item.get("text"),
        extra_info={
            "url": item.get("url"),
        },
    )


ApifyDataset = download_loader("ApifyDataset")

reader = ApifyDataset("<Your Apify API token>")
documents = reader.load_data(
    dataset_id="<Apify Dataset ID>",
    dataset_mapping_function=tranform_dataset_item,
)
```

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "llama-index-readers-apify",
    "maintainer": "drobnikj",
    "docs_url": null,
    "requires_python": ">=3.8.1,<4.0",
    "maintainer_email": "",
    "keywords": "apify,crawler,scraper,scraping",
    "author": "Your Name",
    "author_email": "you@example.com",
    "download_url": "https://files.pythonhosted.org/packages/be/31/7642742b80b4e57c33bd19cdb15253cb091d1e2fc076e7a69c029699f07d/llama_index_readers_apify-0.1.3.tar.gz",
    "platform": null,
    "description": "# Apify Loaders\n\n## Apify Actor Loader\n\n[Apify](https://apify.com/) is a cloud platform for web scraping and data extraction,\nwhich provides an [ecosystem](https://apify.com/store) of more than a thousand\nready-made apps called _Actors_ for various scraping, crawling, and extraction use cases.\n\nThis loader runs a specific Actor and loads its results.\n\n## Usage\n\nIn this example, we\u2019ll use the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor,\nwhich can deeply crawl websites such as documentation, knowledge bases, help centers,\nor blogs, and extract text content from the web pages.\nThe extracted text then can be fed to a vector index or language model like GPT\nin order to answer questions from it.\n\nTo use this loader, you need to have a (free) Apify account\nand set your [Apify API token](https://console.apify.com/account/integrations) in the code.\n\n```python\nfrom llama_index import download_loader\nfrom llama_index.readers.schema import Document\n\n\n# Converts a single record from the Actor's resulting dataset to the LlamaIndex format\ndef tranform_dataset_item(item):\n    return Document(\n        text=item.get(\"text\"),\n        extra_info={\n            \"url\": item.get(\"url\"),\n        },\n    )\n\n\nApifyActor = download_loader(\"ApifyActor\")\n\nreader = ApifyActor(\"<My Apify API token>\")\ndocuments = reader.load_data(\n    actor_id=\"apify/website-content-crawler\",\n    run_input={\n        \"startUrls\": [{\"url\": \"https://gpt-index.readthedocs.io/en/latest\"}]\n    },\n    dataset_mapping_function=tranform_dataset_item,\n)\n```\n\nThis loader is designed to be used as a way to load data into\n[LlamaIndex](https://github.com/run-llama/llama_index/tree/main/llama_index) and/or subsequently\nused as a Tool in a [LangChain](https://github.com/hwchase17/langchain) Agent.\nSee [here](https://github.com/emptycrown/llama-hub/tree/main) for examples.\n\n## Apify Dataset Loader\n\n[Apify](https://apify.com/) is a cloud platform for web scraping and data extraction,\nwhich provides an [ecosystem](https://apify.com/store) of more than a thousand\nready-made apps called _Actors_ for various scraping, crawling, and extraction use cases.\n\nThis loader loads documents from an existing [Apify dataset](https://docs.apify.com/platform/storage/dataset).\n\n## Usage\n\nIn this example, we\u2019ll load a dataset generated by\nthe [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor,\nwhich can deeply crawl websites such as documentation, knowledge bases, help centers,\nor blogs, and extract text content from the web pages.\nThe extracted text then can be fed to a vector index or language model like GPT\nin order to answer questions from it.\n\nTo use this loader, you need to have a (free) Apify account\nand set your [Apify API token](https://console.apify.com/account/integrations) in the code.\n\n```python\nfrom llama_index import download_loader\nfrom llama_index.readers.schema import Document\n\n\n# Converts a single record from the Apify dataset to the LlamaIndex format\ndef tranform_dataset_item(item):\n    return Document(\n        text=item.get(\"text\"),\n        extra_info={\n            \"url\": item.get(\"url\"),\n        },\n    )\n\n\nApifyDataset = download_loader(\"ApifyDataset\")\n\nreader = ApifyDataset(\"<Your Apify API token>\")\ndocuments = reader.load_data(\n    dataset_id=\"<Apify Dataset ID>\",\n    dataset_mapping_function=tranform_dataset_item,\n)\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "llama-index readers apify integration",
    "version": "0.1.3",
    "project_urls": null,
    "split_keywords": [
        "apify",
        "crawler",
        "scraper",
        "scraping"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9682ef90d442a95c3521caff767139ff297511dcb543b9264ea255034a21e037",
                "md5": "15c3bae35db0a469c15c87f95785b94e",
                "sha256": "c677349a9c97b0e5661efa4434351bc6bb9ccdc7f0de33e3e28e17c77b442ec1"
            },
            "downloads": -1,
            "filename": "llama_index_readers_apify-0.1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "15c3bae35db0a469c15c87f95785b94e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8.1,<4.0",
            "size": 4734,
            "upload_time": "2024-02-21T19:23:05",
            "upload_time_iso_8601": "2024-02-21T19:23:05.656562Z",
            "url": "https://files.pythonhosted.org/packages/96/82/ef90d442a95c3521caff767139ff297511dcb543b9264ea255034a21e037/llama_index_readers_apify-0.1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "be317642742b80b4e57c33bd19cdb15253cb091d1e2fc076e7a69c029699f07d",
                "md5": "592502780cc64a9e31860cb397e019df",
                "sha256": "c7f400917346783c7f9ec6b27eebd30fa154a51c8f1f0040a5b52d977efd466b"
            },
            "downloads": -1,
            "filename": "llama_index_readers_apify-0.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "592502780cc64a9e31860cb397e019df",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8.1,<4.0",
            "size": 3563,
            "upload_time": "2024-02-21T19:23:06",
            "upload_time_iso_8601": "2024-02-21T19:23:06.780280Z",
            "url": "https://files.pythonhosted.org/packages/be/31/7642742b80b4e57c33bd19cdb15253cb091d1e2fc076e7a69c029699f07d/llama_index_readers_apify-0.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-21 19:23:06",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "llama-index-readers-apify"
}
        
Elapsed time: 0.29216s