llama-index-readers-apify


Namellama-index-readers-apify JSON
Version 0.2.0 PyPI version JSON
download
home_pageNone
Summaryllama-index readers apify integration
upload_time2024-08-22 05:45:09
maintainerdrobnikj
docs_urlNone
authorYour Name
requires_python<4.0,>=3.8.1
licenseMIT
keywords apify crawler scraper scraping
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Apify Loaders

```bash
pip install llama-index-readers-apify
```

## Apify Actor Loader

[Apify](https://apify.com/) is a cloud platform for web scraping and data extraction,
which provides an [ecosystem](https://apify.com/store) of more than a thousand
ready-made apps called _Actors_ for various scraping, crawling, and extraction use cases.

This loader runs a specific Actor and loads its results.

## Usage

In this example, we’ll use the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor,
which can deeply crawl websites such as documentation, knowledge bases, help centers,
or blogs, and extract text content from the web pages.
The extracted text then can be fed to a vector index or language model like GPT
in order to answer questions from it.

To use this loader, you need to have a (free) Apify account
and set your [Apify API token](https://console.apify.com/account/integrations) in the code.

```python
from llama_index.core import Document
from llama_index.readers.apify import ApifyActor

reader = ApifyActor("<My Apify API token>")

documents = reader.load_data(
    actor_id="apify/website-content-crawler",
    run_input={
        "startUrls": [{"url": "https://docs.llamaindex.ai/en/latest/"}]
    },
    dataset_mapping_function=lambda item: Document(
        text=item.get("text"),
        metadata={
            "url": item.get("url"),
        },
    ),
)
```

This loader is designed to be used as a way to load data into
[LlamaIndex](https://github.com/run-llama/llama_index/tree/main/llama_index) and/or subsequently
used as a Tool in a [LangChain](https://github.com/hwchase17/langchain) Agent.

## Apify Dataset Loader

[Apify](https://apify.com/) is a cloud platform for web scraping and data extraction,
which provides an [ecosystem](https://apify.com/store) of more than a thousand
ready-made apps called _Actors_ for various scraping, crawling, and extraction use cases.

This loader loads documents from an existing [Apify dataset](https://docs.apify.com/platform/storage/dataset).

## Usage

In this example, we’ll load a dataset generated by
the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor,
which can deeply crawl websites such as documentation, knowledge bases, help centers,
or blogs, and extract text content from the web pages.
The extracted text then can be fed to a vector index or language model like GPT
in order to answer questions from it.

To use this loader, you need to have a (free) Apify account
and set your [Apify API token](https://console.apify.com/account/integrations) in the code.

```python
from llama_index.core import Document
from llama_index.readers.apify import ApifyDataset

reader = ApifyDataset("<Your Apify API token>")
documents = reader.load_data(
    dataset_id="<Apify Dataset ID>",
    dataset_mapping_function=lambda item: Document(
        text=item.get("text"),
        metadata={
            "url": item.get("url"),
        },
    ),
)
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "llama-index-readers-apify",
    "maintainer": "drobnikj",
    "docs_url": null,
    "requires_python": "<4.0,>=3.8.1",
    "maintainer_email": null,
    "keywords": "apify, crawler, scraper, scraping",
    "author": "Your Name",
    "author_email": "you@example.com",
    "download_url": "https://files.pythonhosted.org/packages/0d/dd/d3267ce374892a304086b99ddae4a8df342baf286b9c05ea9f87cd2eb373/llama_index_readers_apify-0.2.0.tar.gz",
    "platform": null,
    "description": "# Apify Loaders\n\n```bash\npip install llama-index-readers-apify\n```\n\n## Apify Actor Loader\n\n[Apify](https://apify.com/) is a cloud platform for web scraping and data extraction,\nwhich provides an [ecosystem](https://apify.com/store) of more than a thousand\nready-made apps called _Actors_ for various scraping, crawling, and extraction use cases.\n\nThis loader runs a specific Actor and loads its results.\n\n## Usage\n\nIn this example, we\u2019ll use the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor,\nwhich can deeply crawl websites such as documentation, knowledge bases, help centers,\nor blogs, and extract text content from the web pages.\nThe extracted text then can be fed to a vector index or language model like GPT\nin order to answer questions from it.\n\nTo use this loader, you need to have a (free) Apify account\nand set your [Apify API token](https://console.apify.com/account/integrations) in the code.\n\n```python\nfrom llama_index.core import Document\nfrom llama_index.readers.apify import ApifyActor\n\nreader = ApifyActor(\"<My Apify API token>\")\n\ndocuments = reader.load_data(\n    actor_id=\"apify/website-content-crawler\",\n    run_input={\n        \"startUrls\": [{\"url\": \"https://docs.llamaindex.ai/en/latest/\"}]\n    },\n    dataset_mapping_function=lambda item: Document(\n        text=item.get(\"text\"),\n        metadata={\n            \"url\": item.get(\"url\"),\n        },\n    ),\n)\n```\n\nThis loader is designed to be used as a way to load data into\n[LlamaIndex](https://github.com/run-llama/llama_index/tree/main/llama_index) and/or subsequently\nused as a Tool in a [LangChain](https://github.com/hwchase17/langchain) Agent.\n\n## Apify Dataset Loader\n\n[Apify](https://apify.com/) is a cloud platform for web scraping and data extraction,\nwhich provides an [ecosystem](https://apify.com/store) of more than a thousand\nready-made apps called _Actors_ for various scraping, crawling, and extraction use cases.\n\nThis loader loads documents from an existing [Apify dataset](https://docs.apify.com/platform/storage/dataset).\n\n## Usage\n\nIn this example, we\u2019ll load a dataset generated by\nthe [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor,\nwhich can deeply crawl websites such as documentation, knowledge bases, help centers,\nor blogs, and extract text content from the web pages.\nThe extracted text then can be fed to a vector index or language model like GPT\nin order to answer questions from it.\n\nTo use this loader, you need to have a (free) Apify account\nand set your [Apify API token](https://console.apify.com/account/integrations) in the code.\n\n```python\nfrom llama_index.core import Document\nfrom llama_index.readers.apify import ApifyDataset\n\nreader = ApifyDataset(\"<Your Apify API token>\")\ndocuments = reader.load_data(\n    dataset_id=\"<Apify Dataset ID>\",\n    dataset_mapping_function=lambda item: Document(\n        text=item.get(\"text\"),\n        metadata={\n            \"url\": item.get(\"url\"),\n        },\n    ),\n)\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "llama-index readers apify integration",
    "version": "0.2.0",
    "project_urls": null,
    "split_keywords": [
        "apify",
        " crawler",
        " scraper",
        " scraping"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "438f3a0c3edcd1e50818c63cbecdd2b3fe9c73b7bdfafa6f577f1950653af4d8",
                "md5": "f2527b2b3772ca75e659656e96218d00",
                "sha256": "df54112c941a51baf4737e3e13da8c80c492fb26582bde3bbff60a8fe8917330"
            },
            "downloads": -1,
            "filename": "llama_index_readers_apify-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f2527b2b3772ca75e659656e96218d00",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.8.1",
            "size": 4790,
            "upload_time": "2024-08-22T05:45:08",
            "upload_time_iso_8601": "2024-08-22T05:45:08.699853Z",
            "url": "https://files.pythonhosted.org/packages/43/8f/3a0c3edcd1e50818c63cbecdd2b3fe9c73b7bdfafa6f577f1950653af4d8/llama_index_readers_apify-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0dddd3267ce374892a304086b99ddae4a8df342baf286b9c05ea9f87cd2eb373",
                "md5": "d8340acf226393ccaaf782f71e778355",
                "sha256": "c4b0eb2c7043b0ff21225c41447aeefc274b27fcf4bc39dec61db1cd1b19522d"
            },
            "downloads": -1,
            "filename": "llama_index_readers_apify-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "d8340acf226393ccaaf782f71e778355",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.8.1",
            "size": 3584,
            "upload_time": "2024-08-22T05:45:09",
            "upload_time_iso_8601": "2024-08-22T05:45:09.712585Z",
            "url": "https://files.pythonhosted.org/packages/0d/dd/d3267ce374892a304086b99ddae4a8df342baf286b9c05ea9f87cd2eb373/llama_index_readers_apify-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-22 05:45:09",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "llama-index-readers-apify"
}
        
Elapsed time: 4.07178s