llama-index-readers-github


Namellama-index-readers-github JSON
Version 0.8.0 PyPI version JSON
download
home_pageNone
Summaryllama-index readers github integration
upload_time2025-07-31 02:44:36
maintainerahmetkca, moncho, rwood-97
docs_urlNone
authorNone
requires_python<4.0,>=3.9
licenseNone
keywords code collaborators git github issues placeholder repository source code
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # LlamaIndex Readers Integration: Github

`pip install llama-index-readers-github`

The github readers package consists of three separate readers:

1. Repository Reader
2. Issues Reader
3. Collaborators Reader

All three readers will require a personal access token (which you can generate under your account settings).

## Repository Reader

This reader will read through a repo, with options to specifically filter directories, file extensions, file paths, and custom processing logic.

### Basic Usage

```python
from llama_index.readers.github import GithubRepositoryReader, GithubClient

client = github_client = GithubClient(github_token=github_token, verbose=False)

reader = GithubRepositoryReader(
    github_client=github_client,
    owner="run-llama",
    repo="llama_index",
    use_parser=False,
    verbose=True,
    filter_directories=(
        ["docs"],
        GithubRepositoryReader.FilterType.INCLUDE,
    ),
    filter_file_extensions=(
        [
            ".png",
            ".jpg",
            ".jpeg",
            ".gif",
            ".svg",
            ".ico",
            "json",
            ".ipynb",
        ],
        GithubRepositoryReader.FilterType.EXCLUDE,
    ),
)

documents = reader.load_data(branch="main")
```

### Advanced Filtering Options

#### Filter Specific File Paths

```python
# Include only specific files
reader = GithubRepositoryReader(
    github_client=github_client,
    owner="run-llama",
    repo="llama_index",
    filter_file_paths=(
        ["README.md", "src/main.py", "docs/guide.md"],
        GithubRepositoryReader.FilterType.INCLUDE,
    ),
)

# Exclude specific files
reader = GithubRepositoryReader(
    github_client=github_client,
    owner="run-llama",
    repo="llama_index",
    filter_file_paths=(
        ["tests/test_file.py", "temp/cache.txt"],
        GithubRepositoryReader.FilterType.EXCLUDE,
    ),
)
```

#### Custom File Processing Callback

```python
def process_file_callback(file_path: str, file_size: int) -> tuple[bool, str]:
    """Custom logic to determine if a file should be processed.

    Args:
        file_path: The full path to the file
        file_size: The size of the file in bytes

    Returns:
        Tuple of (should_process: bool, reason: str)
    """
    # Skip large files
    if file_size > 1024 * 1024:  # 1MB
        return False, f"File too large: {file_size} bytes"

    # Skip test files
    if "test" in file_path.lower():
        return False, "Skipping test files"

    # Skip binary files by extension
    binary_extensions = [".exe", ".bin", ".so", ".dylib"]
    if any(file_path.endswith(ext) for ext in binary_extensions):
        return False, "Skipping binary files"

    return True, ""


reader = GithubRepositoryReader(
    github_client=github_client,
    owner="run-llama",
    repo="llama_index",
    process_file_callback=process_file_callback,
    fail_on_error=False,  # Continue processing if callback fails
)
```

#### Custom Folder for Temporary Files

```python
from llama_index.core.readers.base import BaseReader


# Custom parser for specific file types
class CustomMarkdownParser(BaseReader):
    def load_data(self, file_path, extra_info=None):
        # Custom parsing logic here
        pass


reader = GithubRepositoryReader(
    github_client=github_client,
    owner="run-llama",
    repo="llama_index",
    use_parser=True,
    custom_parsers={".md": CustomMarkdownParser()},
    custom_folder="/tmp/github_processing",  # Custom temp directory
)
```

### Event System Integration

The reader integrates with LlamaIndex's instrumentation system to provide detailed events during processing:

```python
from llama_index.core.instrumentation import get_dispatcher
from llama_index.core.instrumentation.event_handlers import BaseEventHandler
from llama_index.readers.github.repository.event import (
    GitHubFileProcessedEvent,
    GitHubFileSkippedEvent,
    GitHubFileFailedEvent,
    GitHubRepositoryProcessingStartedEvent,
    GitHubRepositoryProcessingCompletedEvent,
)


class GitHubEventHandler(BaseEventHandler):
    def handle(self, event):
        if isinstance(event, GitHubRepositoryProcessingStartedEvent):
            print(f"Started processing repository: {event.repository_name}")
        elif isinstance(event, GitHubFileProcessedEvent):
            print(
                f"Processed file: {event.file_path} ({event.file_size} bytes)"
            )
        elif isinstance(event, GitHubFileSkippedEvent):
            print(f"Skipped file: {event.file_path} - {event.reason}")
        elif isinstance(event, GitHubFileFailedEvent):
            print(f"Failed to process file: {event.file_path} - {event.error}")
        elif isinstance(event, GitHubRepositoryProcessingCompletedEvent):
            print(
                f"Completed processing. Total documents: {event.total_documents}"
            )


# Register the event handler
dispatcher = get_dispatcher()
handler = GitHubEventHandler()
dispatcher.add_event_handler(handler)

# Use the reader - events will be automatically dispatched
reader = GithubRepositoryReader(
    github_client=github_client,
    owner="run-llama",
    repo="llama_index",
)
documents = reader.load_data(branch="main")
```

#### Available Events

The following events are dispatched during repository processing:

- **`GitHubRepositoryProcessingStartedEvent`**: Fired when repository processing begins

  - `repository_name`: Name of the repository (owner/repo)
  - `branch_or_commit`: Branch name or commit SHA being processed

- **`GitHubRepositoryProcessingCompletedEvent`**: Fired when repository processing completes

  - `repository_name`: Name of the repository
  - `branch_or_commit`: Branch name or commit SHA
  - `total_documents`: Number of documents created

- **`GitHubTotalFilesToProcessEvent`**: Fired with the total count of files to be processed

  - `repository_name`: Name of the repository
  - `branch_or_commit`: Branch name or commit SHA
  - `total_files`: Total number of files found

- **`GitHubFileProcessingStartedEvent`**: Fired when individual file processing starts

  - `file_path`: Path to the file being processed
  - `file_type`: File extension

- **`GitHubFileProcessedEvent`**: Fired when a file is successfully processed

  - `file_path`: Path to the processed file
  - `file_type`: File extension
  - `file_size`: Size of the file in bytes
  - `document`: The created Document object

- **`GitHubFileSkippedEvent`**: Fired when a file is skipped

  - `file_path`: Path to the skipped file
  - `file_type`: File extension
  - `reason`: Reason why the file was skipped

- **`GitHubFileFailedEvent`**: Fired when file processing fails
  - `file_path`: Path to the failed file
  - `file_type`: File extension
  - `error`: Error message describing the failure

## Issues Reader

```python
from llama_index.readers.github import (
    GitHubRepositoryIssuesReader,
    GitHubIssuesClient,
)

github_client = GitHubIssuesClient(github_token=github_token, verbose=True)

reader = GitHubRepositoryIssuesReader(
    github_client=github_client,
    owner="moncho",
    repo="dry",
    verbose=True,
)

documents = reader.load_data(
    state=GitHubRepositoryIssuesReader.IssueState.ALL,
    labelFilters=[("bug", GitHubRepositoryIssuesReader.FilterType.INCLUDE)],
)
```

## Collaborators Reader

```python
from llama_index.readers.github import (
    GitHubRepositoryCollaboratorsReader,
    GitHubCollaboratorsClient,
)

github_client = GitHubCollaboratorsClient(
    github_token=github_token, verbose=True
)

reader = GitHubRepositoryCollaboratorsReader(
    github_client=github_client,
    owner="moncho",
    repo="dry",
    verbose=True,
)

documents = reader.load_data()
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "llama-index-readers-github",
    "maintainer": "ahmetkca, moncho, rwood-97",
    "docs_url": null,
    "requires_python": "<4.0,>=3.9",
    "maintainer_email": null,
    "keywords": "code, collaborators, git, github, issues, placeholder, repository, source code",
    "author": null,
    "author_email": "Your Name <you@example.com>",
    "download_url": "https://files.pythonhosted.org/packages/55/6a/e8c742abbc3e0d02bd8958f49dab075f9ecce535769d64feb7cae54dd228/llama_index_readers_github-0.8.0.tar.gz",
    "platform": null,
    "description": "# LlamaIndex Readers Integration: Github\n\n`pip install llama-index-readers-github`\n\nThe github readers package consists of three separate readers:\n\n1. Repository Reader\n2. Issues Reader\n3. Collaborators Reader\n\nAll three readers will require a personal access token (which you can generate under your account settings).\n\n## Repository Reader\n\nThis reader will read through a repo, with options to specifically filter directories, file extensions, file paths, and custom processing logic.\n\n### Basic Usage\n\n```python\nfrom llama_index.readers.github import GithubRepositoryReader, GithubClient\n\nclient = github_client = GithubClient(github_token=github_token, verbose=False)\n\nreader = GithubRepositoryReader(\n    github_client=github_client,\n    owner=\"run-llama\",\n    repo=\"llama_index\",\n    use_parser=False,\n    verbose=True,\n    filter_directories=(\n        [\"docs\"],\n        GithubRepositoryReader.FilterType.INCLUDE,\n    ),\n    filter_file_extensions=(\n        [\n            \".png\",\n            \".jpg\",\n            \".jpeg\",\n            \".gif\",\n            \".svg\",\n            \".ico\",\n            \"json\",\n            \".ipynb\",\n        ],\n        GithubRepositoryReader.FilterType.EXCLUDE,\n    ),\n)\n\ndocuments = reader.load_data(branch=\"main\")\n```\n\n### Advanced Filtering Options\n\n#### Filter Specific File Paths\n\n```python\n# Include only specific files\nreader = GithubRepositoryReader(\n    github_client=github_client,\n    owner=\"run-llama\",\n    repo=\"llama_index\",\n    filter_file_paths=(\n        [\"README.md\", \"src/main.py\", \"docs/guide.md\"],\n        GithubRepositoryReader.FilterType.INCLUDE,\n    ),\n)\n\n# Exclude specific files\nreader = GithubRepositoryReader(\n    github_client=github_client,\n    owner=\"run-llama\",\n    repo=\"llama_index\",\n    filter_file_paths=(\n        [\"tests/test_file.py\", \"temp/cache.txt\"],\n        GithubRepositoryReader.FilterType.EXCLUDE,\n    ),\n)\n```\n\n#### Custom File Processing Callback\n\n```python\ndef process_file_callback(file_path: str, file_size: int) -> tuple[bool, str]:\n    \"\"\"Custom logic to determine if a file should be processed.\n\n    Args:\n        file_path: The full path to the file\n        file_size: The size of the file in bytes\n\n    Returns:\n        Tuple of (should_process: bool, reason: str)\n    \"\"\"\n    # Skip large files\n    if file_size > 1024 * 1024:  # 1MB\n        return False, f\"File too large: {file_size} bytes\"\n\n    # Skip test files\n    if \"test\" in file_path.lower():\n        return False, \"Skipping test files\"\n\n    # Skip binary files by extension\n    binary_extensions = [\".exe\", \".bin\", \".so\", \".dylib\"]\n    if any(file_path.endswith(ext) for ext in binary_extensions):\n        return False, \"Skipping binary files\"\n\n    return True, \"\"\n\n\nreader = GithubRepositoryReader(\n    github_client=github_client,\n    owner=\"run-llama\",\n    repo=\"llama_index\",\n    process_file_callback=process_file_callback,\n    fail_on_error=False,  # Continue processing if callback fails\n)\n```\n\n#### Custom Folder for Temporary Files\n\n```python\nfrom llama_index.core.readers.base import BaseReader\n\n\n# Custom parser for specific file types\nclass CustomMarkdownParser(BaseReader):\n    def load_data(self, file_path, extra_info=None):\n        # Custom parsing logic here\n        pass\n\n\nreader = GithubRepositoryReader(\n    github_client=github_client,\n    owner=\"run-llama\",\n    repo=\"llama_index\",\n    use_parser=True,\n    custom_parsers={\".md\": CustomMarkdownParser()},\n    custom_folder=\"/tmp/github_processing\",  # Custom temp directory\n)\n```\n\n### Event System Integration\n\nThe reader integrates with LlamaIndex's instrumentation system to provide detailed events during processing:\n\n```python\nfrom llama_index.core.instrumentation import get_dispatcher\nfrom llama_index.core.instrumentation.event_handlers import BaseEventHandler\nfrom llama_index.readers.github.repository.event import (\n    GitHubFileProcessedEvent,\n    GitHubFileSkippedEvent,\n    GitHubFileFailedEvent,\n    GitHubRepositoryProcessingStartedEvent,\n    GitHubRepositoryProcessingCompletedEvent,\n)\n\n\nclass GitHubEventHandler(BaseEventHandler):\n    def handle(self, event):\n        if isinstance(event, GitHubRepositoryProcessingStartedEvent):\n            print(f\"Started processing repository: {event.repository_name}\")\n        elif isinstance(event, GitHubFileProcessedEvent):\n            print(\n                f\"Processed file: {event.file_path} ({event.file_size} bytes)\"\n            )\n        elif isinstance(event, GitHubFileSkippedEvent):\n            print(f\"Skipped file: {event.file_path} - {event.reason}\")\n        elif isinstance(event, GitHubFileFailedEvent):\n            print(f\"Failed to process file: {event.file_path} - {event.error}\")\n        elif isinstance(event, GitHubRepositoryProcessingCompletedEvent):\n            print(\n                f\"Completed processing. Total documents: {event.total_documents}\"\n            )\n\n\n# Register the event handler\ndispatcher = get_dispatcher()\nhandler = GitHubEventHandler()\ndispatcher.add_event_handler(handler)\n\n# Use the reader - events will be automatically dispatched\nreader = GithubRepositoryReader(\n    github_client=github_client,\n    owner=\"run-llama\",\n    repo=\"llama_index\",\n)\ndocuments = reader.load_data(branch=\"main\")\n```\n\n#### Available Events\n\nThe following events are dispatched during repository processing:\n\n- **`GitHubRepositoryProcessingStartedEvent`**: Fired when repository processing begins\n\n  - `repository_name`: Name of the repository (owner/repo)\n  - `branch_or_commit`: Branch name or commit SHA being processed\n\n- **`GitHubRepositoryProcessingCompletedEvent`**: Fired when repository processing completes\n\n  - `repository_name`: Name of the repository\n  - `branch_or_commit`: Branch name or commit SHA\n  - `total_documents`: Number of documents created\n\n- **`GitHubTotalFilesToProcessEvent`**: Fired with the total count of files to be processed\n\n  - `repository_name`: Name of the repository\n  - `branch_or_commit`: Branch name or commit SHA\n  - `total_files`: Total number of files found\n\n- **`GitHubFileProcessingStartedEvent`**: Fired when individual file processing starts\n\n  - `file_path`: Path to the file being processed\n  - `file_type`: File extension\n\n- **`GitHubFileProcessedEvent`**: Fired when a file is successfully processed\n\n  - `file_path`: Path to the processed file\n  - `file_type`: File extension\n  - `file_size`: Size of the file in bytes\n  - `document`: The created Document object\n\n- **`GitHubFileSkippedEvent`**: Fired when a file is skipped\n\n  - `file_path`: Path to the skipped file\n  - `file_type`: File extension\n  - `reason`: Reason why the file was skipped\n\n- **`GitHubFileFailedEvent`**: Fired when file processing fails\n  - `file_path`: Path to the failed file\n  - `file_type`: File extension\n  - `error`: Error message describing the failure\n\n## Issues Reader\n\n```python\nfrom llama_index.readers.github import (\n    GitHubRepositoryIssuesReader,\n    GitHubIssuesClient,\n)\n\ngithub_client = GitHubIssuesClient(github_token=github_token, verbose=True)\n\nreader = GitHubRepositoryIssuesReader(\n    github_client=github_client,\n    owner=\"moncho\",\n    repo=\"dry\",\n    verbose=True,\n)\n\ndocuments = reader.load_data(\n    state=GitHubRepositoryIssuesReader.IssueState.ALL,\n    labelFilters=[(\"bug\", GitHubRepositoryIssuesReader.FilterType.INCLUDE)],\n)\n```\n\n## Collaborators Reader\n\n```python\nfrom llama_index.readers.github import (\n    GitHubRepositoryCollaboratorsReader,\n    GitHubCollaboratorsClient,\n)\n\ngithub_client = GitHubCollaboratorsClient(\n    github_token=github_token, verbose=True\n)\n\nreader = GitHubRepositoryCollaboratorsReader(\n    github_client=github_client,\n    owner=\"moncho\",\n    repo=\"dry\",\n    verbose=True,\n)\n\ndocuments = reader.load_data()\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "llama-index readers github integration",
    "version": "0.8.0",
    "project_urls": null,
    "split_keywords": [
        "code",
        " collaborators",
        " git",
        " github",
        " issues",
        " placeholder",
        " repository",
        " source code"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f346df7c8899469aac3a3b3805a338c486619e9bacfcaec396cdf700ed919cdc",
                "md5": "82c9457631741a1ebd64fda85804c955",
                "sha256": "3bfec0100d44025ad2b36c621a87ca3f9243701fd56322da0defe3539f226e10"
            },
            "downloads": -1,
            "filename": "llama_index_readers_github-0.8.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "82c9457631741a1ebd64fda85804c955",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.9",
            "size": 26775,
            "upload_time": "2025-07-31T02:44:35",
            "upload_time_iso_8601": "2025-07-31T02:44:35.112225Z",
            "url": "https://files.pythonhosted.org/packages/f3/46/df7c8899469aac3a3b3805a338c486619e9bacfcaec396cdf700ed919cdc/llama_index_readers_github-0.8.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "556ae8c742abbc3e0d02bd8958f49dab075f9ecce535769d64feb7cae54dd228",
                "md5": "f672718055cc443c3f73446995deb00b",
                "sha256": "881e4f8127521c5919003c9bd0149b619892fd30209be02f9845305110af2a0f"
            },
            "downloads": -1,
            "filename": "llama_index_readers_github-0.8.0.tar.gz",
            "has_sig": false,
            "md5_digest": "f672718055cc443c3f73446995deb00b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.9",
            "size": 20246,
            "upload_time": "2025-07-31T02:44:36",
            "upload_time_iso_8601": "2025-07-31T02:44:36.223762Z",
            "url": "https://files.pythonhosted.org/packages/55/6a/e8c742abbc3e0d02bd8958f49dab075f9ecce535769d64feb7cae54dd228/llama_index_readers_github-0.8.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-31 02:44:36",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "llama-index-readers-github"
}
        
Elapsed time: 1.77038s