# GitHub Repository Extractor
This Python script allows you to extract the contents of a GitHub repository into a single text file. It's particularly useful for encapsulating an entire codebase into a single file, facilitating its use with Large Language Models (LLMs) that have high-capacity context memory.
## Features
- Support for both local and remote GitHub repositories
- Flexible ignore and include lists for files, folders, and extensions
- Progress bar to track extraction process
- Handles binary files
- Option to clone remote repositories temporarily
- Ideal for preparing codebases for analysis by LLMs
## Dependencies
This project requires the following Python packages:
- `pygithub`: For interacting with the GitHub API
- `tqdm`: For displaying progress bars
- `gitpython`: For handling Git operations
You can install these dependencies using pip:
```
pip install pygithub tqdm gitpython
```
## Installation
1. Clone this repository:
```
git clone https://github.com/yourusername/github-repo-extractor.git
```
2. Install the required dependencies:
```
pip install -r requirements.txt
```
## Usage
1. Import the `GitHubRepoExtractor` class from the script.
2. Create an instance of `GitHubRepoExtractor` with your repository details.
3. Set ignore and include lists as needed.
4. Call the `extract_to_file()` method to start the extraction process.
Example:
```python
from github_repo_extractor import GitHubRepoExtractor
extractor = GitHubRepoExtractor(
repo_input='https://github.com/username/repo.git',
access_token='your_github_token'
)
extractor.set_ignore_list(
files=['.gitignore'],
folders=['tests', '.github'],
extensions=['.log']
)
extractor.set_include_list(
files=['README.md'],
extensions=['.py'],
exclusive=True
)
extractor.extract_to_file('output.txt')
```
## Authentication
For optimal usage of this script, instead of prompting for the GitHub authentication token every time, you can use a centralized and easily integratable solution like keyvault. We recommend using the keyvault library available at [https://github.com/ltoscano/keyvault](https://github.com/ltoscano/keyvault).
This approach provides a more secure and centralized way to manage your GitHub token.
## Use Case: Preparing Codebases for LLMs
This tool is particularly valuable when working with Large Language Models (LLMs) that have high-capacity context memory. By encapsulating an entire codebase into a single file, you can:
1. Easily feed the entire codebase into an LLM for analysis, code review, or understanding.
2. Maintain context across multiple files and directories when discussing code with an LLM.
3. Simplify the process of asking LLMs to perform tasks that require understanding of the entire project structure.
This approach allows for more comprehensive and context-aware interactions with LLMs when working with large software projects.
## Contributing
Contributions are welcome! Please see the [CONTRIBUTING.md](CONTRIBUTING.md) file for guidelines on how to contribute to this project.
## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
Raw data
{
"_id": null,
"home_page": "https://github.com/ltoscano/github-repo-extractor",
"name": "github-repo-extractor",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": null,
"author": "Lorenzo Toscano",
"author_email": "lorenzo.toscano@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/6c/b8/5ad39f3d8a40ae339e1111ea5073538a325aa1dcbc1e64d800a9cdc5494a/github_repo_extractor-0.1.1.tar.gz",
"platform": null,
"description": "# GitHub Repository Extractor\n\nThis Python script allows you to extract the contents of a GitHub repository into a single text file. It's particularly useful for encapsulating an entire codebase into a single file, facilitating its use with Large Language Models (LLMs) that have high-capacity context memory.\n\n## Features\n\n- Support for both local and remote GitHub repositories\n- Flexible ignore and include lists for files, folders, and extensions\n- Progress bar to track extraction process\n- Handles binary files\n- Option to clone remote repositories temporarily\n- Ideal for preparing codebases for analysis by LLMs\n\n## Dependencies\n\nThis project requires the following Python packages:\n\n- `pygithub`: For interacting with the GitHub API\n- `tqdm`: For displaying progress bars\n- `gitpython`: For handling Git operations\n\nYou can install these dependencies using pip:\n\n```\npip install pygithub tqdm gitpython\n```\n\n## Installation\n\n1. Clone this repository:\n ```\n git clone https://github.com/yourusername/github-repo-extractor.git\n ```\n2. Install the required dependencies:\n ```\n pip install -r requirements.txt\n ```\n\n## Usage\n\n1. Import the `GitHubRepoExtractor` class from the script.\n2. Create an instance of `GitHubRepoExtractor` with your repository details.\n3. Set ignore and include lists as needed.\n4. Call the `extract_to_file()` method to start the extraction process.\n\nExample:\n\n```python\nfrom github_repo_extractor import GitHubRepoExtractor\n\nextractor = GitHubRepoExtractor(\n repo_input='https://github.com/username/repo.git',\n access_token='your_github_token'\n)\n\nextractor.set_ignore_list(\n files=['.gitignore'],\n folders=['tests', '.github'],\n extensions=['.log']\n)\n\nextractor.set_include_list(\n files=['README.md'],\n extensions=['.py'],\n exclusive=True\n)\n\nextractor.extract_to_file('output.txt')\n```\n\n## Authentication\n\nFor optimal usage of this script, instead of prompting for the GitHub authentication token every time, you can use a centralized and easily integratable solution like keyvault. We recommend using the keyvault library available at [https://github.com/ltoscano/keyvault](https://github.com/ltoscano/keyvault).\n\nThis approach provides a more secure and centralized way to manage your GitHub token.\n\n## Use Case: Preparing Codebases for LLMs\n\nThis tool is particularly valuable when working with Large Language Models (LLMs) that have high-capacity context memory. By encapsulating an entire codebase into a single file, you can:\n\n1. Easily feed the entire codebase into an LLM for analysis, code review, or understanding.\n2. Maintain context across multiple files and directories when discussing code with an LLM.\n3. Simplify the process of asking LLMs to perform tasks that require understanding of the entire project structure.\n\nThis approach allows for more comprehensive and context-aware interactions with LLMs when working with large software projects.\n\n## Contributing\n\nContributions are welcome! Please see the [CONTRIBUTING.md](CONTRIBUTING.md) file for guidelines on how to contribute to this project.\n\n## License\n\nThis project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.\n",
"bugtrack_url": null,
"license": null,
"summary": "A tool to extract GitHub repositories into a single file",
"version": "0.1.1",
"project_urls": {
"Homepage": "https://github.com/ltoscano/github-repo-extractor"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "0e38763fa4a20a6464b17033662e732a66f5ccf49f684d026970a19a4f867fb0",
"md5": "b5a2cf8660e9ee98870d7629f492b058",
"sha256": "afa08911d6d808f37646496b019db065a66422f0ff0fae9bb450e9a03184e50b"
},
"downloads": -1,
"filename": "github_repo_extractor-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b5a2cf8660e9ee98870d7629f492b058",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 6845,
"upload_time": "2024-09-15T15:10:26",
"upload_time_iso_8601": "2024-09-15T15:10:26.976707Z",
"url": "https://files.pythonhosted.org/packages/0e/38/763fa4a20a6464b17033662e732a66f5ccf49f684d026970a19a4f867fb0/github_repo_extractor-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "6cb85ad39f3d8a40ae339e1111ea5073538a325aa1dcbc1e64d800a9cdc5494a",
"md5": "5d9ddd01f684ba0f0edeb4c36750a91d",
"sha256": "efbdfa598dfaf5f16d3f6989a133bc20a6ed43a97b0432fbbe2e006434f028d2"
},
"downloads": -1,
"filename": "github_repo_extractor-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "5d9ddd01f684ba0f0edeb4c36750a91d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 7488,
"upload_time": "2024-09-15T15:10:29",
"upload_time_iso_8601": "2024-09-15T15:10:29.263468Z",
"url": "https://files.pythonhosted.org/packages/6c/b8/5ad39f3d8a40ae339e1111ea5073538a325aa1dcbc1e64d800a9cdc5494a/github_repo_extractor-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-15 15:10:29",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ltoscano",
"github_project": "github-repo-extractor",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "github-repo-extractor"
}