# Codebase to Text Converter
A powerful Python tool that converts codebases (folder structures with files) into a single text file or Microsoft Word document (.docx), while preserving folder structure and file contents. Perfect for AI/LLM processing, documentation generation, and code analysis.
## ✨ Features
- **Multi-source input**: Local directories and GitHub repositories
- **Flexible output**: Text files (.txt) and Microsoft Word documents (.docx)
- **Smart exclusions**: Advanced pattern matching for files and directories
- **Performance optimized**: Efficient traversal of large codebases
- **Comprehensive logging**: Detailed verbose mode for transparency
- **Encoding support**: Handles various file encodings gracefully
## 🚀 Installation
```bash
pip install codebase-to-text
```
## 📖 Usage
### Command Line Interface (CLI)
#### Basic Usage
```bash
codebase-to-text --input "path_or_github_url" --output "output_path" --output_type "txt"
```
#### Advanced Usage with Exclusions
```bash
# Exclude specific patterns
codebase-to-text --input "./my_project" --output "output.txt" --output_type "txt" --exclude "*.log,temp/,**/__pycache__/**"
# Multiple exclude arguments
codebase-to-text --input "./my_project" --output "output.txt" --output_type "txt" --exclude "*.pyc" --exclude "build/" --exclude "venv/"
# Exclude hidden files
codebase-to-text --input "./my_project" --output "output.txt" --output_type "txt" --exclude_hidden
# Verbose mode for detailed logging
codebase-to-text --input "./my_project" --output "output.txt" --output_type "txt" --verbose
```
### Python API
```python
from codebase_to_text import CodebaseToText
# Basic usage
converter = CodebaseToText(
input_path="path_or_github_url",
output_path="output_path",
output_type="txt"
)
converter.get_file()
# Advanced usage with exclusions
converter = CodebaseToText(
input_path="./my_project",
output_path="./output.txt",
output_type="txt",
exclude=["*.log", "temp/", "**/__pycache__/**"],
exclude_hidden=True,
verbose=True
)
converter.get_file()
# Get text content without saving to file
text_content = converter.get_text()
print(text_content)
```
## 🎯 Exclusion Patterns
The tool supports powerful exclusion patterns to filter out unwanted files and directories:
### Pattern Types
1. **Exact filename**: `README.md`, `config.yaml`
2. **Wildcard patterns**: `*.log`, `*.tmp`, `test_*`
3. **Directory patterns**: `__pycache__/`, `.git/`, `node_modules/`
4. **Recursive patterns**: `**/__pycache__/**`, `**/node_modules/**`
5. **Path-based patterns**: `src/temp/`, `docs/build/`
### Exclusion Sources
1. **CLI Arguments**: Use `--exclude` flag (can be used multiple times)
2. **`.exclude` file**: Place in your project root (see example below)
3. **Default patterns**: Common files/folders are excluded automatically
### Default Exclusions
The tool automatically excludes common development files:
- `.git/`, `__pycache__/`, `*.pyc`, `*.pyo`
- `node_modules/`, `.venv/`, `venv/`, `env/`
- `*.log`, `*.tmp`, `.DS_Store`
- `.pytest_cache/`, `build/`, `dist/`
## 📝 .exclude File Example
Create a `.exclude` file in your project root:
```bash
# .exclude file - Patterns for files/folders to exclude
# Version control
.git/
.gitignore
# Python
__pycache__/
*.pyc
venv/
.pytest_cache/
# Node.js
node_modules/
*.log
# IDE files
.vscode/
.idea/
# Project specific
config/secrets.yaml
data/large_files/
```
## 🔧 CLI Parameters
| Parameter | Description | Example |
|-----------|-------------|---------|
| `--input` | Input path (local folder or GitHub URL) | `./my_project` or `https://github.com/user/repo` |
| `--output` | Output file path | `./output.txt` |
| `--output_type` | Output format (`txt` or `docx`) | `txt` |
| `--exclude` | Exclusion patterns (repeatable) | `--exclude "*.log" --exclude "temp/"` |
| `--exclude_hidden` | Exclude hidden files/folders | Flag (no value) |
| `--verbose` | Enable detailed logging | Flag (no value) |
## 💡 Examples
### Convert Local Project
```bash
# Basic conversion
codebase-to-text --input "~/projects/my_app" --output "my_app_code.txt" --output_type "txt"
# With custom exclusions
codebase-to-text --input "~/projects/my_app" --output "my_app_code.txt" --output_type "txt" --exclude "*.log,build/,dist/" --verbose
```
### Convert GitHub Repository
```bash
# Public repository
codebase-to-text --input "https://github.com/username/repo" --output "repo_analysis.docx" --output_type "docx"
# With exclusions for cleaner output
codebase-to-text --input "https://github.com/username/repo" --output "repo_clean.txt" --output_type "txt" --exclude "*.md,docs/,examples/"
```
### Python Integration
```python
# Analyze a codebase programmatically
from codebase_to_text import CodebaseToText
def analyze_codebase(project_path):
converter = CodebaseToText(
input_path=project_path,
output_path="analysis.txt",
output_type="txt",
exclude=["*.log", "test/", "**/__pycache__/**"],
verbose=True
)
# Get the content
content = converter.get_text()
# Process with your preferred LLM/AI tool
# analysis_result = your_ai_tool.analyze(content)
return content
# Usage
code_content = analyze_codebase("./my_project")
```
## 🎯 Use Cases
- **AI/LLM Training**: Prepare codebases for language model training
- **Code Review**: Generate comprehensive code overviews for review
- **Documentation**: Create single-file documentation from projects
- **Analysis**: Feed entire codebases to AI tools for analysis
- **Migration**: Document legacy codebases before migration
- **Learning**: Study open-source projects more effectively
## 🔄 Output Format
The generated output includes:
1. **Folder Structure**: Tree-like representation of the directory structure
2. **File Contents**: Full content of each file with metadata
3. **Clear Separators**: Distinct sections for easy navigation
## ✒️ License
License This project is licensed under the MIT License - see the LICENSE file for details.
Raw data
{
"_id": null,
"home_page": "https://github.com/QaisarRajput/codebase_to_text",
"name": "codebase-to-text",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "codebase, code conversion, text conversion, folder structure, file contents, text extraction, document conversion, Python package, GitHub repository, command-line tool, code analysis, file parsing, code documentation, formatting preservation, readability",
"author": "Qaisar Tanvir",
"author_email": "qaisartanvir.dev+codebase_to_text@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/f2/cd/b7ef3e1100d7f44c89c6a86d9ada97890edd6cf25bc11a1c6e8780b6da08/codebase_to_text-1.2.tar.gz",
"platform": null,
"description": "# Codebase to Text Converter\n\nA powerful Python tool that converts codebases (folder structures with files) into a single text file or Microsoft Word document (.docx), while preserving folder structure and file contents. Perfect for AI/LLM processing, documentation generation, and code analysis.\n\n## \u2728 Features\n\n- **Multi-source input**: Local directories and GitHub repositories\n- **Flexible output**: Text files (.txt) and Microsoft Word documents (.docx)\n- **Smart exclusions**: Advanced pattern matching for files and directories\n- **Performance optimized**: Efficient traversal of large codebases\n- **Comprehensive logging**: Detailed verbose mode for transparency\n- **Encoding support**: Handles various file encodings gracefully\n\n## \ud83d\ude80 Installation\n\n```bash\npip install codebase-to-text\n```\n\n## \ud83d\udcd6 Usage\n\n### Command Line Interface (CLI)\n\n#### Basic Usage\n\n```bash\ncodebase-to-text --input \"path_or_github_url\" --output \"output_path\" --output_type \"txt\"\n```\n\n#### Advanced Usage with Exclusions\n\n```bash\n# Exclude specific patterns\ncodebase-to-text --input \"./my_project\" --output \"output.txt\" --output_type \"txt\" --exclude \"*.log,temp/,**/__pycache__/**\"\n\n# Multiple exclude arguments\ncodebase-to-text --input \"./my_project\" --output \"output.txt\" --output_type \"txt\" --exclude \"*.pyc\" --exclude \"build/\" --exclude \"venv/\"\n\n# Exclude hidden files\ncodebase-to-text --input \"./my_project\" --output \"output.txt\" --output_type \"txt\" --exclude_hidden\n\n# Verbose mode for detailed logging\ncodebase-to-text --input \"./my_project\" --output \"output.txt\" --output_type \"txt\" --verbose\n```\n\n### Python API\n\n```python\nfrom codebase_to_text import CodebaseToText\n\n# Basic usage\nconverter = CodebaseToText(\n input_path=\"path_or_github_url\",\n output_path=\"output_path\",\n output_type=\"txt\"\n)\nconverter.get_file()\n\n# Advanced usage with exclusions\nconverter = CodebaseToText(\n input_path=\"./my_project\",\n output_path=\"./output.txt\",\n output_type=\"txt\",\n exclude=[\"*.log\", \"temp/\", \"**/__pycache__/**\"],\n exclude_hidden=True,\n verbose=True\n)\nconverter.get_file()\n\n# Get text content without saving to file\ntext_content = converter.get_text()\nprint(text_content)\n```\n\n## \ud83c\udfaf Exclusion Patterns\n\nThe tool supports powerful exclusion patterns to filter out unwanted files and directories:\n\n### Pattern Types\n\n1. **Exact filename**: `README.md`, `config.yaml`\n2. **Wildcard patterns**: `*.log`, `*.tmp`, `test_*`\n3. **Directory patterns**: `__pycache__/`, `.git/`, `node_modules/`\n4. **Recursive patterns**: `**/__pycache__/**`, `**/node_modules/**`\n5. **Path-based patterns**: `src/temp/`, `docs/build/`\n\n### Exclusion Sources\n\n1. **CLI Arguments**: Use `--exclude` flag (can be used multiple times)\n2. **`.exclude` file**: Place in your project root (see example below)\n3. **Default patterns**: Common files/folders are excluded automatically\n\n### Default Exclusions\n\nThe tool automatically excludes common development files:\n\n- `.git/`, `__pycache__/`, `*.pyc`, `*.pyo`\n- `node_modules/`, `.venv/`, `venv/`, `env/`\n- `*.log`, `*.tmp`, `.DS_Store`\n- `.pytest_cache/`, `build/`, `dist/`\n\n## \ud83d\udcdd .exclude File Example\n\nCreate a `.exclude` file in your project root:\n\n```bash\n# .exclude file - Patterns for files/folders to exclude\n\n# Version control\n.git/\n.gitignore\n\n# Python\n__pycache__/\n*.pyc\nvenv/\n.pytest_cache/\n\n# Node.js\nnode_modules/\n*.log\n\n# IDE files\n.vscode/\n.idea/\n\n# Project specific\nconfig/secrets.yaml\ndata/large_files/\n```\n\n## \ud83d\udd27 CLI Parameters\n\n| Parameter | Description | Example |\n|-----------|-------------|---------|\n| `--input` | Input path (local folder or GitHub URL) | `./my_project` or `https://github.com/user/repo` |\n| `--output` | Output file path | `./output.txt` |\n| `--output_type` | Output format (`txt` or `docx`) | `txt` |\n| `--exclude` | Exclusion patterns (repeatable) | `--exclude \"*.log\" --exclude \"temp/\"` |\n| `--exclude_hidden` | Exclude hidden files/folders | Flag (no value) |\n| `--verbose` | Enable detailed logging | Flag (no value) |\n\n## \ud83d\udca1 Examples\n\n### Convert Local Project\n\n```bash\n# Basic conversion\ncodebase-to-text --input \"~/projects/my_app\" --output \"my_app_code.txt\" --output_type \"txt\"\n\n# With custom exclusions\ncodebase-to-text --input \"~/projects/my_app\" --output \"my_app_code.txt\" --output_type \"txt\" --exclude \"*.log,build/,dist/\" --verbose\n```\n\n### Convert GitHub Repository\n\n```bash\n# Public repository\ncodebase-to-text --input \"https://github.com/username/repo\" --output \"repo_analysis.docx\" --output_type \"docx\"\n\n# With exclusions for cleaner output\ncodebase-to-text --input \"https://github.com/username/repo\" --output \"repo_clean.txt\" --output_type \"txt\" --exclude \"*.md,docs/,examples/\"\n```\n\n### Python Integration\n\n```python\n# Analyze a codebase programmatically\nfrom codebase_to_text import CodebaseToText\n\ndef analyze_codebase(project_path):\n converter = CodebaseToText(\n input_path=project_path,\n output_path=\"analysis.txt\",\n output_type=\"txt\",\n exclude=[\"*.log\", \"test/\", \"**/__pycache__/**\"],\n verbose=True\n )\n \n # Get the content\n content = converter.get_text()\n \n # Process with your preferred LLM/AI tool\n # analysis_result = your_ai_tool.analyze(content)\n \n return content\n\n# Usage\ncode_content = analyze_codebase(\"./my_project\")\n```\n\n## \ud83c\udfaf Use Cases\n\n- **AI/LLM Training**: Prepare codebases for language model training\n- **Code Review**: Generate comprehensive code overviews for review\n- **Documentation**: Create single-file documentation from projects\n- **Analysis**: Feed entire codebases to AI tools for analysis\n- **Migration**: Document legacy codebases before migration\n- **Learning**: Study open-source projects more effectively\n\n## \ud83d\udd04 Output Format\n\nThe generated output includes:\n\n1. **Folder Structure**: Tree-like representation of the directory structure\n2. **File Contents**: Full content of each file with metadata\n3. **Clear Separators**: Distinct sections for easy navigation\n\n## \u2712\ufe0f License\n\nLicense This project is licensed under the MIT License - see the LICENSE file for details.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A Python package to convert codebase to text",
"version": "1.2",
"project_urls": {
"Download": "https://github.com/QaisarRajput/codebase_to_text/archive/refs/tags/1.2.tar.gz",
"Homepage": "https://github.com/QaisarRajput/codebase_to_text"
},
"split_keywords": [
"codebase",
" code conversion",
" text conversion",
" folder structure",
" file contents",
" text extraction",
" document conversion",
" python package",
" github repository",
" command-line tool",
" code analysis",
" file parsing",
" code documentation",
" formatting preservation",
" readability"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "9b156436d3b1a5c5f180eb6cc30de96e24319b35219412b93c5386704b489263",
"md5": "5eb942c999c5bb08579018b3926c7667",
"sha256": "fa02b8dcdc08842d301f5e34547c0859c5448e0e5ae61c64a86f9b979c37b481"
},
"downloads": -1,
"filename": "codebase_to_text-1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5eb942c999c5bb08579018b3926c7667",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 17277,
"upload_time": "2025-07-20T20:26:40",
"upload_time_iso_8601": "2025-07-20T20:26:40.199786Z",
"url": "https://files.pythonhosted.org/packages/9b/15/6436d3b1a5c5f180eb6cc30de96e24319b35219412b93c5386704b489263/codebase_to_text-1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "f2cdb7ef3e1100d7f44c89c6a86d9ada97890edd6cf25bc11a1c6e8780b6da08",
"md5": "0f6837dae44f79db632707c015971a4b",
"sha256": "0e88bfdcab4378e35b1fbe2f0a9a8070abe16701fd8b1e9baff9f9a6446b286a"
},
"downloads": -1,
"filename": "codebase_to_text-1.2.tar.gz",
"has_sig": false,
"md5_digest": "0f6837dae44f79db632707c015971a4b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 17997,
"upload_time": "2025-07-20T20:26:41",
"upload_time_iso_8601": "2025-07-20T20:26:41.383289Z",
"url": "https://files.pythonhosted.org/packages/f2/cd/b7ef3e1100d7f44c89c6a86d9ada97890edd6cf25bc11a1c6e8780b6da08/codebase_to_text-1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-20 20:26:41",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "QaisarRajput",
"github_project": "codebase_to_text",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "python-docx",
"specs": [
[
">=",
"0.8.11"
]
]
},
{
"name": "gitpython",
"specs": []
}
],
"lcname": "codebase-to-text"
}