# dir2text
A Python library and command-line tool for expressing directory structures and file contents in formats suitable for Large Language Models (LLMs). It combines directory tree visualization with file contents in a memory-efficient, streaming format.
## Features
- Tree-style directory structure visualization
- Complete file contents with proper escaping
- Memory-efficient streaming processing
- Multiple output formats (XML, JSON)
- Easy extensibility for new formats
- Support for exclusion patterns (e.g., .gitignore rules)
- Optional token counting for LLM context management
- Safe handling of large files and directories
## Installation
This project uses [Poetry](https://python-poetry.org/) for dependency management. We recommend using Poetry for the best development experience, but we also provide traditional pip installation.
### Using Poetry (Recommended)
1. First, [install Poetry](https://python-poetry.org/docs/#installation) if you haven't already.
2. Install dir2text:
```bash
poetry add dir2text
```
### Using pip
```bash
pip install dir2text
```
### Optional Features
Install with token counting support (for LLM context management):
```bash
# With Poetry
poetry add "dir2text[token_counting]"
# With pip
pip install "dir2text[token_counting]"
```
**Note:** The `token_counting` feature requires the `tiktoken` package, which needs a Rust compiler (e.g., `rustc`) and Cargo to be available during installation.
## Usage
### Command Line Interface
Basic usage:
```bash
dir2text /path/to/project
# Exclude files matching .gitignore patterns
dir2text -e .gitignore /path/to/project
# Count tokens for LLM context management
dir2text -c /path/to/project
# Generate JSON output and save to file
dir2text --format json -o output.json /path/to/project
# Skip tree or content sections
dir2text -T /path/to/project # Skip tree visualization
dir2text -C /path/to/project # Skip file contents
```
### Python API
Basic usage:
```python
from dir2text import StreamingDir2Text
# Initialize the analyzer
analyzer = StreamingDir2Text("path/to/project")
# Stream the directory tree
for line in analyzer.stream_tree():
print(line, end='')
# Stream file contents
for chunk in analyzer.stream_contents():
print(chunk, end='')
# Get metrics
print(f"Processed {analyzer.file_count} files in {analyzer.directory_count} directories")
```
Memory-efficient processing with exclusions and token counting:
```python
from dir2text import StreamingDir2Text
# Initialize with options
analyzer = StreamingDir2Text(
directory="path/to/project",
exclude_file=".gitignore",
output_format="json",
tokenizer_model="gpt-4"
)
# Process content incrementally
with open("output.json", "w") as f:
for line in analyzer.stream_tree():
f.write(line)
for chunk in analyzer.stream_contents():
f.write(chunk)
# Print statistics
print(f"Files: {analyzer.file_count}")
print(f"Directories: {analyzer.directory_count}")
print(f"Lines: {analyzer.line_count}")
print(f"Tokens: {analyzer.token_count}")
print(f"Characters: {analyzer.character_count}")
```
Immediate processing (for smaller directories):
```python
from dir2text import Dir2Text
# Process everything immediately
analyzer = Dir2Text("path/to/project")
# Access complete content
print(analyzer.tree_string)
print(analyzer.content_string)
```
## Output Formats
### XML Format
```xml
<file path="relative/path/to/file.py" tokens="150">
def example():
print("Hello, world!")
</file>
```
### JSON Format
```json
{
"path": "relative/path/to/file.py",
"content": "def example():\n print(\"Hello, world!\")",
"tokens": 150
}
```
## Development
### Setup Development Environment
1. Clone the repository:
```bash
git clone https://github.com/rlichtenwalter/dir2text.git
cd dir2text
```
2. Install development dependencies:
```bash
poetry install --with dev
```
3. Install pre-commit hooks:
```bash
poetry run pre-commit install
```
### Running Tests
```bash
# Run specific quality control categories
poetry run tox -e format # Run formatters
poetry run tox -e lint # Run linters
poetry run tox -e test # Run tests
poetry run tox -e coverage # Run test coverage analysis
```
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
1. Fork the repository
2. Create a new branch (`git checkout -b feature/amazing-feature`)
3. Make your changes
4. Run the test suite
5. Commit your changes (`git commit -m 'Add some amazing feature'`)
6. Push to the branch (`git push origin feature/amazing-feature`)
7. Open a Pull Request
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Acknowledgments
- This project uses [anytree](https://github.com/c0fec0de/anytree) for tree data structures
- .gitignore pattern matching uses [pathspec](https://github.com/cpburnz/python-pathspec)
- Token counting functionality is provided by OpenAI's [tiktoken](https://github.com/openai/tiktoken)
## Requirements
- Python 3.9+
- Poetry (recommended) or pip
- Optional: Rust compiler and Cargo (for token counting feature)
## Project Status
This project is actively maintained. Issues and pull requests are welcome.
## FAQ
**Q: Why use streaming processing?**
A: Streaming allows processing of large directories and files with constant memory usage, making it suitable for processing repositories of any size.
**Q: How does dir2text handle symbolic links?**
A: dir2text follows symbolic links to both files and directories. While this enables complete directory traversal, no symlink loop detection is currently implemented. Users should be cautious when processing directories with circular symlinks, as this could lead to infinite recursion. Future versions may add configuration options for symlink handling and loop prevention.
**Q: Can I use this with binary files?**
A: The tool is designed for text files. Binary files should be excluded using the exclusion rules feature.
**Q: What models are supported for token counting?**
A: The token counting feature uses OpenAI's tiktoken library with the following primary models and encodings:
- cl100k_base encoding:
- GPT-4 models (gpt-4, gpt-4-32k)
- GPT-3.5-Turbo models (gpt-3.5-turbo)
- p50k_base encoding:
- Text Davinci models (text-davinci-003)
For other language models, using a similar model's tokenizer (like gpt-4) can provide useful approximations of token counts. While the counts may not exactly match your target model's tokenization, they can give a good general estimate. The default model is "gpt-4", which uses cl100k_base encoding and provides a good general-purpose tokenization.
**Q: What happens if I specify a model that doesn't have a dedicated tokenizer?**
A: The library will suggest using a well-supported model like 'gpt-4' or 'text-davinci-003' for token counting. While token counts may not exactly match your target model, they can provide useful approximations for most modern language models.
## Contact
Ryan N. Lichtenwalter - rlichtenwalter@gmail.com
Project Link: [https://github.com/rlichtenwalter/dir2text](https://github.com/rlichtenwalter/dir2text)
Raw data
{
"_id": null,
"home_page": "https://github.com/rlichtenwalter/dir2text.git",
"name": "dir2text",
"maintainer": "Ryan N. Lichtenwalter",
"docs_url": null,
"requires_python": "<4.0,>=3.9.1",
"maintainer_email": "rlichtenwalter@gmail.com",
"keywords": "large language model, LLM, token, tokenizer, SLOC, tree, code assistant",
"author": "Ryan N. Lichtenwalter",
"author_email": "rlichtenwalter@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/ae/13/89d239ca103ea9ca5b1e66ca62e6ec473afba73e71f957f5f9eb2f4464f9/dir2text-1.0.1.tar.gz",
"platform": null,
"description": "# dir2text\n\nA Python library and command-line tool for expressing directory structures and file contents in formats suitable for Large Language Models (LLMs). It combines directory tree visualization with file contents in a memory-efficient, streaming format.\n\n## Features\n\n- Tree-style directory structure visualization\n- Complete file contents with proper escaping\n- Memory-efficient streaming processing\n- Multiple output formats (XML, JSON)\n- Easy extensibility for new formats\n- Support for exclusion patterns (e.g., .gitignore rules)\n- Optional token counting for LLM context management\n- Safe handling of large files and directories\n\n## Installation\n\nThis project uses [Poetry](https://python-poetry.org/) for dependency management. We recommend using Poetry for the best development experience, but we also provide traditional pip installation.\n\n### Using Poetry (Recommended)\n\n1. First, [install Poetry](https://python-poetry.org/docs/#installation) if you haven't already.\n2. Install dir2text:\n ```bash\n poetry add dir2text\n ```\n\n### Using pip\n\n```bash\npip install dir2text\n```\n\n### Optional Features\n\nInstall with token counting support (for LLM context management):\n```bash\n# With Poetry\npoetry add \"dir2text[token_counting]\"\n\n# With pip\npip install \"dir2text[token_counting]\"\n```\n\n**Note:** The `token_counting` feature requires the `tiktoken` package, which needs a Rust compiler (e.g., `rustc`) and Cargo to be available during installation.\n\n## Usage\n\n### Command Line Interface\n\nBasic usage:\n```bash\ndir2text /path/to/project\n\n# Exclude files matching .gitignore patterns\ndir2text -e .gitignore /path/to/project\n\n# Count tokens for LLM context management\ndir2text -c /path/to/project\n\n# Generate JSON output and save to file\ndir2text --format json -o output.json /path/to/project\n\n# Skip tree or content sections\ndir2text -T /path/to/project # Skip tree visualization\ndir2text -C /path/to/project # Skip file contents\n```\n\n### Python API\n\nBasic usage:\n```python\nfrom dir2text import StreamingDir2Text\n\n# Initialize the analyzer\nanalyzer = StreamingDir2Text(\"path/to/project\")\n\n# Stream the directory tree\nfor line in analyzer.stream_tree():\n print(line, end='')\n\n# Stream file contents\nfor chunk in analyzer.stream_contents():\n print(chunk, end='')\n\n# Get metrics\nprint(f\"Processed {analyzer.file_count} files in {analyzer.directory_count} directories\")\n```\n\nMemory-efficient processing with exclusions and token counting:\n```python\nfrom dir2text import StreamingDir2Text\n\n# Initialize with options\nanalyzer = StreamingDir2Text(\n directory=\"path/to/project\",\n exclude_file=\".gitignore\",\n output_format=\"json\",\n tokenizer_model=\"gpt-4\"\n)\n\n# Process content incrementally\nwith open(\"output.json\", \"w\") as f:\n for line in analyzer.stream_tree():\n f.write(line)\n for chunk in analyzer.stream_contents():\n f.write(chunk)\n\n# Print statistics\nprint(f\"Files: {analyzer.file_count}\")\nprint(f\"Directories: {analyzer.directory_count}\")\nprint(f\"Lines: {analyzer.line_count}\")\nprint(f\"Tokens: {analyzer.token_count}\")\nprint(f\"Characters: {analyzer.character_count}\")\n```\n\nImmediate processing (for smaller directories):\n```python\nfrom dir2text import Dir2Text\n\n# Process everything immediately\nanalyzer = Dir2Text(\"path/to/project\")\n\n# Access complete content\nprint(analyzer.tree_string)\nprint(analyzer.content_string)\n```\n\n## Output Formats\n\n### XML Format\n```xml\n<file path=\"relative/path/to/file.py\" tokens=\"150\">\ndef example():\n print(\"Hello, world!\")\n</file>\n```\n\n### JSON Format\n```json\n{\n \"path\": \"relative/path/to/file.py\",\n \"content\": \"def example():\\n print(\\\"Hello, world!\\\")\",\n \"tokens\": 150\n}\n```\n\n## Development\n\n### Setup Development Environment\n\n1. Clone the repository:\n ```bash\n git clone https://github.com/rlichtenwalter/dir2text.git\n cd dir2text\n ```\n\n2. Install development dependencies:\n ```bash\n poetry install --with dev\n ```\n\n3. Install pre-commit hooks:\n ```bash\n poetry run pre-commit install\n ```\n\n### Running Tests\n\n```bash\n# Run specific quality control categories\npoetry run tox -e format # Run formatters\npoetry run tox -e lint # Run linters\npoetry run tox -e test # Run tests\npoetry run tox -e coverage # Run test coverage analysis\n```\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n1. Fork the repository\n2. Create a new branch (`git checkout -b feature/amazing-feature`)\n3. Make your changes\n4. Run the test suite\n5. Commit your changes (`git commit -m 'Add some amazing feature'`)\n6. Push to the branch (`git push origin feature/amazing-feature`)\n7. Open a Pull Request\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Acknowledgments\n\n- This project uses [anytree](https://github.com/c0fec0de/anytree) for tree data structures\n- .gitignore pattern matching uses [pathspec](https://github.com/cpburnz/python-pathspec)\n- Token counting functionality is provided by OpenAI's [tiktoken](https://github.com/openai/tiktoken)\n\n## Requirements\n\n- Python 3.9+\n- Poetry (recommended) or pip\n- Optional: Rust compiler and Cargo (for token counting feature)\n\n## Project Status\n\nThis project is actively maintained. Issues and pull requests are welcome.\n\n## FAQ\n\n**Q: Why use streaming processing?** \nA: Streaming allows processing of large directories and files with constant memory usage, making it suitable for processing repositories of any size.\n\n**Q: How does dir2text handle symbolic links?** \nA: dir2text follows symbolic links to both files and directories. While this enables complete directory traversal, no symlink loop detection is currently implemented. Users should be cautious when processing directories with circular symlinks, as this could lead to infinite recursion. Future versions may add configuration options for symlink handling and loop prevention.\n\n**Q: Can I use this with binary files?** \nA: The tool is designed for text files. Binary files should be excluded using the exclusion rules feature.\n\n**Q: What models are supported for token counting?** \nA: The token counting feature uses OpenAI's tiktoken library with the following primary models and encodings:\n- cl100k_base encoding:\n - GPT-4 models (gpt-4, gpt-4-32k)\n - GPT-3.5-Turbo models (gpt-3.5-turbo)\n- p50k_base encoding:\n - Text Davinci models (text-davinci-003)\n\nFor other language models, using a similar model's tokenizer (like gpt-4) can provide useful approximations of token counts. While the counts may not exactly match your target model's tokenization, they can give a good general estimate. The default model is \"gpt-4\", which uses cl100k_base encoding and provides a good general-purpose tokenization.\n\n**Q: What happens if I specify a model that doesn't have a dedicated tokenizer?** \nA: The library will suggest using a well-supported model like 'gpt-4' or 'text-davinci-003' for token counting. While token counts may not exactly match your target model, they can provide useful approximations for most modern language models.\n\n## Contact\n\nRyan N. Lichtenwalter - rlichtenwalter@gmail.com\n\nProject Link: [https://github.com/rlichtenwalter/dir2text](https://github.com/rlichtenwalter/dir2text)\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A Python library and command-line tool for expressing directory structures and file contents in formats suitable for Large Language Models (LLMs). It combines directory tree visualization with file contents in a memory-efficient, streaming format.",
"version": "1.0.1",
"project_urls": {
"Homepage": "https://github.com/rlichtenwalter/dir2text.git",
"Repository": "https://github.com/rlichtenwalter/dir2text.git"
},
"split_keywords": [
"large language model",
" llm",
" token",
" tokenizer",
" sloc",
" tree",
" code assistant"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "fc9ddf71268248d5c8eea72a767961f9baacbd6363b7f2fd4bae6532c25e2e27",
"md5": "ba5721a8dd4cbdc2c4baf8e2374f0fcc",
"sha256": "d5a69235343a698fac6451e8a5dad46138f52594ce3b65fdd8ddc32f9208bbae"
},
"downloads": -1,
"filename": "dir2text-1.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ba5721a8dd4cbdc2c4baf8e2374f0fcc",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.9.1",
"size": 33936,
"upload_time": "2024-10-24T14:36:11",
"upload_time_iso_8601": "2024-10-24T14:36:11.239485Z",
"url": "https://files.pythonhosted.org/packages/fc/9d/df71268248d5c8eea72a767961f9baacbd6363b7f2fd4bae6532c25e2e27/dir2text-1.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "ae1389d239ca103ea9ca5b1e66ca62e6ec473afba73e71f957f5f9eb2f4464f9",
"md5": "472fc1dd66610318b5b14de9ea16f28b",
"sha256": "25697f87f5cfe6555f4e274cfd715b0d1f9c42e3cf48a93c0f2d85f4c215e4d2"
},
"downloads": -1,
"filename": "dir2text-1.0.1.tar.gz",
"has_sig": false,
"md5_digest": "472fc1dd66610318b5b14de9ea16f28b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.9.1",
"size": 29736,
"upload_time": "2024-10-24T14:36:13",
"upload_time_iso_8601": "2024-10-24T14:36:13.044834Z",
"url": "https://files.pythonhosted.org/packages/ae/13/89d239ca103ea9ca5b1e66ca62e6ec473afba73e71f957f5f9eb2f4464f9/dir2text-1.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-24 14:36:13",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "rlichtenwalter",
"github_project": "dir2text",
"github_not_found": true,
"lcname": "dir2text"
}