# mdtoken - Markdown Token Limit Pre-commit Hook
[](https://github.com/applied-artificial-intelligence/mdtoken/actions/workflows/test.yml)
[](https://codecov.io/gh/applied-artificial-intelligence/mdtoken)
[](https://opensource.org/licenses/MIT)
[](https://www.python.org/downloads/)
A pre-commit hook that enforces token count limits on markdown files to prevent unbounded growth and context window bloat in AI-assisted development workflows.
## Problem
Markdown documentation files (like `CLAUDE.md`, memory files, project docs) grow unbounded over time, consuming valuable AI context windows. When your documentation files exceed LLM context limits, they become fragmented, outdated, or worse - silently truncated without your knowledge.
Manual monitoring is tedious and error-prone. You need automated guardrails that **fail fast** when documentation becomes too large.
## Solution
**mdtoken** provides automated token counting checks during your git workflow with clear, actionable feedback when limits are exceeded. Think of it as a linter for your AI context consumption.
## Features
- ✅ **Accurate token counting** using tiktoken library
- ✅ **Configurable tokenizers** - Support for GPT-4, GPT-4o, Claude, Codex, and more
- ✅ **Flexible configuration** with per-file limits and glob patterns
- ✅ **Fast execution** (< 1 second for typical projects, 158 tests in < 2s)
- ✅ **Clear error messages** with actionable suggestions for remediation
- ✅ **Pre-commit integration** - Blocks commits that violate token limits
- ✅ **Directory-level limits** - Different limits for commands, skills, agents
- ✅ **Total token budgets** - Enforce aggregate limits across all files
- ✅ **Dry-run mode** - Preview violations without failing
- ✅ **Zero dependencies** - Only requires Python 3.8+ and standard libraries
## Installation
### From PyPI (Coming Soon)
```bash
pip install mdtoken
```
### From Source (Development)
```bash
git clone https://github.com/applied-artificial-intelligence/mdtoken.git
cd mdtoken
pip install -e .
```
## Quick Start
### 1. Create Configuration File
Create `.mdtokenrc.yaml` in your project root:
```yaml
# .mdtokenrc.yaml
default_limit: 4000
model: "gpt-4" # Use GPT-4 tokenizer
limits:
"CLAUDE.md": 8000
"README.md": 6000
".claude/commands/**": 2000
".claude/skills/**": 3000
exclude:
- "node_modules/**"
- "**/archive/**"
- "venv/**"
total_limit: 50000
fail_on_exceed: true
```
### 2. Run Manually
```bash
# Check all markdown files
mdtoken check
# Check specific files
mdtoken check README.md CLAUDE.md
# Dry run (don't fail on violations)
mdtoken check --dry-run
# Verbose output with suggestions
mdtoken check --verbose
```
### 3. Integrate with Pre-commit
Add to `.pre-commit-config.yaml`:
```yaml
repos:
- repo: https://github.com/applied-artificial-intelligence/mdtoken
rev: v1.0.0 # Use latest release tag
hooks:
- id: markdown-token-limit
args: ['--config=.mdtokenrc.yaml']
```
Then install the hook:
```bash
pre-commit install
```
Now every commit will check markdown files against your token limits!
## Configuration Guide
### Basic Options
```yaml
# Default token limit for all markdown files
default_limit: 4000
# Whether to fail (exit 1) when limits are exceeded
# Set to false for warning-only mode
fail_on_exceed: true
# Optional: Total token limit across all files
total_limit: 50000
```
### Tokenizer Configuration
Choose your tokenizer based on the LLM you're using:
```yaml
# Option 1: User-friendly model name (recommended)
model: "gpt-4"
# Option 2: Direct tiktoken encoding name
encoding: "cl100k_base"
```
**Supported Models:**
- `gpt-4` → cl100k_base (100K token context, default)
- `gpt-4o` → o200k_base (200K token context)
- `gpt-3.5-turbo` → cl100k_base
- `claude-3` → cl100k_base (Claude uses similar tokenization)
- `claude-3.5` → cl100k_base
- `codex` → p50k_base
- `text-davinci-003` → p50k_base
- `text-davinci-002` → p50k_base
- `gpt-3` → r50k_base
**Note:** If both `model` and `encoding` are specified, `encoding` takes precedence.
### Per-File and Pattern-Based Limits
```yaml
limits:
# Exact file match
"CLAUDE.md": 8000
# Pattern matching (substring)
"README.md": 6000 # Matches any path ending with README.md
# Directory patterns
".claude/commands/**": 2000
".claude/skills/**": 3000
".claude/agents/**": 4000
"docs/*.md": 5000
```
**Pattern Matching Rules:**
- Exact path match has highest priority
- Substring match (e.g., "README.md" matches "docs/README.md")
- Falls back to `default_limit` if no pattern matches
### Exclusion Patterns
```yaml
exclude:
# Default exclusions (automatically included)
- ".git/**"
- "node_modules/**"
- "venv/**"
- ".venv/**"
- "build/**"
- "dist/**"
- "__pycache__/**"
# Custom exclusions
- "**/archive/**"
- "**/old_docs/**"
- "README.md" # Exclude specific files entirely
```
## Usage Examples
### Example 1: Claude Code Project
For a Claude Code project with commands, skills, and agents:
```yaml
# .mdtokenrc.yaml
model: "claude-3.5"
default_limit: 4000
limits:
".claude/CLAUDE.md": 10000 # Main instruction file
".claude/commands/**": 2000 # Keep commands concise
".claude/skills/**": 3000
".claude/agents/**": 5000
"README.md": 8000
exclude:
- ".claude/memory/archived/**"
total_limit: 100000 # Aggregate limit
```
### Example 2: Documentation-Heavy Project
```yaml
# .mdtokenrc.yaml
model: "gpt-4"
default_limit: 5000
limits:
"README.md": 8000
"docs/api.md": 10000
"docs/tutorials/**": 6000
"docs/reference/**": 8000
exclude:
- "docs/archive/**"
- "docs/drafts/**"
fail_on_exceed: true
```
### Example 3: Multi-Model Support
If you're optimizing for different models:
```yaml
# For GPT-4o with larger context window
model: "gpt-4o"
default_limit: 8000 # Can use larger limits
limits:
"CLAUDE.md": 15000
"docs/**": 10000
```
## Output Examples
### Passing Check
```
✓ All files within token limits
3 files checked, 8,245 tokens total
```
### Failing Check
```
✗ Token limit violations found:
docs/README.md: 5,234 tokens (limit: 4,000, over by 1,234)
Suggestions:
- Consider splitting this file into multiple smaller files
- Target: reduce by ~1,234 tokens to get under the limit
- Move detailed documentation to separate docs/ files
.claude/CLAUDE.md: 9,876 tokens (limit: 8,000, over by 1,876)
Suggestions:
- Review content and remove unnecessary sections
- Consider moving older content to an archived directory
2/3 files over limit, 15,110 tokens total
```
## Programmatic Usage
For programmatic usage in Python scripts, see the [API Documentation](docs/api.md) for detailed information on:
- **TokenCounter** - Count tokens in text and files
- **Config** - Load and manage configuration
- **LimitEnforcer** - Check files against token limits
- **FileMatcher** - Discover and filter markdown files
- **Reporter** - Format and display results
Quick example:
```python
from mdtoken.config import Config
from mdtoken.enforcer import LimitEnforcer
from mdtoken.reporter import Reporter
config = Config.from_file()
enforcer = LimitEnforcer(config)
result = enforcer.check_files()
reporter = Reporter(enforcer)
reporter.report(result, verbose=True)
```
See [docs/api.md](docs/api.md) for complete API reference with examples.
## Documentation
- **[API Reference](docs/api.md)** - Complete API documentation for programmatic usage
- **[Usage Examples](docs/examples.md)** - Project-specific configurations and workflows
- **[Troubleshooting](docs/troubleshooting.md)** - Common issues and solutions
- **[FAQ](docs/faq.md)** - Frequently asked questions
## Troubleshooting
### "Config file not found" Warning
**Problem:** mdtoken looks for `.mdtokenrc.yaml` in the current directory.
**Solution:** Either:
1. Create `.mdtokenrc.yaml` in your project root
2. Specify config path: `mdtoken check --config path/to/config.yaml`
3. Use defaults (no config file needed)
### "Unknown model" Error
**Problem:** Model name not recognized in MODEL_ENCODING_MAP.
**Solution:** Either:
- Use a supported model name (see "Supported Models" above)
- Use direct encoding: `encoding: "cl100k_base"`
### Pre-commit Hook Not Running
**Problem:** Hook doesn't execute on commit.
**Solution:**
1. Verify `.pre-commit-config.yaml` exists
2. Run `pre-commit install`
3. Test with `pre-commit run --all-files`
### Performance Issues
**Problem:** Slow execution on large repositories.
**Current Status:** mdtoken is highly optimized:
- 158 tests run in < 2 seconds
- Token counting: 6.15ms for 8K tokens (16x faster than requirement)
If you're experiencing slowness:
1. Use exclusion patterns to skip unnecessary directories
2. Limit to specific file patterns
3. Consider running only on changed files (pre-commit does this automatically)
## Development
### Setup Development Environment
```bash
# Clone repository
git clone https://github.com/applied-artificial-intelligence/mdtoken.git
cd mdtoken
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run linting
make lint
# Run type checking
make typecheck
# Format code
make format
```
### Running Tests
```bash
# Run all tests
pytest
# Run with coverage
pytest --cov=src/mdtoken --cov-report=term
# Run specific test file
pytest tests/test_config.py
# Run verbose
pytest -xvs
```
### Project Structure
```
mdtoken/
├── src/mdtoken/
│ ├── __init__.py
│ ├── __version__.py
│ ├── cli.py # Command-line interface
│ ├── config.py # Configuration loading/validation
│ ├── counter.py # Token counting with tiktoken
│ ├── enforcer.py # Limit enforcement logic
│ ├── matcher.py # File pattern matching
│ ├── reporter.py # Output formatting
│ └── commands/
│ └── check.py # Check command implementation
├── tests/
│ ├── test_config.py # Config tests (64 tests)
│ ├── test_counter.py # Token counting tests
│ ├── test_enforcer.py # Enforcement logic tests (72 tests)
│ ├── test_matcher.py # File matching tests
│ ├── test_edge_cases.py # Edge case handling (27 tests)
│ └── integration/
│ └── test_git_workflow.py # End-to-end tests (12 tests)
├── .pre-commit-hooks.yaml # Pre-commit hook definition
├── pyproject.toml # Package configuration
└── README.md
```
### Test Coverage
**Current Coverage: 84%** (176 tests passing)
| Module | Coverage | Notes |
|--------|----------|-------|
| config.py | 97% | Model/encoding resolution fully tested |
| counter.py | 90% | Token counting core |
| enforcer.py | 96% | Limit enforcement logic |
| matcher.py | 97% | File pattern matching |
| reporter.py | 90% | Output formatting |
| cli.py | 0% | Integration tests cover CLI |
| commands/check.py | 0% | Integration tests cover commands |
## Roadmap
### v1.0.0 (Current - Ready for Release)
- ✅ Core token counting functionality
- ✅ Configurable tokenizers (encoding/model)
- ✅ YAML configuration support
- ✅ Pre-commit hook integration
- ✅ Per-file and pattern-based limits
- ✅ Total token budgets
- ✅ Clear error messages with suggestions
- ✅ Comprehensive test suite (176 tests, 84% coverage)
- ✅ CI/CD with GitHub Actions
- 🚧 PyPI distribution (pending)
- 🚧 Comprehensive documentation (in progress)
### v1.1+ (Future Enhancements)
- Auto-fix/splitting suggestions with AI
- Token count caching for performance
- Parallel processing for large repos
- GitHub Action for PR checks
- IDE integrations (VS Code extension)
- Watch mode for live feedback
- HTML/JSON output formats
- Integration with documentation generators
## FAQ
**Q: Why another markdown linter?**
A: mdtoken is specifically designed for AI-assisted workflows where token counts matter. Traditional linters check syntax/style; mdtoken checks token consumption.
**Q: Does it work with Claude, GPT-4o, or other models?**
A: Yes! Use the `model` config parameter to select the appropriate tokenizer. While exact tokenization may vary slightly between models, tiktoken provides excellent approximations.
**Q: Can I use this without pre-commit?**
A: Absolutely! Run `mdtoken check` manually as part of your CI/CD pipeline, as a git hook, or integrate it into your build process.
**Q: What if I want different limits for different branches?**
A: Create multiple config files (e.g., `.mdtokenrc.production.yaml`, `.mdtokenrc.dev.yaml`) and use `--config` flag to specify which one to use.
**Q: How accurate is the token counting?**
A: mdtoken uses tiktoken, the same library used by OpenAI's models. Accuracy is within 1-2% of actual model tokenization.
**Q: Does it support languages other than markdown?**
A: Currently markdown-only. Future versions may support additional file types.
## Contributing
Contributions are welcome! This project follows standard open-source practices:
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes with tests
4. Run the test suite (`pytest`)
5. Ensure code quality (`make lint`)
6. Commit your changes (`git commit -m 'Add amazing feature'`)
7. Push to the branch (`git push origin feature/amazing-feature`)
8. Open a Pull Request
See [CONTRIBUTING.md](CONTRIBUTING.md) (coming soon) for detailed guidelines.
## License
MIT License - See [LICENSE](LICENSE) file for details.
## Author
Stefan Rummer ([@stefanrmmr](https://github.com/stefanrmmr))
## Acknowledgments
- Built with [Claude Code](https://claude.com/claude-code) as part of an AI-assisted development workflow
- Uses OpenAI's [tiktoken](https://github.com/openai/tiktoken) library for accurate token counting
- Inspired by real-world challenges managing context windows in LLM-assisted development
- Thanks to the pre-commit framework team for excellent hook infrastructure
---
**Status**: Ready for v1.0.0 release. Star ⭐ and watch 👀 for updates!
Raw data
{
"_id": null,
"home_page": null,
"name": "mdtoken",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "Stefan Rummer <stefan@example.com>",
"keywords": "markdown, tokens, pre-commit, ai-tools, context-window, token-limit, git-hooks",
"author": null,
"author_email": "Stefan Rummer <stefan@example.com>",
"download_url": "https://files.pythonhosted.org/packages/20/65/519d38535c9b8499aa597784f2a4374342c6bd6945c0727de5b9c49a04f4/mdtoken-1.0.0.tar.gz",
"platform": null,
"description": "# mdtoken - Markdown Token Limit Pre-commit Hook\n\n[](https://github.com/applied-artificial-intelligence/mdtoken/actions/workflows/test.yml)\n[](https://codecov.io/gh/applied-artificial-intelligence/mdtoken)\n[](https://opensource.org/licenses/MIT)\n[](https://www.python.org/downloads/)\n\nA pre-commit hook that enforces token count limits on markdown files to prevent unbounded growth and context window bloat in AI-assisted development workflows.\n\n## Problem\n\nMarkdown documentation files (like `CLAUDE.md`, memory files, project docs) grow unbounded over time, consuming valuable AI context windows. When your documentation files exceed LLM context limits, they become fragmented, outdated, or worse - silently truncated without your knowledge.\n\nManual monitoring is tedious and error-prone. You need automated guardrails that **fail fast** when documentation becomes too large.\n\n## Solution\n\n**mdtoken** provides automated token counting checks during your git workflow with clear, actionable feedback when limits are exceeded. Think of it as a linter for your AI context consumption.\n\n## Features\n\n- \u2705 **Accurate token counting** using tiktoken library\n- \u2705 **Configurable tokenizers** - Support for GPT-4, GPT-4o, Claude, Codex, and more\n- \u2705 **Flexible configuration** with per-file limits and glob patterns\n- \u2705 **Fast execution** (< 1 second for typical projects, 158 tests in < 2s)\n- \u2705 **Clear error messages** with actionable suggestions for remediation\n- \u2705 **Pre-commit integration** - Blocks commits that violate token limits\n- \u2705 **Directory-level limits** - Different limits for commands, skills, agents\n- \u2705 **Total token budgets** - Enforce aggregate limits across all files\n- \u2705 **Dry-run mode** - Preview violations without failing\n- \u2705 **Zero dependencies** - Only requires Python 3.8+ and standard libraries\n\n## Installation\n\n### From PyPI (Coming Soon)\n\n```bash\npip install mdtoken\n```\n\n### From Source (Development)\n\n```bash\ngit clone https://github.com/applied-artificial-intelligence/mdtoken.git\ncd mdtoken\npip install -e .\n```\n\n## Quick Start\n\n### 1. Create Configuration File\n\nCreate `.mdtokenrc.yaml` in your project root:\n\n```yaml\n# .mdtokenrc.yaml\ndefault_limit: 4000\nmodel: \"gpt-4\" # Use GPT-4 tokenizer\n\nlimits:\n \"CLAUDE.md\": 8000\n \"README.md\": 6000\n \".claude/commands/**\": 2000\n \".claude/skills/**\": 3000\n\nexclude:\n - \"node_modules/**\"\n - \"**/archive/**\"\n - \"venv/**\"\n\ntotal_limit: 50000\nfail_on_exceed: true\n```\n\n### 2. Run Manually\n\n```bash\n# Check all markdown files\nmdtoken check\n\n# Check specific files\nmdtoken check README.md CLAUDE.md\n\n# Dry run (don't fail on violations)\nmdtoken check --dry-run\n\n# Verbose output with suggestions\nmdtoken check --verbose\n```\n\n### 3. Integrate with Pre-commit\n\nAdd to `.pre-commit-config.yaml`:\n\n```yaml\nrepos:\n - repo: https://github.com/applied-artificial-intelligence/mdtoken\n rev: v1.0.0 # Use latest release tag\n hooks:\n - id: markdown-token-limit\n args: ['--config=.mdtokenrc.yaml']\n```\n\nThen install the hook:\n\n```bash\npre-commit install\n```\n\nNow every commit will check markdown files against your token limits!\n\n## Configuration Guide\n\n### Basic Options\n\n```yaml\n# Default token limit for all markdown files\ndefault_limit: 4000\n\n# Whether to fail (exit 1) when limits are exceeded\n# Set to false for warning-only mode\nfail_on_exceed: true\n\n# Optional: Total token limit across all files\ntotal_limit: 50000\n```\n\n### Tokenizer Configuration\n\nChoose your tokenizer based on the LLM you're using:\n\n```yaml\n# Option 1: User-friendly model name (recommended)\nmodel: \"gpt-4\"\n\n# Option 2: Direct tiktoken encoding name\nencoding: \"cl100k_base\"\n```\n\n**Supported Models:**\n- `gpt-4` \u2192 cl100k_base (100K token context, default)\n- `gpt-4o` \u2192 o200k_base (200K token context)\n- `gpt-3.5-turbo` \u2192 cl100k_base\n- `claude-3` \u2192 cl100k_base (Claude uses similar tokenization)\n- `claude-3.5` \u2192 cl100k_base\n- `codex` \u2192 p50k_base\n- `text-davinci-003` \u2192 p50k_base\n- `text-davinci-002` \u2192 p50k_base\n- `gpt-3` \u2192 r50k_base\n\n**Note:** If both `model` and `encoding` are specified, `encoding` takes precedence.\n\n### Per-File and Pattern-Based Limits\n\n```yaml\nlimits:\n # Exact file match\n \"CLAUDE.md\": 8000\n\n # Pattern matching (substring)\n \"README.md\": 6000 # Matches any path ending with README.md\n\n # Directory patterns\n \".claude/commands/**\": 2000\n \".claude/skills/**\": 3000\n \".claude/agents/**\": 4000\n \"docs/*.md\": 5000\n```\n\n**Pattern Matching Rules:**\n- Exact path match has highest priority\n- Substring match (e.g., \"README.md\" matches \"docs/README.md\")\n- Falls back to `default_limit` if no pattern matches\n\n### Exclusion Patterns\n\n```yaml\nexclude:\n # Default exclusions (automatically included)\n - \".git/**\"\n - \"node_modules/**\"\n - \"venv/**\"\n - \".venv/**\"\n - \"build/**\"\n - \"dist/**\"\n - \"__pycache__/**\"\n\n # Custom exclusions\n - \"**/archive/**\"\n - \"**/old_docs/**\"\n - \"README.md\" # Exclude specific files entirely\n```\n\n## Usage Examples\n\n### Example 1: Claude Code Project\n\nFor a Claude Code project with commands, skills, and agents:\n\n```yaml\n# .mdtokenrc.yaml\nmodel: \"claude-3.5\"\ndefault_limit: 4000\n\nlimits:\n \".claude/CLAUDE.md\": 10000 # Main instruction file\n \".claude/commands/**\": 2000 # Keep commands concise\n \".claude/skills/**\": 3000\n \".claude/agents/**\": 5000\n \"README.md\": 8000\n\nexclude:\n - \".claude/memory/archived/**\"\n\ntotal_limit: 100000 # Aggregate limit\n```\n\n### Example 2: Documentation-Heavy Project\n\n```yaml\n# .mdtokenrc.yaml\nmodel: \"gpt-4\"\ndefault_limit: 5000\n\nlimits:\n \"README.md\": 8000\n \"docs/api.md\": 10000\n \"docs/tutorials/**\": 6000\n \"docs/reference/**\": 8000\n\nexclude:\n - \"docs/archive/**\"\n - \"docs/drafts/**\"\n\nfail_on_exceed: true\n```\n\n### Example 3: Multi-Model Support\n\nIf you're optimizing for different models:\n\n```yaml\n# For GPT-4o with larger context window\nmodel: \"gpt-4o\"\ndefault_limit: 8000 # Can use larger limits\n\nlimits:\n \"CLAUDE.md\": 15000\n \"docs/**\": 10000\n```\n\n## Output Examples\n\n### Passing Check\n\n```\n\u2713 All files within token limits\n3 files checked, 8,245 tokens total\n```\n\n### Failing Check\n\n```\n\u2717 Token limit violations found:\n\ndocs/README.md: 5,234 tokens (limit: 4,000, over by 1,234)\n Suggestions:\n - Consider splitting this file into multiple smaller files\n - Target: reduce by ~1,234 tokens to get under the limit\n - Move detailed documentation to separate docs/ files\n\n.claude/CLAUDE.md: 9,876 tokens (limit: 8,000, over by 1,876)\n Suggestions:\n - Review content and remove unnecessary sections\n - Consider moving older content to an archived directory\n\n2/3 files over limit, 15,110 tokens total\n```\n\n## Programmatic Usage\n\nFor programmatic usage in Python scripts, see the [API Documentation](docs/api.md) for detailed information on:\n\n- **TokenCounter** - Count tokens in text and files\n- **Config** - Load and manage configuration\n- **LimitEnforcer** - Check files against token limits\n- **FileMatcher** - Discover and filter markdown files\n- **Reporter** - Format and display results\n\nQuick example:\n\n```python\nfrom mdtoken.config import Config\nfrom mdtoken.enforcer import LimitEnforcer\nfrom mdtoken.reporter import Reporter\n\nconfig = Config.from_file()\nenforcer = LimitEnforcer(config)\nresult = enforcer.check_files()\n\nreporter = Reporter(enforcer)\nreporter.report(result, verbose=True)\n```\n\nSee [docs/api.md](docs/api.md) for complete API reference with examples.\n\n## Documentation\n\n- **[API Reference](docs/api.md)** - Complete API documentation for programmatic usage\n- **[Usage Examples](docs/examples.md)** - Project-specific configurations and workflows\n- **[Troubleshooting](docs/troubleshooting.md)** - Common issues and solutions\n- **[FAQ](docs/faq.md)** - Frequently asked questions\n\n## Troubleshooting\n\n### \"Config file not found\" Warning\n\n**Problem:** mdtoken looks for `.mdtokenrc.yaml` in the current directory.\n\n**Solution:** Either:\n1. Create `.mdtokenrc.yaml` in your project root\n2. Specify config path: `mdtoken check --config path/to/config.yaml`\n3. Use defaults (no config file needed)\n\n### \"Unknown model\" Error\n\n**Problem:** Model name not recognized in MODEL_ENCODING_MAP.\n\n**Solution:** Either:\n- Use a supported model name (see \"Supported Models\" above)\n- Use direct encoding: `encoding: \"cl100k_base\"`\n\n### Pre-commit Hook Not Running\n\n**Problem:** Hook doesn't execute on commit.\n\n**Solution:**\n1. Verify `.pre-commit-config.yaml` exists\n2. Run `pre-commit install`\n3. Test with `pre-commit run --all-files`\n\n### Performance Issues\n\n**Problem:** Slow execution on large repositories.\n\n**Current Status:** mdtoken is highly optimized:\n- 158 tests run in < 2 seconds\n- Token counting: 6.15ms for 8K tokens (16x faster than requirement)\n\nIf you're experiencing slowness:\n1. Use exclusion patterns to skip unnecessary directories\n2. Limit to specific file patterns\n3. Consider running only on changed files (pre-commit does this automatically)\n\n## Development\n\n### Setup Development Environment\n\n```bash\n# Clone repository\ngit clone https://github.com/applied-artificial-intelligence/mdtoken.git\ncd mdtoken\n\n# Create virtual environment\npython -m venv venv\nsource venv/bin/activate # On Windows: venv\\Scripts\\activate\n\n# Install development dependencies\npip install -e \".[dev]\"\n\n# Run tests\npytest\n\n# Run linting\nmake lint\n\n# Run type checking\nmake typecheck\n\n# Format code\nmake format\n```\n\n### Running Tests\n\n```bash\n# Run all tests\npytest\n\n# Run with coverage\npytest --cov=src/mdtoken --cov-report=term\n\n# Run specific test file\npytest tests/test_config.py\n\n# Run verbose\npytest -xvs\n```\n\n### Project Structure\n\n```\nmdtoken/\n\u251c\u2500\u2500 src/mdtoken/\n\u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u251c\u2500\u2500 __version__.py\n\u2502 \u251c\u2500\u2500 cli.py # Command-line interface\n\u2502 \u251c\u2500\u2500 config.py # Configuration loading/validation\n\u2502 \u251c\u2500\u2500 counter.py # Token counting with tiktoken\n\u2502 \u251c\u2500\u2500 enforcer.py # Limit enforcement logic\n\u2502 \u251c\u2500\u2500 matcher.py # File pattern matching\n\u2502 \u251c\u2500\u2500 reporter.py # Output formatting\n\u2502 \u2514\u2500\u2500 commands/\n\u2502 \u2514\u2500\u2500 check.py # Check command implementation\n\u251c\u2500\u2500 tests/\n\u2502 \u251c\u2500\u2500 test_config.py # Config tests (64 tests)\n\u2502 \u251c\u2500\u2500 test_counter.py # Token counting tests\n\u2502 \u251c\u2500\u2500 test_enforcer.py # Enforcement logic tests (72 tests)\n\u2502 \u251c\u2500\u2500 test_matcher.py # File matching tests\n\u2502 \u251c\u2500\u2500 test_edge_cases.py # Edge case handling (27 tests)\n\u2502 \u2514\u2500\u2500 integration/\n\u2502 \u2514\u2500\u2500 test_git_workflow.py # End-to-end tests (12 tests)\n\u251c\u2500\u2500 .pre-commit-hooks.yaml # Pre-commit hook definition\n\u251c\u2500\u2500 pyproject.toml # Package configuration\n\u2514\u2500\u2500 README.md\n```\n\n### Test Coverage\n\n**Current Coverage: 84%** (176 tests passing)\n\n| Module | Coverage | Notes |\n|--------|----------|-------|\n| config.py | 97% | Model/encoding resolution fully tested |\n| counter.py | 90% | Token counting core |\n| enforcer.py | 96% | Limit enforcement logic |\n| matcher.py | 97% | File pattern matching |\n| reporter.py | 90% | Output formatting |\n| cli.py | 0% | Integration tests cover CLI |\n| commands/check.py | 0% | Integration tests cover commands |\n\n## Roadmap\n\n### v1.0.0 (Current - Ready for Release)\n- \u2705 Core token counting functionality\n- \u2705 Configurable tokenizers (encoding/model)\n- \u2705 YAML configuration support\n- \u2705 Pre-commit hook integration\n- \u2705 Per-file and pattern-based limits\n- \u2705 Total token budgets\n- \u2705 Clear error messages with suggestions\n- \u2705 Comprehensive test suite (176 tests, 84% coverage)\n- \u2705 CI/CD with GitHub Actions\n- \ud83d\udea7 PyPI distribution (pending)\n- \ud83d\udea7 Comprehensive documentation (in progress)\n\n### v1.1+ (Future Enhancements)\n- Auto-fix/splitting suggestions with AI\n- Token count caching for performance\n- Parallel processing for large repos\n- GitHub Action for PR checks\n- IDE integrations (VS Code extension)\n- Watch mode for live feedback\n- HTML/JSON output formats\n- Integration with documentation generators\n\n## FAQ\n\n**Q: Why another markdown linter?**\nA: mdtoken is specifically designed for AI-assisted workflows where token counts matter. Traditional linters check syntax/style; mdtoken checks token consumption.\n\n**Q: Does it work with Claude, GPT-4o, or other models?**\nA: Yes! Use the `model` config parameter to select the appropriate tokenizer. While exact tokenization may vary slightly between models, tiktoken provides excellent approximations.\n\n**Q: Can I use this without pre-commit?**\nA: Absolutely! Run `mdtoken check` manually as part of your CI/CD pipeline, as a git hook, or integrate it into your build process.\n\n**Q: What if I want different limits for different branches?**\nA: Create multiple config files (e.g., `.mdtokenrc.production.yaml`, `.mdtokenrc.dev.yaml`) and use `--config` flag to specify which one to use.\n\n**Q: How accurate is the token counting?**\nA: mdtoken uses tiktoken, the same library used by OpenAI's models. Accuracy is within 1-2% of actual model tokenization.\n\n**Q: Does it support languages other than markdown?**\nA: Currently markdown-only. Future versions may support additional file types.\n\n## Contributing\n\nContributions are welcome! This project follows standard open-source practices:\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Make your changes with tests\n4. Run the test suite (`pytest`)\n5. Ensure code quality (`make lint`)\n6. Commit your changes (`git commit -m 'Add amazing feature'`)\n7. Push to the branch (`git push origin feature/amazing-feature`)\n8. Open a Pull Request\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) (coming soon) for detailed guidelines.\n\n## License\n\nMIT License - See [LICENSE](LICENSE) file for details.\n\n## Author\n\nStefan Rummer ([@stefanrmmr](https://github.com/stefanrmmr))\n\n## Acknowledgments\n\n- Built with [Claude Code](https://claude.com/claude-code) as part of an AI-assisted development workflow\n- Uses OpenAI's [tiktoken](https://github.com/openai/tiktoken) library for accurate token counting\n- Inspired by real-world challenges managing context windows in LLM-assisted development\n- Thanks to the pre-commit framework team for excellent hook infrastructure\n\n---\n\n**Status**: Ready for v1.0.0 release. Star \u2b50 and watch \ud83d\udc40 for updates!\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Pre-commit hook to enforce token count limits on markdown files",
"version": "1.0.0",
"project_urls": {
"Documentation": "https://github.com/stefanrmmr/mdtoken#readme",
"Homepage": "https://github.com/stefanrmmr/mdtoken",
"Issues": "https://github.com/stefanrmmr/mdtoken/issues",
"Repository": "https://github.com/stefanrmmr/mdtoken"
},
"split_keywords": [
"markdown",
" tokens",
" pre-commit",
" ai-tools",
" context-window",
" token-limit",
" git-hooks"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "540d6f1a0ff382076a0c6283f1a3f935771af6abc68037fde52b9d4a10b24a9b",
"md5": "cae2346c87d23e34691dd670fb2610f2",
"sha256": "490316baef15228144f5ec158ce5983f8592452f13b4eadeedcc071069cf94d8"
},
"downloads": -1,
"filename": "mdtoken-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "cae2346c87d23e34691dd670fb2610f2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 18736,
"upload_time": "2025-11-03T03:25:37",
"upload_time_iso_8601": "2025-11-03T03:25:37.582027Z",
"url": "https://files.pythonhosted.org/packages/54/0d/6f1a0ff382076a0c6283f1a3f935771af6abc68037fde52b9d4a10b24a9b/mdtoken-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "2065519d38535c9b8499aa597784f2a4374342c6bd6945c0727de5b9c49a04f4",
"md5": "add9a74cbfda5ef1ae46c3dcdcc20b0a",
"sha256": "af18909a69bec2b0e64bd9a06c370e71a06492910d6c190f8448ad1a391804d9"
},
"downloads": -1,
"filename": "mdtoken-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "add9a74cbfda5ef1ae46c3dcdcc20b0a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 52053,
"upload_time": "2025-11-03T03:25:39",
"upload_time_iso_8601": "2025-11-03T03:25:39.108589Z",
"url": "https://files.pythonhosted.org/packages/20/65/519d38535c9b8499aa597784f2a4374342c6bd6945c0727de5b9c49a04f4/mdtoken-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-11-03 03:25:39",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "stefanrmmr",
"github_project": "mdtoken#readme",
"github_not_found": true,
"lcname": "mdtoken"
}