mdtoken


Namemdtoken JSON
Version 1.0.0 PyPI version JSON
download
home_pageNone
SummaryPre-commit hook to enforce token count limits on markdown files
upload_time2025-11-03 03:25:39
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT
keywords markdown tokens pre-commit ai-tools context-window token-limit git-hooks
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # mdtoken - Markdown Token Limit Pre-commit Hook

[![Tests](https://github.com/applied-artificial-intelligence/mdtoken/actions/workflows/test.yml/badge.svg)](https://github.com/applied-artificial-intelligence/mdtoken/actions/workflows/test.yml)
[![codecov](https://codecov.io/gh/applied-artificial-intelligence/mdtoken/branch/main/graph/badge.svg)](https://codecov.io/gh/applied-artificial-intelligence/mdtoken)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)

A pre-commit hook that enforces token count limits on markdown files to prevent unbounded growth and context window bloat in AI-assisted development workflows.

## Problem

Markdown documentation files (like `CLAUDE.md`, memory files, project docs) grow unbounded over time, consuming valuable AI context windows. When your documentation files exceed LLM context limits, they become fragmented, outdated, or worse - silently truncated without your knowledge.

Manual monitoring is tedious and error-prone. You need automated guardrails that **fail fast** when documentation becomes too large.

## Solution

**mdtoken** provides automated token counting checks during your git workflow with clear, actionable feedback when limits are exceeded. Think of it as a linter for your AI context consumption.

## Features

- ✅ **Accurate token counting** using tiktoken library
- ✅ **Configurable tokenizers** - Support for GPT-4, GPT-4o, Claude, Codex, and more
- ✅ **Flexible configuration** with per-file limits and glob patterns
- ✅ **Fast execution** (< 1 second for typical projects, 158 tests in < 2s)
- ✅ **Clear error messages** with actionable suggestions for remediation
- ✅ **Pre-commit integration** - Blocks commits that violate token limits
- ✅ **Directory-level limits** - Different limits for commands, skills, agents
- ✅ **Total token budgets** - Enforce aggregate limits across all files
- ✅ **Dry-run mode** - Preview violations without failing
- ✅ **Zero dependencies** - Only requires Python 3.8+ and standard libraries

## Installation

### From PyPI (Coming Soon)

```bash
pip install mdtoken
```

### From Source (Development)

```bash
git clone https://github.com/applied-artificial-intelligence/mdtoken.git
cd mdtoken
pip install -e .
```

## Quick Start

### 1. Create Configuration File

Create `.mdtokenrc.yaml` in your project root:

```yaml
# .mdtokenrc.yaml
default_limit: 4000
model: "gpt-4"  # Use GPT-4 tokenizer

limits:
  "CLAUDE.md": 8000
  "README.md": 6000
  ".claude/commands/**": 2000
  ".claude/skills/**": 3000

exclude:
  - "node_modules/**"
  - "**/archive/**"
  - "venv/**"

total_limit: 50000
fail_on_exceed: true
```

### 2. Run Manually

```bash
# Check all markdown files
mdtoken check

# Check specific files
mdtoken check README.md CLAUDE.md

# Dry run (don't fail on violations)
mdtoken check --dry-run

# Verbose output with suggestions
mdtoken check --verbose
```

### 3. Integrate with Pre-commit

Add to `.pre-commit-config.yaml`:

```yaml
repos:
  - repo: https://github.com/applied-artificial-intelligence/mdtoken
    rev: v1.0.0  # Use latest release tag
    hooks:
      - id: markdown-token-limit
        args: ['--config=.mdtokenrc.yaml']
```

Then install the hook:

```bash
pre-commit install
```

Now every commit will check markdown files against your token limits!

## Configuration Guide

### Basic Options

```yaml
# Default token limit for all markdown files
default_limit: 4000

# Whether to fail (exit 1) when limits are exceeded
# Set to false for warning-only mode
fail_on_exceed: true

# Optional: Total token limit across all files
total_limit: 50000
```

### Tokenizer Configuration

Choose your tokenizer based on the LLM you're using:

```yaml
# Option 1: User-friendly model name (recommended)
model: "gpt-4"

# Option 2: Direct tiktoken encoding name
encoding: "cl100k_base"
```

**Supported Models:**
- `gpt-4` → cl100k_base (100K token context, default)
- `gpt-4o` → o200k_base (200K token context)
- `gpt-3.5-turbo` → cl100k_base
- `claude-3` → cl100k_base (Claude uses similar tokenization)
- `claude-3.5` → cl100k_base
- `codex` → p50k_base
- `text-davinci-003` → p50k_base
- `text-davinci-002` → p50k_base
- `gpt-3` → r50k_base

**Note:** If both `model` and `encoding` are specified, `encoding` takes precedence.

### Per-File and Pattern-Based Limits

```yaml
limits:
  # Exact file match
  "CLAUDE.md": 8000

  # Pattern matching (substring)
  "README.md": 6000  # Matches any path ending with README.md

  # Directory patterns
  ".claude/commands/**": 2000
  ".claude/skills/**": 3000
  ".claude/agents/**": 4000
  "docs/*.md": 5000
```

**Pattern Matching Rules:**
- Exact path match has highest priority
- Substring match (e.g., "README.md" matches "docs/README.md")
- Falls back to `default_limit` if no pattern matches

### Exclusion Patterns

```yaml
exclude:
  # Default exclusions (automatically included)
  - ".git/**"
  - "node_modules/**"
  - "venv/**"
  - ".venv/**"
  - "build/**"
  - "dist/**"
  - "__pycache__/**"

  # Custom exclusions
  - "**/archive/**"
  - "**/old_docs/**"
  - "README.md"  # Exclude specific files entirely
```

## Usage Examples

### Example 1: Claude Code Project

For a Claude Code project with commands, skills, and agents:

```yaml
# .mdtokenrc.yaml
model: "claude-3.5"
default_limit: 4000

limits:
  ".claude/CLAUDE.md": 10000  # Main instruction file
  ".claude/commands/**": 2000  # Keep commands concise
  ".claude/skills/**": 3000
  ".claude/agents/**": 5000
  "README.md": 8000

exclude:
  - ".claude/memory/archived/**"

total_limit: 100000  # Aggregate limit
```

### Example 2: Documentation-Heavy Project

```yaml
# .mdtokenrc.yaml
model: "gpt-4"
default_limit: 5000

limits:
  "README.md": 8000
  "docs/api.md": 10000
  "docs/tutorials/**": 6000
  "docs/reference/**": 8000

exclude:
  - "docs/archive/**"
  - "docs/drafts/**"

fail_on_exceed: true
```

### Example 3: Multi-Model Support

If you're optimizing for different models:

```yaml
# For GPT-4o with larger context window
model: "gpt-4o"
default_limit: 8000  # Can use larger limits

limits:
  "CLAUDE.md": 15000
  "docs/**": 10000
```

## Output Examples

### Passing Check

```
✓ All files within token limits
3 files checked, 8,245 tokens total
```

### Failing Check

```
✗ Token limit violations found:

docs/README.md: 5,234 tokens (limit: 4,000, over by 1,234)
  Suggestions:
  - Consider splitting this file into multiple smaller files
  - Target: reduce by ~1,234 tokens to get under the limit
  - Move detailed documentation to separate docs/ files

.claude/CLAUDE.md: 9,876 tokens (limit: 8,000, over by 1,876)
  Suggestions:
  - Review content and remove unnecessary sections
  - Consider moving older content to an archived directory

2/3 files over limit, 15,110 tokens total
```

## Programmatic Usage

For programmatic usage in Python scripts, see the [API Documentation](docs/api.md) for detailed information on:

- **TokenCounter** - Count tokens in text and files
- **Config** - Load and manage configuration
- **LimitEnforcer** - Check files against token limits
- **FileMatcher** - Discover and filter markdown files
- **Reporter** - Format and display results

Quick example:

```python
from mdtoken.config import Config
from mdtoken.enforcer import LimitEnforcer
from mdtoken.reporter import Reporter

config = Config.from_file()
enforcer = LimitEnforcer(config)
result = enforcer.check_files()

reporter = Reporter(enforcer)
reporter.report(result, verbose=True)
```

See [docs/api.md](docs/api.md) for complete API reference with examples.

## Documentation

- **[API Reference](docs/api.md)** - Complete API documentation for programmatic usage
- **[Usage Examples](docs/examples.md)** - Project-specific configurations and workflows
- **[Troubleshooting](docs/troubleshooting.md)** - Common issues and solutions
- **[FAQ](docs/faq.md)** - Frequently asked questions

## Troubleshooting

### "Config file not found" Warning

**Problem:** mdtoken looks for `.mdtokenrc.yaml` in the current directory.

**Solution:** Either:
1. Create `.mdtokenrc.yaml` in your project root
2. Specify config path: `mdtoken check --config path/to/config.yaml`
3. Use defaults (no config file needed)

### "Unknown model" Error

**Problem:** Model name not recognized in MODEL_ENCODING_MAP.

**Solution:** Either:
- Use a supported model name (see "Supported Models" above)
- Use direct encoding: `encoding: "cl100k_base"`

### Pre-commit Hook Not Running

**Problem:** Hook doesn't execute on commit.

**Solution:**
1. Verify `.pre-commit-config.yaml` exists
2. Run `pre-commit install`
3. Test with `pre-commit run --all-files`

### Performance Issues

**Problem:** Slow execution on large repositories.

**Current Status:** mdtoken is highly optimized:
- 158 tests run in < 2 seconds
- Token counting: 6.15ms for 8K tokens (16x faster than requirement)

If you're experiencing slowness:
1. Use exclusion patterns to skip unnecessary directories
2. Limit to specific file patterns
3. Consider running only on changed files (pre-commit does this automatically)

## Development

### Setup Development Environment

```bash
# Clone repository
git clone https://github.com/applied-artificial-intelligence/mdtoken.git
cd mdtoken

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run linting
make lint

# Run type checking
make typecheck

# Format code
make format
```

### Running Tests

```bash
# Run all tests
pytest

# Run with coverage
pytest --cov=src/mdtoken --cov-report=term

# Run specific test file
pytest tests/test_config.py

# Run verbose
pytest -xvs
```

### Project Structure

```
mdtoken/
├── src/mdtoken/
│   ├── __init__.py
│   ├── __version__.py
│   ├── cli.py              # Command-line interface
│   ├── config.py           # Configuration loading/validation
│   ├── counter.py          # Token counting with tiktoken
│   ├── enforcer.py         # Limit enforcement logic
│   ├── matcher.py          # File pattern matching
│   ├── reporter.py         # Output formatting
│   └── commands/
│       └── check.py        # Check command implementation
├── tests/
│   ├── test_config.py      # Config tests (64 tests)
│   ├── test_counter.py     # Token counting tests
│   ├── test_enforcer.py    # Enforcement logic tests (72 tests)
│   ├── test_matcher.py     # File matching tests
│   ├── test_edge_cases.py  # Edge case handling (27 tests)
│   └── integration/
│       └── test_git_workflow.py  # End-to-end tests (12 tests)
├── .pre-commit-hooks.yaml  # Pre-commit hook definition
├── pyproject.toml          # Package configuration
└── README.md
```

### Test Coverage

**Current Coverage: 84%** (176 tests passing)

| Module | Coverage | Notes |
|--------|----------|-------|
| config.py | 97% | Model/encoding resolution fully tested |
| counter.py | 90% | Token counting core |
| enforcer.py | 96% | Limit enforcement logic |
| matcher.py | 97% | File pattern matching |
| reporter.py | 90% | Output formatting |
| cli.py | 0% | Integration tests cover CLI |
| commands/check.py | 0% | Integration tests cover commands |

## Roadmap

### v1.0.0 (Current - Ready for Release)
- ✅ Core token counting functionality
- ✅ Configurable tokenizers (encoding/model)
- ✅ YAML configuration support
- ✅ Pre-commit hook integration
- ✅ Per-file and pattern-based limits
- ✅ Total token budgets
- ✅ Clear error messages with suggestions
- ✅ Comprehensive test suite (176 tests, 84% coverage)
- ✅ CI/CD with GitHub Actions
- 🚧 PyPI distribution (pending)
- 🚧 Comprehensive documentation (in progress)

### v1.1+ (Future Enhancements)
- Auto-fix/splitting suggestions with AI
- Token count caching for performance
- Parallel processing for large repos
- GitHub Action for PR checks
- IDE integrations (VS Code extension)
- Watch mode for live feedback
- HTML/JSON output formats
- Integration with documentation generators

## FAQ

**Q: Why another markdown linter?**
A: mdtoken is specifically designed for AI-assisted workflows where token counts matter. Traditional linters check syntax/style; mdtoken checks token consumption.

**Q: Does it work with Claude, GPT-4o, or other models?**
A: Yes! Use the `model` config parameter to select the appropriate tokenizer. While exact tokenization may vary slightly between models, tiktoken provides excellent approximations.

**Q: Can I use this without pre-commit?**
A: Absolutely! Run `mdtoken check` manually as part of your CI/CD pipeline, as a git hook, or integrate it into your build process.

**Q: What if I want different limits for different branches?**
A: Create multiple config files (e.g., `.mdtokenrc.production.yaml`, `.mdtokenrc.dev.yaml`) and use `--config` flag to specify which one to use.

**Q: How accurate is the token counting?**
A: mdtoken uses tiktoken, the same library used by OpenAI's models. Accuracy is within 1-2% of actual model tokenization.

**Q: Does it support languages other than markdown?**
A: Currently markdown-only. Future versions may support additional file types.

## Contributing

Contributions are welcome! This project follows standard open-source practices:

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes with tests
4. Run the test suite (`pytest`)
5. Ensure code quality (`make lint`)
6. Commit your changes (`git commit -m 'Add amazing feature'`)
7. Push to the branch (`git push origin feature/amazing-feature`)
8. Open a Pull Request

See [CONTRIBUTING.md](CONTRIBUTING.md) (coming soon) for detailed guidelines.

## License

MIT License - See [LICENSE](LICENSE) file for details.

## Author

Stefan Rummer ([@stefanrmmr](https://github.com/stefanrmmr))

## Acknowledgments

- Built with [Claude Code](https://claude.com/claude-code) as part of an AI-assisted development workflow
- Uses OpenAI's [tiktoken](https://github.com/openai/tiktoken) library for accurate token counting
- Inspired by real-world challenges managing context windows in LLM-assisted development
- Thanks to the pre-commit framework team for excellent hook infrastructure

---

**Status**: Ready for v1.0.0 release. Star ⭐ and watch 👀 for updates!

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "mdtoken",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Stefan Rummer <stefan@example.com>",
    "keywords": "markdown, tokens, pre-commit, ai-tools, context-window, token-limit, git-hooks",
    "author": null,
    "author_email": "Stefan Rummer <stefan@example.com>",
    "download_url": "https://files.pythonhosted.org/packages/20/65/519d38535c9b8499aa597784f2a4374342c6bd6945c0727de5b9c49a04f4/mdtoken-1.0.0.tar.gz",
    "platform": null,
    "description": "# mdtoken - Markdown Token Limit Pre-commit Hook\n\n[![Tests](https://github.com/applied-artificial-intelligence/mdtoken/actions/workflows/test.yml/badge.svg)](https://github.com/applied-artificial-intelligence/mdtoken/actions/workflows/test.yml)\n[![codecov](https://codecov.io/gh/applied-artificial-intelligence/mdtoken/branch/main/graph/badge.svg)](https://codecov.io/gh/applied-artificial-intelligence/mdtoken)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n\nA pre-commit hook that enforces token count limits on markdown files to prevent unbounded growth and context window bloat in AI-assisted development workflows.\n\n## Problem\n\nMarkdown documentation files (like `CLAUDE.md`, memory files, project docs) grow unbounded over time, consuming valuable AI context windows. When your documentation files exceed LLM context limits, they become fragmented, outdated, or worse - silently truncated without your knowledge.\n\nManual monitoring is tedious and error-prone. You need automated guardrails that **fail fast** when documentation becomes too large.\n\n## Solution\n\n**mdtoken** provides automated token counting checks during your git workflow with clear, actionable feedback when limits are exceeded. Think of it as a linter for your AI context consumption.\n\n## Features\n\n- \u2705 **Accurate token counting** using tiktoken library\n- \u2705 **Configurable tokenizers** - Support for GPT-4, GPT-4o, Claude, Codex, and more\n- \u2705 **Flexible configuration** with per-file limits and glob patterns\n- \u2705 **Fast execution** (< 1 second for typical projects, 158 tests in < 2s)\n- \u2705 **Clear error messages** with actionable suggestions for remediation\n- \u2705 **Pre-commit integration** - Blocks commits that violate token limits\n- \u2705 **Directory-level limits** - Different limits for commands, skills, agents\n- \u2705 **Total token budgets** - Enforce aggregate limits across all files\n- \u2705 **Dry-run mode** - Preview violations without failing\n- \u2705 **Zero dependencies** - Only requires Python 3.8+ and standard libraries\n\n## Installation\n\n### From PyPI (Coming Soon)\n\n```bash\npip install mdtoken\n```\n\n### From Source (Development)\n\n```bash\ngit clone https://github.com/applied-artificial-intelligence/mdtoken.git\ncd mdtoken\npip install -e .\n```\n\n## Quick Start\n\n### 1. Create Configuration File\n\nCreate `.mdtokenrc.yaml` in your project root:\n\n```yaml\n# .mdtokenrc.yaml\ndefault_limit: 4000\nmodel: \"gpt-4\"  # Use GPT-4 tokenizer\n\nlimits:\n  \"CLAUDE.md\": 8000\n  \"README.md\": 6000\n  \".claude/commands/**\": 2000\n  \".claude/skills/**\": 3000\n\nexclude:\n  - \"node_modules/**\"\n  - \"**/archive/**\"\n  - \"venv/**\"\n\ntotal_limit: 50000\nfail_on_exceed: true\n```\n\n### 2. Run Manually\n\n```bash\n# Check all markdown files\nmdtoken check\n\n# Check specific files\nmdtoken check README.md CLAUDE.md\n\n# Dry run (don't fail on violations)\nmdtoken check --dry-run\n\n# Verbose output with suggestions\nmdtoken check --verbose\n```\n\n### 3. Integrate with Pre-commit\n\nAdd to `.pre-commit-config.yaml`:\n\n```yaml\nrepos:\n  - repo: https://github.com/applied-artificial-intelligence/mdtoken\n    rev: v1.0.0  # Use latest release tag\n    hooks:\n      - id: markdown-token-limit\n        args: ['--config=.mdtokenrc.yaml']\n```\n\nThen install the hook:\n\n```bash\npre-commit install\n```\n\nNow every commit will check markdown files against your token limits!\n\n## Configuration Guide\n\n### Basic Options\n\n```yaml\n# Default token limit for all markdown files\ndefault_limit: 4000\n\n# Whether to fail (exit 1) when limits are exceeded\n# Set to false for warning-only mode\nfail_on_exceed: true\n\n# Optional: Total token limit across all files\ntotal_limit: 50000\n```\n\n### Tokenizer Configuration\n\nChoose your tokenizer based on the LLM you're using:\n\n```yaml\n# Option 1: User-friendly model name (recommended)\nmodel: \"gpt-4\"\n\n# Option 2: Direct tiktoken encoding name\nencoding: \"cl100k_base\"\n```\n\n**Supported Models:**\n- `gpt-4` \u2192 cl100k_base (100K token context, default)\n- `gpt-4o` \u2192 o200k_base (200K token context)\n- `gpt-3.5-turbo` \u2192 cl100k_base\n- `claude-3` \u2192 cl100k_base (Claude uses similar tokenization)\n- `claude-3.5` \u2192 cl100k_base\n- `codex` \u2192 p50k_base\n- `text-davinci-003` \u2192 p50k_base\n- `text-davinci-002` \u2192 p50k_base\n- `gpt-3` \u2192 r50k_base\n\n**Note:** If both `model` and `encoding` are specified, `encoding` takes precedence.\n\n### Per-File and Pattern-Based Limits\n\n```yaml\nlimits:\n  # Exact file match\n  \"CLAUDE.md\": 8000\n\n  # Pattern matching (substring)\n  \"README.md\": 6000  # Matches any path ending with README.md\n\n  # Directory patterns\n  \".claude/commands/**\": 2000\n  \".claude/skills/**\": 3000\n  \".claude/agents/**\": 4000\n  \"docs/*.md\": 5000\n```\n\n**Pattern Matching Rules:**\n- Exact path match has highest priority\n- Substring match (e.g., \"README.md\" matches \"docs/README.md\")\n- Falls back to `default_limit` if no pattern matches\n\n### Exclusion Patterns\n\n```yaml\nexclude:\n  # Default exclusions (automatically included)\n  - \".git/**\"\n  - \"node_modules/**\"\n  - \"venv/**\"\n  - \".venv/**\"\n  - \"build/**\"\n  - \"dist/**\"\n  - \"__pycache__/**\"\n\n  # Custom exclusions\n  - \"**/archive/**\"\n  - \"**/old_docs/**\"\n  - \"README.md\"  # Exclude specific files entirely\n```\n\n## Usage Examples\n\n### Example 1: Claude Code Project\n\nFor a Claude Code project with commands, skills, and agents:\n\n```yaml\n# .mdtokenrc.yaml\nmodel: \"claude-3.5\"\ndefault_limit: 4000\n\nlimits:\n  \".claude/CLAUDE.md\": 10000  # Main instruction file\n  \".claude/commands/**\": 2000  # Keep commands concise\n  \".claude/skills/**\": 3000\n  \".claude/agents/**\": 5000\n  \"README.md\": 8000\n\nexclude:\n  - \".claude/memory/archived/**\"\n\ntotal_limit: 100000  # Aggregate limit\n```\n\n### Example 2: Documentation-Heavy Project\n\n```yaml\n# .mdtokenrc.yaml\nmodel: \"gpt-4\"\ndefault_limit: 5000\n\nlimits:\n  \"README.md\": 8000\n  \"docs/api.md\": 10000\n  \"docs/tutorials/**\": 6000\n  \"docs/reference/**\": 8000\n\nexclude:\n  - \"docs/archive/**\"\n  - \"docs/drafts/**\"\n\nfail_on_exceed: true\n```\n\n### Example 3: Multi-Model Support\n\nIf you're optimizing for different models:\n\n```yaml\n# For GPT-4o with larger context window\nmodel: \"gpt-4o\"\ndefault_limit: 8000  # Can use larger limits\n\nlimits:\n  \"CLAUDE.md\": 15000\n  \"docs/**\": 10000\n```\n\n## Output Examples\n\n### Passing Check\n\n```\n\u2713 All files within token limits\n3 files checked, 8,245 tokens total\n```\n\n### Failing Check\n\n```\n\u2717 Token limit violations found:\n\ndocs/README.md: 5,234 tokens (limit: 4,000, over by 1,234)\n  Suggestions:\n  - Consider splitting this file into multiple smaller files\n  - Target: reduce by ~1,234 tokens to get under the limit\n  - Move detailed documentation to separate docs/ files\n\n.claude/CLAUDE.md: 9,876 tokens (limit: 8,000, over by 1,876)\n  Suggestions:\n  - Review content and remove unnecessary sections\n  - Consider moving older content to an archived directory\n\n2/3 files over limit, 15,110 tokens total\n```\n\n## Programmatic Usage\n\nFor programmatic usage in Python scripts, see the [API Documentation](docs/api.md) for detailed information on:\n\n- **TokenCounter** - Count tokens in text and files\n- **Config** - Load and manage configuration\n- **LimitEnforcer** - Check files against token limits\n- **FileMatcher** - Discover and filter markdown files\n- **Reporter** - Format and display results\n\nQuick example:\n\n```python\nfrom mdtoken.config import Config\nfrom mdtoken.enforcer import LimitEnforcer\nfrom mdtoken.reporter import Reporter\n\nconfig = Config.from_file()\nenforcer = LimitEnforcer(config)\nresult = enforcer.check_files()\n\nreporter = Reporter(enforcer)\nreporter.report(result, verbose=True)\n```\n\nSee [docs/api.md](docs/api.md) for complete API reference with examples.\n\n## Documentation\n\n- **[API Reference](docs/api.md)** - Complete API documentation for programmatic usage\n- **[Usage Examples](docs/examples.md)** - Project-specific configurations and workflows\n- **[Troubleshooting](docs/troubleshooting.md)** - Common issues and solutions\n- **[FAQ](docs/faq.md)** - Frequently asked questions\n\n## Troubleshooting\n\n### \"Config file not found\" Warning\n\n**Problem:** mdtoken looks for `.mdtokenrc.yaml` in the current directory.\n\n**Solution:** Either:\n1. Create `.mdtokenrc.yaml` in your project root\n2. Specify config path: `mdtoken check --config path/to/config.yaml`\n3. Use defaults (no config file needed)\n\n### \"Unknown model\" Error\n\n**Problem:** Model name not recognized in MODEL_ENCODING_MAP.\n\n**Solution:** Either:\n- Use a supported model name (see \"Supported Models\" above)\n- Use direct encoding: `encoding: \"cl100k_base\"`\n\n### Pre-commit Hook Not Running\n\n**Problem:** Hook doesn't execute on commit.\n\n**Solution:**\n1. Verify `.pre-commit-config.yaml` exists\n2. Run `pre-commit install`\n3. Test with `pre-commit run --all-files`\n\n### Performance Issues\n\n**Problem:** Slow execution on large repositories.\n\n**Current Status:** mdtoken is highly optimized:\n- 158 tests run in < 2 seconds\n- Token counting: 6.15ms for 8K tokens (16x faster than requirement)\n\nIf you're experiencing slowness:\n1. Use exclusion patterns to skip unnecessary directories\n2. Limit to specific file patterns\n3. Consider running only on changed files (pre-commit does this automatically)\n\n## Development\n\n### Setup Development Environment\n\n```bash\n# Clone repository\ngit clone https://github.com/applied-artificial-intelligence/mdtoken.git\ncd mdtoken\n\n# Create virtual environment\npython -m venv venv\nsource venv/bin/activate  # On Windows: venv\\Scripts\\activate\n\n# Install development dependencies\npip install -e \".[dev]\"\n\n# Run tests\npytest\n\n# Run linting\nmake lint\n\n# Run type checking\nmake typecheck\n\n# Format code\nmake format\n```\n\n### Running Tests\n\n```bash\n# Run all tests\npytest\n\n# Run with coverage\npytest --cov=src/mdtoken --cov-report=term\n\n# Run specific test file\npytest tests/test_config.py\n\n# Run verbose\npytest -xvs\n```\n\n### Project Structure\n\n```\nmdtoken/\n\u251c\u2500\u2500 src/mdtoken/\n\u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u251c\u2500\u2500 __version__.py\n\u2502   \u251c\u2500\u2500 cli.py              # Command-line interface\n\u2502   \u251c\u2500\u2500 config.py           # Configuration loading/validation\n\u2502   \u251c\u2500\u2500 counter.py          # Token counting with tiktoken\n\u2502   \u251c\u2500\u2500 enforcer.py         # Limit enforcement logic\n\u2502   \u251c\u2500\u2500 matcher.py          # File pattern matching\n\u2502   \u251c\u2500\u2500 reporter.py         # Output formatting\n\u2502   \u2514\u2500\u2500 commands/\n\u2502       \u2514\u2500\u2500 check.py        # Check command implementation\n\u251c\u2500\u2500 tests/\n\u2502   \u251c\u2500\u2500 test_config.py      # Config tests (64 tests)\n\u2502   \u251c\u2500\u2500 test_counter.py     # Token counting tests\n\u2502   \u251c\u2500\u2500 test_enforcer.py    # Enforcement logic tests (72 tests)\n\u2502   \u251c\u2500\u2500 test_matcher.py     # File matching tests\n\u2502   \u251c\u2500\u2500 test_edge_cases.py  # Edge case handling (27 tests)\n\u2502   \u2514\u2500\u2500 integration/\n\u2502       \u2514\u2500\u2500 test_git_workflow.py  # End-to-end tests (12 tests)\n\u251c\u2500\u2500 .pre-commit-hooks.yaml  # Pre-commit hook definition\n\u251c\u2500\u2500 pyproject.toml          # Package configuration\n\u2514\u2500\u2500 README.md\n```\n\n### Test Coverage\n\n**Current Coverage: 84%** (176 tests passing)\n\n| Module | Coverage | Notes |\n|--------|----------|-------|\n| config.py | 97% | Model/encoding resolution fully tested |\n| counter.py | 90% | Token counting core |\n| enforcer.py | 96% | Limit enforcement logic |\n| matcher.py | 97% | File pattern matching |\n| reporter.py | 90% | Output formatting |\n| cli.py | 0% | Integration tests cover CLI |\n| commands/check.py | 0% | Integration tests cover commands |\n\n## Roadmap\n\n### v1.0.0 (Current - Ready for Release)\n- \u2705 Core token counting functionality\n- \u2705 Configurable tokenizers (encoding/model)\n- \u2705 YAML configuration support\n- \u2705 Pre-commit hook integration\n- \u2705 Per-file and pattern-based limits\n- \u2705 Total token budgets\n- \u2705 Clear error messages with suggestions\n- \u2705 Comprehensive test suite (176 tests, 84% coverage)\n- \u2705 CI/CD with GitHub Actions\n- \ud83d\udea7 PyPI distribution (pending)\n- \ud83d\udea7 Comprehensive documentation (in progress)\n\n### v1.1+ (Future Enhancements)\n- Auto-fix/splitting suggestions with AI\n- Token count caching for performance\n- Parallel processing for large repos\n- GitHub Action for PR checks\n- IDE integrations (VS Code extension)\n- Watch mode for live feedback\n- HTML/JSON output formats\n- Integration with documentation generators\n\n## FAQ\n\n**Q: Why another markdown linter?**\nA: mdtoken is specifically designed for AI-assisted workflows where token counts matter. Traditional linters check syntax/style; mdtoken checks token consumption.\n\n**Q: Does it work with Claude, GPT-4o, or other models?**\nA: Yes! Use the `model` config parameter to select the appropriate tokenizer. While exact tokenization may vary slightly between models, tiktoken provides excellent approximations.\n\n**Q: Can I use this without pre-commit?**\nA: Absolutely! Run `mdtoken check` manually as part of your CI/CD pipeline, as a git hook, or integrate it into your build process.\n\n**Q: What if I want different limits for different branches?**\nA: Create multiple config files (e.g., `.mdtokenrc.production.yaml`, `.mdtokenrc.dev.yaml`) and use `--config` flag to specify which one to use.\n\n**Q: How accurate is the token counting?**\nA: mdtoken uses tiktoken, the same library used by OpenAI's models. Accuracy is within 1-2% of actual model tokenization.\n\n**Q: Does it support languages other than markdown?**\nA: Currently markdown-only. Future versions may support additional file types.\n\n## Contributing\n\nContributions are welcome! This project follows standard open-source practices:\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Make your changes with tests\n4. Run the test suite (`pytest`)\n5. Ensure code quality (`make lint`)\n6. Commit your changes (`git commit -m 'Add amazing feature'`)\n7. Push to the branch (`git push origin feature/amazing-feature`)\n8. Open a Pull Request\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) (coming soon) for detailed guidelines.\n\n## License\n\nMIT License - See [LICENSE](LICENSE) file for details.\n\n## Author\n\nStefan Rummer ([@stefanrmmr](https://github.com/stefanrmmr))\n\n## Acknowledgments\n\n- Built with [Claude Code](https://claude.com/claude-code) as part of an AI-assisted development workflow\n- Uses OpenAI's [tiktoken](https://github.com/openai/tiktoken) library for accurate token counting\n- Inspired by real-world challenges managing context windows in LLM-assisted development\n- Thanks to the pre-commit framework team for excellent hook infrastructure\n\n---\n\n**Status**: Ready for v1.0.0 release. Star \u2b50 and watch \ud83d\udc40 for updates!\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Pre-commit hook to enforce token count limits on markdown files",
    "version": "1.0.0",
    "project_urls": {
        "Documentation": "https://github.com/stefanrmmr/mdtoken#readme",
        "Homepage": "https://github.com/stefanrmmr/mdtoken",
        "Issues": "https://github.com/stefanrmmr/mdtoken/issues",
        "Repository": "https://github.com/stefanrmmr/mdtoken"
    },
    "split_keywords": [
        "markdown",
        " tokens",
        " pre-commit",
        " ai-tools",
        " context-window",
        " token-limit",
        " git-hooks"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "540d6f1a0ff382076a0c6283f1a3f935771af6abc68037fde52b9d4a10b24a9b",
                "md5": "cae2346c87d23e34691dd670fb2610f2",
                "sha256": "490316baef15228144f5ec158ce5983f8592452f13b4eadeedcc071069cf94d8"
            },
            "downloads": -1,
            "filename": "mdtoken-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "cae2346c87d23e34691dd670fb2610f2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 18736,
            "upload_time": "2025-11-03T03:25:37",
            "upload_time_iso_8601": "2025-11-03T03:25:37.582027Z",
            "url": "https://files.pythonhosted.org/packages/54/0d/6f1a0ff382076a0c6283f1a3f935771af6abc68037fde52b9d4a10b24a9b/mdtoken-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2065519d38535c9b8499aa597784f2a4374342c6bd6945c0727de5b9c49a04f4",
                "md5": "add9a74cbfda5ef1ae46c3dcdcc20b0a",
                "sha256": "af18909a69bec2b0e64bd9a06c370e71a06492910d6c190f8448ad1a391804d9"
            },
            "downloads": -1,
            "filename": "mdtoken-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "add9a74cbfda5ef1ae46c3dcdcc20b0a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 52053,
            "upload_time": "2025-11-03T03:25:39",
            "upload_time_iso_8601": "2025-11-03T03:25:39.108589Z",
            "url": "https://files.pythonhosted.org/packages/20/65/519d38535c9b8499aa597784f2a4374342c6bd6945c0727de5b9c49a04f4/mdtoken-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-11-03 03:25:39",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "stefanrmmr",
    "github_project": "mdtoken#readme",
    "github_not_found": true,
    "lcname": "mdtoken"
}
        
Elapsed time: 0.58636s