# Code Duplicates CLI — Version 3
Multi-language code duplicate detector using **semantic embeddings**, **Tree-Sitter parsing**, **parallelism**,
**minimum size filtering**, **configurable exclusions**, and **HTML reports with syntax highlighting** and **JSON output**.
## ✨ Features
- **Multi-language support**: `py, js, ts, tsx, jsx, go, rs (rust), c, cpp, java` (extensible)
- **AST extraction** with `tree-sitter-languages` (functions, methods, classes)
- **Semantic embeddings** with `SentenceTransformers` (default: `sentence-transformers/all-mpnet-base-v2`)
- **Fast similarity search** with `faiss` (Inner Product + L2 normalization)
- **Parallel processing** for snippet extraction and embedding generation
- **Smart caching** of embeddings by file content hash
- **Configurable exclusions** for folders and files (node_modules, .git, etc.)
- **Minimum block size filtering** (by lines)
- **Rich HTML reports** with **Pygments syntax highlighting** + **JSON export**
- **Configurable similarity threshold** and top-k results per snippet
- **Verbose mode** and **dry-run** capabilities
## 🚀 Installation
```bash
python -m venv .venv
source .venv/bin/activate # (Windows: .venv\Scripts\activate)
pip install -r requirements.txt
```
> **Note**: `tree-sitter-languages` includes pre-compiled grammars; no manual compilation required.
## 📖 Usage
### Basic Usage
```bash
# Simple scan with defaults
python cli.py ./path/to/project
# Scan current directory
python cli.py .
```
### Advanced Usage
```bash
python cli.py ./path/to/project \
--languages py,ts,tsx,js,go \
--threshold 0.85 \
--min-lines 5 \
--top-k 10 \
--report-html reports/duplicates.html \
--report-json reports/duplicates.json \
--model sentence-transformers/all-mpnet-base-v2 \
--cache-dir .cache_embeddings \
--exclusions-file .codeduplicates-ignore \
--verbose
```
### Command Line Options
```bash
Options:
--languages TEXT Comma-separated list of file extensions (default: py,js,ts,tsx,jsx,go,rs,c,cpp,java)
--threshold FLOAT Similarity threshold (0.0-1.0, default: 0.85)
--min-lines INTEGER Minimum lines per code block (default: 3)
--top-k INTEGER Max similar snippets per block (default: 5)
--report-html PATH HTML report output path
--report-json PATH JSON report output path
--model TEXT Embedding model name (default: sentence-transformers/all-mpnet-base-v2)
--cache-dir PATH Directory for embedding cache
--exclusions-file PATH Custom exclusions file (default: .codeduplicates-ignore)
--verbose Enable verbose output
--dry-run Show what would be processed without running
--help Show this message and exit
```
## 🚫 Exclusions
The CLI includes default exclusions for common folders like `node_modules`, `.git`, `__pycache__`, etc.
Create a custom exclusions file:
```bash
# .codeduplicates-ignore
node_modules
.git
dist
build
*.min.js
my-generated-folder
test-data
venv
.venv
__pycache__
.pytest_cache
coverage
.coverage
```
Supports glob patterns (`*`, `?`, `[]`) and comments with `#`.
## 📊 Examples
### Language-Specific Scans
```bash
# Scan TypeScript/JavaScript with high threshold
python cli.py . --languages ts,tsx,js --threshold 0.92 --report-html reports/js-duplicates.html
# Python and Go only, minimum 5 lines, top-3 results
python cli.py src --languages py,go --min-lines 5 --top-k 3 --report-json reports/py-go-duplicates.json
# Rust projects with custom exclusions
python cli.py . --languages rs --exclusions-file rust-exclusions.txt --verbose
```
### Performance Optimization
```bash
# Large codebase with caching
python cli.py . --cache-dir .embeddings-cache --min-lines 10 --threshold 0.80
# Quick scan with dry-run
python cli.py . --dry-run --verbose
```
## 🏗️ Project Structure
```
code-duplicates-cli-v3/
├── cli.py # Main CLI entry point
├── core/
│ ├── config.py # Configuration constants
│ ├── utils.py # Utility functions
│ ├── embeddings.py # Embedding generation engine
│ ├── similarity.py # Similarity calculation engine
│ ├── report.py # HTML/JSON report generation
│ └── extractors/
│ ├── base_extractor.py # Base extractor interface
│ ├── parser_manager.py # Tree-sitter parser management
│ ├── code_extractor.py # AST-based code extraction
│ ├── regex_extractor.py # Regex-based extraction
│ └── simple_extractor.py # Simple text extraction
├── ARCHITECTURE.md # Detailed architecture documentation
├── MODEL_ANALYSIS.md # Embedding model analysis
└── requirements.txt # Python dependencies
```
## 🎯 Model Information
### Current Default Model
- **Model**: `sentence-transformers/all-mpnet-base-v2`
- **Type**: Semantic similarity model based on MPNet
- **Dimensions**: 768
- **Strengths**: Better semantic understanding, fewer false positives
- **Use Case**: General-purpose semantic similarity with good code understanding
### Alternative Models
```bash
# Code-specific models
python cli.py . --model microsoft/codebert-base
python cli.py . --model microsoft/graphcodebert-base
# Faster alternatives
python cli.py . --model sentence-transformers/all-MiniLM-L6-v2
python cli.py . --model sentence-transformers/paraphrase-MiniLM-L6-v2
```
## ⚡ Performance Tips
### For Large Repositories
- **Use caching**: `--cache-dir .embeddings-cache` to avoid recomputing embeddings
- **Increase minimum lines**: `--min-lines 10` to reduce noise from trivial blocks
- **Adjust threshold**: `--threshold 0.80` for broader matches, `0.90+` for strict matches
- **Limit top-k**: `--top-k 3` for faster processing
### Memory Optimization
- **Smaller models**: Use `all-MiniLM-L6-v2` for faster inference
- **Batch processing**: The tool automatically handles large codebases efficiently
- **Exclusions**: Use comprehensive exclusion files to skip irrelevant directories
## 🐛 Troubleshooting
### Common Issues
1. **High false positives**: Increase `--threshold` to 0.90 or higher
2. **Missing duplicates**: Decrease `--threshold` to 0.70-0.80
3. **Too many results**: Increase `--min-lines` or use stricter exclusions
4. **Slow performance**: Use `--cache-dir` and consider a smaller model
### Debug Mode
```bash
# Verbose output for debugging
python cli.py . --verbose --dry-run
# Test specific file types
python cli.py test-project --languages py --verbose
```
## 📄 Output Formats
### HTML Report
- **Syntax highlighting** with Pygments
- **Side-by-side comparison** of duplicate code blocks
- **Similarity scores** and file locations
- **Interactive navigation** between duplicates
### JSON Report
- **Structured data** for programmatic processing
- **Complete metadata** including file paths, line numbers, similarity scores
- **Easy integration** with CI/CD pipelines and other tools
## 🤝 Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request
## 📝 License
This project is licensed under the MIT License.
Raw data
{
"_id": null,
"home_page": null,
"name": "code-duplicates-cli",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "Dario Soria <dssupertech@gmail.com>",
"keywords": "code-analysis, duplicate-detection, semantic-embeddings, tree-sitter, cli",
"author": null,
"author_email": "Dario Soria <dssupertech@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/b0/e3/a37fe79272268241544d6ecc52ad5d8ecaab2545612c0e6996573029679b/code_duplicates_cli-3.0.0.tar.gz",
"platform": null,
"description": "# Code Duplicates CLI \u2014 Version 3\n\nMulti-language code duplicate detector using **semantic embeddings**, **Tree-Sitter parsing**, **parallelism**, \n**minimum size filtering**, **configurable exclusions**, and **HTML reports with syntax highlighting** and **JSON output**.\n\n## \u2728 Features\n- **Multi-language support**: `py, js, ts, tsx, jsx, go, rs (rust), c, cpp, java` (extensible)\n- **AST extraction** with `tree-sitter-languages` (functions, methods, classes)\n- **Semantic embeddings** with `SentenceTransformers` (default: `sentence-transformers/all-mpnet-base-v2`)\n- **Fast similarity search** with `faiss` (Inner Product + L2 normalization)\n- **Parallel processing** for snippet extraction and embedding generation\n- **Smart caching** of embeddings by file content hash\n- **Configurable exclusions** for folders and files (node_modules, .git, etc.)\n- **Minimum block size filtering** (by lines)\n- **Rich HTML reports** with **Pygments syntax highlighting** + **JSON export**\n- **Configurable similarity threshold** and top-k results per snippet\n- **Verbose mode** and **dry-run** capabilities\n\n## \ud83d\ude80 Installation\n```bash\npython -m venv .venv\nsource .venv/bin/activate # (Windows: .venv\\Scripts\\activate)\npip install -r requirements.txt\n```\n\n> **Note**: `tree-sitter-languages` includes pre-compiled grammars; no manual compilation required.\n\n## \ud83d\udcd6 Usage\n\n### Basic Usage\n```bash\n# Simple scan with defaults\npython cli.py ./path/to/project\n\n# Scan current directory\npython cli.py .\n```\n\n### Advanced Usage\n```bash\npython cli.py ./path/to/project \\\n --languages py,ts,tsx,js,go \\\n --threshold 0.85 \\\n --min-lines 5 \\\n --top-k 10 \\\n --report-html reports/duplicates.html \\\n --report-json reports/duplicates.json \\\n --model sentence-transformers/all-mpnet-base-v2 \\\n --cache-dir .cache_embeddings \\\n --exclusions-file .codeduplicates-ignore \\\n --verbose\n```\n\n### Command Line Options\n```bash\nOptions:\n --languages TEXT Comma-separated list of file extensions (default: py,js,ts,tsx,jsx,go,rs,c,cpp,java)\n --threshold FLOAT Similarity threshold (0.0-1.0, default: 0.85)\n --min-lines INTEGER Minimum lines per code block (default: 3)\n --top-k INTEGER Max similar snippets per block (default: 5)\n --report-html PATH HTML report output path\n --report-json PATH JSON report output path\n --model TEXT Embedding model name (default: sentence-transformers/all-mpnet-base-v2)\n --cache-dir PATH Directory for embedding cache\n --exclusions-file PATH Custom exclusions file (default: .codeduplicates-ignore)\n --verbose Enable verbose output\n --dry-run Show what would be processed without running\n --help Show this message and exit\n```\n\n## \ud83d\udeab Exclusions\nThe CLI includes default exclusions for common folders like `node_modules`, `.git`, `__pycache__`, etc.\n\nCreate a custom exclusions file:\n```bash\n# .codeduplicates-ignore\nnode_modules\n.git\ndist\nbuild\n*.min.js\nmy-generated-folder\ntest-data\nvenv\n.venv\n__pycache__\n.pytest_cache\ncoverage\n.coverage\n```\n\nSupports glob patterns (`*`, `?`, `[]`) and comments with `#`.\n\n## \ud83d\udcca Examples\n\n### Language-Specific Scans\n```bash\n# Scan TypeScript/JavaScript with high threshold\npython cli.py . --languages ts,tsx,js --threshold 0.92 --report-html reports/js-duplicates.html\n\n# Python and Go only, minimum 5 lines, top-3 results\npython cli.py src --languages py,go --min-lines 5 --top-k 3 --report-json reports/py-go-duplicates.json\n\n# Rust projects with custom exclusions\npython cli.py . --languages rs --exclusions-file rust-exclusions.txt --verbose\n```\n\n### Performance Optimization\n```bash\n# Large codebase with caching\npython cli.py . --cache-dir .embeddings-cache --min-lines 10 --threshold 0.80\n\n# Quick scan with dry-run\npython cli.py . --dry-run --verbose\n```\n\n## \ud83c\udfd7\ufe0f Project Structure\n```\ncode-duplicates-cli-v3/\n\u251c\u2500\u2500 cli.py # Main CLI entry point\n\u251c\u2500\u2500 core/\n\u2502 \u251c\u2500\u2500 config.py # Configuration constants\n\u2502 \u251c\u2500\u2500 utils.py # Utility functions\n\u2502 \u251c\u2500\u2500 embeddings.py # Embedding generation engine\n\u2502 \u251c\u2500\u2500 similarity.py # Similarity calculation engine\n\u2502 \u251c\u2500\u2500 report.py # HTML/JSON report generation\n\u2502 \u2514\u2500\u2500 extractors/\n\u2502 \u251c\u2500\u2500 base_extractor.py # Base extractor interface\n\u2502 \u251c\u2500\u2500 parser_manager.py # Tree-sitter parser management\n\u2502 \u251c\u2500\u2500 code_extractor.py # AST-based code extraction\n\u2502 \u251c\u2500\u2500 regex_extractor.py # Regex-based extraction\n\u2502 \u2514\u2500\u2500 simple_extractor.py # Simple text extraction\n\u251c\u2500\u2500 ARCHITECTURE.md # Detailed architecture documentation\n\u251c\u2500\u2500 MODEL_ANALYSIS.md # Embedding model analysis\n\u2514\u2500\u2500 requirements.txt # Python dependencies\n```\n\n## \ud83c\udfaf Model Information\n\n### Current Default Model\n- **Model**: `sentence-transformers/all-mpnet-base-v2`\n- **Type**: Semantic similarity model based on MPNet\n- **Dimensions**: 768\n- **Strengths**: Better semantic understanding, fewer false positives\n- **Use Case**: General-purpose semantic similarity with good code understanding\n\n### Alternative Models\n```bash\n# Code-specific models\npython cli.py . --model microsoft/codebert-base\npython cli.py . --model microsoft/graphcodebert-base\n\n# Faster alternatives\npython cli.py . --model sentence-transformers/all-MiniLM-L6-v2\npython cli.py . --model sentence-transformers/paraphrase-MiniLM-L6-v2\n```\n\n## \u26a1 Performance Tips\n\n### For Large Repositories\n- **Use caching**: `--cache-dir .embeddings-cache` to avoid recomputing embeddings\n- **Increase minimum lines**: `--min-lines 10` to reduce noise from trivial blocks\n- **Adjust threshold**: `--threshold 0.80` for broader matches, `0.90+` for strict matches\n- **Limit top-k**: `--top-k 3` for faster processing\n\n### Memory Optimization\n- **Smaller models**: Use `all-MiniLM-L6-v2` for faster inference\n- **Batch processing**: The tool automatically handles large codebases efficiently\n- **Exclusions**: Use comprehensive exclusion files to skip irrelevant directories\n\n## \ud83d\udc1b Troubleshooting\n\n### Common Issues\n1. **High false positives**: Increase `--threshold` to 0.90 or higher\n2. **Missing duplicates**: Decrease `--threshold` to 0.70-0.80\n3. **Too many results**: Increase `--min-lines` or use stricter exclusions\n4. **Slow performance**: Use `--cache-dir` and consider a smaller model\n\n### Debug Mode\n```bash\n# Verbose output for debugging\npython cli.py . --verbose --dry-run\n\n# Test specific file types\npython cli.py test-project --languages py --verbose\n```\n\n## \ud83d\udcc4 Output Formats\n\n### HTML Report\n- **Syntax highlighting** with Pygments\n- **Side-by-side comparison** of duplicate code blocks\n- **Similarity scores** and file locations\n- **Interactive navigation** between duplicates\n\n### JSON Report\n- **Structured data** for programmatic processing\n- **Complete metadata** including file paths, line numbers, similarity scores\n- **Easy integration** with CI/CD pipelines and other tools\n\n## \ud83e\udd1d Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes\n4. Add tests if applicable\n5. Submit a pull request\n\n## \ud83d\udcdd License\n\nThis project is licensed under the MIT License.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Multi-language code duplicate detector using semantic embeddings and Tree-Sitter parsing",
"version": "3.0.0",
"project_urls": {
"Documentation": "https://github.com/RDSoria/code-duplicates-cli#readme",
"Homepage": "https://github.com/RDSoria/code-duplicates-cli",
"Issues": "https://github.com/RDSoria/code-duplicates-cli/issues",
"Repository": "https://github.com/RDSoria/code-duplicates-cli"
},
"split_keywords": [
"code-analysis",
" duplicate-detection",
" semantic-embeddings",
" tree-sitter",
" cli"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "7e25444c0bf302fc8fe5a59199752be334fd874c4a05148366aba009d19c4453",
"md5": "e4c062fae2b192a43eb1d04e03de48e0",
"sha256": "4d2a03f84f7bf93db1d9a7270dee23d65c47ef4a9b3262fc08c0c83964b25990"
},
"downloads": -1,
"filename": "code_duplicates_cli-3.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e4c062fae2b192a43eb1d04e03de48e0",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 18281,
"upload_time": "2025-10-09T10:53:08",
"upload_time_iso_8601": "2025-10-09T10:53:08.068502Z",
"url": "https://files.pythonhosted.org/packages/7e/25/444c0bf302fc8fe5a59199752be334fd874c4a05148366aba009d19c4453/code_duplicates_cli-3.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "b0e3a37fe79272268241544d6ecc52ad5d8ecaab2545612c0e6996573029679b",
"md5": "61b9bc30b5e8754295d83cbd4d12b21c",
"sha256": "40f2dd933082b7e810a4cfa591c530e7be5927423d88ddd5a629cc3bae5ea65b"
},
"downloads": -1,
"filename": "code_duplicates_cli-3.0.0.tar.gz",
"has_sig": false,
"md5_digest": "61b9bc30b5e8754295d83cbd4d12b21c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 18515,
"upload_time": "2025-10-09T10:53:09",
"upload_time_iso_8601": "2025-10-09T10:53:09.461144Z",
"url": "https://files.pythonhosted.org/packages/b0/e3/a37fe79272268241544d6ecc52ad5d8ecaab2545612c0e6996573029679b/code_duplicates_cli-3.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-09 10:53:09",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "RDSoria",
"github_project": "code-duplicates-cli#readme",
"github_not_found": true,
"lcname": "code-duplicates-cli"
}