code-duplicates-cli


Namecode-duplicates-cli JSON
Version 3.0.0 PyPI version JSON
download
home_pageNone
SummaryMulti-language code duplicate detector using semantic embeddings and Tree-Sitter parsing
upload_time2025-10-09 10:53:09
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT
keywords code-analysis duplicate-detection semantic-embeddings tree-sitter cli
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Code Duplicates CLI — Version 3

Multi-language code duplicate detector using **semantic embeddings**, **Tree-Sitter parsing**, **parallelism**, 
**minimum size filtering**, **configurable exclusions**, and **HTML reports with syntax highlighting** and **JSON output**.

## ✨ Features
- **Multi-language support**: `py, js, ts, tsx, jsx, go, rs (rust), c, cpp, java` (extensible)
- **AST extraction** with `tree-sitter-languages` (functions, methods, classes)
- **Semantic embeddings** with `SentenceTransformers` (default: `sentence-transformers/all-mpnet-base-v2`)
- **Fast similarity search** with `faiss` (Inner Product + L2 normalization)
- **Parallel processing** for snippet extraction and embedding generation
- **Smart caching** of embeddings by file content hash
- **Configurable exclusions** for folders and files (node_modules, .git, etc.)
- **Minimum block size filtering** (by lines)
- **Rich HTML reports** with **Pygments syntax highlighting** + **JSON export**
- **Configurable similarity threshold** and top-k results per snippet
- **Verbose mode** and **dry-run** capabilities

## 🚀 Installation
```bash
python -m venv .venv
source .venv/bin/activate  # (Windows: .venv\Scripts\activate)
pip install -r requirements.txt
```

> **Note**: `tree-sitter-languages` includes pre-compiled grammars; no manual compilation required.

## 📖 Usage

### Basic Usage
```bash
# Simple scan with defaults
python cli.py ./path/to/project

# Scan current directory
python cli.py .
```

### Advanced Usage
```bash
python cli.py ./path/to/project \
  --languages py,ts,tsx,js,go \
  --threshold 0.85 \
  --min-lines 5 \
  --top-k 10 \
  --report-html reports/duplicates.html \
  --report-json reports/duplicates.json \
  --model sentence-transformers/all-mpnet-base-v2 \
  --cache-dir .cache_embeddings \
  --exclusions-file .codeduplicates-ignore \
  --verbose
```

### Command Line Options
```bash
Options:
  --languages TEXT        Comma-separated list of file extensions (default: py,js,ts,tsx,jsx,go,rs,c,cpp,java)
  --threshold FLOAT       Similarity threshold (0.0-1.0, default: 0.85)
  --min-lines INTEGER     Minimum lines per code block (default: 3)
  --top-k INTEGER         Max similar snippets per block (default: 5)
  --report-html PATH      HTML report output path
  --report-json PATH      JSON report output path
  --model TEXT            Embedding model name (default: sentence-transformers/all-mpnet-base-v2)
  --cache-dir PATH        Directory for embedding cache
  --exclusions-file PATH  Custom exclusions file (default: .codeduplicates-ignore)
  --verbose              Enable verbose output
  --dry-run              Show what would be processed without running
  --help                 Show this message and exit
```

## 🚫 Exclusions
The CLI includes default exclusions for common folders like `node_modules`, `.git`, `__pycache__`, etc.

Create a custom exclusions file:
```bash
# .codeduplicates-ignore
node_modules
.git
dist
build
*.min.js
my-generated-folder
test-data
venv
.venv
__pycache__
.pytest_cache
coverage
.coverage
```

Supports glob patterns (`*`, `?`, `[]`) and comments with `#`.

## 📊 Examples

### Language-Specific Scans
```bash
# Scan TypeScript/JavaScript with high threshold
python cli.py . --languages ts,tsx,js --threshold 0.92 --report-html reports/js-duplicates.html

# Python and Go only, minimum 5 lines, top-3 results
python cli.py src --languages py,go --min-lines 5 --top-k 3 --report-json reports/py-go-duplicates.json

# Rust projects with custom exclusions
python cli.py . --languages rs --exclusions-file rust-exclusions.txt --verbose
```

### Performance Optimization
```bash
# Large codebase with caching
python cli.py . --cache-dir .embeddings-cache --min-lines 10 --threshold 0.80

# Quick scan with dry-run
python cli.py . --dry-run --verbose
```

## 🏗️ Project Structure
```
code-duplicates-cli-v3/
├── cli.py                    # Main CLI entry point
├── core/
│   ├── config.py            # Configuration constants
│   ├── utils.py             # Utility functions
│   ├── embeddings.py        # Embedding generation engine
│   ├── similarity.py        # Similarity calculation engine
│   ├── report.py            # HTML/JSON report generation
│   └── extractors/
│       ├── base_extractor.py    # Base extractor interface
│       ├── parser_manager.py    # Tree-sitter parser management
│       ├── code_extractor.py    # AST-based code extraction
│       ├── regex_extractor.py   # Regex-based extraction
│       └── simple_extractor.py  # Simple text extraction
├── ARCHITECTURE.md          # Detailed architecture documentation
├── MODEL_ANALYSIS.md        # Embedding model analysis
└── requirements.txt         # Python dependencies
```

## 🎯 Model Information

### Current Default Model
- **Model**: `sentence-transformers/all-mpnet-base-v2`
- **Type**: Semantic similarity model based on MPNet
- **Dimensions**: 768
- **Strengths**: Better semantic understanding, fewer false positives
- **Use Case**: General-purpose semantic similarity with good code understanding

### Alternative Models
```bash
# Code-specific models
python cli.py . --model microsoft/codebert-base
python cli.py . --model microsoft/graphcodebert-base

# Faster alternatives
python cli.py . --model sentence-transformers/all-MiniLM-L6-v2
python cli.py . --model sentence-transformers/paraphrase-MiniLM-L6-v2
```

## ⚡ Performance Tips

### For Large Repositories
- **Use caching**: `--cache-dir .embeddings-cache` to avoid recomputing embeddings
- **Increase minimum lines**: `--min-lines 10` to reduce noise from trivial blocks
- **Adjust threshold**: `--threshold 0.80` for broader matches, `0.90+` for strict matches
- **Limit top-k**: `--top-k 3` for faster processing

### Memory Optimization
- **Smaller models**: Use `all-MiniLM-L6-v2` for faster inference
- **Batch processing**: The tool automatically handles large codebases efficiently
- **Exclusions**: Use comprehensive exclusion files to skip irrelevant directories

## 🐛 Troubleshooting

### Common Issues
1. **High false positives**: Increase `--threshold` to 0.90 or higher
2. **Missing duplicates**: Decrease `--threshold` to 0.70-0.80
3. **Too many results**: Increase `--min-lines` or use stricter exclusions
4. **Slow performance**: Use `--cache-dir` and consider a smaller model

### Debug Mode
```bash
# Verbose output for debugging
python cli.py . --verbose --dry-run

# Test specific file types
python cli.py test-project --languages py --verbose
```

## 📄 Output Formats

### HTML Report
- **Syntax highlighting** with Pygments
- **Side-by-side comparison** of duplicate code blocks
- **Similarity scores** and file locations
- **Interactive navigation** between duplicates

### JSON Report
- **Structured data** for programmatic processing
- **Complete metadata** including file paths, line numbers, similarity scores
- **Easy integration** with CI/CD pipelines and other tools

## 🤝 Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request

## 📝 License

This project is licensed under the MIT License.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "code-duplicates-cli",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Dario Soria <dssupertech@gmail.com>",
    "keywords": "code-analysis, duplicate-detection, semantic-embeddings, tree-sitter, cli",
    "author": null,
    "author_email": "Dario Soria <dssupertech@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/b0/e3/a37fe79272268241544d6ecc52ad5d8ecaab2545612c0e6996573029679b/code_duplicates_cli-3.0.0.tar.gz",
    "platform": null,
    "description": "# Code Duplicates CLI \u2014 Version 3\n\nMulti-language code duplicate detector using **semantic embeddings**, **Tree-Sitter parsing**, **parallelism**, \n**minimum size filtering**, **configurable exclusions**, and **HTML reports with syntax highlighting** and **JSON output**.\n\n## \u2728 Features\n- **Multi-language support**: `py, js, ts, tsx, jsx, go, rs (rust), c, cpp, java` (extensible)\n- **AST extraction** with `tree-sitter-languages` (functions, methods, classes)\n- **Semantic embeddings** with `SentenceTransformers` (default: `sentence-transformers/all-mpnet-base-v2`)\n- **Fast similarity search** with `faiss` (Inner Product + L2 normalization)\n- **Parallel processing** for snippet extraction and embedding generation\n- **Smart caching** of embeddings by file content hash\n- **Configurable exclusions** for folders and files (node_modules, .git, etc.)\n- **Minimum block size filtering** (by lines)\n- **Rich HTML reports** with **Pygments syntax highlighting** + **JSON export**\n- **Configurable similarity threshold** and top-k results per snippet\n- **Verbose mode** and **dry-run** capabilities\n\n## \ud83d\ude80 Installation\n```bash\npython -m venv .venv\nsource .venv/bin/activate  # (Windows: .venv\\Scripts\\activate)\npip install -r requirements.txt\n```\n\n> **Note**: `tree-sitter-languages` includes pre-compiled grammars; no manual compilation required.\n\n## \ud83d\udcd6 Usage\n\n### Basic Usage\n```bash\n# Simple scan with defaults\npython cli.py ./path/to/project\n\n# Scan current directory\npython cli.py .\n```\n\n### Advanced Usage\n```bash\npython cli.py ./path/to/project \\\n  --languages py,ts,tsx,js,go \\\n  --threshold 0.85 \\\n  --min-lines 5 \\\n  --top-k 10 \\\n  --report-html reports/duplicates.html \\\n  --report-json reports/duplicates.json \\\n  --model sentence-transformers/all-mpnet-base-v2 \\\n  --cache-dir .cache_embeddings \\\n  --exclusions-file .codeduplicates-ignore \\\n  --verbose\n```\n\n### Command Line Options\n```bash\nOptions:\n  --languages TEXT        Comma-separated list of file extensions (default: py,js,ts,tsx,jsx,go,rs,c,cpp,java)\n  --threshold FLOAT       Similarity threshold (0.0-1.0, default: 0.85)\n  --min-lines INTEGER     Minimum lines per code block (default: 3)\n  --top-k INTEGER         Max similar snippets per block (default: 5)\n  --report-html PATH      HTML report output path\n  --report-json PATH      JSON report output path\n  --model TEXT            Embedding model name (default: sentence-transformers/all-mpnet-base-v2)\n  --cache-dir PATH        Directory for embedding cache\n  --exclusions-file PATH  Custom exclusions file (default: .codeduplicates-ignore)\n  --verbose              Enable verbose output\n  --dry-run              Show what would be processed without running\n  --help                 Show this message and exit\n```\n\n## \ud83d\udeab Exclusions\nThe CLI includes default exclusions for common folders like `node_modules`, `.git`, `__pycache__`, etc.\n\nCreate a custom exclusions file:\n```bash\n# .codeduplicates-ignore\nnode_modules\n.git\ndist\nbuild\n*.min.js\nmy-generated-folder\ntest-data\nvenv\n.venv\n__pycache__\n.pytest_cache\ncoverage\n.coverage\n```\n\nSupports glob patterns (`*`, `?`, `[]`) and comments with `#`.\n\n## \ud83d\udcca Examples\n\n### Language-Specific Scans\n```bash\n# Scan TypeScript/JavaScript with high threshold\npython cli.py . --languages ts,tsx,js --threshold 0.92 --report-html reports/js-duplicates.html\n\n# Python and Go only, minimum 5 lines, top-3 results\npython cli.py src --languages py,go --min-lines 5 --top-k 3 --report-json reports/py-go-duplicates.json\n\n# Rust projects with custom exclusions\npython cli.py . --languages rs --exclusions-file rust-exclusions.txt --verbose\n```\n\n### Performance Optimization\n```bash\n# Large codebase with caching\npython cli.py . --cache-dir .embeddings-cache --min-lines 10 --threshold 0.80\n\n# Quick scan with dry-run\npython cli.py . --dry-run --verbose\n```\n\n## \ud83c\udfd7\ufe0f Project Structure\n```\ncode-duplicates-cli-v3/\n\u251c\u2500\u2500 cli.py                    # Main CLI entry point\n\u251c\u2500\u2500 core/\n\u2502   \u251c\u2500\u2500 config.py            # Configuration constants\n\u2502   \u251c\u2500\u2500 utils.py             # Utility functions\n\u2502   \u251c\u2500\u2500 embeddings.py        # Embedding generation engine\n\u2502   \u251c\u2500\u2500 similarity.py        # Similarity calculation engine\n\u2502   \u251c\u2500\u2500 report.py            # HTML/JSON report generation\n\u2502   \u2514\u2500\u2500 extractors/\n\u2502       \u251c\u2500\u2500 base_extractor.py    # Base extractor interface\n\u2502       \u251c\u2500\u2500 parser_manager.py    # Tree-sitter parser management\n\u2502       \u251c\u2500\u2500 code_extractor.py    # AST-based code extraction\n\u2502       \u251c\u2500\u2500 regex_extractor.py   # Regex-based extraction\n\u2502       \u2514\u2500\u2500 simple_extractor.py  # Simple text extraction\n\u251c\u2500\u2500 ARCHITECTURE.md          # Detailed architecture documentation\n\u251c\u2500\u2500 MODEL_ANALYSIS.md        # Embedding model analysis\n\u2514\u2500\u2500 requirements.txt         # Python dependencies\n```\n\n## \ud83c\udfaf Model Information\n\n### Current Default Model\n- **Model**: `sentence-transformers/all-mpnet-base-v2`\n- **Type**: Semantic similarity model based on MPNet\n- **Dimensions**: 768\n- **Strengths**: Better semantic understanding, fewer false positives\n- **Use Case**: General-purpose semantic similarity with good code understanding\n\n### Alternative Models\n```bash\n# Code-specific models\npython cli.py . --model microsoft/codebert-base\npython cli.py . --model microsoft/graphcodebert-base\n\n# Faster alternatives\npython cli.py . --model sentence-transformers/all-MiniLM-L6-v2\npython cli.py . --model sentence-transformers/paraphrase-MiniLM-L6-v2\n```\n\n## \u26a1 Performance Tips\n\n### For Large Repositories\n- **Use caching**: `--cache-dir .embeddings-cache` to avoid recomputing embeddings\n- **Increase minimum lines**: `--min-lines 10` to reduce noise from trivial blocks\n- **Adjust threshold**: `--threshold 0.80` for broader matches, `0.90+` for strict matches\n- **Limit top-k**: `--top-k 3` for faster processing\n\n### Memory Optimization\n- **Smaller models**: Use `all-MiniLM-L6-v2` for faster inference\n- **Batch processing**: The tool automatically handles large codebases efficiently\n- **Exclusions**: Use comprehensive exclusion files to skip irrelevant directories\n\n## \ud83d\udc1b Troubleshooting\n\n### Common Issues\n1. **High false positives**: Increase `--threshold` to 0.90 or higher\n2. **Missing duplicates**: Decrease `--threshold` to 0.70-0.80\n3. **Too many results**: Increase `--min-lines` or use stricter exclusions\n4. **Slow performance**: Use `--cache-dir` and consider a smaller model\n\n### Debug Mode\n```bash\n# Verbose output for debugging\npython cli.py . --verbose --dry-run\n\n# Test specific file types\npython cli.py test-project --languages py --verbose\n```\n\n## \ud83d\udcc4 Output Formats\n\n### HTML Report\n- **Syntax highlighting** with Pygments\n- **Side-by-side comparison** of duplicate code blocks\n- **Similarity scores** and file locations\n- **Interactive navigation** between duplicates\n\n### JSON Report\n- **Structured data** for programmatic processing\n- **Complete metadata** including file paths, line numbers, similarity scores\n- **Easy integration** with CI/CD pipelines and other tools\n\n## \ud83e\udd1d Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes\n4. Add tests if applicable\n5. Submit a pull request\n\n## \ud83d\udcdd License\n\nThis project is licensed under the MIT License.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Multi-language code duplicate detector using semantic embeddings and Tree-Sitter parsing",
    "version": "3.0.0",
    "project_urls": {
        "Documentation": "https://github.com/RDSoria/code-duplicates-cli#readme",
        "Homepage": "https://github.com/RDSoria/code-duplicates-cli",
        "Issues": "https://github.com/RDSoria/code-duplicates-cli/issues",
        "Repository": "https://github.com/RDSoria/code-duplicates-cli"
    },
    "split_keywords": [
        "code-analysis",
        " duplicate-detection",
        " semantic-embeddings",
        " tree-sitter",
        " cli"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7e25444c0bf302fc8fe5a59199752be334fd874c4a05148366aba009d19c4453",
                "md5": "e4c062fae2b192a43eb1d04e03de48e0",
                "sha256": "4d2a03f84f7bf93db1d9a7270dee23d65c47ef4a9b3262fc08c0c83964b25990"
            },
            "downloads": -1,
            "filename": "code_duplicates_cli-3.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e4c062fae2b192a43eb1d04e03de48e0",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 18281,
            "upload_time": "2025-10-09T10:53:08",
            "upload_time_iso_8601": "2025-10-09T10:53:08.068502Z",
            "url": "https://files.pythonhosted.org/packages/7e/25/444c0bf302fc8fe5a59199752be334fd874c4a05148366aba009d19c4453/code_duplicates_cli-3.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b0e3a37fe79272268241544d6ecc52ad5d8ecaab2545612c0e6996573029679b",
                "md5": "61b9bc30b5e8754295d83cbd4d12b21c",
                "sha256": "40f2dd933082b7e810a4cfa591c530e7be5927423d88ddd5a629cc3bae5ea65b"
            },
            "downloads": -1,
            "filename": "code_duplicates_cli-3.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "61b9bc30b5e8754295d83cbd4d12b21c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 18515,
            "upload_time": "2025-10-09T10:53:09",
            "upload_time_iso_8601": "2025-10-09T10:53:09.461144Z",
            "url": "https://files.pythonhosted.org/packages/b0/e3/a37fe79272268241544d6ecc52ad5d8ecaab2545612c0e6996573029679b/code_duplicates_cli-3.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-09 10:53:09",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "RDSoria",
    "github_project": "code-duplicates-cli#readme",
    "github_not_found": true,
    "lcname": "code-duplicates-cli"
}
        
Elapsed time: 2.73216s