# Semantic Copycat Miner (CopycatM) v1.6.3
A comprehensive defensive security tool for detecting intellectual property contamination in AI-generated code. CopycatM extracts semantic signatures to identify when AI systems trained on proprietary codebases reproduce protected patterns, enabling organizations to assess IP risks and ensure legal compliance.
## Key Features
- **Enhanced Three-Tier Architecture**: Hybrid analysis with intelligent fallbacks
- Tier 1: Baseline analysis with SWHID support (all files)
- Tier 2: Traditional pattern matching with 60+ algorithm patterns + **NEW Semantic Analysis**
- Tier 3: Semantic AI analysis with automatic fallback
- **Multi-language Support**: Python, JavaScript/TypeScript, Java, C/C++, Go, Rust
- **Enhanced Cross-Language Semantic Features** (v1.6.3): Improved similarity detection
- **MinHash: 3.1% → 62.1%** (20x improvement) and **SimHash: 72.7%** maintained performance
- **All features now exceed 40% cross-language similarity threshold**
- Semantic algorithmic feature extraction replacing character/token n-grams
- Language-agnostic control flow, algorithmic, and mathematical patterns
- Multi-granularity similarity calculation with enhanced sensitivity
- Validated: Quicksort (66.7% MinHash, 80.5% SimHash), Binary Search (57.5% MinHash, 64.8% SimHash)
- **🆕 Enhanced Semantic Similarity Detection** (v1.6.2+): Patent/codec reimplementation detection
- Multi-granularity winnowing (k=5, 15, 30) for comprehensive analysis
- Control Flow Graph (CFG) and Data Flow Graph (DFG) extraction
- Domain-specific feature extraction for codecs/DSP patterns
- Reference implementation database for known algorithm matching
- **Domain-Specific Algorithm Detection**: Advanced multimedia codec recognition
- **H.264/AVC**: CAVLC entropy coding, CABAC arithmetic coding, IDCT transforms
- **AAC Audio**: Psychoacoustic models, MDCT transforms, perceptual coding
- **Video Processing**: Deblocking filters, motion compensation, quantization
- **Mathematical Transforms**: FFT butterfly operations, DCT coefficients, wavelets
- **Advanced Signature Matching**: 5 signature types with pre/post normalization
- EXACT, REGEX, STRUCTURAL, NORMALIZED, SEMANTIC pattern recognition
- Context-aware detection with domain-specific bonuses
- **52.5% F1 Score** on FFmpeg codec validation (6x improvement)
- **Cross-Language Normalization**: 84.2% file coverage
- **Signature Aggregation & Ranking**: 68.4% file coverage with 25% reduction
- **Dynamic Transformation Resistance**: ~64% average resistance
- **External Pattern Configuration**: JSON-based pattern definitions
- **Comprehensive Hashing**: Direct, fuzzy (TLSH), and semantic hashes
- **Multimedia Processing Detection**: Video, audio, image, and signal processing
- **Software Heritage ID Support**: Persistent identifiers for reproducible research
- **Control Flow Analysis**: Extract and hash control structures (if/for/while blocks)
- **Minified Code Detection**: Specialized analysis for compressed JavaScript/TypeScript
## Installation
### From PyPI
```bash
pip install semantic-copycat-miner
```
### From Source
```bash
git clone https://github.com/oscarvalenzuelab/semantic-copycat-miner
cd semantic-copycat-miner
pip install -e .
```
### Optional GNN Support
```bash
pip install -e .[gnn]
```
## Quick Start
### CLI Usage
```bash
# Single file analysis
copycatm src/algorithm.py -o results.json
# Directory analysis with parallel processing
copycatm ./codebase --parallel 4 -o results.json
# Enhanced configuration
copycatm algorithm.py --complexity-threshold 2 --min-lines 5 -o results.json
# Enable SWHID for persistent identification
copycatm algorithm.py --enable-swhid -o results.json
```
### Python API
```python
from copycatm import CopycatAnalyzer, AnalysisConfig
# Enhanced three-tier analysis (recommended)
analyzer = CopycatAnalyzer()
result = analyzer.analyze_file_enhanced("algorithm.py")
# Check for potential IP contamination
if result.get('algorithms'):
for algo in result['algorithms']:
if algo['confidence'] > 0.75:
print(f"High-confidence pattern detected: {algo['algorithm_subtype']}")
# Custom configuration
config = AnalysisConfig(
complexity_threshold=2,
min_lines=5,
enable_swhid=True
)
analyzer = CopycatAnalyzer(config)
result = analyzer.analyze_file_enhanced("algorithm.py")
```
## Documentation
For comprehensive documentation, see the **[Documentation Index](docs/index.md)**.
## Algorithm Detection
Detects 60+ algorithmic patterns across 8 major categories:
- **Core CS**: Sorting, searching, graph algorithms, dynamic programming
- **Security**: Encryption, hashing, authentication, anti-tampering
- **🆕 Multimedia Codecs**: H.264 CAVLC/CABAC/IDCT, AAC psychoacoustic models, deblocking filters
- **Mathematical Transforms**: FFT butterfly operations, DCT coefficient processing, wavelets
- **Video Processing**: Motion compensation, quantization, intra prediction
- **Audio Processing**: MDCT transforms, temporal noise shaping, spectral analysis
- **System Level**: Drivers, firmware, bootloaders, kernel modules
- **Domain Specific**: ML, graphics, financial, medical, automotive
## Performance Metrics
- **🎯 FFmpeg Codec Detection**: 52.5% F1-Score (6x improvement from 8.7% baseline)
- H.264 CAVLC: Perfect detection of coefficient tokens, total zeros, run before
- H.264 IDCT: Butterfly operations with z0/z1 transforms detected
- H.264 CABAC: Context table and arithmetic coding recognition
- Deblocking Filter: 100% F1 score (perfect detection)
- **Algorithm Detection F1-Score**: 0.857 (precision: 0.750, recall: 1.000)
- **Source Code Verification**: 100% success rate
- **Cross-Language Consistency**: 100% (JavaScript, Python, C)
- **Processing Speed**: ~0.061s average per file
- **Transformation Resistance**: ~64% average (dynamic calculation)
## Configuration
Create `copycatm.json` in your project directory:
```json
{
"analysis": {
"complexity_threshold": 3,
"min_lines": 20,
"confidence_threshold": 0.0
},
"hashing": {
"algorithms": ["sha256", "tlsh", "minhash"]
},
"performance": {
"parallel_workers": 4,
"chunk_size": 100
}
}
```
## Development
```bash
# Setup
pip install -e .[dev]
# Run tests
pytest tests/
# Code quality
black copycatm/
flake8 copycatm/
mypy copycatm/
```
## License
GNU Affero General Public License v3.0 - see LICENSE file for details.
## Acknowledgments
- Tree-sitter for robust parsing
- TLSH for fuzzy hashing
- DataSketch for MinHash implementation
- NetworkX for graph analysis
Raw data
{
"_id": null,
"home_page": null,
"name": "semantic-copycat-miner",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "\"Oscar Valenzuela B.\" <oscar.valenzuela.b@gmail.com>",
"keywords": "code-analysis, ai-detection, semantic-analysis, copyright, gpl",
"author": null,
"author_email": "\"Oscar Valenzuela B.\" <oscar.valenzuela.b@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/0a/fc/435f33cc2338961c09ce1dcd1cbdf6a7a55b89cc35ab45547e94ba1573a6/semantic_copycat_miner-1.6.3.tar.gz",
"platform": null,
"description": "# Semantic Copycat Miner (CopycatM) v1.6.3\n\nA comprehensive defensive security tool for detecting intellectual property contamination in AI-generated code. CopycatM extracts semantic signatures to identify when AI systems trained on proprietary codebases reproduce protected patterns, enabling organizations to assess IP risks and ensure legal compliance.\n\n## Key Features\n\n- **Enhanced Three-Tier Architecture**: Hybrid analysis with intelligent fallbacks\n - Tier 1: Baseline analysis with SWHID support (all files)\n - Tier 2: Traditional pattern matching with 60+ algorithm patterns + **NEW Semantic Analysis**\n - Tier 3: Semantic AI analysis with automatic fallback\n- **Multi-language Support**: Python, JavaScript/TypeScript, Java, C/C++, Go, Rust\n- **Enhanced Cross-Language Semantic Features** (v1.6.3): Improved similarity detection\n - **MinHash: 3.1% \u2192 62.1%** (20x improvement) and **SimHash: 72.7%** maintained performance\n - **All features now exceed 40% cross-language similarity threshold**\n - Semantic algorithmic feature extraction replacing character/token n-grams\n - Language-agnostic control flow, algorithmic, and mathematical patterns\n - Multi-granularity similarity calculation with enhanced sensitivity\n - Validated: Quicksort (66.7% MinHash, 80.5% SimHash), Binary Search (57.5% MinHash, 64.8% SimHash)\n- **\ud83c\udd95 Enhanced Semantic Similarity Detection** (v1.6.2+): Patent/codec reimplementation detection\n - Multi-granularity winnowing (k=5, 15, 30) for comprehensive analysis\n - Control Flow Graph (CFG) and Data Flow Graph (DFG) extraction\n - Domain-specific feature extraction for codecs/DSP patterns\n - Reference implementation database for known algorithm matching\n- **Domain-Specific Algorithm Detection**: Advanced multimedia codec recognition\n - **H.264/AVC**: CAVLC entropy coding, CABAC arithmetic coding, IDCT transforms\n - **AAC Audio**: Psychoacoustic models, MDCT transforms, perceptual coding\n - **Video Processing**: Deblocking filters, motion compensation, quantization\n - **Mathematical Transforms**: FFT butterfly operations, DCT coefficients, wavelets\n- **Advanced Signature Matching**: 5 signature types with pre/post normalization\n - EXACT, REGEX, STRUCTURAL, NORMALIZED, SEMANTIC pattern recognition\n - Context-aware detection with domain-specific bonuses\n - **52.5% F1 Score** on FFmpeg codec validation (6x improvement)\n- **Cross-Language Normalization**: 84.2% file coverage\n- **Signature Aggregation & Ranking**: 68.4% file coverage with 25% reduction\n- **Dynamic Transformation Resistance**: ~64% average resistance\n- **External Pattern Configuration**: JSON-based pattern definitions\n- **Comprehensive Hashing**: Direct, fuzzy (TLSH), and semantic hashes\n- **Multimedia Processing Detection**: Video, audio, image, and signal processing\n- **Software Heritage ID Support**: Persistent identifiers for reproducible research\n- **Control Flow Analysis**: Extract and hash control structures (if/for/while blocks)\n- **Minified Code Detection**: Specialized analysis for compressed JavaScript/TypeScript\n\n## Installation\n\n### From PyPI\n```bash\npip install semantic-copycat-miner\n```\n\n### From Source\n```bash\ngit clone https://github.com/oscarvalenzuelab/semantic-copycat-miner\ncd semantic-copycat-miner\npip install -e .\n```\n\n### Optional GNN Support\n```bash\npip install -e .[gnn]\n```\n\n## Quick Start\n\n### CLI Usage\n```bash\n# Single file analysis\ncopycatm src/algorithm.py -o results.json\n\n# Directory analysis with parallel processing\ncopycatm ./codebase --parallel 4 -o results.json\n\n# Enhanced configuration\ncopycatm algorithm.py --complexity-threshold 2 --min-lines 5 -o results.json\n\n# Enable SWHID for persistent identification\ncopycatm algorithm.py --enable-swhid -o results.json\n```\n\n### Python API\n```python\nfrom copycatm import CopycatAnalyzer, AnalysisConfig\n\n# Enhanced three-tier analysis (recommended)\nanalyzer = CopycatAnalyzer()\nresult = analyzer.analyze_file_enhanced(\"algorithm.py\")\n\n# Check for potential IP contamination\nif result.get('algorithms'):\n for algo in result['algorithms']:\n if algo['confidence'] > 0.75:\n print(f\"High-confidence pattern detected: {algo['algorithm_subtype']}\")\n\n# Custom configuration\nconfig = AnalysisConfig(\n complexity_threshold=2,\n min_lines=5,\n enable_swhid=True\n)\nanalyzer = CopycatAnalyzer(config)\nresult = analyzer.analyze_file_enhanced(\"algorithm.py\")\n```\n\n## Documentation\n\nFor comprehensive documentation, see the **[Documentation Index](docs/index.md)**.\n\n## Algorithm Detection\n\nDetects 60+ algorithmic patterns across 8 major categories:\n\n- **Core CS**: Sorting, searching, graph algorithms, dynamic programming\n- **Security**: Encryption, hashing, authentication, anti-tampering\n- **\ud83c\udd95 Multimedia Codecs**: H.264 CAVLC/CABAC/IDCT, AAC psychoacoustic models, deblocking filters\n- **Mathematical Transforms**: FFT butterfly operations, DCT coefficient processing, wavelets\n- **Video Processing**: Motion compensation, quantization, intra prediction\n- **Audio Processing**: MDCT transforms, temporal noise shaping, spectral analysis\n- **System Level**: Drivers, firmware, bootloaders, kernel modules\n- **Domain Specific**: ML, graphics, financial, medical, automotive\n\n## Performance Metrics\n\n- **\ud83c\udfaf FFmpeg Codec Detection**: 52.5% F1-Score (6x improvement from 8.7% baseline)\n - H.264 CAVLC: Perfect detection of coefficient tokens, total zeros, run before\n - H.264 IDCT: Butterfly operations with z0/z1 transforms detected\n - H.264 CABAC: Context table and arithmetic coding recognition\n - Deblocking Filter: 100% F1 score (perfect detection)\n- **Algorithm Detection F1-Score**: 0.857 (precision: 0.750, recall: 1.000)\n- **Source Code Verification**: 100% success rate\n- **Cross-Language Consistency**: 100% (JavaScript, Python, C)\n- **Processing Speed**: ~0.061s average per file\n- **Transformation Resistance**: ~64% average (dynamic calculation)\n\n## Configuration\n\nCreate `copycatm.json` in your project directory:\n\n```json\n{\n \"analysis\": {\n \"complexity_threshold\": 3,\n \"min_lines\": 20,\n \"confidence_threshold\": 0.0\n },\n \"hashing\": {\n \"algorithms\": [\"sha256\", \"tlsh\", \"minhash\"]\n },\n \"performance\": {\n \"parallel_workers\": 4,\n \"chunk_size\": 100\n }\n}\n```\n\n## Development\n\n```bash\n# Setup\npip install -e .[dev]\n\n# Run tests\npytest tests/\n\n# Code quality\nblack copycatm/\nflake8 copycatm/\nmypy copycatm/\n```\n\n## License\n\nGNU Affero General Public License v3.0 - see LICENSE file for details.\n\n## Acknowledgments\n\n- Tree-sitter for robust parsing\n- TLSH for fuzzy hashing\n- DataSketch for MinHash implementation\n- NetworkX for graph analysis\n",
"bugtrack_url": null,
"license": "AGPL-3.0",
"summary": "Semantic analysis tool for detecting AI-generated code derived from copyrighted sources",
"version": "1.6.3",
"project_urls": {
"Bug Tracker": "https://github.com/oscarvalenzuelab/semantic-copycat-miner/issues",
"Documentation": "https://github.com/oscarvalenzuelab/semantic-copycat-miner#readme",
"Homepage": "https://github.com/oscarvalenzuelab/semantic-copycat-miner",
"Repository": "https://github.com/oscarvalenzuelab/semantic-copycat-miner"
},
"split_keywords": [
"code-analysis",
" ai-detection",
" semantic-analysis",
" copyright",
" gpl"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "2590c973a9f87467914b47ecb29693a5bfef633e6998021eac7ff6b025f9d486",
"md5": "28d4d37af80f912e9065693547f7bdd1",
"sha256": "1ef99a86768fbd898c95a91c4ce691d2615a88f0640270bc70992b9c0d961e43"
},
"downloads": -1,
"filename": "semantic_copycat_miner-1.6.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "28d4d37af80f912e9065693547f7bdd1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 345161,
"upload_time": "2025-08-02T06:15:48",
"upload_time_iso_8601": "2025-08-02T06:15:48.780601Z",
"url": "https://files.pythonhosted.org/packages/25/90/c973a9f87467914b47ecb29693a5bfef633e6998021eac7ff6b025f9d486/semantic_copycat_miner-1.6.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "0afc435f33cc2338961c09ce1dcd1cbdf6a7a55b89cc35ab45547e94ba1573a6",
"md5": "fe4affc0a5875167fe24e78a2935f6ac",
"sha256": "661c909754a8898ed87f0130754278feb379014516d5b768607f7aa0a509a70a"
},
"downloads": -1,
"filename": "semantic_copycat_miner-1.6.3.tar.gz",
"has_sig": false,
"md5_digest": "fe4affc0a5875167fe24e78a2935f6ac",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 313183,
"upload_time": "2025-08-02T06:15:50",
"upload_time_iso_8601": "2025-08-02T06:15:50.490331Z",
"url": "https://files.pythonhosted.org/packages/0a/fc/435f33cc2338961c09ce1dcd1cbdf6a7a55b89cc35ab45547e94ba1573a6/semantic_copycat_miner-1.6.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-02 06:15:50",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "oscarvalenzuelab",
"github_project": "semantic-copycat-miner",
"github_not_found": true,
"lcname": "semantic-copycat-miner"
}