semantic-copycat-miner


Namesemantic-copycat-miner JSON
Version 1.6.3 PyPI version JSON
download
home_pageNone
SummarySemantic analysis tool for detecting AI-generated code derived from copyrighted sources
upload_time2025-08-02 06:15:50
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseAGPL-3.0
keywords code-analysis ai-detection semantic-analysis copyright gpl
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Semantic Copycat Miner (CopycatM) v1.6.3

A comprehensive defensive security tool for detecting intellectual property contamination in AI-generated code. CopycatM extracts semantic signatures to identify when AI systems trained on proprietary codebases reproduce protected patterns, enabling organizations to assess IP risks and ensure legal compliance.

## Key Features

- **Enhanced Three-Tier Architecture**: Hybrid analysis with intelligent fallbacks
  - Tier 1: Baseline analysis with SWHID support (all files)
  - Tier 2: Traditional pattern matching with 60+ algorithm patterns + **NEW Semantic Analysis**
  - Tier 3: Semantic AI analysis with automatic fallback
- **Multi-language Support**: Python, JavaScript/TypeScript, Java, C/C++, Go, Rust
- **Enhanced Cross-Language Semantic Features** (v1.6.3): Improved similarity detection
  - **MinHash: 3.1% → 62.1%** (20x improvement) and **SimHash: 72.7%** maintained performance
  - **All features now exceed 40% cross-language similarity threshold**
  - Semantic algorithmic feature extraction replacing character/token n-grams
  - Language-agnostic control flow, algorithmic, and mathematical patterns
  - Multi-granularity similarity calculation with enhanced sensitivity
  - Validated: Quicksort (66.7% MinHash, 80.5% SimHash), Binary Search (57.5% MinHash, 64.8% SimHash)
- **🆕 Enhanced Semantic Similarity Detection** (v1.6.2+): Patent/codec reimplementation detection
  - Multi-granularity winnowing (k=5, 15, 30) for comprehensive analysis
  - Control Flow Graph (CFG) and Data Flow Graph (DFG) extraction
  - Domain-specific feature extraction for codecs/DSP patterns
  - Reference implementation database for known algorithm matching
- **Domain-Specific Algorithm Detection**: Advanced multimedia codec recognition
  - **H.264/AVC**: CAVLC entropy coding, CABAC arithmetic coding, IDCT transforms
  - **AAC Audio**: Psychoacoustic models, MDCT transforms, perceptual coding
  - **Video Processing**: Deblocking filters, motion compensation, quantization
  - **Mathematical Transforms**: FFT butterfly operations, DCT coefficients, wavelets
- **Advanced Signature Matching**: 5 signature types with pre/post normalization
  - EXACT, REGEX, STRUCTURAL, NORMALIZED, SEMANTIC pattern recognition
  - Context-aware detection with domain-specific bonuses
  - **52.5% F1 Score** on FFmpeg codec validation (6x improvement)
- **Cross-Language Normalization**: 84.2% file coverage
- **Signature Aggregation & Ranking**: 68.4% file coverage with 25% reduction
- **Dynamic Transformation Resistance**: ~64% average resistance
- **External Pattern Configuration**: JSON-based pattern definitions
- **Comprehensive Hashing**: Direct, fuzzy (TLSH), and semantic hashes
- **Multimedia Processing Detection**: Video, audio, image, and signal processing
- **Software Heritage ID Support**: Persistent identifiers for reproducible research
- **Control Flow Analysis**: Extract and hash control structures (if/for/while blocks)
- **Minified Code Detection**: Specialized analysis for compressed JavaScript/TypeScript

## Installation

### From PyPI
```bash
pip install semantic-copycat-miner
```

### From Source
```bash
git clone https://github.com/oscarvalenzuelab/semantic-copycat-miner
cd semantic-copycat-miner
pip install -e .
```

### Optional GNN Support
```bash
pip install -e .[gnn]
```

## Quick Start

### CLI Usage
```bash
# Single file analysis
copycatm src/algorithm.py -o results.json

# Directory analysis with parallel processing
copycatm ./codebase --parallel 4 -o results.json

# Enhanced configuration
copycatm algorithm.py --complexity-threshold 2 --min-lines 5 -o results.json

# Enable SWHID for persistent identification
copycatm algorithm.py --enable-swhid -o results.json
```

### Python API
```python
from copycatm import CopycatAnalyzer, AnalysisConfig

# Enhanced three-tier analysis (recommended)
analyzer = CopycatAnalyzer()
result = analyzer.analyze_file_enhanced("algorithm.py")

# Check for potential IP contamination
if result.get('algorithms'):
    for algo in result['algorithms']:
        if algo['confidence'] > 0.75:
            print(f"High-confidence pattern detected: {algo['algorithm_subtype']}")

# Custom configuration
config = AnalysisConfig(
    complexity_threshold=2,
    min_lines=5,
    enable_swhid=True
)
analyzer = CopycatAnalyzer(config)
result = analyzer.analyze_file_enhanced("algorithm.py")
```

## Documentation

For comprehensive documentation, see the **[Documentation Index](docs/index.md)**.

## Algorithm Detection

Detects 60+ algorithmic patterns across 8 major categories:

- **Core CS**: Sorting, searching, graph algorithms, dynamic programming
- **Security**: Encryption, hashing, authentication, anti-tampering
- **🆕 Multimedia Codecs**: H.264 CAVLC/CABAC/IDCT, AAC psychoacoustic models, deblocking filters
- **Mathematical Transforms**: FFT butterfly operations, DCT coefficient processing, wavelets
- **Video Processing**: Motion compensation, quantization, intra prediction
- **Audio Processing**: MDCT transforms, temporal noise shaping, spectral analysis
- **System Level**: Drivers, firmware, bootloaders, kernel modules
- **Domain Specific**: ML, graphics, financial, medical, automotive

## Performance Metrics

- **🎯 FFmpeg Codec Detection**: 52.5% F1-Score (6x improvement from 8.7% baseline)
  - H.264 CAVLC: Perfect detection of coefficient tokens, total zeros, run before
  - H.264 IDCT: Butterfly operations with z0/z1 transforms detected
  - H.264 CABAC: Context table and arithmetic coding recognition
  - Deblocking Filter: 100% F1 score (perfect detection)
- **Algorithm Detection F1-Score**: 0.857 (precision: 0.750, recall: 1.000)
- **Source Code Verification**: 100% success rate
- **Cross-Language Consistency**: 100% (JavaScript, Python, C)
- **Processing Speed**: ~0.061s average per file
- **Transformation Resistance**: ~64% average (dynamic calculation)

## Configuration

Create `copycatm.json` in your project directory:

```json
{
  "analysis": {
    "complexity_threshold": 3,
    "min_lines": 20,
    "confidence_threshold": 0.0
  },
  "hashing": {
    "algorithms": ["sha256", "tlsh", "minhash"]
  },
  "performance": {
    "parallel_workers": 4,
    "chunk_size": 100
  }
}
```

## Development

```bash
# Setup
pip install -e .[dev]

# Run tests
pytest tests/

# Code quality
black copycatm/
flake8 copycatm/
mypy copycatm/
```

## License

GNU Affero General Public License v3.0 - see LICENSE file for details.

## Acknowledgments

- Tree-sitter for robust parsing
- TLSH for fuzzy hashing
- DataSketch for MinHash implementation
- NetworkX for graph analysis

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "semantic-copycat-miner",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "\"Oscar Valenzuela B.\" <oscar.valenzuela.b@gmail.com>",
    "keywords": "code-analysis, ai-detection, semantic-analysis, copyright, gpl",
    "author": null,
    "author_email": "\"Oscar Valenzuela B.\" <oscar.valenzuela.b@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/0a/fc/435f33cc2338961c09ce1dcd1cbdf6a7a55b89cc35ab45547e94ba1573a6/semantic_copycat_miner-1.6.3.tar.gz",
    "platform": null,
    "description": "# Semantic Copycat Miner (CopycatM) v1.6.3\n\nA comprehensive defensive security tool for detecting intellectual property contamination in AI-generated code. CopycatM extracts semantic signatures to identify when AI systems trained on proprietary codebases reproduce protected patterns, enabling organizations to assess IP risks and ensure legal compliance.\n\n## Key Features\n\n- **Enhanced Three-Tier Architecture**: Hybrid analysis with intelligent fallbacks\n  - Tier 1: Baseline analysis with SWHID support (all files)\n  - Tier 2: Traditional pattern matching with 60+ algorithm patterns + **NEW Semantic Analysis**\n  - Tier 3: Semantic AI analysis with automatic fallback\n- **Multi-language Support**: Python, JavaScript/TypeScript, Java, C/C++, Go, Rust\n- **Enhanced Cross-Language Semantic Features** (v1.6.3): Improved similarity detection\n  - **MinHash: 3.1% \u2192 62.1%** (20x improvement) and **SimHash: 72.7%** maintained performance\n  - **All features now exceed 40% cross-language similarity threshold**\n  - Semantic algorithmic feature extraction replacing character/token n-grams\n  - Language-agnostic control flow, algorithmic, and mathematical patterns\n  - Multi-granularity similarity calculation with enhanced sensitivity\n  - Validated: Quicksort (66.7% MinHash, 80.5% SimHash), Binary Search (57.5% MinHash, 64.8% SimHash)\n- **\ud83c\udd95 Enhanced Semantic Similarity Detection** (v1.6.2+): Patent/codec reimplementation detection\n  - Multi-granularity winnowing (k=5, 15, 30) for comprehensive analysis\n  - Control Flow Graph (CFG) and Data Flow Graph (DFG) extraction\n  - Domain-specific feature extraction for codecs/DSP patterns\n  - Reference implementation database for known algorithm matching\n- **Domain-Specific Algorithm Detection**: Advanced multimedia codec recognition\n  - **H.264/AVC**: CAVLC entropy coding, CABAC arithmetic coding, IDCT transforms\n  - **AAC Audio**: Psychoacoustic models, MDCT transforms, perceptual coding\n  - **Video Processing**: Deblocking filters, motion compensation, quantization\n  - **Mathematical Transforms**: FFT butterfly operations, DCT coefficients, wavelets\n- **Advanced Signature Matching**: 5 signature types with pre/post normalization\n  - EXACT, REGEX, STRUCTURAL, NORMALIZED, SEMANTIC pattern recognition\n  - Context-aware detection with domain-specific bonuses\n  - **52.5% F1 Score** on FFmpeg codec validation (6x improvement)\n- **Cross-Language Normalization**: 84.2% file coverage\n- **Signature Aggregation & Ranking**: 68.4% file coverage with 25% reduction\n- **Dynamic Transformation Resistance**: ~64% average resistance\n- **External Pattern Configuration**: JSON-based pattern definitions\n- **Comprehensive Hashing**: Direct, fuzzy (TLSH), and semantic hashes\n- **Multimedia Processing Detection**: Video, audio, image, and signal processing\n- **Software Heritage ID Support**: Persistent identifiers for reproducible research\n- **Control Flow Analysis**: Extract and hash control structures (if/for/while blocks)\n- **Minified Code Detection**: Specialized analysis for compressed JavaScript/TypeScript\n\n## Installation\n\n### From PyPI\n```bash\npip install semantic-copycat-miner\n```\n\n### From Source\n```bash\ngit clone https://github.com/oscarvalenzuelab/semantic-copycat-miner\ncd semantic-copycat-miner\npip install -e .\n```\n\n### Optional GNN Support\n```bash\npip install -e .[gnn]\n```\n\n## Quick Start\n\n### CLI Usage\n```bash\n# Single file analysis\ncopycatm src/algorithm.py -o results.json\n\n# Directory analysis with parallel processing\ncopycatm ./codebase --parallel 4 -o results.json\n\n# Enhanced configuration\ncopycatm algorithm.py --complexity-threshold 2 --min-lines 5 -o results.json\n\n# Enable SWHID for persistent identification\ncopycatm algorithm.py --enable-swhid -o results.json\n```\n\n### Python API\n```python\nfrom copycatm import CopycatAnalyzer, AnalysisConfig\n\n# Enhanced three-tier analysis (recommended)\nanalyzer = CopycatAnalyzer()\nresult = analyzer.analyze_file_enhanced(\"algorithm.py\")\n\n# Check for potential IP contamination\nif result.get('algorithms'):\n    for algo in result['algorithms']:\n        if algo['confidence'] > 0.75:\n            print(f\"High-confidence pattern detected: {algo['algorithm_subtype']}\")\n\n# Custom configuration\nconfig = AnalysisConfig(\n    complexity_threshold=2,\n    min_lines=5,\n    enable_swhid=True\n)\nanalyzer = CopycatAnalyzer(config)\nresult = analyzer.analyze_file_enhanced(\"algorithm.py\")\n```\n\n## Documentation\n\nFor comprehensive documentation, see the **[Documentation Index](docs/index.md)**.\n\n## Algorithm Detection\n\nDetects 60+ algorithmic patterns across 8 major categories:\n\n- **Core CS**: Sorting, searching, graph algorithms, dynamic programming\n- **Security**: Encryption, hashing, authentication, anti-tampering\n- **\ud83c\udd95 Multimedia Codecs**: H.264 CAVLC/CABAC/IDCT, AAC psychoacoustic models, deblocking filters\n- **Mathematical Transforms**: FFT butterfly operations, DCT coefficient processing, wavelets\n- **Video Processing**: Motion compensation, quantization, intra prediction\n- **Audio Processing**: MDCT transforms, temporal noise shaping, spectral analysis\n- **System Level**: Drivers, firmware, bootloaders, kernel modules\n- **Domain Specific**: ML, graphics, financial, medical, automotive\n\n## Performance Metrics\n\n- **\ud83c\udfaf FFmpeg Codec Detection**: 52.5% F1-Score (6x improvement from 8.7% baseline)\n  - H.264 CAVLC: Perfect detection of coefficient tokens, total zeros, run before\n  - H.264 IDCT: Butterfly operations with z0/z1 transforms detected\n  - H.264 CABAC: Context table and arithmetic coding recognition\n  - Deblocking Filter: 100% F1 score (perfect detection)\n- **Algorithm Detection F1-Score**: 0.857 (precision: 0.750, recall: 1.000)\n- **Source Code Verification**: 100% success rate\n- **Cross-Language Consistency**: 100% (JavaScript, Python, C)\n- **Processing Speed**: ~0.061s average per file\n- **Transformation Resistance**: ~64% average (dynamic calculation)\n\n## Configuration\n\nCreate `copycatm.json` in your project directory:\n\n```json\n{\n  \"analysis\": {\n    \"complexity_threshold\": 3,\n    \"min_lines\": 20,\n    \"confidence_threshold\": 0.0\n  },\n  \"hashing\": {\n    \"algorithms\": [\"sha256\", \"tlsh\", \"minhash\"]\n  },\n  \"performance\": {\n    \"parallel_workers\": 4,\n    \"chunk_size\": 100\n  }\n}\n```\n\n## Development\n\n```bash\n# Setup\npip install -e .[dev]\n\n# Run tests\npytest tests/\n\n# Code quality\nblack copycatm/\nflake8 copycatm/\nmypy copycatm/\n```\n\n## License\n\nGNU Affero General Public License v3.0 - see LICENSE file for details.\n\n## Acknowledgments\n\n- Tree-sitter for robust parsing\n- TLSH for fuzzy hashing\n- DataSketch for MinHash implementation\n- NetworkX for graph analysis\n",
    "bugtrack_url": null,
    "license": "AGPL-3.0",
    "summary": "Semantic analysis tool for detecting AI-generated code derived from copyrighted sources",
    "version": "1.6.3",
    "project_urls": {
        "Bug Tracker": "https://github.com/oscarvalenzuelab/semantic-copycat-miner/issues",
        "Documentation": "https://github.com/oscarvalenzuelab/semantic-copycat-miner#readme",
        "Homepage": "https://github.com/oscarvalenzuelab/semantic-copycat-miner",
        "Repository": "https://github.com/oscarvalenzuelab/semantic-copycat-miner"
    },
    "split_keywords": [
        "code-analysis",
        " ai-detection",
        " semantic-analysis",
        " copyright",
        " gpl"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2590c973a9f87467914b47ecb29693a5bfef633e6998021eac7ff6b025f9d486",
                "md5": "28d4d37af80f912e9065693547f7bdd1",
                "sha256": "1ef99a86768fbd898c95a91c4ce691d2615a88f0640270bc70992b9c0d961e43"
            },
            "downloads": -1,
            "filename": "semantic_copycat_miner-1.6.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "28d4d37af80f912e9065693547f7bdd1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 345161,
            "upload_time": "2025-08-02T06:15:48",
            "upload_time_iso_8601": "2025-08-02T06:15:48.780601Z",
            "url": "https://files.pythonhosted.org/packages/25/90/c973a9f87467914b47ecb29693a5bfef633e6998021eac7ff6b025f9d486/semantic_copycat_miner-1.6.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "0afc435f33cc2338961c09ce1dcd1cbdf6a7a55b89cc35ab45547e94ba1573a6",
                "md5": "fe4affc0a5875167fe24e78a2935f6ac",
                "sha256": "661c909754a8898ed87f0130754278feb379014516d5b768607f7aa0a509a70a"
            },
            "downloads": -1,
            "filename": "semantic_copycat_miner-1.6.3.tar.gz",
            "has_sig": false,
            "md5_digest": "fe4affc0a5875167fe24e78a2935f6ac",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 313183,
            "upload_time": "2025-08-02T06:15:50",
            "upload_time_iso_8601": "2025-08-02T06:15:50.490331Z",
            "url": "https://files.pythonhosted.org/packages/0a/fc/435f33cc2338961c09ce1dcd1cbdf6a7a55b89cc35ab45547e94ba1573a6/semantic_copycat_miner-1.6.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-02 06:15:50",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "oscarvalenzuelab",
    "github_project": "semantic-copycat-miner",
    "github_not_found": true,
    "lcname": "semantic-copycat-miner"
}
        
Elapsed time: 1.22015s