semantic-copycat-minner


Namesemantic-copycat-minner JSON
Version 1.4.2 PyPI version JSON
download
home_pageNone
SummarySemantic analysis tool for detecting AI-generated code derived from copyrighted sources
upload_time2025-07-29 23:02:46
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseAGPL-3.0
keywords code-analysis ai-detection semantic-analysis copyright gpl
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Semantic Copycat Minner (CopycatM)

A semantic analysis tool for extracting code hashes, algorithms, and structural features for similarity analysis and copyright detection.

## Features

- **Multi-language Support**: Python, JavaScript/TypeScript, Java, C/C++, Go, Rust
- **Semantic Analysis**: AST-based code analysis with tree-sitter parsers
- **Algorithm Detection**: Pattern recognition for 40+ algorithm types across 8 categories
- **Unknown Algorithm Detection**: Structural complexity analysis to identify novel algorithms (v1.2.0+)
- **Cross-Language Consistency**: 100% MinHash similarity for same algorithms across languages (v1.4.0+)
- **Transformation Resistance**: 86.8% average resistance to code transformations (v1.4.0+)
- **Audio/Video Codec Detection**: Comprehensive detection of 40+ multimedia codecs (v1.4.0+)
  - Successfully tested on FFmpeg source code
  - Detects MP3, AAC, Opus, FLAC, PCM audio codecs
  - Detects H.264, H.265, VP8/9, AV1 video codecs
- **Fuzzy Hashing**: TLSH with optimized preprocessing for code similarity
- **Semantic Hashing**: MinHash and SimHash for structural similarity
- **CLI Interface**: Easy-to-use command-line tool with batch processing
- **JSON Output**: Structured output for integration with other tools

## Installation

### From PyPI

```bash
pip install semantic-copycat-minner
```

### From Source

```bash
git clone https://github.com/username/semantic-copycat-minner
cd semantic-copycat-minner
pip install -e .
```

## Quick Start

### Analyze a Single File

```bash
copycatm src/algorithm.py -o results.json
```

### Analyze a Directory

```bash
copycatm ./codebase -o results.json
```

### Custom Configuration

```bash
copycatm algorithm.py --complexity-threshold 5 --min-lines 50 -o results.json
```

## CLI Usage

### Basic Commands

```bash
# Single file analysis
copycatm <file_path> [options]

# Batch directory analysis
copycatm <directory_path> [options]
```

### Options

```bash
# Core options
--output, -o           Output JSON file path (default: stdout)
--verbose, -v          Verbose output (can be repeated: -v, -vv, -vvv)
--quiet, -q           Suppress all output except errors
--debug               Enable debug mode with intermediate representations

# Analysis configuration
--complexity-threshold, -c    Cyclomatic complexity threshold (default: 3)
--min-lines                   Minimum lines for algorithm analysis (default: 20, recommend 2 for utility libraries)
--include-intermediates       Include AST and control flow graphs in output
--languages                   Comma-separated list of languages to analyze

# Hash configuration
--hash-algorithms            Comma-separated hash types (default: sha256,tlsh,minhash)
--tlsh-threshold            TLSH similarity threshold (default: 100)
--lsh-bands                 LSH band count for similarity detection (default: 20)

# Output filtering
--only-algorithms           Only output algorithmic signatures
--only-metadata            Only output file metadata
--confidence-threshold     Minimum confidence score to include (0.0-1.0)

# Performance
--parallel, -p             Number of parallel workers (default: CPU count)
--chunk-size              Files per chunk for batch processing (default: 100)
```

## Library API

The library provides different levels of API access, with `CopycatAnalyzer` as the main entry point for most use cases.

### Main Entry Point: CopycatAnalyzer

`CopycatAnalyzer` is the primary interface that orchestrates all analysis components including parsing, algorithm detection, hashing, and complexity analysis.

```python
from semantic_copycat_minner import CopycatAnalyzer, AnalysisConfig

# Create analyzer with default configuration
analyzer = CopycatAnalyzer()

# Analyze a file (auto-detect language from extension)
result = analyzer.analyze_file("src/algorithm.py")

# Force specific language (useful for non-standard extensions)
result = analyzer.analyze_file("script.txt", force_language="python")

# Analyze code string directly
result = analyzer.analyze_code(code, "python", "algorithm.py")

# Analyze directory
results = analyzer.analyze_directory("./codebase")
```

### Lower-Level Components

For advanced use cases, you can access individual components directly:

```python
from semantic_copycat_minner import AlgorithmDetector
from semantic_copycat_minner.parsers import TreeSitterParser

# Direct algorithm detection with flexible input
detector = AlgorithmDetector()

# Option 1: Provide raw content (convenience method)
algorithms = detector.detect_algorithms_from_input(content, "python")

# Option 2: Provide pre-parsed AST (for reuse across components)
parser = TreeSitterParser()
ast_tree = parser.parse(content, "python")
algorithms = detector.detect_algorithms_from_input(ast_tree, "python")

# Option 3: Use original method (backward compatible)
algorithms = detector.detect_algorithms(ast_tree, "python")
```

### Custom Configuration

```python
from semantic_copycat_minner import CopycatAnalyzer, AnalysisConfig

# Create custom configuration
config = AnalysisConfig(
    complexity_threshold=5,
    min_lines=50,
    include_intermediates=True,
    hash_algorithms=["sha256", "tlsh", "minhash"],
    confidence_threshold=0.8
)

analyzer = CopycatAnalyzer(config)
result = analyzer.analyze_file("src/algorithm.py")
```

## Output Format

The tool outputs structured JSON with the following components:

### File Metadata

```json
{
  "file_metadata": {
    "file_name": "algorithm.py",
    "language": "python",
    "line_count": 85,
    "is_source_code": true,
    "analysis_timestamp": "2025-07-25T10:30:00Z"
  }
}
```

### Algorithm Detection

```json
{
  "algorithms": [
    {
      "id": "algo_001",
      "type": "algorithm",
      "name": "quicksort_implementation",
      "confidence": 0.92,
      "complexity_metric": 8,
      "evidence": {
        "pattern_type": "divide_and_conquer",
        "control_flow": "recursive_partition"
      },
      "hashes": {
        "direct": {"sha256": "abc123..."},
        "fuzzy": {"tlsh": "T1A2B3C4..."},
        "semantic": {"minhash": "123456789abcdef"}
      },
      "transformation_resistance": {
        "variable_renaming": 0.95,
        "language_translation": 0.85
      }
    }
  ]
}
```

### Unknown Algorithm Detection (v1.2.0+)

For complex code that doesn't match known algorithm patterns, CopycatM performs structural complexity analysis to identify unknown algorithms. This feature automatically activates for files with 50+ lines to optimize performance.

```json
{
  "algorithms": [
    {
      "id": "unknown_a1b2c3d4",
      "algorithm_type": "unknown_complex_algorithm",
      "subtype_classification": "bitwise_manipulation_algorithm",
      "confidence_score": 0.79,
      "evidence": {
        "complexity_score": 0.79,
        "cyclomatic_complexity": 33,
        "nesting_depth": 5,
        "operation_density": 4.2,
        "unique_operations": 25,
        "structural_hash": "abc123def456",
        "algorithmic_fingerprint": "ALG-E66468BA743C"
      },
      "transformation_resistance": {
        "structural_hash": 0.9,
        "operation_patterns": 0.85,
        "complexity_metrics": 0.95
      }
    }
  ]
}
```

Unknown algorithms are classified into subtypes based on their dominant characteristics:
- `complex_iteration_pattern` - Nested loops and complex iteration
- `bitwise_manipulation_algorithm` - Heavy use of bitwise operations
- `mathematical_computation` - Dense mathematical operations
- `complex_decision_logic` - High conditional complexity
- `data_transformation_algorithm` - Complex data flow patterns
- `deeply_nested_algorithm` - Extreme nesting depth
- `unclassified_complex_pattern` - Other complex patterns

### Mathematical Invariants

```json
{
  "mathematical_invariants": [
    {
      "id": "inv_001",
      "type": "mathematical_expression",
      "confidence": 0.78,
      "evidence": {
        "expression_type": "arithmetic_calculation"
      }
    }
  ]
}
```

## Configuration

### Configuration File

Create `copycatm.json` in your project directory:

```json
{
  "analysis": {
    "complexity_threshold": 3,
    "min_lines": 2,  // Recommended: 2 for utility libraries, 20 for general code
    "confidence_threshold": 0.0,
    "unknown_algorithm_threshold": 50  // Line count threshold for unknown algorithm detection
  },
  "languages": {
    "enabled": ["python", "javascript", "java", "c", "cpp", "go", "rust"]
  },
  "hashing": {
    "algorithms": ["sha256", "tlsh", "minhash"],
    "tlsh_threshold": 100,
    "lsh_bands": 20
  },
  "performance": {
    "parallel_workers": null,
    "chunk_size": 100
  },
  "output": {
    "include_intermediates": false
  }
}
```

## Supported Languages

- **Python**: `.py`, `.pyx`, `.pyi`
- **JavaScript**: `.js`, `.jsx`
- **TypeScript**: `.ts`, `.tsx`
- **Java**: `.java`
- **C/C++**: `.c`, `.cpp`, `.cc`, `.cxx`, `.h`, `.hpp`
- **Go**: `.go`
- **Rust**: `.rs`

## Algorithm Detection

The tool can detect 40+ algorithmic patterns across 8 major categories:

### Core CS Algorithms
- **Sorting**: Quicksort, Mergesort, Bubblesort, Heapsort, Radix Sort
- **Searching**: Binary Search, Linear Search, DFS, BFS, A*, Jump Search
- **Graph Algorithms**: Dijkstra's, Bellman-Ford, Kruskal's, Prim's, Floyd-Warshall
- **Dynamic Programming**: Fibonacci, LCS, Knapsack, Edit Distance

### Security & Cryptography
- **Encryption**: AES, RSA, DES, ChaCha20, Elliptic Curve
- **Hashing**: SHA family, MD5, bcrypt, Argon2
- **Security**: Anti-tampering, obfuscation, authentication

### Media Processing
- **Audio Codecs**: MP3, AAC, Opus, FLAC, Vorbis, PCM, AC3, DTS
- **Video Codecs**: H.264, H.265/HEVC, VP8/9, AV1, MPEG-2, ProRes
- **Image Processing**: JPEG, PNG compression, filters, transforms

### System Level
- **Drivers**: Device drivers, kernel modules
- **Firmware**: Bootloaders, embedded systems
- **Low-level**: Memory management, interrupt handlers

### Domain Specific
- **Machine Learning**: Neural networks, gradient descent, k-means
- **Graphics**: Ray tracing, rasterization, shaders
- **Financial**: Options pricing, risk models
- **Medical**: Image reconstruction, signal processing
- **Automotive**: Control systems, sensor fusion

### Cross-Language Support (v1.4.0+)
- Same algorithms detected consistently across Python, JavaScript, C/C++, Java
- 100% MinHash similarity for identical algorithms in different languages
- Language-agnostic normalization for true semantic matching

## Hashing Methods

### Direct Hashing
- SHA256: Cryptographic hash for exact matching
- MD5: Fast hash for quick comparisons

### Fuzzy Hashing (Enhanced in v1.4.0)
- **TLSH**: Optimized preprocessing for code similarity
  - Algorithm-focused normalization
  - Smart padding for short functions
  - 15% transformation resistance (vs 5% standard)
- **ssdeep**: Primary fallback for code similarity
- **Enhanced Fallback**: Multi-component hashing when libraries unavailable

### Semantic Hashing (Cross-Language Support in v1.4.0)
- **MinHash**: 100% cross-language similarity with normalization
  - Language-agnostic code representation
  - Structural shingle extraction
  - 96.9% uniqueness (up from 61.9%)
- **SimHash**: Hamming distance for structural similarity
- **LSH**: Locality-sensitive hashing for approximate nearest neighbor search

## Development

### Setup Development Environment

```bash
git clone https://github.com/oscarvalenzuelab/semantic-copycat-minner
cd semantic-copycat-minner
pip install -e .[dev]
```

### Running Tests

```bash
pytest tests/
```

## License

GNU Affero General Public License v3.0 - see LICENSE file for details.

## Acknowledgments

- Tree-sitter for robust parsing
- TLSH for fuzzy hashing
- DataSketch for MinHash implementation
- Radon for cyclomatic complexity analysis

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "semantic-copycat-minner",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "\"Oscar Valenzuela B.\" <oscar.valenzuela.b@gmail.com>",
    "keywords": "code-analysis, ai-detection, semantic-analysis, copyright, gpl",
    "author": null,
    "author_email": "\"Oscar Valenzuela B.\" <oscar.valenzuela.b@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/12/a9/be0d4ced592cd9c8e3aa211703b32d62d64375ad8a2fd73dc07e8b46fb76/semantic_copycat_minner-1.4.2.tar.gz",
    "platform": null,
    "description": "# Semantic Copycat Minner (CopycatM)\n\nA semantic analysis tool for extracting code hashes, algorithms, and structural features for similarity analysis and copyright detection.\n\n## Features\n\n- **Multi-language Support**: Python, JavaScript/TypeScript, Java, C/C++, Go, Rust\n- **Semantic Analysis**: AST-based code analysis with tree-sitter parsers\n- **Algorithm Detection**: Pattern recognition for 40+ algorithm types across 8 categories\n- **Unknown Algorithm Detection**: Structural complexity analysis to identify novel algorithms (v1.2.0+)\n- **Cross-Language Consistency**: 100% MinHash similarity for same algorithms across languages (v1.4.0+)\n- **Transformation Resistance**: 86.8% average resistance to code transformations (v1.4.0+)\n- **Audio/Video Codec Detection**: Comprehensive detection of 40+ multimedia codecs (v1.4.0+)\n  - Successfully tested on FFmpeg source code\n  - Detects MP3, AAC, Opus, FLAC, PCM audio codecs\n  - Detects H.264, H.265, VP8/9, AV1 video codecs\n- **Fuzzy Hashing**: TLSH with optimized preprocessing for code similarity\n- **Semantic Hashing**: MinHash and SimHash for structural similarity\n- **CLI Interface**: Easy-to-use command-line tool with batch processing\n- **JSON Output**: Structured output for integration with other tools\n\n## Installation\n\n### From PyPI\n\n```bash\npip install semantic-copycat-minner\n```\n\n### From Source\n\n```bash\ngit clone https://github.com/username/semantic-copycat-minner\ncd semantic-copycat-minner\npip install -e .\n```\n\n## Quick Start\n\n### Analyze a Single File\n\n```bash\ncopycatm src/algorithm.py -o results.json\n```\n\n### Analyze a Directory\n\n```bash\ncopycatm ./codebase -o results.json\n```\n\n### Custom Configuration\n\n```bash\ncopycatm algorithm.py --complexity-threshold 5 --min-lines 50 -o results.json\n```\n\n## CLI Usage\n\n### Basic Commands\n\n```bash\n# Single file analysis\ncopycatm <file_path> [options]\n\n# Batch directory analysis\ncopycatm <directory_path> [options]\n```\n\n### Options\n\n```bash\n# Core options\n--output, -o           Output JSON file path (default: stdout)\n--verbose, -v          Verbose output (can be repeated: -v, -vv, -vvv)\n--quiet, -q           Suppress all output except errors\n--debug               Enable debug mode with intermediate representations\n\n# Analysis configuration\n--complexity-threshold, -c    Cyclomatic complexity threshold (default: 3)\n--min-lines                   Minimum lines for algorithm analysis (default: 20, recommend 2 for utility libraries)\n--include-intermediates       Include AST and control flow graphs in output\n--languages                   Comma-separated list of languages to analyze\n\n# Hash configuration\n--hash-algorithms            Comma-separated hash types (default: sha256,tlsh,minhash)\n--tlsh-threshold            TLSH similarity threshold (default: 100)\n--lsh-bands                 LSH band count for similarity detection (default: 20)\n\n# Output filtering\n--only-algorithms           Only output algorithmic signatures\n--only-metadata            Only output file metadata\n--confidence-threshold     Minimum confidence score to include (0.0-1.0)\n\n# Performance\n--parallel, -p             Number of parallel workers (default: CPU count)\n--chunk-size              Files per chunk for batch processing (default: 100)\n```\n\n## Library API\n\nThe library provides different levels of API access, with `CopycatAnalyzer` as the main entry point for most use cases.\n\n### Main Entry Point: CopycatAnalyzer\n\n`CopycatAnalyzer` is the primary interface that orchestrates all analysis components including parsing, algorithm detection, hashing, and complexity analysis.\n\n```python\nfrom semantic_copycat_minner import CopycatAnalyzer, AnalysisConfig\n\n# Create analyzer with default configuration\nanalyzer = CopycatAnalyzer()\n\n# Analyze a file (auto-detect language from extension)\nresult = analyzer.analyze_file(\"src/algorithm.py\")\n\n# Force specific language (useful for non-standard extensions)\nresult = analyzer.analyze_file(\"script.txt\", force_language=\"python\")\n\n# Analyze code string directly\nresult = analyzer.analyze_code(code, \"python\", \"algorithm.py\")\n\n# Analyze directory\nresults = analyzer.analyze_directory(\"./codebase\")\n```\n\n### Lower-Level Components\n\nFor advanced use cases, you can access individual components directly:\n\n```python\nfrom semantic_copycat_minner import AlgorithmDetector\nfrom semantic_copycat_minner.parsers import TreeSitterParser\n\n# Direct algorithm detection with flexible input\ndetector = AlgorithmDetector()\n\n# Option 1: Provide raw content (convenience method)\nalgorithms = detector.detect_algorithms_from_input(content, \"python\")\n\n# Option 2: Provide pre-parsed AST (for reuse across components)\nparser = TreeSitterParser()\nast_tree = parser.parse(content, \"python\")\nalgorithms = detector.detect_algorithms_from_input(ast_tree, \"python\")\n\n# Option 3: Use original method (backward compatible)\nalgorithms = detector.detect_algorithms(ast_tree, \"python\")\n```\n\n### Custom Configuration\n\n```python\nfrom semantic_copycat_minner import CopycatAnalyzer, AnalysisConfig\n\n# Create custom configuration\nconfig = AnalysisConfig(\n    complexity_threshold=5,\n    min_lines=50,\n    include_intermediates=True,\n    hash_algorithms=[\"sha256\", \"tlsh\", \"minhash\"],\n    confidence_threshold=0.8\n)\n\nanalyzer = CopycatAnalyzer(config)\nresult = analyzer.analyze_file(\"src/algorithm.py\")\n```\n\n## Output Format\n\nThe tool outputs structured JSON with the following components:\n\n### File Metadata\n\n```json\n{\n  \"file_metadata\": {\n    \"file_name\": \"algorithm.py\",\n    \"language\": \"python\",\n    \"line_count\": 85,\n    \"is_source_code\": true,\n    \"analysis_timestamp\": \"2025-07-25T10:30:00Z\"\n  }\n}\n```\n\n### Algorithm Detection\n\n```json\n{\n  \"algorithms\": [\n    {\n      \"id\": \"algo_001\",\n      \"type\": \"algorithm\",\n      \"name\": \"quicksort_implementation\",\n      \"confidence\": 0.92,\n      \"complexity_metric\": 8,\n      \"evidence\": {\n        \"pattern_type\": \"divide_and_conquer\",\n        \"control_flow\": \"recursive_partition\"\n      },\n      \"hashes\": {\n        \"direct\": {\"sha256\": \"abc123...\"},\n        \"fuzzy\": {\"tlsh\": \"T1A2B3C4...\"},\n        \"semantic\": {\"minhash\": \"123456789abcdef\"}\n      },\n      \"transformation_resistance\": {\n        \"variable_renaming\": 0.95,\n        \"language_translation\": 0.85\n      }\n    }\n  ]\n}\n```\n\n### Unknown Algorithm Detection (v1.2.0+)\n\nFor complex code that doesn't match known algorithm patterns, CopycatM performs structural complexity analysis to identify unknown algorithms. This feature automatically activates for files with 50+ lines to optimize performance.\n\n```json\n{\n  \"algorithms\": [\n    {\n      \"id\": \"unknown_a1b2c3d4\",\n      \"algorithm_type\": \"unknown_complex_algorithm\",\n      \"subtype_classification\": \"bitwise_manipulation_algorithm\",\n      \"confidence_score\": 0.79,\n      \"evidence\": {\n        \"complexity_score\": 0.79,\n        \"cyclomatic_complexity\": 33,\n        \"nesting_depth\": 5,\n        \"operation_density\": 4.2,\n        \"unique_operations\": 25,\n        \"structural_hash\": \"abc123def456\",\n        \"algorithmic_fingerprint\": \"ALG-E66468BA743C\"\n      },\n      \"transformation_resistance\": {\n        \"structural_hash\": 0.9,\n        \"operation_patterns\": 0.85,\n        \"complexity_metrics\": 0.95\n      }\n    }\n  ]\n}\n```\n\nUnknown algorithms are classified into subtypes based on their dominant characteristics:\n- `complex_iteration_pattern` - Nested loops and complex iteration\n- `bitwise_manipulation_algorithm` - Heavy use of bitwise operations\n- `mathematical_computation` - Dense mathematical operations\n- `complex_decision_logic` - High conditional complexity\n- `data_transformation_algorithm` - Complex data flow patterns\n- `deeply_nested_algorithm` - Extreme nesting depth\n- `unclassified_complex_pattern` - Other complex patterns\n\n### Mathematical Invariants\n\n```json\n{\n  \"mathematical_invariants\": [\n    {\n      \"id\": \"inv_001\",\n      \"type\": \"mathematical_expression\",\n      \"confidence\": 0.78,\n      \"evidence\": {\n        \"expression_type\": \"arithmetic_calculation\"\n      }\n    }\n  ]\n}\n```\n\n## Configuration\n\n### Configuration File\n\nCreate `copycatm.json` in your project directory:\n\n```json\n{\n  \"analysis\": {\n    \"complexity_threshold\": 3,\n    \"min_lines\": 2,  // Recommended: 2 for utility libraries, 20 for general code\n    \"confidence_threshold\": 0.0,\n    \"unknown_algorithm_threshold\": 50  // Line count threshold for unknown algorithm detection\n  },\n  \"languages\": {\n    \"enabled\": [\"python\", \"javascript\", \"java\", \"c\", \"cpp\", \"go\", \"rust\"]\n  },\n  \"hashing\": {\n    \"algorithms\": [\"sha256\", \"tlsh\", \"minhash\"],\n    \"tlsh_threshold\": 100,\n    \"lsh_bands\": 20\n  },\n  \"performance\": {\n    \"parallel_workers\": null,\n    \"chunk_size\": 100\n  },\n  \"output\": {\n    \"include_intermediates\": false\n  }\n}\n```\n\n## Supported Languages\n\n- **Python**: `.py`, `.pyx`, `.pyi`\n- **JavaScript**: `.js`, `.jsx`\n- **TypeScript**: `.ts`, `.tsx`\n- **Java**: `.java`\n- **C/C++**: `.c`, `.cpp`, `.cc`, `.cxx`, `.h`, `.hpp`\n- **Go**: `.go`\n- **Rust**: `.rs`\n\n## Algorithm Detection\n\nThe tool can detect 40+ algorithmic patterns across 8 major categories:\n\n### Core CS Algorithms\n- **Sorting**: Quicksort, Mergesort, Bubblesort, Heapsort, Radix Sort\n- **Searching**: Binary Search, Linear Search, DFS, BFS, A*, Jump Search\n- **Graph Algorithms**: Dijkstra's, Bellman-Ford, Kruskal's, Prim's, Floyd-Warshall\n- **Dynamic Programming**: Fibonacci, LCS, Knapsack, Edit Distance\n\n### Security & Cryptography\n- **Encryption**: AES, RSA, DES, ChaCha20, Elliptic Curve\n- **Hashing**: SHA family, MD5, bcrypt, Argon2\n- **Security**: Anti-tampering, obfuscation, authentication\n\n### Media Processing\n- **Audio Codecs**: MP3, AAC, Opus, FLAC, Vorbis, PCM, AC3, DTS\n- **Video Codecs**: H.264, H.265/HEVC, VP8/9, AV1, MPEG-2, ProRes\n- **Image Processing**: JPEG, PNG compression, filters, transforms\n\n### System Level\n- **Drivers**: Device drivers, kernel modules\n- **Firmware**: Bootloaders, embedded systems\n- **Low-level**: Memory management, interrupt handlers\n\n### Domain Specific\n- **Machine Learning**: Neural networks, gradient descent, k-means\n- **Graphics**: Ray tracing, rasterization, shaders\n- **Financial**: Options pricing, risk models\n- **Medical**: Image reconstruction, signal processing\n- **Automotive**: Control systems, sensor fusion\n\n### Cross-Language Support (v1.4.0+)\n- Same algorithms detected consistently across Python, JavaScript, C/C++, Java\n- 100% MinHash similarity for identical algorithms in different languages\n- Language-agnostic normalization for true semantic matching\n\n## Hashing Methods\n\n### Direct Hashing\n- SHA256: Cryptographic hash for exact matching\n- MD5: Fast hash for quick comparisons\n\n### Fuzzy Hashing (Enhanced in v1.4.0)\n- **TLSH**: Optimized preprocessing for code similarity\n  - Algorithm-focused normalization\n  - Smart padding for short functions\n  - 15% transformation resistance (vs 5% standard)\n- **ssdeep**: Primary fallback for code similarity\n- **Enhanced Fallback**: Multi-component hashing when libraries unavailable\n\n### Semantic Hashing (Cross-Language Support in v1.4.0)\n- **MinHash**: 100% cross-language similarity with normalization\n  - Language-agnostic code representation\n  - Structural shingle extraction\n  - 96.9% uniqueness (up from 61.9%)\n- **SimHash**: Hamming distance for structural similarity\n- **LSH**: Locality-sensitive hashing for approximate nearest neighbor search\n\n## Development\n\n### Setup Development Environment\n\n```bash\ngit clone https://github.com/oscarvalenzuelab/semantic-copycat-minner\ncd semantic-copycat-minner\npip install -e .[dev]\n```\n\n### Running Tests\n\n```bash\npytest tests/\n```\n\n## License\n\nGNU Affero General Public License v3.0 - see LICENSE file for details.\n\n## Acknowledgments\n\n- Tree-sitter for robust parsing\n- TLSH for fuzzy hashing\n- DataSketch for MinHash implementation\n- Radon for cyclomatic complexity analysis\n",
    "bugtrack_url": null,
    "license": "AGPL-3.0",
    "summary": "Semantic analysis tool for detecting AI-generated code derived from copyrighted sources",
    "version": "1.4.2",
    "project_urls": {
        "Bug Tracker": "https://github.com/oscarvalenzuelab/semantic-copycat-minner/issues",
        "Documentation": "https://github.com/oscarvalenzuelab/semantic-copycat-minner#readme",
        "Homepage": "https://github.com/oscarvalenzuelab/semantic-copycat-minner",
        "Repository": "https://github.com/oscarvalenzuelab/semantic-copycat-minner"
    },
    "split_keywords": [
        "code-analysis",
        " ai-detection",
        " semantic-analysis",
        " copyright",
        " gpl"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a776800895bea37917ee982085e5f29a31f4aa6983037e45994db7fd3079761f",
                "md5": "d06121582ca2d3b677b74faa25baaf0d",
                "sha256": "9c63a21f0299def502f4aabae99c04d1a061073e42a928941b0aa99841218a4c"
            },
            "downloads": -1,
            "filename": "semantic_copycat_minner-1.4.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d06121582ca2d3b677b74faa25baaf0d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 150098,
            "upload_time": "2025-07-29T23:02:44",
            "upload_time_iso_8601": "2025-07-29T23:02:44.625024Z",
            "url": "https://files.pythonhosted.org/packages/a7/76/800895bea37917ee982085e5f29a31f4aa6983037e45994db7fd3079761f/semantic_copycat_minner-1.4.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "12a9be0d4ced592cd9c8e3aa211703b32d62d64375ad8a2fd73dc07e8b46fb76",
                "md5": "e1c65356c14c5c0e9ae2ad2f495ba729",
                "sha256": "3d3982d9f0f14aacd0fe177d912964c8d0085d593ae3e7ecaa8dc666efefb9aa"
            },
            "downloads": -1,
            "filename": "semantic_copycat_minner-1.4.2.tar.gz",
            "has_sig": false,
            "md5_digest": "e1c65356c14c5c0e9ae2ad2f495ba729",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 148215,
            "upload_time": "2025-07-29T23:02:46",
            "upload_time_iso_8601": "2025-07-29T23:02:46.260055Z",
            "url": "https://files.pythonhosted.org/packages/12/a9/be0d4ced592cd9c8e3aa211703b32d62d64375ad8a2fd73dc07e8b46fb76/semantic_copycat_minner-1.4.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-29 23:02:46",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "oscarvalenzuelab",
    "github_project": "semantic-copycat-minner",
    "github_not_found": true,
    "lcname": "semantic-copycat-minner"
}
        
Elapsed time: 0.71027s