semantic-copycat-minner

Name	semantic-copycat-minner JSON
Version	1.5.4 JSON
	download
home_page	None
Summary	Semantic analysis tool for detecting AI-generated code derived from copyrighted sources
upload_time	2025-07-31 21:57:23
maintainer	None
docs_url	None
author	None
requires_python	>=3.8
license	AGPL-3.0
keywords	code-analysis ai-detection semantic-analysis copyright gpl
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Semantic Copycat Minner (CopycatM) v1.5.4

A comprehensive defensive security tool for semantic code analysis, designed to extract hashes, detect algorithms, and identify structural features for similarity analysis and copyright detection. Successfully identifies AI-generated code that removes attribution while maintaining algorithmic essence.

## 🎯 Key Features (v1.5.4)

- **🏗️ Enhanced Three-Tier Architecture**: Hybrid analysis with intelligent fallbacks
  - **Tier 1**: Baseline analysis with SWHID support (all files)
  - **Tier 2**: Traditional pattern matching with 60+ algorithm patterns (smaller files)  
  - **Tier 3**: Semantic AI analysis with automatic fallback to Tier 2 (large/complex files)
- **🌐 Multi-language Support**: Python, JavaScript/TypeScript, Java, C/C++, Go, Rust
- **🎯 Advanced Algorithm Detection**: **98.3% true positive rate**
  - **C language detection**: Full support with FallbackAST
  - **Python detection**: 100% accuracy on test suite
  - **JavaScript detection**: 100% accuracy with enhanced features
  - **Audio codec normalization**: Streamlined reporting for codecs
- **✨ MAJOR NEW FEATURES (v1.5.2)**:
  - **🔄 Cross-Language Normalization**: **84.2% file coverage**
    - Language-agnostic algorithm signatures
    - Framework pattern normalization (React, Django, Express, Spring)
    - Algorithmic feature extraction across languages
  - **📊 Signature Aggregation & Ranking**: **68.4% file coverage**
    - Multi-strategy aggregation (Ensemble, Weighted Confidence, Tier Priority)
    - 25% average signature reduction with improved accuracy
    - Quality-based ranking and intelligent deduplication
  - **📁 External Pattern Configuration** (v1.5.4): Algorithm patterns in JSON files
    - Patterns separated from code for easy maintenance
    - Language-agnostic pattern definitions
    - Single-load pattern caching for performance
- **🚀 Dynamic Transformation Resistance**: Real-time calculation based on code features
  - Variable renaming resistance based on identifier patterns
  - Language translation resistance based on algorithm complexity
  - Style changes resistance based on AST structure
  - Framework adaptation resistance based on modularity
  - ~64% average resistance with accurate measurements
- **Multimedia Processing Detection**: Comprehensive patterns for video, audio, image, and signal processing (v1.5.0)
  - Video: Color space conversion, quantization, motion estimation, prediction, segmentation
  - Audio: FFT, psychoacoustic models, resampling, echo effects, noise reduction
  - Image: Interpolation, morphological operations, edge detection, convolution
  - Signal: Digital filters, correlation, windowing, modulation
- **Audio/Video Codec Detection**: Comprehensive detection of 40+ multimedia codecs (v1.4.0+)
  - Successfully tested on FFmpeg source code
  - Detects MP3, AAC, Opus, FLAC, PCM, Vorbis, AC3, DTS, WMA, G.711/722/729, AMR, Speex, CELP
  - Detects H.264, H.265, VP8/9, AV1, MPEG-2/4, Theora, H.263, RealVideo, Cinepak, Indeo, ProRes, DNxHD
- **Software Heritage ID (SWHID) Support**: Generate persistent identifiers for reproducible research
  - Industry-standard persistent identification following SWHID v1.6 specification
  - Git-compatible SHA1 hashing for interoperability
  - Optional for performance (disabled by default)
- **Comprehensive Hashing**:
  - Direct: SHA256, MD5, SHA1 for exact matching
  - Fuzzy: TLSH with optimized preprocessing for code similarity
  - Semantic: MinHash and SimHash for structural similarity (enabled by default)
  - GNN-based: Graph neural network similarity hashes for structural analysis
- **Mutation Detection**: Identifies algorithm variants with 70-85% similarity to known patterns
- **Control Structure Analysis**: Extracts and hashes control blocks (loops, conditionals, try-catch)
- **Mathematical Property Detection**: Identifies commutative, associative, distributive, and other mathematical properties
- **Evidence Extraction**: Rich evidence including matched keywords, complexity metrics, and AST signatures
- **Mathematical Invariant Detection**: Typed invariants (loop, recursion, boundary, comparison, etc.)
- **CLI Interface**: Easy-to-use command-line tool with batch processing
- **JSON Output**: Structured output for integration with other tools

## Documentation

- **[API Documentation](docs/API_DOCUMENTATION.md)** - Comprehensive API reference with examples
- **[Technical Paper](docs/copycatm_technical_paper.md)** - Detailed technical implementation
- **[Tool Usage](tools/TOOL_USAGE.md)** - Guide for validation and analysis tools

## Installation

### From PyPI

```bash
pip install semantic-copycat-minner
```

### From Source

```bash
git clone https://github.com/username/semantic-copycat-minner
cd semantic-copycat-minner
pip install -e .
```

### Optional GNN Support

For advanced graph neural network similarity detection:

```bash
pip install -e .[gnn]
```

## 🚀 Quick Start

### Enhanced Analysis (Recommended v1.5.2)

```bash
# Single file with enhanced three-tier analysis
copycatm src/algorithm.py -o results.json

# Directory analysis with parallel processing
copycatm ./codebase --parallel 4 -o results.json

# Enhanced configuration for better detection
copycatm algorithm.py --complexity-threshold 2 --min-lines 5 -o results.json
```

### Software Heritage ID Generation

```bash
# Enable SWHID for persistent identification (optional, performance impact)
copycatm algorithm.py --enable-swhid -o results.json

# Include directory SWHID (additional performance impact)
copycatm algorithm.py --enable-swhid --swhid-include-directory -o results.json
```

### Python API Usage (Enhanced)

```python
from copycatm import CopycatAnalyzer, AnalysisConfig

# Enhanced three-tier analysis (recommended)
analyzer = CopycatAnalyzer()
result = analyzer.analyze_file_enhanced("algorithm.py")

# Access enhanced features
print(f"Algorithms: {len(result.get('algorithms', []))}")
print(f"Signatures: {len(result.get('signatures', []))}")

# Check for cross-language normalization
for algo in result.get('algorithms', []):
    if 'cross_language_normalization' in algo:
        print(f"Enhanced: {algo['algorithm_type']}")

# Check signature aggregation
if result.get('signature_aggregation'):
    agg = result['signature_aggregation']['aggregation_summary']
    print(f"Signature reduction: {agg['reduction_rate']:.1%}")
```

## CLI Usage

### Basic Commands

```bash
# Single file analysis
copycatm <file_path> [options]

# Batch directory analysis
copycatm <directory_path> [options]
```

### Options

```bash
# Core options
--output, -o           Output JSON file path (default: stdout)
--verbose, -v          Verbose output (can be repeated: -v, -vv, -vvv)
--quiet, -q           Suppress all output except errors
--debug               Enable debug mode with intermediate representations

# Analysis configuration
--complexity-threshold, -c    Cyclomatic complexity threshold (default: 3)
--min-lines                   Minimum lines for algorithm analysis (default: 20, recommend 2 for utility libraries)
--include-intermediates       Include AST and control flow graphs in output
--languages                   Comma-separated list of languages to analyze

# Hash configuration
--hash-algorithms            Comma-separated hash types (default: sha256,tlsh,minhash)
--tlsh-threshold            TLSH similarity threshold (default: 100)
--lsh-bands                 LSH band count for similarity detection (default: 20)

# Output filtering
--only-algorithms           Only output algorithmic signatures
--only-metadata            Only output file metadata
--confidence-threshold     Minimum confidence score to include (0.0-1.0)

# Performance
--parallel, -p             Number of parallel workers (default: CPU count)
--chunk-size              Files per chunk for batch processing (default: 100)
```

## Library API

The library provides different levels of API access, with `CopycatAnalyzer` as the main entry point for most use cases.

### Main Entry Point: CopycatAnalyzer

`CopycatAnalyzer` is the primary interface that orchestrates all analysis components including parsing, algorithm detection, hashing, and complexity analysis.

```python
from copycatm import CopycatAnalyzer, AnalysisConfig

# Create analyzer with default configuration
analyzer = CopycatAnalyzer()

# Analyze a file (auto-detect language from extension)
result = analyzer.analyze_file("src/algorithm.py")

# Force specific language (useful for non-standard extensions)
result = analyzer.analyze_file("script.txt", force_language="python")

# Analyze code string directly
result = analyzer.analyze_code(code, "python", "algorithm.py")

# Analyze directory
results = analyzer.analyze_directory("./codebase")
```

### Lower-Level Components

For advanced use cases, you can access individual components directly:

```python
from copycatm.analysis import AlgorithmDetector
from copycatm.parsers import TreeSitterParser

# Direct algorithm detection with flexible input
detector = AlgorithmDetector()

# Option 1: Provide raw content (convenience method)
algorithms = detector.detect_algorithms_from_input(content, "python")

# Option 2: Provide pre-parsed AST (for reuse across components)
parser = TreeSitterParser()
ast_tree = parser.parse(content, "python")
algorithms = detector.detect_algorithms_from_input(ast_tree, "python")

# Option 3: Use original method (backward compatible)
algorithms = detector.detect_algorithms(ast_tree, "python")
```

### Custom Configuration

```python
from copycatm import CopycatAnalyzer, AnalysisConfig

# Create custom configuration
config = AnalysisConfig(
    complexity_threshold=5,
    min_lines=50,
    include_intermediates=True,
    hash_algorithms=["sha256", "tlsh", "minhash"],
    confidence_threshold=0.8,
    enable_swhid=True  # Generate Software Heritage IDs
)

analyzer = CopycatAnalyzer(config)
result = analyzer.analyze_file("src/algorithm.py")
```

## Enhanced Algorithm Detection (v1.5.2)

The enhanced algorithm detector provides excellent accuracy with comprehensive algorithm detection:

### Performance Metrics (v1.5.2)
- **Overall F1-Score: 0.857** (precision: 0.750, recall: 1.000)
- **100% Source Code Verification** - All expected algorithms correctly detected
- **Perfect Cross-Language Detection** - JavaScript, Python, C consistency
- **C Function Detection: 100%** - Including audio codec functions
- **Zero False Negatives** - No missed algorithms in test suite

### Using Current Implementation

```python
from copycatm import CopycatAnalyzer, AnalysisConfig

# Standard analyzer with current enhancements
config = AnalysisConfig(
    complexity_threshold=3,
    min_lines=5,
    enable_swhid=True  # SWHID support available
)
analyzer = CopycatAnalyzer(config)
result = analyzer.analyze_file("algorithm.py")
```

### Validation Tools

```bash
# Comprehensive validation against source code
python validate_tests_focused.py

# Source code verification
python verify_source_code.py

# Comprehensive test suite analysis  
python validate_tests_comprehensive.py
```

### Key Improvements (v1.5.2)

- **Perfect Algorithm Detection**: 100% detection rate on test algorithms
- **Audio Codec Support**: PCM codec functions with 68-70% confidence
- **C Function Detection**: Complete support for C algorithm implementations
- **SWHID Integration**: Software Heritage persistent identifiers
- **Robust Parser**: Tree-sitter with intelligent fallback mode

## Advanced Features

### AI-Generated Code Detection (v1.5.0)

CopycatM includes advanced features to detect AI-generated code that removes attribution:

1. **AST Normalization**: Identifies structurally similar code despite variable renaming
2. **Semantic Similarity**: Multi-method similarity detection (TF-IDF, idiom patterns)
3. **Call Graph Analysis**: Function relationship mapping using NetworkX
4. **Code Clone Detection**: Type 1-4 clone identification (exact to semantic)
5. **Wrapper Detection**: Identifies delegation, adapter, facade patterns
6. **Confidence Scoring**: Multi-factor scoring with weighted components

```python
from copycatm.analysis import SemanticSimilarity, CodeCloneDetector

# Compare two code snippets for semantic similarity
similarity = SemanticSimilarity()
result = similarity.calculate_semantic_similarity(code1, code2, "python", "python")

# Detect code clones
clone_detector = CodeCloneDetector()
clones = clone_detector.detect_clones(files)
```

### Cross-Implementation Similarity

CopycatM can detect similar algorithms across different implementations:

```bash
# Compare FFmpeg and GStreamer audio resampling
python tools/copycatm_validator.py compare ffmpeg/resample.c gstreamer/audioresample.c

# Result: 65% similarity (audio_resampling domain, threshold: 30%)
```

### Unknown Algorithm Detection

For complex code that doesn't match known patterns:

```python
# Configure for unknown algorithm detection
config = AnalysisConfig(
    min_lines=2,  # Low threshold for function analysis
    unknown_algorithm_threshold=50  # File size threshold
)

analyzer = CopycatAnalyzer(config)
result = analyzer.analyze_file("complex_algorithm.py")

# Unknown algorithms will have type: "unknown_complex_algorithm"
# With evidence including complexity metrics and operation patterns
```

## Output Format

The tool outputs structured JSON with the following components:

### File Metadata

```json
{
  "file_metadata": {
    "file_name": "algorithm.py",
    "language": "python",
    "line_count": 85,
    "is_source_code": true,
    "analysis_timestamp": "2025-07-25T10:30:00Z"
  }
}
```

### Algorithm Detection

```json
{
  "algorithms": [
    {
      "id": "algo_001",
      "type": "algorithm",
      "name": "quicksort_implementation",
      "algorithm_type": "sorting_algorithm",
      "subtype": "quicksort",
      "confidence": 0.92,
      "function_name": "quick_sort",
      "complexity_metric": 8,
      "evidence": {
        "pattern_type": "divide_and_conquer",
        "control_flow": "recursive_partition"
      },
      "hashes": {
        "direct": {"sha256": "abc123..."},
        "fuzzy": {"tlsh": "T1A2B3C4..."},
        "semantic": {"minhash": "123456789abcdef"}
      },
      "transformation_resistance": {
        "variable_renaming": 0.95,
        "language_translation": 0.85
      }
    }
  ]
}
```

### Unknown Algorithm Detection (v1.2.0+)

For complex code that doesn't match known algorithm patterns, CopycatM performs structural complexity analysis to identify unknown algorithms. This feature automatically activates for files with 50+ lines to optimize performance.

```json
{
  "algorithms": [
    {
      "id": "unknown_a1b2c3d4",
      "algorithm_type": "unknown_complex_algorithm",
      "subtype": "unknown_algorithm",
      "confidence": 0.79,
      "function_name": "complex_transform",
      "evidence": {
        "complexity_score": 0.79,
        "cyclomatic_complexity": 33,
        "nesting_depth": 5,
        "operation_density": 4.2,
        "unique_operations": 25,
        "structural_hash": "abc123def456",
        "algorithmic_fingerprint": "ALG-E66468BA743C"
      }
    }
  ]
}
```

### Mathematical Invariants

```json
{
  "mathematical_invariants": [
    {
      "id": "inv_001",
      "type": "mathematical_expression",
      "confidence": 0.78,
      "evidence": {
        "expression_type": "loop_invariant",
        "invariant_type": "precondition"
      }
    }
  ]
}
```

## Configuration

### Configuration File

Create `copycatm.json` in your project directory:

```json
{
  "analysis": {
    "complexity_threshold": 3,
    "min_lines": 2,  // Recommended: 2 for utility libraries, 20 for general code
    "confidence_threshold": 0.0,
    "unknown_algorithm_threshold": 50  // Line count threshold for unknown algorithm detection
  },
  "languages": {
    "enabled": ["python", "javascript", "java", "c", "cpp", "go", "rust"]
  },
  "hashing": {
    "algorithms": ["sha256", "tlsh", "minhash"],
    "tlsh_threshold": 100,
    "lsh_bands": 20
  },
  "performance": {
    "parallel_workers": null,
    "chunk_size": 100
  },
  "output": {
    "include_intermediates": false
  }
}
```

### Language-Specific Configuration (v1.3.0+)

CopycatM supports per-language configurations for optimal detection:

```python
from copycatm.data.language_configs import get_language_config

config = get_language_config("javascript")
# Returns: {"min_lines": 2, "min_function_lines": 2, "unknown_algorithm_threshold": 50, ...}
```

## Supported Languages

- **Python**: `.py`, `.pyx`, `.pyi`
- **JavaScript**: `.js`, `.jsx`
- **TypeScript**: `.ts`, `.tsx`
- **Java**: `.java`
- **C/C++**: `.c`, `.cpp`, `.cc`, `.cxx`, `.h`, `.hpp`
- **Go**: `.go`
- **Rust**: `.rs`

## Algorithm Detection

The tool can detect 60+ algorithmic patterns across 8 major categories:

### Core CS Algorithms
- **Sorting**: Quicksort, Mergesort, Bubblesort, Heapsort, Radix Sort, Tim Sort, Shell Sort
- **Searching**: Binary Search, Linear Search, DFS, BFS, A*, Jump Search
- **Graph Algorithms**: Dijkstra's, Bellman-Ford, Kruskal's, Prim's, Floyd-Warshall
- **Dynamic Programming**: Fibonacci, LCS, Knapsack, Edit Distance, Memoization, Tabulation

### Multimedia Processing (v1.5.0)
- **Video Processing**: Color space conversion (YUV/RGB), Quantization, Motion estimation, Intra/inter prediction, Loop filtering, Segmentation
- **Audio Processing**: FFT/DFT transforms, Psychoacoustic modeling, Resampling, Echo/reverb effects, Noise reduction, Digital filters
- **Image Processing**: Bilinear/bicubic interpolation, Morphological operations, Edge detection (Canny, Sobel), Histogram operations, Convolution
- **Signal Processing**: Butterworth/Chebyshev filters, Correlation, Windowing (Hamming, Hanning), Modulation/demodulation

### Security & Cryptography
- **Encryption**: AES, RSA, DES, ChaCha20, Elliptic Curve
- **Hashing**: SHA family, MD5, bcrypt, Argon2
- **Security**: Anti-tampering, obfuscation, authentication

### Media Processing
- **Audio Codecs**: MP3, AAC, Opus, FLAC, Vorbis, PCM, AC3, DTS
- **Video Codecs**: H.264, H.265/HEVC, VP8/9, AV1, MPEG-2, ProRes
- **Image Processing**: JPEG, PNG compression, filters, transforms

### System Level
- **Drivers**: Device drivers, kernel modules
- **Firmware**: Bootloaders, embedded systems
- **Low-level**: Memory management, interrupt handlers

### Domain Specific
- **Machine Learning**: Neural networks, gradient descent, k-means
- **Graphics**: Ray tracing, rasterization, shaders
- **Financial**: Options pricing, risk models
- **Medical**: Image reconstruction, signal processing
- **Automotive**: Control systems, sensor fusion

### Cross-Language Support (v1.4.0+)
- Same algorithms detected consistently across Python, JavaScript, C/C++, Java
- 100% MinHash similarity for identical algorithms in different languages
- Language-agnostic normalization for true semantic matching

## Hashing Methods

### Direct Hashing
- SHA256: Cryptographic hash for exact matching
- MD5: Fast hash for quick comparisons

### Fuzzy Hashing (Enhanced in v1.4.0)
- **TLSH**: Optimized preprocessing for code similarity
  - Algorithm-focused normalization
  - Smart padding for short functions
  - 15% transformation resistance (vs 5% standard)
- **ssdeep**: Primary fallback for code similarity
- **Enhanced Fallback**: Multi-component hashing when libraries unavailable

### Semantic Hashing (Cross-Language Support in v1.4.0)
- **MinHash**: 100% cross-language similarity with normalization
  - Language-agnostic code representation
  - Structural shingle extraction
  - 96.9% uniqueness (up from 61.9%)
- **SimHash**: Hamming distance for structural similarity
- **LSH**: Locality-sensitive hashing for approximate nearest neighbor search

## Validation Tools

CopycatM includes a comprehensive validation tool for testing and verification:

```bash
# Compare two files for similarity
python tools/copycatm_validator.py compare file1.py file2.py

# Validate entire project
python tools/copycatm_validator.py validate ./project --reference ./reference

# Test transformation resistance
python tools/copycatm_validator.py transform algorithm.py

# Batch comparison
python tools/copycatm_validator.py batch comparisons.json
```

## Development

### Setup Development Environment

```bash
git clone https://github.com/oscarvalenzuelab/semantic-copycat-minner
cd semantic-copycat-minner
pip install -e .[dev]
```

### Running Tests

```bash
# Run all tests
pytest tests/

# Run specific test modules
pytest tests/test_analyzer.py
pytest tests/test_algorithm_detection.py

# Run validation tests
python comprehensive_test.py
python test_transformation_resistance.py
```

### Code Quality

```bash
# Format code
black copycatm/

# Lint code
flake8 copycatm/

# Type checking
mypy copycatm/
```

## Performance Metrics

Based on comprehensive testing (v1.5.2):

### Current Performance (v1.5.2)
- **Algorithm Detection F1-Score**: 0.857 (precision: 0.750, recall: 1.000)
- **Source Code Verification**: 100% success rate (7/7 test files)
- **C Function Detection**: 100% accuracy (bubble sort, binary search, quicksort)
- **Audio Codec Detection**: 100% detection (PCM encode/decode functions)
- **Cross-Language Consistency**: 100% (JavaScript, Python, C)
- **Processing Speed**: ~0.061s average per file
- **SWHID Generation**: 26 persistent identifiers across test suite

### Historical Performance
- **MinHash Uniqueness**: 100% (different algorithms generate unique hashes)
- **Codec Detection**: 100% accuracy on FFmpeg test files  
- **Transformation Resistance**: 86.8% average
- **Invariant Classification**: 100% properly typed (precondition, loop, etc.)

## License

GNU Affero General Public License v3.0 - see LICENSE file for details.

## Acknowledgments

- Tree-sitter for robust parsing
- TLSH for fuzzy hashing
- DataSketch for MinHash implementation
- Radon for cyclomatic complexity analysis
- NetworkX for graph analysis
- PyTorch for GNN components (optional)

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "semantic-copycat-minner",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "\"Oscar Valenzuela B.\" <oscar.valenzuela.b@gmail.com>",
    "keywords": "code-analysis, ai-detection, semantic-analysis, copyright, gpl",
    "author": null,
    "author_email": "\"Oscar Valenzuela B.\" <oscar.valenzuela.b@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/f1/59/daf5e50e4e51b04853b80d0079f5332ab48cb887691ba5e83241e60e916a/semantic_copycat_minner-1.5.4.tar.gz",
    "platform": null,
    "description": "# Semantic Copycat Minner (CopycatM) v1.5.4\n\nA comprehensive defensive security tool for semantic code analysis, designed to extract hashes, detect algorithms, and identify structural features for similarity analysis and copyright detection. Successfully identifies AI-generated code that removes attribution while maintaining algorithmic essence.\n\n## \ud83c\udfaf Key Features (v1.5.4)\n\n- **\ud83c\udfd7\ufe0f Enhanced Three-Tier Architecture**: Hybrid analysis with intelligent fallbacks\n  - **Tier 1**: Baseline analysis with SWHID support (all files)\n  - **Tier 2**: Traditional pattern matching with 60+ algorithm patterns (smaller files)  \n  - **Tier 3**: Semantic AI analysis with automatic fallback to Tier 2 (large/complex files)\n- **\ud83c\udf10 Multi-language Support**: Python, JavaScript/TypeScript, Java, C/C++, Go, Rust\n- **\ud83c\udfaf Advanced Algorithm Detection**: **98.3% true positive rate**\n  - **C language detection**: Full support with FallbackAST\n  - **Python detection**: 100% accuracy on test suite\n  - **JavaScript detection**: 100% accuracy with enhanced features\n  - **Audio codec normalization**: Streamlined reporting for codecs\n- **\u2728 MAJOR NEW FEATURES (v1.5.2)**:\n  - **\ud83d\udd04 Cross-Language Normalization**: **84.2% file coverage**\n    - Language-agnostic algorithm signatures\n    - Framework pattern normalization (React, Django, Express, Spring)\n    - Algorithmic feature extraction across languages\n  - **\ud83d\udcca Signature Aggregation & Ranking**: **68.4% file coverage**\n    - Multi-strategy aggregation (Ensemble, Weighted Confidence, Tier Priority)\n    - 25% average signature reduction with improved accuracy\n    - Quality-based ranking and intelligent deduplication\n  - **\ud83d\udcc1 External Pattern Configuration** (v1.5.4): Algorithm patterns in JSON files\n    - Patterns separated from code for easy maintenance\n    - Language-agnostic pattern definitions\n    - Single-load pattern caching for performance\n- **\ud83d\ude80 Dynamic Transformation Resistance**: Real-time calculation based on code features\n  - Variable renaming resistance based on identifier patterns\n  - Language translation resistance based on algorithm complexity\n  - Style changes resistance based on AST structure\n  - Framework adaptation resistance based on modularity\n  - ~64% average resistance with accurate measurements\n- **Multimedia Processing Detection**: Comprehensive patterns for video, audio, image, and signal processing (v1.5.0)\n  - Video: Color space conversion, quantization, motion estimation, prediction, segmentation\n  - Audio: FFT, psychoacoustic models, resampling, echo effects, noise reduction\n  - Image: Interpolation, morphological operations, edge detection, convolution\n  - Signal: Digital filters, correlation, windowing, modulation\n- **Audio/Video Codec Detection**: Comprehensive detection of 40+ multimedia codecs (v1.4.0+)\n  - Successfully tested on FFmpeg source code\n  - Detects MP3, AAC, Opus, FLAC, PCM, Vorbis, AC3, DTS, WMA, G.711/722/729, AMR, Speex, CELP\n  - Detects H.264, H.265, VP8/9, AV1, MPEG-2/4, Theora, H.263, RealVideo, Cinepak, Indeo, ProRes, DNxHD\n- **Software Heritage ID (SWHID) Support**: Generate persistent identifiers for reproducible research\n  - Industry-standard persistent identification following SWHID v1.6 specification\n  - Git-compatible SHA1 hashing for interoperability\n  - Optional for performance (disabled by default)\n- **Comprehensive Hashing**:\n  - Direct: SHA256, MD5, SHA1 for exact matching\n  - Fuzzy: TLSH with optimized preprocessing for code similarity\n  - Semantic: MinHash and SimHash for structural similarity (enabled by default)\n  - GNN-based: Graph neural network similarity hashes for structural analysis\n- **Mutation Detection**: Identifies algorithm variants with 70-85% similarity to known patterns\n- **Control Structure Analysis**: Extracts and hashes control blocks (loops, conditionals, try-catch)\n- **Mathematical Property Detection**: Identifies commutative, associative, distributive, and other mathematical properties\n- **Evidence Extraction**: Rich evidence including matched keywords, complexity metrics, and AST signatures\n- **Mathematical Invariant Detection**: Typed invariants (loop, recursion, boundary, comparison, etc.)\n- **CLI Interface**: Easy-to-use command-line tool with batch processing\n- **JSON Output**: Structured output for integration with other tools\n\n## Documentation\n\n- **[API Documentation](docs/API_DOCUMENTATION.md)** - Comprehensive API reference with examples\n- **[Technical Paper](docs/copycatm_technical_paper.md)** - Detailed technical implementation\n- **[Tool Usage](tools/TOOL_USAGE.md)** - Guide for validation and analysis tools\n\n## Installation\n\n### From PyPI\n\n```bash\npip install semantic-copycat-minner\n```\n\n### From Source\n\n```bash\ngit clone https://github.com/username/semantic-copycat-minner\ncd semantic-copycat-minner\npip install -e .\n```\n\n### Optional GNN Support\n\nFor advanced graph neural network similarity detection:\n\n```bash\npip install -e .[gnn]\n```\n\n## \ud83d\ude80 Quick Start\n\n### Enhanced Analysis (Recommended v1.5.2)\n\n```bash\n# Single file with enhanced three-tier analysis\ncopycatm src/algorithm.py -o results.json\n\n# Directory analysis with parallel processing\ncopycatm ./codebase --parallel 4 -o results.json\n\n# Enhanced configuration for better detection\ncopycatm algorithm.py --complexity-threshold 2 --min-lines 5 -o results.json\n```\n\n### Software Heritage ID Generation\n\n```bash\n# Enable SWHID for persistent identification (optional, performance impact)\ncopycatm algorithm.py --enable-swhid -o results.json\n\n# Include directory SWHID (additional performance impact)\ncopycatm algorithm.py --enable-swhid --swhid-include-directory -o results.json\n```\n\n### Python API Usage (Enhanced)\n\n```python\nfrom copycatm import CopycatAnalyzer, AnalysisConfig\n\n# Enhanced three-tier analysis (recommended)\nanalyzer = CopycatAnalyzer()\nresult = analyzer.analyze_file_enhanced(\"algorithm.py\")\n\n# Access enhanced features\nprint(f\"Algorithms: {len(result.get('algorithms', []))}\")\nprint(f\"Signatures: {len(result.get('signatures', []))}\")\n\n# Check for cross-language normalization\nfor algo in result.get('algorithms', []):\n    if 'cross_language_normalization' in algo:\n        print(f\"Enhanced: {algo['algorithm_type']}\")\n\n# Check signature aggregation\nif result.get('signature_aggregation'):\n    agg = result['signature_aggregation']['aggregation_summary']\n    print(f\"Signature reduction: {agg['reduction_rate']:.1%}\")\n```\n\n## CLI Usage\n\n### Basic Commands\n\n```bash\n# Single file analysis\ncopycatm <file_path> [options]\n\n# Batch directory analysis\ncopycatm <directory_path> [options]\n```\n\n### Options\n\n```bash\n# Core options\n--output, -o           Output JSON file path (default: stdout)\n--verbose, -v          Verbose output (can be repeated: -v, -vv, -vvv)\n--quiet, -q           Suppress all output except errors\n--debug               Enable debug mode with intermediate representations\n\n# Analysis configuration\n--complexity-threshold, -c    Cyclomatic complexity threshold (default: 3)\n--min-lines                   Minimum lines for algorithm analysis (default: 20, recommend 2 for utility libraries)\n--include-intermediates       Include AST and control flow graphs in output\n--languages                   Comma-separated list of languages to analyze\n\n# Hash configuration\n--hash-algorithms            Comma-separated hash types (default: sha256,tlsh,minhash)\n--tlsh-threshold            TLSH similarity threshold (default: 100)\n--lsh-bands                 LSH band count for similarity detection (default: 20)\n\n# Output filtering\n--only-algorithms           Only output algorithmic signatures\n--only-metadata            Only output file metadata\n--confidence-threshold     Minimum confidence score to include (0.0-1.0)\n\n# Performance\n--parallel, -p             Number of parallel workers (default: CPU count)\n--chunk-size              Files per chunk for batch processing (default: 100)\n```\n\n## Library API\n\nThe library provides different levels of API access, with `CopycatAnalyzer` as the main entry point for most use cases.\n\n### Main Entry Point: CopycatAnalyzer\n\n`CopycatAnalyzer` is the primary interface that orchestrates all analysis components including parsing, algorithm detection, hashing, and complexity analysis.\n\n```python\nfrom copycatm import CopycatAnalyzer, AnalysisConfig\n\n# Create analyzer with default configuration\nanalyzer = CopycatAnalyzer()\n\n# Analyze a file (auto-detect language from extension)\nresult = analyzer.analyze_file(\"src/algorithm.py\")\n\n# Force specific language (useful for non-standard extensions)\nresult = analyzer.analyze_file(\"script.txt\", force_language=\"python\")\n\n# Analyze code string directly\nresult = analyzer.analyze_code(code, \"python\", \"algorithm.py\")\n\n# Analyze directory\nresults = analyzer.analyze_directory(\"./codebase\")\n```\n\n### Lower-Level Components\n\nFor advanced use cases, you can access individual components directly:\n\n```python\nfrom copycatm.analysis import AlgorithmDetector\nfrom copycatm.parsers import TreeSitterParser\n\n# Direct algorithm detection with flexible input\ndetector = AlgorithmDetector()\n\n# Option 1: Provide raw content (convenience method)\nalgorithms = detector.detect_algorithms_from_input(content, \"python\")\n\n# Option 2: Provide pre-parsed AST (for reuse across components)\nparser = TreeSitterParser()\nast_tree = parser.parse(content, \"python\")\nalgorithms = detector.detect_algorithms_from_input(ast_tree, \"python\")\n\n# Option 3: Use original method (backward compatible)\nalgorithms = detector.detect_algorithms(ast_tree, \"python\")\n```\n\n### Custom Configuration\n\n```python\nfrom copycatm import CopycatAnalyzer, AnalysisConfig\n\n# Create custom configuration\nconfig = AnalysisConfig(\n    complexity_threshold=5,\n    min_lines=50,\n    include_intermediates=True,\n    hash_algorithms=[\"sha256\", \"tlsh\", \"minhash\"],\n    confidence_threshold=0.8,\n    enable_swhid=True  # Generate Software Heritage IDs\n)\n\nanalyzer = CopycatAnalyzer(config)\nresult = analyzer.analyze_file(\"src/algorithm.py\")\n```\n\n## Enhanced Algorithm Detection (v1.5.2)\n\nThe enhanced algorithm detector provides excellent accuracy with comprehensive algorithm detection:\n\n### Performance Metrics (v1.5.2)\n- **Overall F1-Score: 0.857** (precision: 0.750, recall: 1.000)\n- **100% Source Code Verification** - All expected algorithms correctly detected\n- **Perfect Cross-Language Detection** - JavaScript, Python, C consistency\n- **C Function Detection: 100%** - Including audio codec functions\n- **Zero False Negatives** - No missed algorithms in test suite\n\n### Using Current Implementation\n\n```python\nfrom copycatm import CopycatAnalyzer, AnalysisConfig\n\n# Standard analyzer with current enhancements\nconfig = AnalysisConfig(\n    complexity_threshold=3,\n    min_lines=5,\n    enable_swhid=True  # SWHID support available\n)\nanalyzer = CopycatAnalyzer(config)\nresult = analyzer.analyze_file(\"algorithm.py\")\n```\n\n### Validation Tools\n\n```bash\n# Comprehensive validation against source code\npython validate_tests_focused.py\n\n# Source code verification\npython verify_source_code.py\n\n# Comprehensive test suite analysis  \npython validate_tests_comprehensive.py\n```\n\n### Key Improvements (v1.5.2)\n\n- **Perfect Algorithm Detection**: 100% detection rate on test algorithms\n- **Audio Codec Support**: PCM codec functions with 68-70% confidence\n- **C Function Detection**: Complete support for C algorithm implementations\n- **SWHID Integration**: Software Heritage persistent identifiers\n- **Robust Parser**: Tree-sitter with intelligent fallback mode\n\n## Advanced Features\n\n### AI-Generated Code Detection (v1.5.0)\n\nCopycatM includes advanced features to detect AI-generated code that removes attribution:\n\n1. **AST Normalization**: Identifies structurally similar code despite variable renaming\n2. **Semantic Similarity**: Multi-method similarity detection (TF-IDF, idiom patterns)\n3. **Call Graph Analysis**: Function relationship mapping using NetworkX\n4. **Code Clone Detection**: Type 1-4 clone identification (exact to semantic)\n5. **Wrapper Detection**: Identifies delegation, adapter, facade patterns\n6. **Confidence Scoring**: Multi-factor scoring with weighted components\n\n```python\nfrom copycatm.analysis import SemanticSimilarity, CodeCloneDetector\n\n# Compare two code snippets for semantic similarity\nsimilarity = SemanticSimilarity()\nresult = similarity.calculate_semantic_similarity(code1, code2, \"python\", \"python\")\n\n# Detect code clones\nclone_detector = CodeCloneDetector()\nclones = clone_detector.detect_clones(files)\n```\n\n### Cross-Implementation Similarity\n\nCopycatM can detect similar algorithms across different implementations:\n\n```bash\n# Compare FFmpeg and GStreamer audio resampling\npython tools/copycatm_validator.py compare ffmpeg/resample.c gstreamer/audioresample.c\n\n# Result: 65% similarity (audio_resampling domain, threshold: 30%)\n```\n\n### Unknown Algorithm Detection\n\nFor complex code that doesn't match known patterns:\n\n```python\n# Configure for unknown algorithm detection\nconfig = AnalysisConfig(\n    min_lines=2,  # Low threshold for function analysis\n    unknown_algorithm_threshold=50  # File size threshold\n)\n\nanalyzer = CopycatAnalyzer(config)\nresult = analyzer.analyze_file(\"complex_algorithm.py\")\n\n# Unknown algorithms will have type: \"unknown_complex_algorithm\"\n# With evidence including complexity metrics and operation patterns\n```\n\n## Output Format\n\nThe tool outputs structured JSON with the following components:\n\n### File Metadata\n\n```json\n{\n  \"file_metadata\": {\n    \"file_name\": \"algorithm.py\",\n    \"language\": \"python\",\n    \"line_count\": 85,\n    \"is_source_code\": true,\n    \"analysis_timestamp\": \"2025-07-25T10:30:00Z\"\n  }\n}\n```\n\n### Algorithm Detection\n\n```json\n{\n  \"algorithms\": [\n    {\n      \"id\": \"algo_001\",\n      \"type\": \"algorithm\",\n      \"name\": \"quicksort_implementation\",\n      \"algorithm_type\": \"sorting_algorithm\",\n      \"subtype\": \"quicksort\",\n      \"confidence\": 0.92,\n      \"function_name\": \"quick_sort\",\n      \"complexity_metric\": 8,\n      \"evidence\": {\n        \"pattern_type\": \"divide_and_conquer\",\n        \"control_flow\": \"recursive_partition\"\n      },\n      \"hashes\": {\n        \"direct\": {\"sha256\": \"abc123...\"},\n        \"fuzzy\": {\"tlsh\": \"T1A2B3C4...\"},\n        \"semantic\": {\"minhash\": \"123456789abcdef\"}\n      },\n      \"transformation_resistance\": {\n        \"variable_renaming\": 0.95,\n        \"language_translation\": 0.85\n      }\n    }\n  ]\n}\n```\n\n### Unknown Algorithm Detection (v1.2.0+)\n\nFor complex code that doesn't match known algorithm patterns, CopycatM performs structural complexity analysis to identify unknown algorithms. This feature automatically activates for files with 50+ lines to optimize performance.\n\n```json\n{\n  \"algorithms\": [\n    {\n      \"id\": \"unknown_a1b2c3d4\",\n      \"algorithm_type\": \"unknown_complex_algorithm\",\n      \"subtype\": \"unknown_algorithm\",\n      \"confidence\": 0.79,\n      \"function_name\": \"complex_transform\",\n      \"evidence\": {\n        \"complexity_score\": 0.79,\n        \"cyclomatic_complexity\": 33,\n        \"nesting_depth\": 5,\n        \"operation_density\": 4.2,\n        \"unique_operations\": 25,\n        \"structural_hash\": \"abc123def456\",\n        \"algorithmic_fingerprint\": \"ALG-E66468BA743C\"\n      }\n    }\n  ]\n}\n```\n\n### Mathematical Invariants\n\n```json\n{\n  \"mathematical_invariants\": [\n    {\n      \"id\": \"inv_001\",\n      \"type\": \"mathematical_expression\",\n      \"confidence\": 0.78,\n      \"evidence\": {\n        \"expression_type\": \"loop_invariant\",\n        \"invariant_type\": \"precondition\"\n      }\n    }\n  ]\n}\n```\n\n## Configuration\n\n### Configuration File\n\nCreate `copycatm.json` in your project directory:\n\n```json\n{\n  \"analysis\": {\n    \"complexity_threshold\": 3,\n    \"min_lines\": 2,  // Recommended: 2 for utility libraries, 20 for general code\n    \"confidence_threshold\": 0.0,\n    \"unknown_algorithm_threshold\": 50  // Line count threshold for unknown algorithm detection\n  },\n  \"languages\": {\n    \"enabled\": [\"python\", \"javascript\", \"java\", \"c\", \"cpp\", \"go\", \"rust\"]\n  },\n  \"hashing\": {\n    \"algorithms\": [\"sha256\", \"tlsh\", \"minhash\"],\n    \"tlsh_threshold\": 100,\n    \"lsh_bands\": 20\n  },\n  \"performance\": {\n    \"parallel_workers\": null,\n    \"chunk_size\": 100\n  },\n  \"output\": {\n    \"include_intermediates\": false\n  }\n}\n```\n\n### Language-Specific Configuration (v1.3.0+)\n\nCopycatM supports per-language configurations for optimal detection:\n\n```python\nfrom copycatm.data.language_configs import get_language_config\n\nconfig = get_language_config(\"javascript\")\n# Returns: {\"min_lines\": 2, \"min_function_lines\": 2, \"unknown_algorithm_threshold\": 50, ...}\n```\n\n## Supported Languages\n\n- **Python**: `.py`, `.pyx`, `.pyi`\n- **JavaScript**: `.js`, `.jsx`\n- **TypeScript**: `.ts`, `.tsx`\n- **Java**: `.java`\n- **C/C++**: `.c`, `.cpp`, `.cc`, `.cxx`, `.h`, `.hpp`\n- **Go**: `.go`\n- **Rust**: `.rs`\n\n## Algorithm Detection\n\nThe tool can detect 60+ algorithmic patterns across 8 major categories:\n\n### Core CS Algorithms\n- **Sorting**: Quicksort, Mergesort, Bubblesort, Heapsort, Radix Sort, Tim Sort, Shell Sort\n- **Searching**: Binary Search, Linear Search, DFS, BFS, A*, Jump Search\n- **Graph Algorithms**: Dijkstra's, Bellman-Ford, Kruskal's, Prim's, Floyd-Warshall\n- **Dynamic Programming**: Fibonacci, LCS, Knapsack, Edit Distance, Memoization, Tabulation\n\n### Multimedia Processing (v1.5.0)\n- **Video Processing**: Color space conversion (YUV/RGB), Quantization, Motion estimation, Intra/inter prediction, Loop filtering, Segmentation\n- **Audio Processing**: FFT/DFT transforms, Psychoacoustic modeling, Resampling, Echo/reverb effects, Noise reduction, Digital filters\n- **Image Processing**: Bilinear/bicubic interpolation, Morphological operations, Edge detection (Canny, Sobel), Histogram operations, Convolution\n- **Signal Processing**: Butterworth/Chebyshev filters, Correlation, Windowing (Hamming, Hanning), Modulation/demodulation\n\n### Security & Cryptography\n- **Encryption**: AES, RSA, DES, ChaCha20, Elliptic Curve\n- **Hashing**: SHA family, MD5, bcrypt, Argon2\n- **Security**: Anti-tampering, obfuscation, authentication\n\n### Media Processing\n- **Audio Codecs**: MP3, AAC, Opus, FLAC, Vorbis, PCM, AC3, DTS\n- **Video Codecs**: H.264, H.265/HEVC, VP8/9, AV1, MPEG-2, ProRes\n- **Image Processing**: JPEG, PNG compression, filters, transforms\n\n### System Level\n- **Drivers**: Device drivers, kernel modules\n- **Firmware**: Bootloaders, embedded systems\n- **Low-level**: Memory management, interrupt handlers\n\n### Domain Specific\n- **Machine Learning**: Neural networks, gradient descent, k-means\n- **Graphics**: Ray tracing, rasterization, shaders\n- **Financial**: Options pricing, risk models\n- **Medical**: Image reconstruction, signal processing\n- **Automotive**: Control systems, sensor fusion\n\n### Cross-Language Support (v1.4.0+)\n- Same algorithms detected consistently across Python, JavaScript, C/C++, Java\n- 100% MinHash similarity for identical algorithms in different languages\n- Language-agnostic normalization for true semantic matching\n\n## Hashing Methods\n\n### Direct Hashing\n- SHA256: Cryptographic hash for exact matching\n- MD5: Fast hash for quick comparisons\n\n### Fuzzy Hashing (Enhanced in v1.4.0)\n- **TLSH**: Optimized preprocessing for code similarity\n  - Algorithm-focused normalization\n  - Smart padding for short functions\n  - 15% transformation resistance (vs 5% standard)\n- **ssdeep**: Primary fallback for code similarity\n- **Enhanced Fallback**: Multi-component hashing when libraries unavailable\n\n### Semantic Hashing (Cross-Language Support in v1.4.0)\n- **MinHash**: 100% cross-language similarity with normalization\n  - Language-agnostic code representation\n  - Structural shingle extraction\n  - 96.9% uniqueness (up from 61.9%)\n- **SimHash**: Hamming distance for structural similarity\n- **LSH**: Locality-sensitive hashing for approximate nearest neighbor search\n\n## Validation Tools\n\nCopycatM includes a comprehensive validation tool for testing and verification:\n\n```bash\n# Compare two files for similarity\npython tools/copycatm_validator.py compare file1.py file2.py\n\n# Validate entire project\npython tools/copycatm_validator.py validate ./project --reference ./reference\n\n# Test transformation resistance\npython tools/copycatm_validator.py transform algorithm.py\n\n# Batch comparison\npython tools/copycatm_validator.py batch comparisons.json\n```\n\n## Development\n\n### Setup Development Environment\n\n```bash\ngit clone https://github.com/oscarvalenzuelab/semantic-copycat-minner\ncd semantic-copycat-minner\npip install -e .[dev]\n```\n\n### Running Tests\n\n```bash\n# Run all tests\npytest tests/\n\n# Run specific test modules\npytest tests/test_analyzer.py\npytest tests/test_algorithm_detection.py\n\n# Run validation tests\npython comprehensive_test.py\npython test_transformation_resistance.py\n```\n\n### Code Quality\n\n```bash\n# Format code\nblack copycatm/\n\n# Lint code\nflake8 copycatm/\n\n# Type checking\nmypy copycatm/\n```\n\n## Performance Metrics\n\nBased on comprehensive testing (v1.5.2):\n\n### Current Performance (v1.5.2)\n- **Algorithm Detection F1-Score**: 0.857 (precision: 0.750, recall: 1.000)\n- **Source Code Verification**: 100% success rate (7/7 test files)\n- **C Function Detection**: 100% accuracy (bubble sort, binary search, quicksort)\n- **Audio Codec Detection**: 100% detection (PCM encode/decode functions)\n- **Cross-Language Consistency**: 100% (JavaScript, Python, C)\n- **Processing Speed**: ~0.061s average per file\n- **SWHID Generation**: 26 persistent identifiers across test suite\n\n### Historical Performance\n- **MinHash Uniqueness**: 100% (different algorithms generate unique hashes)\n- **Codec Detection**: 100% accuracy on FFmpeg test files  \n- **Transformation Resistance**: 86.8% average\n- **Invariant Classification**: 100% properly typed (precondition, loop, etc.)\n\n## License\n\nGNU Affero General Public License v3.0 - see LICENSE file for details.\n\n## Acknowledgments\n\n- Tree-sitter for robust parsing\n- TLSH for fuzzy hashing\n- DataSketch for MinHash implementation\n- Radon for cyclomatic complexity analysis\n- NetworkX for graph analysis\n- PyTorch for GNN components (optional)\n",
    "bugtrack_url": null,
    "license": "AGPL-3.0",
    "summary": "Semantic analysis tool for detecting AI-generated code derived from copyrighted sources",
    "version": "1.5.4",
    "project_urls": {
        "Bug Tracker": "https://github.com/oscarvalenzuelab/semantic-copycat-minner/issues",
        "Documentation": "https://github.com/oscarvalenzuelab/semantic-copycat-minner#readme",
        "Homepage": "https://github.com/oscarvalenzuelab/semantic-copycat-minner",
        "Repository": "https://github.com/oscarvalenzuelab/semantic-copycat-minner"
    },
    "split_keywords": [
        "code-analysis",
        " ai-detection",
        " semantic-analysis",
        " copyright",
        " gpl"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a19252823c4e0441f0b79a0f5f6dc61e73b760f050a750bb8c7d42a3bd0036b2",
                "md5": "a22173d11018e9e252334c2a8031d09f",
                "sha256": "4bfdeda3f287fa78804bf7dcfc884dca5f55e47db44ea9eab13157efd2a88518"
            },
            "downloads": -1,
            "filename": "semantic_copycat_minner-1.5.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a22173d11018e9e252334c2a8031d09f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 283760,
            "upload_time": "2025-07-31T21:57:22",
            "upload_time_iso_8601": "2025-07-31T21:57:22.477317Z",
            "url": "https://files.pythonhosted.org/packages/a1/92/52823c4e0441f0b79a0f5f6dc61e73b760f050a750bb8c7d42a3bd0036b2/semantic_copycat_minner-1.5.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f159daf5e50e4e51b04853b80d0079f5332ab48cb887691ba5e83241e60e916a",
                "md5": "7c268e55254f2852fed1f33d8e4a8152",
                "sha256": "cbd97d3c2a8fa0758a7c1928c1cfd94387f97859c599c8a1ffa2bbf387600fbb"
            },
            "downloads": -1,
            "filename": "semantic_copycat_minner-1.5.4.tar.gz",
            "has_sig": false,
            "md5_digest": "7c268e55254f2852fed1f33d8e4a8152",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 265269,
            "upload_time": "2025-07-31T21:57:23",
            "upload_time_iso_8601": "2025-07-31T21:57:23.941412Z",
            "url": "https://files.pythonhosted.org/packages/f1/59/daf5e50e4e51b04853b80d0079f5332ab48cb887691ba5e83241e60e916a/semantic_copycat_minner-1.5.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-31 21:57:23",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "oscarvalenzuelab",
    "github_project": "semantic-copycat-minner",
    "github_not_found": true,
    "lcname": "semantic-copycat-minner"
}

None