# Semantic Copycat Minner (CopycatM)
A semantic analysis tool for extracting code hashes, algorithms, and structural features for similarity analysis and copyright detection.
## Features
- **Multi-language Support**: Python, JavaScript/TypeScript, Java, C/C++, Go, Rust
- **Semantic Analysis**: AST-based code analysis with tree-sitter parsers
- **Algorithm Detection**: Pattern recognition for 40+ algorithm types across 8 categories
- **Unknown Algorithm Detection**: Structural complexity analysis to identify novel algorithms (v1.2.0+)
- **Cross-Language Consistency**: 100% MinHash similarity for same algorithms across languages (v1.4.0+)
- **Transformation Resistance**: 86.8% average resistance to code transformations (v1.4.0+)
- **Audio/Video Codec Detection**: Comprehensive detection of 40+ multimedia codecs (v1.4.0+)
- Successfully tested on FFmpeg source code
- Detects MP3, AAC, Opus, FLAC, PCM audio codecs
- Detects H.264, H.265, VP8/9, AV1 video codecs
- **Fuzzy Hashing**: TLSH with optimized preprocessing for code similarity
- **Semantic Hashing**: MinHash and SimHash for structural similarity
- **CLI Interface**: Easy-to-use command-line tool with batch processing
- **JSON Output**: Structured output for integration with other tools
## Installation
### From PyPI
```bash
pip install semantic-copycat-minner
```
### From Source
```bash
git clone https://github.com/username/semantic-copycat-minner
cd semantic-copycat-minner
pip install -e .
```
## Quick Start
### Analyze a Single File
```bash
copycatm src/algorithm.py -o results.json
```
### Analyze a Directory
```bash
copycatm ./codebase -o results.json
```
### Custom Configuration
```bash
copycatm algorithm.py --complexity-threshold 5 --min-lines 50 -o results.json
```
## CLI Usage
### Basic Commands
```bash
# Single file analysis
copycatm <file_path> [options]
# Batch directory analysis
copycatm <directory_path> [options]
```
### Options
```bash
# Core options
--output, -o Output JSON file path (default: stdout)
--verbose, -v Verbose output (can be repeated: -v, -vv, -vvv)
--quiet, -q Suppress all output except errors
--debug Enable debug mode with intermediate representations
# Analysis configuration
--complexity-threshold, -c Cyclomatic complexity threshold (default: 3)
--min-lines Minimum lines for algorithm analysis (default: 20, recommend 2 for utility libraries)
--include-intermediates Include AST and control flow graphs in output
--languages Comma-separated list of languages to analyze
# Hash configuration
--hash-algorithms Comma-separated hash types (default: sha256,tlsh,minhash)
--tlsh-threshold TLSH similarity threshold (default: 100)
--lsh-bands LSH band count for similarity detection (default: 20)
# Output filtering
--only-algorithms Only output algorithmic signatures
--only-metadata Only output file metadata
--confidence-threshold Minimum confidence score to include (0.0-1.0)
# Performance
--parallel, -p Number of parallel workers (default: CPU count)
--chunk-size Files per chunk for batch processing (default: 100)
```
## Library API
The library provides different levels of API access, with `CopycatAnalyzer` as the main entry point for most use cases.
### Main Entry Point: CopycatAnalyzer
`CopycatAnalyzer` is the primary interface that orchestrates all analysis components including parsing, algorithm detection, hashing, and complexity analysis.
```python
from semantic_copycat_minner import CopycatAnalyzer, AnalysisConfig
# Create analyzer with default configuration
analyzer = CopycatAnalyzer()
# Analyze a file (auto-detect language from extension)
result = analyzer.analyze_file("src/algorithm.py")
# Force specific language (useful for non-standard extensions)
result = analyzer.analyze_file("script.txt", force_language="python")
# Analyze code string directly
result = analyzer.analyze_code(code, "python", "algorithm.py")
# Analyze directory
results = analyzer.analyze_directory("./codebase")
```
### Lower-Level Components
For advanced use cases, you can access individual components directly:
```python
from semantic_copycat_minner import AlgorithmDetector
from semantic_copycat_minner.parsers import TreeSitterParser
# Direct algorithm detection with flexible input
detector = AlgorithmDetector()
# Option 1: Provide raw content (convenience method)
algorithms = detector.detect_algorithms_from_input(content, "python")
# Option 2: Provide pre-parsed AST (for reuse across components)
parser = TreeSitterParser()
ast_tree = parser.parse(content, "python")
algorithms = detector.detect_algorithms_from_input(ast_tree, "python")
# Option 3: Use original method (backward compatible)
algorithms = detector.detect_algorithms(ast_tree, "python")
```
### Custom Configuration
```python
from semantic_copycat_minner import CopycatAnalyzer, AnalysisConfig
# Create custom configuration
config = AnalysisConfig(
complexity_threshold=5,
min_lines=50,
include_intermediates=True,
hash_algorithms=["sha256", "tlsh", "minhash"],
confidence_threshold=0.8
)
analyzer = CopycatAnalyzer(config)
result = analyzer.analyze_file("src/algorithm.py")
```
## Output Format
The tool outputs structured JSON with the following components:
### File Metadata
```json
{
"file_metadata": {
"file_name": "algorithm.py",
"language": "python",
"line_count": 85,
"is_source_code": true,
"analysis_timestamp": "2025-07-25T10:30:00Z"
}
}
```
### Algorithm Detection
```json
{
"algorithms": [
{
"id": "algo_001",
"type": "algorithm",
"name": "quicksort_implementation",
"confidence": 0.92,
"complexity_metric": 8,
"evidence": {
"pattern_type": "divide_and_conquer",
"control_flow": "recursive_partition"
},
"hashes": {
"direct": {"sha256": "abc123..."},
"fuzzy": {"tlsh": "T1A2B3C4..."},
"semantic": {"minhash": "123456789abcdef"}
},
"transformation_resistance": {
"variable_renaming": 0.95,
"language_translation": 0.85
}
}
]
}
```
### Unknown Algorithm Detection (v1.2.0+)
For complex code that doesn't match known algorithm patterns, CopycatM performs structural complexity analysis to identify unknown algorithms. This feature automatically activates for files with 50+ lines to optimize performance.
```json
{
"algorithms": [
{
"id": "unknown_a1b2c3d4",
"algorithm_type": "unknown_complex_algorithm",
"subtype_classification": "bitwise_manipulation_algorithm",
"confidence_score": 0.79,
"evidence": {
"complexity_score": 0.79,
"cyclomatic_complexity": 33,
"nesting_depth": 5,
"operation_density": 4.2,
"unique_operations": 25,
"structural_hash": "abc123def456",
"algorithmic_fingerprint": "ALG-E66468BA743C"
},
"transformation_resistance": {
"structural_hash": 0.9,
"operation_patterns": 0.85,
"complexity_metrics": 0.95
}
}
]
}
```
Unknown algorithms are classified into subtypes based on their dominant characteristics:
- `complex_iteration_pattern` - Nested loops and complex iteration
- `bitwise_manipulation_algorithm` - Heavy use of bitwise operations
- `mathematical_computation` - Dense mathematical operations
- `complex_decision_logic` - High conditional complexity
- `data_transformation_algorithm` - Complex data flow patterns
- `deeply_nested_algorithm` - Extreme nesting depth
- `unclassified_complex_pattern` - Other complex patterns
### Mathematical Invariants
```json
{
"mathematical_invariants": [
{
"id": "inv_001",
"type": "mathematical_expression",
"confidence": 0.78,
"evidence": {
"expression_type": "arithmetic_calculation"
}
}
]
}
```
## Configuration
### Configuration File
Create `copycatm.json` in your project directory:
```json
{
"analysis": {
"complexity_threshold": 3,
"min_lines": 2, // Recommended: 2 for utility libraries, 20 for general code
"confidence_threshold": 0.0,
"unknown_algorithm_threshold": 50 // Line count threshold for unknown algorithm detection
},
"languages": {
"enabled": ["python", "javascript", "java", "c", "cpp", "go", "rust"]
},
"hashing": {
"algorithms": ["sha256", "tlsh", "minhash"],
"tlsh_threshold": 100,
"lsh_bands": 20
},
"performance": {
"parallel_workers": null,
"chunk_size": 100
},
"output": {
"include_intermediates": false
}
}
```
## Supported Languages
- **Python**: `.py`, `.pyx`, `.pyi`
- **JavaScript**: `.js`, `.jsx`
- **TypeScript**: `.ts`, `.tsx`
- **Java**: `.java`
- **C/C++**: `.c`, `.cpp`, `.cc`, `.cxx`, `.h`, `.hpp`
- **Go**: `.go`
- **Rust**: `.rs`
## Algorithm Detection
The tool can detect 40+ algorithmic patterns across 8 major categories:
### Core CS Algorithms
- **Sorting**: Quicksort, Mergesort, Bubblesort, Heapsort, Radix Sort
- **Searching**: Binary Search, Linear Search, DFS, BFS, A*, Jump Search
- **Graph Algorithms**: Dijkstra's, Bellman-Ford, Kruskal's, Prim's, Floyd-Warshall
- **Dynamic Programming**: Fibonacci, LCS, Knapsack, Edit Distance
### Security & Cryptography
- **Encryption**: AES, RSA, DES, ChaCha20, Elliptic Curve
- **Hashing**: SHA family, MD5, bcrypt, Argon2
- **Security**: Anti-tampering, obfuscation, authentication
### Media Processing
- **Audio Codecs**: MP3, AAC, Opus, FLAC, Vorbis, PCM, AC3, DTS
- **Video Codecs**: H.264, H.265/HEVC, VP8/9, AV1, MPEG-2, ProRes
- **Image Processing**: JPEG, PNG compression, filters, transforms
### System Level
- **Drivers**: Device drivers, kernel modules
- **Firmware**: Bootloaders, embedded systems
- **Low-level**: Memory management, interrupt handlers
### Domain Specific
- **Machine Learning**: Neural networks, gradient descent, k-means
- **Graphics**: Ray tracing, rasterization, shaders
- **Financial**: Options pricing, risk models
- **Medical**: Image reconstruction, signal processing
- **Automotive**: Control systems, sensor fusion
### Cross-Language Support (v1.4.0+)
- Same algorithms detected consistently across Python, JavaScript, C/C++, Java
- 100% MinHash similarity for identical algorithms in different languages
- Language-agnostic normalization for true semantic matching
## Hashing Methods
### Direct Hashing
- SHA256: Cryptographic hash for exact matching
- MD5: Fast hash for quick comparisons
### Fuzzy Hashing (Enhanced in v1.4.0)
- **TLSH**: Optimized preprocessing for code similarity
- Algorithm-focused normalization
- Smart padding for short functions
- 15% transformation resistance (vs 5% standard)
- **ssdeep**: Primary fallback for code similarity
- **Enhanced Fallback**: Multi-component hashing when libraries unavailable
### Semantic Hashing (Cross-Language Support in v1.4.0)
- **MinHash**: 100% cross-language similarity with normalization
- Language-agnostic code representation
- Structural shingle extraction
- 96.9% uniqueness (up from 61.9%)
- **SimHash**: Hamming distance for structural similarity
- **LSH**: Locality-sensitive hashing for approximate nearest neighbor search
## Development
### Setup Development Environment
```bash
git clone https://github.com/oscarvalenzuelab/semantic-copycat-minner
cd semantic-copycat-minner
pip install -e .[dev]
```
### Running Tests
```bash
pytest tests/
```
## License
GNU Affero General Public License v3.0 - see LICENSE file for details.
## Acknowledgments
- Tree-sitter for robust parsing
- TLSH for fuzzy hashing
- DataSketch for MinHash implementation
- Radon for cyclomatic complexity analysis
Raw data
{
"_id": null,
"home_page": null,
"name": "semantic-copycat-minner",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "\"Oscar Valenzuela B.\" <oscar.valenzuela.b@gmail.com>",
"keywords": "code-analysis, ai-detection, semantic-analysis, copyright, gpl",
"author": null,
"author_email": "\"Oscar Valenzuela B.\" <oscar.valenzuela.b@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/12/a9/be0d4ced592cd9c8e3aa211703b32d62d64375ad8a2fd73dc07e8b46fb76/semantic_copycat_minner-1.4.2.tar.gz",
"platform": null,
"description": "# Semantic Copycat Minner (CopycatM)\n\nA semantic analysis tool for extracting code hashes, algorithms, and structural features for similarity analysis and copyright detection.\n\n## Features\n\n- **Multi-language Support**: Python, JavaScript/TypeScript, Java, C/C++, Go, Rust\n- **Semantic Analysis**: AST-based code analysis with tree-sitter parsers\n- **Algorithm Detection**: Pattern recognition for 40+ algorithm types across 8 categories\n- **Unknown Algorithm Detection**: Structural complexity analysis to identify novel algorithms (v1.2.0+)\n- **Cross-Language Consistency**: 100% MinHash similarity for same algorithms across languages (v1.4.0+)\n- **Transformation Resistance**: 86.8% average resistance to code transformations (v1.4.0+)\n- **Audio/Video Codec Detection**: Comprehensive detection of 40+ multimedia codecs (v1.4.0+)\n - Successfully tested on FFmpeg source code\n - Detects MP3, AAC, Opus, FLAC, PCM audio codecs\n - Detects H.264, H.265, VP8/9, AV1 video codecs\n- **Fuzzy Hashing**: TLSH with optimized preprocessing for code similarity\n- **Semantic Hashing**: MinHash and SimHash for structural similarity\n- **CLI Interface**: Easy-to-use command-line tool with batch processing\n- **JSON Output**: Structured output for integration with other tools\n\n## Installation\n\n### From PyPI\n\n```bash\npip install semantic-copycat-minner\n```\n\n### From Source\n\n```bash\ngit clone https://github.com/username/semantic-copycat-minner\ncd semantic-copycat-minner\npip install -e .\n```\n\n## Quick Start\n\n### Analyze a Single File\n\n```bash\ncopycatm src/algorithm.py -o results.json\n```\n\n### Analyze a Directory\n\n```bash\ncopycatm ./codebase -o results.json\n```\n\n### Custom Configuration\n\n```bash\ncopycatm algorithm.py --complexity-threshold 5 --min-lines 50 -o results.json\n```\n\n## CLI Usage\n\n### Basic Commands\n\n```bash\n# Single file analysis\ncopycatm <file_path> [options]\n\n# Batch directory analysis\ncopycatm <directory_path> [options]\n```\n\n### Options\n\n```bash\n# Core options\n--output, -o Output JSON file path (default: stdout)\n--verbose, -v Verbose output (can be repeated: -v, -vv, -vvv)\n--quiet, -q Suppress all output except errors\n--debug Enable debug mode with intermediate representations\n\n# Analysis configuration\n--complexity-threshold, -c Cyclomatic complexity threshold (default: 3)\n--min-lines Minimum lines for algorithm analysis (default: 20, recommend 2 for utility libraries)\n--include-intermediates Include AST and control flow graphs in output\n--languages Comma-separated list of languages to analyze\n\n# Hash configuration\n--hash-algorithms Comma-separated hash types (default: sha256,tlsh,minhash)\n--tlsh-threshold TLSH similarity threshold (default: 100)\n--lsh-bands LSH band count for similarity detection (default: 20)\n\n# Output filtering\n--only-algorithms Only output algorithmic signatures\n--only-metadata Only output file metadata\n--confidence-threshold Minimum confidence score to include (0.0-1.0)\n\n# Performance\n--parallel, -p Number of parallel workers (default: CPU count)\n--chunk-size Files per chunk for batch processing (default: 100)\n```\n\n## Library API\n\nThe library provides different levels of API access, with `CopycatAnalyzer` as the main entry point for most use cases.\n\n### Main Entry Point: CopycatAnalyzer\n\n`CopycatAnalyzer` is the primary interface that orchestrates all analysis components including parsing, algorithm detection, hashing, and complexity analysis.\n\n```python\nfrom semantic_copycat_minner import CopycatAnalyzer, AnalysisConfig\n\n# Create analyzer with default configuration\nanalyzer = CopycatAnalyzer()\n\n# Analyze a file (auto-detect language from extension)\nresult = analyzer.analyze_file(\"src/algorithm.py\")\n\n# Force specific language (useful for non-standard extensions)\nresult = analyzer.analyze_file(\"script.txt\", force_language=\"python\")\n\n# Analyze code string directly\nresult = analyzer.analyze_code(code, \"python\", \"algorithm.py\")\n\n# Analyze directory\nresults = analyzer.analyze_directory(\"./codebase\")\n```\n\n### Lower-Level Components\n\nFor advanced use cases, you can access individual components directly:\n\n```python\nfrom semantic_copycat_minner import AlgorithmDetector\nfrom semantic_copycat_minner.parsers import TreeSitterParser\n\n# Direct algorithm detection with flexible input\ndetector = AlgorithmDetector()\n\n# Option 1: Provide raw content (convenience method)\nalgorithms = detector.detect_algorithms_from_input(content, \"python\")\n\n# Option 2: Provide pre-parsed AST (for reuse across components)\nparser = TreeSitterParser()\nast_tree = parser.parse(content, \"python\")\nalgorithms = detector.detect_algorithms_from_input(ast_tree, \"python\")\n\n# Option 3: Use original method (backward compatible)\nalgorithms = detector.detect_algorithms(ast_tree, \"python\")\n```\n\n### Custom Configuration\n\n```python\nfrom semantic_copycat_minner import CopycatAnalyzer, AnalysisConfig\n\n# Create custom configuration\nconfig = AnalysisConfig(\n complexity_threshold=5,\n min_lines=50,\n include_intermediates=True,\n hash_algorithms=[\"sha256\", \"tlsh\", \"minhash\"],\n confidence_threshold=0.8\n)\n\nanalyzer = CopycatAnalyzer(config)\nresult = analyzer.analyze_file(\"src/algorithm.py\")\n```\n\n## Output Format\n\nThe tool outputs structured JSON with the following components:\n\n### File Metadata\n\n```json\n{\n \"file_metadata\": {\n \"file_name\": \"algorithm.py\",\n \"language\": \"python\",\n \"line_count\": 85,\n \"is_source_code\": true,\n \"analysis_timestamp\": \"2025-07-25T10:30:00Z\"\n }\n}\n```\n\n### Algorithm Detection\n\n```json\n{\n \"algorithms\": [\n {\n \"id\": \"algo_001\",\n \"type\": \"algorithm\",\n \"name\": \"quicksort_implementation\",\n \"confidence\": 0.92,\n \"complexity_metric\": 8,\n \"evidence\": {\n \"pattern_type\": \"divide_and_conquer\",\n \"control_flow\": \"recursive_partition\"\n },\n \"hashes\": {\n \"direct\": {\"sha256\": \"abc123...\"},\n \"fuzzy\": {\"tlsh\": \"T1A2B3C4...\"},\n \"semantic\": {\"minhash\": \"123456789abcdef\"}\n },\n \"transformation_resistance\": {\n \"variable_renaming\": 0.95,\n \"language_translation\": 0.85\n }\n }\n ]\n}\n```\n\n### Unknown Algorithm Detection (v1.2.0+)\n\nFor complex code that doesn't match known algorithm patterns, CopycatM performs structural complexity analysis to identify unknown algorithms. This feature automatically activates for files with 50+ lines to optimize performance.\n\n```json\n{\n \"algorithms\": [\n {\n \"id\": \"unknown_a1b2c3d4\",\n \"algorithm_type\": \"unknown_complex_algorithm\",\n \"subtype_classification\": \"bitwise_manipulation_algorithm\",\n \"confidence_score\": 0.79,\n \"evidence\": {\n \"complexity_score\": 0.79,\n \"cyclomatic_complexity\": 33,\n \"nesting_depth\": 5,\n \"operation_density\": 4.2,\n \"unique_operations\": 25,\n \"structural_hash\": \"abc123def456\",\n \"algorithmic_fingerprint\": \"ALG-E66468BA743C\"\n },\n \"transformation_resistance\": {\n \"structural_hash\": 0.9,\n \"operation_patterns\": 0.85,\n \"complexity_metrics\": 0.95\n }\n }\n ]\n}\n```\n\nUnknown algorithms are classified into subtypes based on their dominant characteristics:\n- `complex_iteration_pattern` - Nested loops and complex iteration\n- `bitwise_manipulation_algorithm` - Heavy use of bitwise operations\n- `mathematical_computation` - Dense mathematical operations\n- `complex_decision_logic` - High conditional complexity\n- `data_transformation_algorithm` - Complex data flow patterns\n- `deeply_nested_algorithm` - Extreme nesting depth\n- `unclassified_complex_pattern` - Other complex patterns\n\n### Mathematical Invariants\n\n```json\n{\n \"mathematical_invariants\": [\n {\n \"id\": \"inv_001\",\n \"type\": \"mathematical_expression\",\n \"confidence\": 0.78,\n \"evidence\": {\n \"expression_type\": \"arithmetic_calculation\"\n }\n }\n ]\n}\n```\n\n## Configuration\n\n### Configuration File\n\nCreate `copycatm.json` in your project directory:\n\n```json\n{\n \"analysis\": {\n \"complexity_threshold\": 3,\n \"min_lines\": 2, // Recommended: 2 for utility libraries, 20 for general code\n \"confidence_threshold\": 0.0,\n \"unknown_algorithm_threshold\": 50 // Line count threshold for unknown algorithm detection\n },\n \"languages\": {\n \"enabled\": [\"python\", \"javascript\", \"java\", \"c\", \"cpp\", \"go\", \"rust\"]\n },\n \"hashing\": {\n \"algorithms\": [\"sha256\", \"tlsh\", \"minhash\"],\n \"tlsh_threshold\": 100,\n \"lsh_bands\": 20\n },\n \"performance\": {\n \"parallel_workers\": null,\n \"chunk_size\": 100\n },\n \"output\": {\n \"include_intermediates\": false\n }\n}\n```\n\n## Supported Languages\n\n- **Python**: `.py`, `.pyx`, `.pyi`\n- **JavaScript**: `.js`, `.jsx`\n- **TypeScript**: `.ts`, `.tsx`\n- **Java**: `.java`\n- **C/C++**: `.c`, `.cpp`, `.cc`, `.cxx`, `.h`, `.hpp`\n- **Go**: `.go`\n- **Rust**: `.rs`\n\n## Algorithm Detection\n\nThe tool can detect 40+ algorithmic patterns across 8 major categories:\n\n### Core CS Algorithms\n- **Sorting**: Quicksort, Mergesort, Bubblesort, Heapsort, Radix Sort\n- **Searching**: Binary Search, Linear Search, DFS, BFS, A*, Jump Search\n- **Graph Algorithms**: Dijkstra's, Bellman-Ford, Kruskal's, Prim's, Floyd-Warshall\n- **Dynamic Programming**: Fibonacci, LCS, Knapsack, Edit Distance\n\n### Security & Cryptography\n- **Encryption**: AES, RSA, DES, ChaCha20, Elliptic Curve\n- **Hashing**: SHA family, MD5, bcrypt, Argon2\n- **Security**: Anti-tampering, obfuscation, authentication\n\n### Media Processing\n- **Audio Codecs**: MP3, AAC, Opus, FLAC, Vorbis, PCM, AC3, DTS\n- **Video Codecs**: H.264, H.265/HEVC, VP8/9, AV1, MPEG-2, ProRes\n- **Image Processing**: JPEG, PNG compression, filters, transforms\n\n### System Level\n- **Drivers**: Device drivers, kernel modules\n- **Firmware**: Bootloaders, embedded systems\n- **Low-level**: Memory management, interrupt handlers\n\n### Domain Specific\n- **Machine Learning**: Neural networks, gradient descent, k-means\n- **Graphics**: Ray tracing, rasterization, shaders\n- **Financial**: Options pricing, risk models\n- **Medical**: Image reconstruction, signal processing\n- **Automotive**: Control systems, sensor fusion\n\n### Cross-Language Support (v1.4.0+)\n- Same algorithms detected consistently across Python, JavaScript, C/C++, Java\n- 100% MinHash similarity for identical algorithms in different languages\n- Language-agnostic normalization for true semantic matching\n\n## Hashing Methods\n\n### Direct Hashing\n- SHA256: Cryptographic hash for exact matching\n- MD5: Fast hash for quick comparisons\n\n### Fuzzy Hashing (Enhanced in v1.4.0)\n- **TLSH**: Optimized preprocessing for code similarity\n - Algorithm-focused normalization\n - Smart padding for short functions\n - 15% transformation resistance (vs 5% standard)\n- **ssdeep**: Primary fallback for code similarity\n- **Enhanced Fallback**: Multi-component hashing when libraries unavailable\n\n### Semantic Hashing (Cross-Language Support in v1.4.0)\n- **MinHash**: 100% cross-language similarity with normalization\n - Language-agnostic code representation\n - Structural shingle extraction\n - 96.9% uniqueness (up from 61.9%)\n- **SimHash**: Hamming distance for structural similarity\n- **LSH**: Locality-sensitive hashing for approximate nearest neighbor search\n\n## Development\n\n### Setup Development Environment\n\n```bash\ngit clone https://github.com/oscarvalenzuelab/semantic-copycat-minner\ncd semantic-copycat-minner\npip install -e .[dev]\n```\n\n### Running Tests\n\n```bash\npytest tests/\n```\n\n## License\n\nGNU Affero General Public License v3.0 - see LICENSE file for details.\n\n## Acknowledgments\n\n- Tree-sitter for robust parsing\n- TLSH for fuzzy hashing\n- DataSketch for MinHash implementation\n- Radon for cyclomatic complexity analysis\n",
"bugtrack_url": null,
"license": "AGPL-3.0",
"summary": "Semantic analysis tool for detecting AI-generated code derived from copyrighted sources",
"version": "1.4.2",
"project_urls": {
"Bug Tracker": "https://github.com/oscarvalenzuelab/semantic-copycat-minner/issues",
"Documentation": "https://github.com/oscarvalenzuelab/semantic-copycat-minner#readme",
"Homepage": "https://github.com/oscarvalenzuelab/semantic-copycat-minner",
"Repository": "https://github.com/oscarvalenzuelab/semantic-copycat-minner"
},
"split_keywords": [
"code-analysis",
" ai-detection",
" semantic-analysis",
" copyright",
" gpl"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "a776800895bea37917ee982085e5f29a31f4aa6983037e45994db7fd3079761f",
"md5": "d06121582ca2d3b677b74faa25baaf0d",
"sha256": "9c63a21f0299def502f4aabae99c04d1a061073e42a928941b0aa99841218a4c"
},
"downloads": -1,
"filename": "semantic_copycat_minner-1.4.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d06121582ca2d3b677b74faa25baaf0d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 150098,
"upload_time": "2025-07-29T23:02:44",
"upload_time_iso_8601": "2025-07-29T23:02:44.625024Z",
"url": "https://files.pythonhosted.org/packages/a7/76/800895bea37917ee982085e5f29a31f4aa6983037e45994db7fd3079761f/semantic_copycat_minner-1.4.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "12a9be0d4ced592cd9c8e3aa211703b32d62d64375ad8a2fd73dc07e8b46fb76",
"md5": "e1c65356c14c5c0e9ae2ad2f495ba729",
"sha256": "3d3982d9f0f14aacd0fe177d912964c8d0085d593ae3e7ecaa8dc666efefb9aa"
},
"downloads": -1,
"filename": "semantic_copycat_minner-1.4.2.tar.gz",
"has_sig": false,
"md5_digest": "e1c65356c14c5c0e9ae2ad2f495ba729",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 148215,
"upload_time": "2025-07-29T23:02:46",
"upload_time_iso_8601": "2025-07-29T23:02:46.260055Z",
"url": "https://files.pythonhosted.org/packages/12/a9/be0d4ced592cd9c8e3aa211703b32d62d64375ad8a2fd73dc07e8b46fb76/semantic_copycat_minner-1.4.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-29 23:02:46",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "oscarvalenzuelab",
"github_project": "semantic-copycat-minner",
"github_not_found": true,
"lcname": "semantic-copycat-minner"
}