semantic-copycat-binarysniffer


Namesemantic-copycat-binarysniffer JSON
Version 1.10.1 PyPI version JSON
download
home_pageNone
SummaryA high-performance CLI and library for detecting open source components in binaries through semantic signature matching
upload_time2025-08-30 17:45:19
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseApache-2.0
keywords binary-analysis license-compliance signature-matching oss-detection semantic-analysis semantic-copycat code-copycat
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Semantic Copycat BinarySniffer

A high-performance CLI tool and Python library for detecting open source components and security threats in binaries through semantic signature matching. Specialized for analyzing mobile apps (APK/IPA), Java archives, ML models, and source code to identify OSS components, their licenses, and potential security risks.


## Features

### Core Analysis
- **Fuzzy Matching**: Detect modified, recompiled, or patched OSS components using TLSH
- **Deterministic Results**: Consistent analysis results across multiple runs
- **Fast Local Analysis**: SQLite-based signature storage with optimized direct matching
- **Efficient Matching**: MinHash LSH for similarity detection, trigram indexing for substring matching
- **Dual Interface**: Use as CLI tool or Python library
- **Smart Compression**: ZSTD-compressed signatures with ~90% size reduction
- **Low Memory Footprint**: Streaming analysis with <100MB memory usage

### SBOM Export Support
- **CycloneDX Format**: Industry-standard SBOM export for security and compliance toolchains
- **File Path Tracking**: Evidence includes file paths for component location tracking
- **Feature Extraction**: Optional feature dump for signature recreation
- **Confidence Scores**: All detections include confidence levels in SBOM
- **Multi-file Support**: Aggregate SBOM for entire projects

### Package Inventory Extraction
- **Comprehensive File Enumeration**: Extract complete file listings from archives
- **Rich Metadata**: MIME types, compression ratios, file sizes, timestamps
- **Hash Calculation**: MD5, SHA1, SHA256 for integrity verification
- **Fuzzy Hashing**: TLSH and ssdeep for similarity analysis
- **Component Detection**: Run OSS detection on individual files within packages
- **Multiple Export Formats**: JSON, CSV, tree visualization, summary reports

### Binary Analysis
- **Advanced Format Support**: ELF, PE, Mach-O analysis with symbol and import extraction via LIEF
- **Static Library Support**: Parse and analyze .a archives, examining each object file separately
- **Android DEX Support**: Specialized extractor for DEX bytecode files
- **Improved Detection**: 25+ components detected in APK files with 152K+ features extracted
- **Substring Matching**: Detects components even with partial pattern matches
- **Progress Indication**: Real-time progress bars for long analysis operations

### Archive Support
- **Mobile Applications**: Android APK and iOS IPA with manifest parsing and native library analysis
- **Java Archives**: JAR/WAR files with MANIFEST.MF parsing and package detection
- **Python Packages**: Wheels (.whl) and eggs (.egg) with metadata extraction
- **Linux Packages**: DEB (Debian/Ubuntu) and RPM (Red Hat/Fedora) packages
- **Extended Formats**: 7z, RAR, Zstandard (.zst, .tar.zst), CPIO
- **Nested Archives**: Handle archives containing other archives (up to 5 levels deep)
- **Intelligent Extraction**: Prioritizes binaries, bytecode, and source files for analysis

### Source Code Analysis
- **CTags Integration**: Advanced source code analysis when universal-ctags is available
- **Multi-language Support**: C/C++, Python, Java, JavaScript, Go, Rust, PHP, Swift, Kotlin
- **Semantic Symbol Extraction**: Functions, classes, structs, constants, and dependencies
- **Graceful Fallback**: Regex-based extraction when CTags is unavailable

### ML Model Security Analysis (v1.10.0+)
- **Comprehensive Security Module**: Deep analysis of ML models for security threats
- **MITRE ATT&CK Integration**: Maps threats to ATT&CK framework techniques
- **Multi-Level Risk Assessment**: SAFE, LOW, MEDIUM, HIGH, CRITICAL risk levels
- **Pickle File Parser**: Safe analysis of Python pickle files without code execution
- **ONNX Model Parser**: Comprehensive analysis of ONNX format models
- **SafeTensors Parser**: Validation of secure tensor storage format
- **PyTorch/TensorFlow Native**: Handles .pt, .pth, .pb, .h5 native formats
- **Malicious Detection**: 100% detection rate on real-world ML exploits
- **Framework Detection**: Identifies PyTorch (96%), TensorFlow, sklearn (94%), XGBoost (77%) origins
- **Obfuscation Detection**: Entropy analysis and pattern matching for hidden threats
- **Model Integrity Validation**: Hash verification and tampering detection
- **Architecture Recognition**: Detects ResNet, BERT, YOLO, LLaMA, ViT, etc.
- **Format Validation**: Detects tampering, injection attempts, and format violations
- **Malformed File Detection**: Identifies corrupted or invalid model files with clear warnings
- **Data Exfiltration Detection**: Flags oversized tensors and suspicious patterns
- **Supply Chain Security**: Verifies model provenance and integrity
- **SARIF Output**: CI/CD integration with GitHub Actions and security tools
- **Security-Enhanced SBOM**: CycloneDX format with ML security metadata

### Signature Database
- **188 OSS Components**: Comprehensive coverage including libraries, frameworks, ML models, and multimedia codecs
- **1,400+ Total Signatures**: High-quality patterns with improved accuracy and reduced false positives
- **Multimedia Support**: H.264/H.265, AAC, Dolby, AV1, GStreamer, GLib, FFmpeg components
- **System Libraries**: libcap, Expat XML, LZ4, XZ Utils, WebP, cURL, Cairo, Opus
- **License Detection**: Automatic license identification for detected components
- **Security Analysis**: Detection of malicious patterns with severity levels (CRITICAL, HIGH, MEDIUM, LOW)
- **Rich Metadata**: Publisher, version, and ecosystem information for each component

## Installation

### From PyPI
```bash
pip install semantic-copycat-binarysniffer
```

### From Source
```bash
git clone https://github.com/oscarvalenzuelab/semantic-copycat-binarysniffer
cd semantic-copycat-binarysniffer
pip install -e .
```

### With Performance Extras
```bash
pip install semantic-copycat-binarysniffer[fast]
```

### With Fuzzy Matching Support
```bash
# Includes TLSH for detecting modified/recompiled components
pip install semantic-copycat-binarysniffer[fuzzy]
```

### With Extended Archive Support
```bash
# Includes support for 7z, RAR, DEB, RPM formats
pip install semantic-copycat-binarysniffer[archives]
```

### With Android APK Analysis
```bash
# Includes Androguard for advanced APK analysis
pip install semantic-copycat-binarysniffer[android]
```

## Optional Tools for Enhanced Format Support

BinarySniffer can leverage external tools when available to provide enhanced analysis capabilities. These tools are **optional** - the core functionality works without them, but installing them unlocks additional features.

### Quick Reference: Archive Format Requirements

| Format | Python Package | System Tool (Alternative) | Fallback |
|--------|---------------|---------------------------|----------|
| 7z | py7zr (included) | 7-Zip | - |
| RAR | rarfile (included) | unrar | 7-Zip |
| DEB | python-debian (included) | ar | 7-Zip |
| RPM | - | rpm2cpio | 7-Zip |
| ZIP/JAR | Built-in | - | - |
| TAR/GZ | Built-in | - | - |

### 7-Zip (Recommended)
**Enables**: Extraction and analysis of Windows installers, macOS packages, and additional compressed formats

```bash
# macOS
brew install p7zip

# Ubuntu/Debian
sudo apt-get install p7zip-full

# Windows
# Download from https://www.7-zip.org/
```

**Benefits**:
- Analyze Windows installers (.exe, .msi) by extracting embedded components
- Analyze macOS installers (.pkg, .dmg) to detect bundled frameworks
- Support for NSIS, InnoSetup, and other installer formats
- Extract and analyze self-extracting archives
- Support for additional archive formats (RAR, CAB, ISO, etc.)

### Tools for Extended Archive Support (Optional)

When using the `[archives]` installation option, these tools enhance format support:

#### DEB Package Analysis
```bash
# For DEB packages (Debian/Ubuntu)
# Option 1: Install python-debian (included with [archives])
pip install semantic-copycat-binarysniffer[archives]

# Option 2: Use system ar command (usually pre-installed)
# Ubuntu/Debian
which ar  # Check if available

# macOS
# ar is included with Xcode Command Line Tools
xcode-select --install  # If not already installed
```

#### RPM Package Analysis
```bash
# For RPM packages (Red Hat/Fedora/CentOS)
# Option 1: Install rpm2cpio
# Ubuntu/Debian
sudo apt-get install rpm2cpio

# macOS
brew install rpm2cpio

# Fedora/RHEL/CentOS
# rpm2cpio is usually pre-installed

# Option 2: Falls back to 7-Zip if available
```

#### Additional Archive Formats
The `[archives]` option includes Python libraries for:
- **7z files**: py7zr (pure Python, no external tools needed)
- **RAR files**: rarfile (requires unrar tool)
  ```bash
  # Install unrar for RAR support
  # Ubuntu/Debian
  sudo apt-get install unrar
  
  # macOS
  brew install unrar
  
  # Note: Falls back to 7-Zip if unrar not available
  ```

### Universal CTags (Optional)
**Enables**: Enhanced source code analysis with semantic understanding

```bash
# macOS
brew install universal-ctags

# Ubuntu/Debian
sudo apt-get install universal-ctags

# Windows
# Download from https://github.com/universal-ctags/ctags-win32/releases
```

**Benefits**:
- Better function/class/method detection in source code
- Multi-language semantic analysis
- More accurate symbol extraction
- Improved signature matching for source code components

### Example: Analyzing Installers

Without 7-Zip:
```bash
$ binarysniffer analyze installer.exe
# Analyzes as compressed binary - limited detection
```

With 7-Zip installed:
```bash
# Windows installers
$ binarysniffer analyze installer.exe
$ binarysniffer analyze setup.msi
# Automatically extracts and analyzes contents
# Detects: Qt5, OpenSSL, SQLite, ICU, libpng, etc.

# macOS installers
$ binarysniffer analyze app.pkg
$ binarysniffer analyze app.dmg
# Automatically extracts and analyzes contents
# Detects: Qt5, WebKit, OpenCV, React Native, etc.
```

## Quick Start

### CLI Usage

```bash
# Basic analysis
binarysniffer analyze /path/to/binary
binarysniffer analyze app.apk                    # Android APK
binarysniffer analyze app.ipa                    # iOS IPA
binarysniffer analyze library.jar                # Java JAR

# ML model component detection
binarysniffer analyze model.pkl                  # Pickle files
binarysniffer analyze model.onnx                 # ONNX models
binarysniffer analyze model.safetensors          # SafeTensors format
binarysniffer analyze suspicious_model.pkl --show-features  # Detailed analysis

# ML model security scanning (v1.10.0+)
binarysniffer ml-scan model.pkl                  # Security analysis of ML models
binarysniffer ml-scan model.pkl --deep           # Deep security analysis
binarysniffer ml-scan models/ -r --format sarif  # SARIF output for CI/CD
binarysniffer ml-scan model.pkl -o report.md     # Markdown security report
binarysniffer ml-scan model.pkl --risk-threshold 0.5  # Custom risk threshold

# Analyze directories recursively
binarysniffer analyze /path/to/project -r

# Output with auto-format detection
binarysniffer analyze app.apk -o report.json     # Auto-detects JSON format
binarysniffer analyze app.apk -o report.csv      # Auto-detects CSV format
binarysniffer analyze app.apk -o app.sbom        # Auto-detects SBOM format

# Performance modes
binarysniffer analyze large.bin --fast           # Quick scan (no fuzzy matching)
binarysniffer analyze app.apk --deep             # Thorough analysis

# Custom confidence threshold
binarysniffer analyze file.exe -t 0.3            # More sensitive (30% confidence)
binarysniffer analyze file.exe -t 0.8            # More conservative (80% confidence)

# Include file hashes in output
binarysniffer analyze file.exe --with-hashes -o report.json
binarysniffer analyze file.exe --basic-hashes    # Only MD5, SHA1, SHA256

# Filter by file patterns
binarysniffer analyze project/ -r -p "*.so" -p "*.dll"

# Export as CycloneDX SBOM
binarysniffer analyze app.apk -f sbom -o app-sbom.json
binarysniffer analyze app.apk --format cyclonedx -o sbom.json

# Save features for signature creation
binarysniffer analyze binary.exe --save-features features.json --show-features

# Filter results
binarysniffer analyze lib.so --min-matches 5     # Show components with 5+ matches
binarysniffer analyze app.apk --show-evidence    # Show detailed match evidence
```

### Understanding the Output

The analysis results display a **Classification** column that shows either:
- **Software licenses** (e.g., Apache-2.0, BSD-3-Clause, MIT) for legitimate OSS components
- **Security severity levels** (CRITICAL, HIGH, MEDIUM, LOW) for detected threats

Example output:
```
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ Component        ┃ Confidence ┃ Classification ┃ Type   ┃ Evidence         ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ PyTorch-Native   │ 94.0%      │ BSD-3-Clause   │ library│ 2 patterns       │
│ SafeTensors      │ 90.0%      │ Apache-2.0     │ library│ 3 patterns       │
│ Pickle-Malicious │ 98.5%      │ CRITICAL       │ threat │ RCE risk detected│
└──────────────────┴────────────┴────────────────┴────────┴──────────────────┘
```

### Python Library Usage

```python
from binarysniffer import EnhancedBinarySniffer

# Initialize analyzer (enhanced mode is default)
sniffer = EnhancedBinarySniffer()

# Analyze a single file
result = sniffer.analyze_file("/path/to/binary")
for match in result.matches:
    print(f"{match.component} - {match.confidence:.2%}")
    print(f"Classification: {match.license}")  # Shows license or severity level

# Analyze mobile applications
apk_result = sniffer.analyze_file("app.apk")
ipa_result = sniffer.analyze_file("app.ipa")
jar_result = sniffer.analyze_file("library.jar")

# Analyze with custom threshold (default is 0.5)
result = sniffer.analyze_file("file.exe", confidence_threshold=0.3)  # More sensitive
result = sniffer.analyze_file("file.exe", confidence_threshold=0.8)  # More conservative

# Analyze with file hashes
result = sniffer.analyze_file("file.exe", include_hashes=True, include_fuzzy_hashes=True)

# Directory analysis
results = sniffer.analyze_directory("/path/to/project", recursive=True)
for file_path, result in results.items():
    if result.matches:
        print(f"{file_path}: {len(result.matches)} components detected")

# TLSH fuzzy matching for modified components
result = sniffer.analyze_file(
    "modified_binary.exe",
    use_tlsh=True,              # Enable TLSH fuzzy matching (default)
    tlsh_threshold=50           # Lower threshold = more similar required
)
for match in result.matches:
    if match.match_type == 'tlsh_fuzzy':
        print(f"Fuzzy match: {match.component} (similarity: {match.confidence:.0%})")
```

### SBOM Export (v1.8.6+)

Generate Software Bill of Materials in CycloneDX format for integration with security and compliance tools:

```bash
# Export single file analysis as SBOM
binarysniffer analyze app.apk --format cyclonedx -o app-sbom.json

# Export directory analysis as aggregated SBOM
binarysniffer analyze project/ -r --format cdx -o project-sbom.json

# Include extracted features for signature recreation
binarysniffer analyze binary.exe --format cyclonedx --show-features -o sbom-with-features.json
```

The SBOM includes:
- Component names, versions, and licenses
- Confidence scores for each detection
- File paths showing where components were found
- Evidence details including matched patterns
- Optional extracted features for signature recreation

### Package Inventory Extraction (v1.8.6+)

Extract comprehensive file inventories from packages with metadata, hashes, and component detection:

```bash
# Basic inventory summary
binarysniffer inventory app.apk

# Export full inventory with auto-format detection
binarysniffer inventory app.apk -o inventory.json
binarysniffer inventory app.jar -o files.csv

# Include file hashes (MD5, SHA1, SHA256, TLSH, ssdeep)
binarysniffer inventory app.jar --analyze --with-hashes -o files.csv

# Full analysis with component detection
binarysniffer inventory app.ipa \
  --analyze \
  --with-hashes \
  --with-components \
  -o full_inventory.json

# Export as directory tree visualization
binarysniffer inventory archive.zip --format tree -o structure.txt
```

#### Python API for Inventory Extraction

```python
from binarysniffer import EnhancedBinarySniffer

sniffer = EnhancedBinarySniffer()

# Basic inventory extraction
inventory = sniffer.extract_package_inventory("app.apk")
print(f"Total files: {inventory['summary']['total_files']}")
print(f"Package size: {inventory['package_size']:,} bytes")

# Full analysis with all features
inventory = sniffer.extract_package_inventory(
    "app.apk",
    analyze_contents=True,        # Extract and analyze file contents
    include_hashes=True,          # Calculate MD5, SHA1, SHA256
    include_fuzzy_hashes=True,    # Calculate TLSH and ssdeep
    detect_components=True        # Run OSS component detection
)

# Access comprehensive file metadata
for file_entry in inventory['files']:
    if not file_entry['is_directory']:
        print(f"File: {file_entry['path']}")
        print(f"  MIME: {file_entry['mime_type']}")
        print(f"  Size: {file_entry['size']:,} bytes")
        print(f"  Compression ratio: {file_entry['compression_ratio']:.1%}")
        
        if 'hashes' in file_entry:
            print(f"  SHA256: {file_entry['hashes']['sha256']}")
        
        if 'components' in file_entry:
            for comp in file_entry['components']:
                print(f"  Component: {comp['name']} ({comp['confidence']:.0%})")
```

#### Inventory Export Formats

- **JSON**: Complete structured data with all metadata
- **CSV**: Tabular format for data analysis (includes hashes, MIME types, components)
- **Tree**: Visual directory structure representation
- **Summary**: Quick overview with file type statistics

### License Detection (v1.8.9+)

Detect and analyze software licenses using pattern matching and SPDX identifier recognition:

```bash
# Analyze licenses in a file or directory
binarysniffer license /path/to/project

# Check license compatibility
binarysniffer license . --check-compatibility

# Show which files contain each license
binarysniffer license src/ --show-files

# Export license report
binarysniffer license app.apk -o licenses.json
binarysniffer license project/ -o report.md --format markdown
```

#### Integrated License Detection with Analysis

Combine component and license detection in a single analysis:

```bash
# Add license detection to regular analysis
binarysniffer analyze app.jar --license-focus

# Perform only license detection (skip component analysis)
binarysniffer analyze source/ --license-only
```

#### Python API for License Detection

```python
from binarysniffer import EnhancedBinarySniffer

sniffer = EnhancedBinarySniffer()

# Analyze licenses in a project
license_result = sniffer.analyze_licenses("/path/to/project")
print(f"Detected licenses: {', '.join(license_result['licenses_detected'])}")

# Check compatibility
compatibility = license_result['compatibility']
if not compatibility['compatible']:
    for warning in compatibility['warnings']:
        print(f"Warning: {warning}")
```

#### Features
- **Pattern-based detection** for common licenses (MIT, Apache-2.0, GPL, BSD, LGPL, ISC)
- **SPDX identifier support** with 100% confidence
- **License compatibility checking** to identify conflicts
- **Multiple output formats**: Table, JSON, CSV, Markdown
- **Works on**: License files, source code with embedded licenses, archives

### Creating and Contributing Signatures

#### Generate Signatures from Binaries or Source Code

Create custom signatures for components you want to detect:

```bash
# From binary files (recommended for compiled components)
binarysniffer signatures create /usr/bin/ffmpeg --name FFmpeg --version 4.4.1

# From source code directories
binarysniffer signatures create /path/to/source --name MyLibrary --license MIT

# With complete metadata for better attribution
binarysniffer signatures create binary.so \
  --name "My Component" \
  --version 2.0.0 \
  --license Apache-2.0 \
  --publisher "My Company" \
  --description "Component description" \
  --output signatures/my-component.json

# Specify minimum signature requirements
binarysniffer signatures create /path/to/library \
  --name "LibraryName" \
  --min-signatures 10  # Require at least 10 unique patterns
```

#### Collision Detection for Signature Quality

The signature generator includes automatic collision detection to identify patterns that appear in multiple existing components:

```bash
# Check for collisions with existing signatures
binarysniffer signatures create /usr/bin/myapp \
  --name "MyApp" \
  --check-collisions

# Interactive review - decide on each collision
binarysniffer signatures create /usr/bin/myapp \
  --name "MyApp" \
  --interactive

# Auto-remove patterns with high collision severity
binarysniffer signatures create /usr/bin/myapp \
  --name "MyApp" \
  --check-collisions \
  --collision-threshold high  # Remove patterns in 3+ components
```

**Collision Severity Levels:**
- **Critical**: Pattern appears in 5+ unrelated components (likely generic)
- **High**: Pattern appears in 3-4 components
- **Medium**: Pattern appears in 2 unrelated components  
- **Low**: Pattern appears in 2 related components (e.g., ffmpeg/libav)

**Features:**
- Automatic generic word filtering (100+ common programming terms)
- Smart deduplication - all signatures are unique
- Cross-signature collision detection
- Interactive and automatic filtering modes
- Preserves library-specific prefixes (av_, curl_, SSL_, etc.)

#### Contributing Signatures to the Community

Help improve detection by contributing your signatures:

1. **Generate the signature file**:
   ```bash
   binarysniffer signatures create /path/to/component \
     --name "Component Name" \
     --version "1.0.0" \
     --license "MIT" \
     --publisher "Publisher Name" \
     --output signatures/component-name.json
   ```

2. **Test your signature**:
   ```bash
   # Import locally for testing
   binarysniffer signatures import signatures/component-name.json
   
   # Verify detection works
   binarysniffer analyze /path/to/test/binary
   ```

3. **Submit via GitHub Pull Request**:
   ```bash
   # Fork the repository on GitHub, then:
   git clone https://github.com/YOUR_USERNAME/semantic-copycat-binarysniffer
   cd semantic-copycat-binarysniffer
   
   # Add your signature file
   cp /path/to/component-name.json signatures/
   
   # Commit and push
   git add signatures/component-name.json
   git commit -m "Add signatures for Component Name v1.0.0"
   git push origin main
   
   # Create a Pull Request on GitHub
   ```

For detailed contribution guidelines, see [CONTRIBUTING.md](CONTRIBUTING.md).

## Architecture

The tool uses a multi-tiered approach for efficient matching:

1. **Pattern Matching**: Direct string/symbol matching against signature database
2. **MinHash LSH**: Fast similarity search for near-duplicate detection (milliseconds)
3. **TLSH Fuzzy Matching**: Locality-sensitive hashing to detect modified/recompiled components
4. **Detailed Verification**: Precise signature verification with confidence scoring

### TLSH Fuzzy Matching (v1.8.0+)

TLSH (Trend Micro Locality Sensitive Hash) enables detection of:
- **Modified Components**: Components with patches or custom modifications
- **Recompiled Binaries**: Same source code compiled with different options
- **Version Variants**: Different versions of the same library
- **Obfuscated Code**: Components with mild obfuscation or optimization

The TLSH algorithm generates a compact hash that remains similar even when files are modified, making it ideal for detecting OSS components that have been customized or rebuilt.

## Performance

- **Analysis Speed**: ~1 second per binary file (5x faster in v1.6.3)
- **Archive Processing**: ~100-500ms for APK/IPA files (depends on contents)
- **Signature Storage**: ~3.5MB database with 5,136 signatures from 131 components
- **Memory Usage**: <100MB during analysis, <200MB for large archives
- **Deterministic Results**: Consistent detection across runs (NEW in v1.6.3)

## Configuration

Configuration file location: `~/.binarysniffer/config.json`

```json
{
  "signature_sources": [
    "https://signatures.binarysniffer.io/core.xmdb"
  ],
  "cache_size_mb": 100,
  "parallel_workers": 4,
  "min_confidence": 0.5,
  "auto_update": true,
  "update_check_interval_days": 7
}
```

## Signature Database

The tool includes a pre-built signature database with **131 OSS components** including:
- **Mobile SDKs**: Facebook Android SDK, Google Firebase, Google Ads
- **Java Libraries**: Jackson, Apache Commons, Google Guava, Netty  
- **Media Libraries**: FFmpeg, x264, x265, Vorbis, Opus
- **Crypto Libraries**: Bounty Castle, mbedTLS variants
- **Development Tools**: Lombok, Dagger, RxJava, OkHttp

### Signature Management

Maintaining an up-to-date signature database is critical for accurate detection. BinarySniffer provides comprehensive signature management commands:

#### Viewing Signature Status

```bash
# Check current signature database status
binarysniffer signatures status
# Shows: total signatures, components, last update, database location

# View detailed statistics
binarysniffer signatures stats
# Shows: signatures per component, database size, index status
```

#### Updating Signatures

```bash
# Update signatures from GitHub repository (recommended)
binarysniffer signatures update
# Pulls latest community-contributed signatures

# Alternative update command (backward compatible)
binarysniffer update

# Force update even if current
binarysniffer signatures update --force
```

#### Rebuilding Database

```bash
# Rebuild database from packaged signatures
binarysniffer signatures rebuild
# Useful when database is corrupted or needs fresh start

# Import specific signature files
binarysniffer signatures import signatures/*.json

# Import from custom directory
binarysniffer signatures import /path/to/signatures --recursive
```

#### Creating Custom Signatures

```bash
# Create signature from binary
binarysniffer signatures create /usr/bin/curl \
  --name "curl" \
  --version 7.81.0 \
  --license "MIT" \
  --output signatures/curl.json

# Create from source code directory
binarysniffer signatures create /path/to/source \
  --name "MyLibrary" \
  --version 1.0.0 \
  --license "Apache-2.0" \
  --min-length 8  # Minimum pattern length

# Create with metadata
binarysniffer signatures create binary.so \
  --name "Custom Component" \
  --publisher "My Company" \
  --description "Custom implementation" \
  --url "https://github.com/mycompany/component"
```

#### Signature Validation

```bash
# Validate signature quality before adding
binarysniffer signatures validate signatures/new-component.json
# Checks for: generic patterns, minimum length, uniqueness

# Test signature against known files
binarysniffer signatures test signatures/component.json /path/to/test/files
```

#### Database Management

```bash
# Export signatures to JSON (for backup or sharing)
binarysniffer signatures export --output my-signatures/
# Creates one JSON file per component

# Clear database (use with caution)
binarysniffer signatures clear --confirm
# Removes all signatures from database

# Optimize database
binarysniffer signatures optimize
# Rebuilds indexes and vacuums database for better performance
```

#### Automated Updates

Configure automatic signature updates in `~/.binarysniffer/config.json`:

```json
{
  "auto_update": true,
  "update_check_interval_days": 7,
  "signature_sources": [
    "https://github.com/oscarvalenzuelab/binarysniffer-signatures"
  ]
}
```

#### Best Practices

1. **Regular Updates**: Run `binarysniffer signatures update` weekly for latest detections
2. **Custom Signatures**: Create signatures for proprietary components you want to track
3. **Validation**: Always validate new signatures to avoid false positives
4. **Backup**: Export signatures before major updates using `signatures export`
5. **Performance**: Run `signatures optimize` monthly for best performance

For detailed signature creation and management documentation, see [docs/SIGNATURE_MANAGEMENT.md](docs/SIGNATURE_MANAGEMENT.md).

## License

Apache License 2.0 - See LICENSE file for details.

## Contributing

Contributions are welcome! Please read our [Contributing Guide](CONTRIBUTING.md) for details on our code of conduct and the process for submitting pull requests.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "semantic-copycat-binarysniffer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "binary-analysis, license-compliance, signature-matching, oss-detection, semantic-analysis, semantic-copycat, code-copycat",
    "author": null,
    "author_email": "\"Oscar Valenzuela B.\" <oscar.valenzuela.b@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/45/9a/d5153fe2c17cc7c09ef55cf4d76d0e3efe6ec0c3f24adf626ef07cba7a1f/semantic_copycat_binarysniffer-1.10.1.tar.gz",
    "platform": null,
    "description": "# Semantic Copycat BinarySniffer\n\nA high-performance CLI tool and Python library for detecting open source components and security threats in binaries through semantic signature matching. Specialized for analyzing mobile apps (APK/IPA), Java archives, ML models, and source code to identify OSS components, their licenses, and potential security risks.\n\n\n## Features\n\n### Core Analysis\n- **Fuzzy Matching**: Detect modified, recompiled, or patched OSS components using TLSH\n- **Deterministic Results**: Consistent analysis results across multiple runs\n- **Fast Local Analysis**: SQLite-based signature storage with optimized direct matching\n- **Efficient Matching**: MinHash LSH for similarity detection, trigram indexing for substring matching\n- **Dual Interface**: Use as CLI tool or Python library\n- **Smart Compression**: ZSTD-compressed signatures with ~90% size reduction\n- **Low Memory Footprint**: Streaming analysis with <100MB memory usage\n\n### SBOM Export Support\n- **CycloneDX Format**: Industry-standard SBOM export for security and compliance toolchains\n- **File Path Tracking**: Evidence includes file paths for component location tracking\n- **Feature Extraction**: Optional feature dump for signature recreation\n- **Confidence Scores**: All detections include confidence levels in SBOM\n- **Multi-file Support**: Aggregate SBOM for entire projects\n\n### Package Inventory Extraction\n- **Comprehensive File Enumeration**: Extract complete file listings from archives\n- **Rich Metadata**: MIME types, compression ratios, file sizes, timestamps\n- **Hash Calculation**: MD5, SHA1, SHA256 for integrity verification\n- **Fuzzy Hashing**: TLSH and ssdeep for similarity analysis\n- **Component Detection**: Run OSS detection on individual files within packages\n- **Multiple Export Formats**: JSON, CSV, tree visualization, summary reports\n\n### Binary Analysis\n- **Advanced Format Support**: ELF, PE, Mach-O analysis with symbol and import extraction via LIEF\n- **Static Library Support**: Parse and analyze .a archives, examining each object file separately\n- **Android DEX Support**: Specialized extractor for DEX bytecode files\n- **Improved Detection**: 25+ components detected in APK files with 152K+ features extracted\n- **Substring Matching**: Detects components even with partial pattern matches\n- **Progress Indication**: Real-time progress bars for long analysis operations\n\n### Archive Support\n- **Mobile Applications**: Android APK and iOS IPA with manifest parsing and native library analysis\n- **Java Archives**: JAR/WAR files with MANIFEST.MF parsing and package detection\n- **Python Packages**: Wheels (.whl) and eggs (.egg) with metadata extraction\n- **Linux Packages**: DEB (Debian/Ubuntu) and RPM (Red Hat/Fedora) packages\n- **Extended Formats**: 7z, RAR, Zstandard (.zst, .tar.zst), CPIO\n- **Nested Archives**: Handle archives containing other archives (up to 5 levels deep)\n- **Intelligent Extraction**: Prioritizes binaries, bytecode, and source files for analysis\n\n### Source Code Analysis\n- **CTags Integration**: Advanced source code analysis when universal-ctags is available\n- **Multi-language Support**: C/C++, Python, Java, JavaScript, Go, Rust, PHP, Swift, Kotlin\n- **Semantic Symbol Extraction**: Functions, classes, structs, constants, and dependencies\n- **Graceful Fallback**: Regex-based extraction when CTags is unavailable\n\n### ML Model Security Analysis (v1.10.0+)\n- **Comprehensive Security Module**: Deep analysis of ML models for security threats\n- **MITRE ATT&CK Integration**: Maps threats to ATT&CK framework techniques\n- **Multi-Level Risk Assessment**: SAFE, LOW, MEDIUM, HIGH, CRITICAL risk levels\n- **Pickle File Parser**: Safe analysis of Python pickle files without code execution\n- **ONNX Model Parser**: Comprehensive analysis of ONNX format models\n- **SafeTensors Parser**: Validation of secure tensor storage format\n- **PyTorch/TensorFlow Native**: Handles .pt, .pth, .pb, .h5 native formats\n- **Malicious Detection**: 100% detection rate on real-world ML exploits\n- **Framework Detection**: Identifies PyTorch (96%), TensorFlow, sklearn (94%), XGBoost (77%) origins\n- **Obfuscation Detection**: Entropy analysis and pattern matching for hidden threats\n- **Model Integrity Validation**: Hash verification and tampering detection\n- **Architecture Recognition**: Detects ResNet, BERT, YOLO, LLaMA, ViT, etc.\n- **Format Validation**: Detects tampering, injection attempts, and format violations\n- **Malformed File Detection**: Identifies corrupted or invalid model files with clear warnings\n- **Data Exfiltration Detection**: Flags oversized tensors and suspicious patterns\n- **Supply Chain Security**: Verifies model provenance and integrity\n- **SARIF Output**: CI/CD integration with GitHub Actions and security tools\n- **Security-Enhanced SBOM**: CycloneDX format with ML security metadata\n\n### Signature Database\n- **188 OSS Components**: Comprehensive coverage including libraries, frameworks, ML models, and multimedia codecs\n- **1,400+ Total Signatures**: High-quality patterns with improved accuracy and reduced false positives\n- **Multimedia Support**: H.264/H.265, AAC, Dolby, AV1, GStreamer, GLib, FFmpeg components\n- **System Libraries**: libcap, Expat XML, LZ4, XZ Utils, WebP, cURL, Cairo, Opus\n- **License Detection**: Automatic license identification for detected components\n- **Security Analysis**: Detection of malicious patterns with severity levels (CRITICAL, HIGH, MEDIUM, LOW)\n- **Rich Metadata**: Publisher, version, and ecosystem information for each component\n\n## Installation\n\n### From PyPI\n```bash\npip install semantic-copycat-binarysniffer\n```\n\n### From Source\n```bash\ngit clone https://github.com/oscarvalenzuelab/semantic-copycat-binarysniffer\ncd semantic-copycat-binarysniffer\npip install -e .\n```\n\n### With Performance Extras\n```bash\npip install semantic-copycat-binarysniffer[fast]\n```\n\n### With Fuzzy Matching Support\n```bash\n# Includes TLSH for detecting modified/recompiled components\npip install semantic-copycat-binarysniffer[fuzzy]\n```\n\n### With Extended Archive Support\n```bash\n# Includes support for 7z, RAR, DEB, RPM formats\npip install semantic-copycat-binarysniffer[archives]\n```\n\n### With Android APK Analysis\n```bash\n# Includes Androguard for advanced APK analysis\npip install semantic-copycat-binarysniffer[android]\n```\n\n## Optional Tools for Enhanced Format Support\n\nBinarySniffer can leverage external tools when available to provide enhanced analysis capabilities. These tools are **optional** - the core functionality works without them, but installing them unlocks additional features.\n\n### Quick Reference: Archive Format Requirements\n\n| Format | Python Package | System Tool (Alternative) | Fallback |\n|--------|---------------|---------------------------|----------|\n| 7z | py7zr (included) | 7-Zip | - |\n| RAR | rarfile (included) | unrar | 7-Zip |\n| DEB | python-debian (included) | ar | 7-Zip |\n| RPM | - | rpm2cpio | 7-Zip |\n| ZIP/JAR | Built-in | - | - |\n| TAR/GZ | Built-in | - | - |\n\n### 7-Zip (Recommended)\n**Enables**: Extraction and analysis of Windows installers, macOS packages, and additional compressed formats\n\n```bash\n# macOS\nbrew install p7zip\n\n# Ubuntu/Debian\nsudo apt-get install p7zip-full\n\n# Windows\n# Download from https://www.7-zip.org/\n```\n\n**Benefits**:\n- Analyze Windows installers (.exe, .msi) by extracting embedded components\n- Analyze macOS installers (.pkg, .dmg) to detect bundled frameworks\n- Support for NSIS, InnoSetup, and other installer formats\n- Extract and analyze self-extracting archives\n- Support for additional archive formats (RAR, CAB, ISO, etc.)\n\n### Tools for Extended Archive Support (Optional)\n\nWhen using the `[archives]` installation option, these tools enhance format support:\n\n#### DEB Package Analysis\n```bash\n# For DEB packages (Debian/Ubuntu)\n# Option 1: Install python-debian (included with [archives])\npip install semantic-copycat-binarysniffer[archives]\n\n# Option 2: Use system ar command (usually pre-installed)\n# Ubuntu/Debian\nwhich ar  # Check if available\n\n# macOS\n# ar is included with Xcode Command Line Tools\nxcode-select --install  # If not already installed\n```\n\n#### RPM Package Analysis\n```bash\n# For RPM packages (Red Hat/Fedora/CentOS)\n# Option 1: Install rpm2cpio\n# Ubuntu/Debian\nsudo apt-get install rpm2cpio\n\n# macOS\nbrew install rpm2cpio\n\n# Fedora/RHEL/CentOS\n# rpm2cpio is usually pre-installed\n\n# Option 2: Falls back to 7-Zip if available\n```\n\n#### Additional Archive Formats\nThe `[archives]` option includes Python libraries for:\n- **7z files**: py7zr (pure Python, no external tools needed)\n- **RAR files**: rarfile (requires unrar tool)\n  ```bash\n  # Install unrar for RAR support\n  # Ubuntu/Debian\n  sudo apt-get install unrar\n  \n  # macOS\n  brew install unrar\n  \n  # Note: Falls back to 7-Zip if unrar not available\n  ```\n\n### Universal CTags (Optional)\n**Enables**: Enhanced source code analysis with semantic understanding\n\n```bash\n# macOS\nbrew install universal-ctags\n\n# Ubuntu/Debian\nsudo apt-get install universal-ctags\n\n# Windows\n# Download from https://github.com/universal-ctags/ctags-win32/releases\n```\n\n**Benefits**:\n- Better function/class/method detection in source code\n- Multi-language semantic analysis\n- More accurate symbol extraction\n- Improved signature matching for source code components\n\n### Example: Analyzing Installers\n\nWithout 7-Zip:\n```bash\n$ binarysniffer analyze installer.exe\n# Analyzes as compressed binary - limited detection\n```\n\nWith 7-Zip installed:\n```bash\n# Windows installers\n$ binarysniffer analyze installer.exe\n$ binarysniffer analyze setup.msi\n# Automatically extracts and analyzes contents\n# Detects: Qt5, OpenSSL, SQLite, ICU, libpng, etc.\n\n# macOS installers\n$ binarysniffer analyze app.pkg\n$ binarysniffer analyze app.dmg\n# Automatically extracts and analyzes contents\n# Detects: Qt5, WebKit, OpenCV, React Native, etc.\n```\n\n## Quick Start\n\n### CLI Usage\n\n```bash\n# Basic analysis\nbinarysniffer analyze /path/to/binary\nbinarysniffer analyze app.apk                    # Android APK\nbinarysniffer analyze app.ipa                    # iOS IPA\nbinarysniffer analyze library.jar                # Java JAR\n\n# ML model component detection\nbinarysniffer analyze model.pkl                  # Pickle files\nbinarysniffer analyze model.onnx                 # ONNX models\nbinarysniffer analyze model.safetensors          # SafeTensors format\nbinarysniffer analyze suspicious_model.pkl --show-features  # Detailed analysis\n\n# ML model security scanning (v1.10.0+)\nbinarysniffer ml-scan model.pkl                  # Security analysis of ML models\nbinarysniffer ml-scan model.pkl --deep           # Deep security analysis\nbinarysniffer ml-scan models/ -r --format sarif  # SARIF output for CI/CD\nbinarysniffer ml-scan model.pkl -o report.md     # Markdown security report\nbinarysniffer ml-scan model.pkl --risk-threshold 0.5  # Custom risk threshold\n\n# Analyze directories recursively\nbinarysniffer analyze /path/to/project -r\n\n# Output with auto-format detection\nbinarysniffer analyze app.apk -o report.json     # Auto-detects JSON format\nbinarysniffer analyze app.apk -o report.csv      # Auto-detects CSV format\nbinarysniffer analyze app.apk -o app.sbom        # Auto-detects SBOM format\n\n# Performance modes\nbinarysniffer analyze large.bin --fast           # Quick scan (no fuzzy matching)\nbinarysniffer analyze app.apk --deep             # Thorough analysis\n\n# Custom confidence threshold\nbinarysniffer analyze file.exe -t 0.3            # More sensitive (30% confidence)\nbinarysniffer analyze file.exe -t 0.8            # More conservative (80% confidence)\n\n# Include file hashes in output\nbinarysniffer analyze file.exe --with-hashes -o report.json\nbinarysniffer analyze file.exe --basic-hashes    # Only MD5, SHA1, SHA256\n\n# Filter by file patterns\nbinarysniffer analyze project/ -r -p \"*.so\" -p \"*.dll\"\n\n# Export as CycloneDX SBOM\nbinarysniffer analyze app.apk -f sbom -o app-sbom.json\nbinarysniffer analyze app.apk --format cyclonedx -o sbom.json\n\n# Save features for signature creation\nbinarysniffer analyze binary.exe --save-features features.json --show-features\n\n# Filter results\nbinarysniffer analyze lib.so --min-matches 5     # Show components with 5+ matches\nbinarysniffer analyze app.apk --show-evidence    # Show detailed match evidence\n```\n\n### Understanding the Output\n\nThe analysis results display a **Classification** column that shows either:\n- **Software licenses** (e.g., Apache-2.0, BSD-3-Clause, MIT) for legitimate OSS components\n- **Security severity levels** (CRITICAL, HIGH, MEDIUM, LOW) for detected threats\n\nExample output:\n```\n\u250f\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2513\n\u2503 Component        \u2503 Confidence \u2503 Classification \u2503 Type   \u2503 Evidence         \u2503\n\u2521\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2529\n\u2502 PyTorch-Native   \u2502 94.0%      \u2502 BSD-3-Clause   \u2502 library\u2502 2 patterns       \u2502\n\u2502 SafeTensors      \u2502 90.0%      \u2502 Apache-2.0     \u2502 library\u2502 3 patterns       \u2502\n\u2502 Pickle-Malicious \u2502 98.5%      \u2502 CRITICAL       \u2502 threat \u2502 RCE risk detected\u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n### Python Library Usage\n\n```python\nfrom binarysniffer import EnhancedBinarySniffer\n\n# Initialize analyzer (enhanced mode is default)\nsniffer = EnhancedBinarySniffer()\n\n# Analyze a single file\nresult = sniffer.analyze_file(\"/path/to/binary\")\nfor match in result.matches:\n    print(f\"{match.component} - {match.confidence:.2%}\")\n    print(f\"Classification: {match.license}\")  # Shows license or severity level\n\n# Analyze mobile applications\napk_result = sniffer.analyze_file(\"app.apk\")\nipa_result = sniffer.analyze_file(\"app.ipa\")\njar_result = sniffer.analyze_file(\"library.jar\")\n\n# Analyze with custom threshold (default is 0.5)\nresult = sniffer.analyze_file(\"file.exe\", confidence_threshold=0.3)  # More sensitive\nresult = sniffer.analyze_file(\"file.exe\", confidence_threshold=0.8)  # More conservative\n\n# Analyze with file hashes\nresult = sniffer.analyze_file(\"file.exe\", include_hashes=True, include_fuzzy_hashes=True)\n\n# Directory analysis\nresults = sniffer.analyze_directory(\"/path/to/project\", recursive=True)\nfor file_path, result in results.items():\n    if result.matches:\n        print(f\"{file_path}: {len(result.matches)} components detected\")\n\n# TLSH fuzzy matching for modified components\nresult = sniffer.analyze_file(\n    \"modified_binary.exe\",\n    use_tlsh=True,              # Enable TLSH fuzzy matching (default)\n    tlsh_threshold=50           # Lower threshold = more similar required\n)\nfor match in result.matches:\n    if match.match_type == 'tlsh_fuzzy':\n        print(f\"Fuzzy match: {match.component} (similarity: {match.confidence:.0%})\")\n```\n\n### SBOM Export (v1.8.6+)\n\nGenerate Software Bill of Materials in CycloneDX format for integration with security and compliance tools:\n\n```bash\n# Export single file analysis as SBOM\nbinarysniffer analyze app.apk --format cyclonedx -o app-sbom.json\n\n# Export directory analysis as aggregated SBOM\nbinarysniffer analyze project/ -r --format cdx -o project-sbom.json\n\n# Include extracted features for signature recreation\nbinarysniffer analyze binary.exe --format cyclonedx --show-features -o sbom-with-features.json\n```\n\nThe SBOM includes:\n- Component names, versions, and licenses\n- Confidence scores for each detection\n- File paths showing where components were found\n- Evidence details including matched patterns\n- Optional extracted features for signature recreation\n\n### Package Inventory Extraction (v1.8.6+)\n\nExtract comprehensive file inventories from packages with metadata, hashes, and component detection:\n\n```bash\n# Basic inventory summary\nbinarysniffer inventory app.apk\n\n# Export full inventory with auto-format detection\nbinarysniffer inventory app.apk -o inventory.json\nbinarysniffer inventory app.jar -o files.csv\n\n# Include file hashes (MD5, SHA1, SHA256, TLSH, ssdeep)\nbinarysniffer inventory app.jar --analyze --with-hashes -o files.csv\n\n# Full analysis with component detection\nbinarysniffer inventory app.ipa \\\n  --analyze \\\n  --with-hashes \\\n  --with-components \\\n  -o full_inventory.json\n\n# Export as directory tree visualization\nbinarysniffer inventory archive.zip --format tree -o structure.txt\n```\n\n#### Python API for Inventory Extraction\n\n```python\nfrom binarysniffer import EnhancedBinarySniffer\n\nsniffer = EnhancedBinarySniffer()\n\n# Basic inventory extraction\ninventory = sniffer.extract_package_inventory(\"app.apk\")\nprint(f\"Total files: {inventory['summary']['total_files']}\")\nprint(f\"Package size: {inventory['package_size']:,} bytes\")\n\n# Full analysis with all features\ninventory = sniffer.extract_package_inventory(\n    \"app.apk\",\n    analyze_contents=True,        # Extract and analyze file contents\n    include_hashes=True,          # Calculate MD5, SHA1, SHA256\n    include_fuzzy_hashes=True,    # Calculate TLSH and ssdeep\n    detect_components=True        # Run OSS component detection\n)\n\n# Access comprehensive file metadata\nfor file_entry in inventory['files']:\n    if not file_entry['is_directory']:\n        print(f\"File: {file_entry['path']}\")\n        print(f\"  MIME: {file_entry['mime_type']}\")\n        print(f\"  Size: {file_entry['size']:,} bytes\")\n        print(f\"  Compression ratio: {file_entry['compression_ratio']:.1%}\")\n        \n        if 'hashes' in file_entry:\n            print(f\"  SHA256: {file_entry['hashes']['sha256']}\")\n        \n        if 'components' in file_entry:\n            for comp in file_entry['components']:\n                print(f\"  Component: {comp['name']} ({comp['confidence']:.0%})\")\n```\n\n#### Inventory Export Formats\n\n- **JSON**: Complete structured data with all metadata\n- **CSV**: Tabular format for data analysis (includes hashes, MIME types, components)\n- **Tree**: Visual directory structure representation\n- **Summary**: Quick overview with file type statistics\n\n### License Detection (v1.8.9+)\n\nDetect and analyze software licenses using pattern matching and SPDX identifier recognition:\n\n```bash\n# Analyze licenses in a file or directory\nbinarysniffer license /path/to/project\n\n# Check license compatibility\nbinarysniffer license . --check-compatibility\n\n# Show which files contain each license\nbinarysniffer license src/ --show-files\n\n# Export license report\nbinarysniffer license app.apk -o licenses.json\nbinarysniffer license project/ -o report.md --format markdown\n```\n\n#### Integrated License Detection with Analysis\n\nCombine component and license detection in a single analysis:\n\n```bash\n# Add license detection to regular analysis\nbinarysniffer analyze app.jar --license-focus\n\n# Perform only license detection (skip component analysis)\nbinarysniffer analyze source/ --license-only\n```\n\n#### Python API for License Detection\n\n```python\nfrom binarysniffer import EnhancedBinarySniffer\n\nsniffer = EnhancedBinarySniffer()\n\n# Analyze licenses in a project\nlicense_result = sniffer.analyze_licenses(\"/path/to/project\")\nprint(f\"Detected licenses: {', '.join(license_result['licenses_detected'])}\")\n\n# Check compatibility\ncompatibility = license_result['compatibility']\nif not compatibility['compatible']:\n    for warning in compatibility['warnings']:\n        print(f\"Warning: {warning}\")\n```\n\n#### Features\n- **Pattern-based detection** for common licenses (MIT, Apache-2.0, GPL, BSD, LGPL, ISC)\n- **SPDX identifier support** with 100% confidence\n- **License compatibility checking** to identify conflicts\n- **Multiple output formats**: Table, JSON, CSV, Markdown\n- **Works on**: License files, source code with embedded licenses, archives\n\n### Creating and Contributing Signatures\n\n#### Generate Signatures from Binaries or Source Code\n\nCreate custom signatures for components you want to detect:\n\n```bash\n# From binary files (recommended for compiled components)\nbinarysniffer signatures create /usr/bin/ffmpeg --name FFmpeg --version 4.4.1\n\n# From source code directories\nbinarysniffer signatures create /path/to/source --name MyLibrary --license MIT\n\n# With complete metadata for better attribution\nbinarysniffer signatures create binary.so \\\n  --name \"My Component\" \\\n  --version 2.0.0 \\\n  --license Apache-2.0 \\\n  --publisher \"My Company\" \\\n  --description \"Component description\" \\\n  --output signatures/my-component.json\n\n# Specify minimum signature requirements\nbinarysniffer signatures create /path/to/library \\\n  --name \"LibraryName\" \\\n  --min-signatures 10  # Require at least 10 unique patterns\n```\n\n#### Collision Detection for Signature Quality\n\nThe signature generator includes automatic collision detection to identify patterns that appear in multiple existing components:\n\n```bash\n# Check for collisions with existing signatures\nbinarysniffer signatures create /usr/bin/myapp \\\n  --name \"MyApp\" \\\n  --check-collisions\n\n# Interactive review - decide on each collision\nbinarysniffer signatures create /usr/bin/myapp \\\n  --name \"MyApp\" \\\n  --interactive\n\n# Auto-remove patterns with high collision severity\nbinarysniffer signatures create /usr/bin/myapp \\\n  --name \"MyApp\" \\\n  --check-collisions \\\n  --collision-threshold high  # Remove patterns in 3+ components\n```\n\n**Collision Severity Levels:**\n- **Critical**: Pattern appears in 5+ unrelated components (likely generic)\n- **High**: Pattern appears in 3-4 components\n- **Medium**: Pattern appears in 2 unrelated components  \n- **Low**: Pattern appears in 2 related components (e.g., ffmpeg/libav)\n\n**Features:**\n- Automatic generic word filtering (100+ common programming terms)\n- Smart deduplication - all signatures are unique\n- Cross-signature collision detection\n- Interactive and automatic filtering modes\n- Preserves library-specific prefixes (av_, curl_, SSL_, etc.)\n\n#### Contributing Signatures to the Community\n\nHelp improve detection by contributing your signatures:\n\n1. **Generate the signature file**:\n   ```bash\n   binarysniffer signatures create /path/to/component \\\n     --name \"Component Name\" \\\n     --version \"1.0.0\" \\\n     --license \"MIT\" \\\n     --publisher \"Publisher Name\" \\\n     --output signatures/component-name.json\n   ```\n\n2. **Test your signature**:\n   ```bash\n   # Import locally for testing\n   binarysniffer signatures import signatures/component-name.json\n   \n   # Verify detection works\n   binarysniffer analyze /path/to/test/binary\n   ```\n\n3. **Submit via GitHub Pull Request**:\n   ```bash\n   # Fork the repository on GitHub, then:\n   git clone https://github.com/YOUR_USERNAME/semantic-copycat-binarysniffer\n   cd semantic-copycat-binarysniffer\n   \n   # Add your signature file\n   cp /path/to/component-name.json signatures/\n   \n   # Commit and push\n   git add signatures/component-name.json\n   git commit -m \"Add signatures for Component Name v1.0.0\"\n   git push origin main\n   \n   # Create a Pull Request on GitHub\n   ```\n\nFor detailed contribution guidelines, see [CONTRIBUTING.md](CONTRIBUTING.md).\n\n## Architecture\n\nThe tool uses a multi-tiered approach for efficient matching:\n\n1. **Pattern Matching**: Direct string/symbol matching against signature database\n2. **MinHash LSH**: Fast similarity search for near-duplicate detection (milliseconds)\n3. **TLSH Fuzzy Matching**: Locality-sensitive hashing to detect modified/recompiled components\n4. **Detailed Verification**: Precise signature verification with confidence scoring\n\n### TLSH Fuzzy Matching (v1.8.0+)\n\nTLSH (Trend Micro Locality Sensitive Hash) enables detection of:\n- **Modified Components**: Components with patches or custom modifications\n- **Recompiled Binaries**: Same source code compiled with different options\n- **Version Variants**: Different versions of the same library\n- **Obfuscated Code**: Components with mild obfuscation or optimization\n\nThe TLSH algorithm generates a compact hash that remains similar even when files are modified, making it ideal for detecting OSS components that have been customized or rebuilt.\n\n## Performance\n\n- **Analysis Speed**: ~1 second per binary file (5x faster in v1.6.3)\n- **Archive Processing**: ~100-500ms for APK/IPA files (depends on contents)\n- **Signature Storage**: ~3.5MB database with 5,136 signatures from 131 components\n- **Memory Usage**: <100MB during analysis, <200MB for large archives\n- **Deterministic Results**: Consistent detection across runs (NEW in v1.6.3)\n\n## Configuration\n\nConfiguration file location: `~/.binarysniffer/config.json`\n\n```json\n{\n  \"signature_sources\": [\n    \"https://signatures.binarysniffer.io/core.xmdb\"\n  ],\n  \"cache_size_mb\": 100,\n  \"parallel_workers\": 4,\n  \"min_confidence\": 0.5,\n  \"auto_update\": true,\n  \"update_check_interval_days\": 7\n}\n```\n\n## Signature Database\n\nThe tool includes a pre-built signature database with **131 OSS components** including:\n- **Mobile SDKs**: Facebook Android SDK, Google Firebase, Google Ads\n- **Java Libraries**: Jackson, Apache Commons, Google Guava, Netty  \n- **Media Libraries**: FFmpeg, x264, x265, Vorbis, Opus\n- **Crypto Libraries**: Bounty Castle, mbedTLS variants\n- **Development Tools**: Lombok, Dagger, RxJava, OkHttp\n\n### Signature Management\n\nMaintaining an up-to-date signature database is critical for accurate detection. BinarySniffer provides comprehensive signature management commands:\n\n#### Viewing Signature Status\n\n```bash\n# Check current signature database status\nbinarysniffer signatures status\n# Shows: total signatures, components, last update, database location\n\n# View detailed statistics\nbinarysniffer signatures stats\n# Shows: signatures per component, database size, index status\n```\n\n#### Updating Signatures\n\n```bash\n# Update signatures from GitHub repository (recommended)\nbinarysniffer signatures update\n# Pulls latest community-contributed signatures\n\n# Alternative update command (backward compatible)\nbinarysniffer update\n\n# Force update even if current\nbinarysniffer signatures update --force\n```\n\n#### Rebuilding Database\n\n```bash\n# Rebuild database from packaged signatures\nbinarysniffer signatures rebuild\n# Useful when database is corrupted or needs fresh start\n\n# Import specific signature files\nbinarysniffer signatures import signatures/*.json\n\n# Import from custom directory\nbinarysniffer signatures import /path/to/signatures --recursive\n```\n\n#### Creating Custom Signatures\n\n```bash\n# Create signature from binary\nbinarysniffer signatures create /usr/bin/curl \\\n  --name \"curl\" \\\n  --version 7.81.0 \\\n  --license \"MIT\" \\\n  --output signatures/curl.json\n\n# Create from source code directory\nbinarysniffer signatures create /path/to/source \\\n  --name \"MyLibrary\" \\\n  --version 1.0.0 \\\n  --license \"Apache-2.0\" \\\n  --min-length 8  # Minimum pattern length\n\n# Create with metadata\nbinarysniffer signatures create binary.so \\\n  --name \"Custom Component\" \\\n  --publisher \"My Company\" \\\n  --description \"Custom implementation\" \\\n  --url \"https://github.com/mycompany/component\"\n```\n\n#### Signature Validation\n\n```bash\n# Validate signature quality before adding\nbinarysniffer signatures validate signatures/new-component.json\n# Checks for: generic patterns, minimum length, uniqueness\n\n# Test signature against known files\nbinarysniffer signatures test signatures/component.json /path/to/test/files\n```\n\n#### Database Management\n\n```bash\n# Export signatures to JSON (for backup or sharing)\nbinarysniffer signatures export --output my-signatures/\n# Creates one JSON file per component\n\n# Clear database (use with caution)\nbinarysniffer signatures clear --confirm\n# Removes all signatures from database\n\n# Optimize database\nbinarysniffer signatures optimize\n# Rebuilds indexes and vacuums database for better performance\n```\n\n#### Automated Updates\n\nConfigure automatic signature updates in `~/.binarysniffer/config.json`:\n\n```json\n{\n  \"auto_update\": true,\n  \"update_check_interval_days\": 7,\n  \"signature_sources\": [\n    \"https://github.com/oscarvalenzuelab/binarysniffer-signatures\"\n  ]\n}\n```\n\n#### Best Practices\n\n1. **Regular Updates**: Run `binarysniffer signatures update` weekly for latest detections\n2. **Custom Signatures**: Create signatures for proprietary components you want to track\n3. **Validation**: Always validate new signatures to avoid false positives\n4. **Backup**: Export signatures before major updates using `signatures export`\n5. **Performance**: Run `signatures optimize` monthly for best performance\n\nFor detailed signature creation and management documentation, see [docs/SIGNATURE_MANAGEMENT.md](docs/SIGNATURE_MANAGEMENT.md).\n\n## License\n\nApache License 2.0 - See LICENSE file for details.\n\n## Contributing\n\nContributions are welcome! Please read our [Contributing Guide](CONTRIBUTING.md) for details on our code of conduct and the process for submitting pull requests.\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "A high-performance CLI and library for detecting open source components in binaries through semantic signature matching",
    "version": "1.10.1",
    "project_urls": {
        "Bug Tracker": "https://github.com/oscarvalenzuelab/semantic-copycat-binarysniffer/issues",
        "Documentation": "https://github.com/oscarvalenzuelab/semantic-copycat-binarysniffer/tree/main/docs",
        "Homepage": "https://github.com/oscarvalenzuelab/semantic-copycat-binarysniffer"
    },
    "split_keywords": [
        "binary-analysis",
        " license-compliance",
        " signature-matching",
        " oss-detection",
        " semantic-analysis",
        " semantic-copycat",
        " code-copycat"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e0581bd36cb7ef9a52803ad5a7de2b2fe54104b2348a2b4164b592da10f1948f",
                "md5": "da26212a7fedca35d8a4096391220ac6",
                "sha256": "79d53aeaac18bfa6069e3bb8317813ef718db2e39310bd4dea7167497781311d"
            },
            "downloads": -1,
            "filename": "semantic_copycat_binarysniffer-1.10.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "da26212a7fedca35d8a4096391220ac6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 344615,
            "upload_time": "2025-08-30T17:45:17",
            "upload_time_iso_8601": "2025-08-30T17:45:17.955525Z",
            "url": "https://files.pythonhosted.org/packages/e0/58/1bd36cb7ef9a52803ad5a7de2b2fe54104b2348a2b4164b592da10f1948f/semantic_copycat_binarysniffer-1.10.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "459ad5153fe2c17cc7c09ef55cf4d76d0e3efe6ec0c3f24adf626ef07cba7a1f",
                "md5": "5aafe38225e2f0880b16c4f0bbc0d840",
                "sha256": "f582b5c9ec7f02a76e944109c1f650bbd21ea8b47e96dd6b213a734b5d1a2eef"
            },
            "downloads": -1,
            "filename": "semantic_copycat_binarysniffer-1.10.1.tar.gz",
            "has_sig": false,
            "md5_digest": "5aafe38225e2f0880b16c4f0bbc0d840",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 250109,
            "upload_time": "2025-08-30T17:45:19",
            "upload_time_iso_8601": "2025-08-30T17:45:19.809462Z",
            "url": "https://files.pythonhosted.org/packages/45/9a/d5153fe2c17cc7c09ef55cf4d76d0e3efe6ec0c3f24adf626ef07cba7a1f/semantic_copycat_binarysniffer-1.10.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-30 17:45:19",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "oscarvalenzuelab",
    "github_project": "semantic-copycat-binarysniffer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "semantic-copycat-binarysniffer"
}
        
Elapsed time: 1.52601s