# Semantic Copycat BinarySniffer
A high-performance CLI tool and Python library for detecting open source components in binaries through semantic signature matching. Specialized for analyzing mobile apps (APK/IPA), Java archives, and source code to identify OSS components and their licenses.
## Features
### Core Analysis
- **Deterministic Results**: Consistent analysis results across multiple runs (NEW in v1.6.3)
- **Fast Local Analysis**: SQLite-based signature storage with optimized direct matching
- **Efficient Matching**: MinHash LSH for similarity detection, trigram indexing for substring matching
- **Dual Interface**: Use as CLI tool or Python library
- **Smart Compression**: ZSTD-compressed signatures with ~90% size reduction
- **Low Memory Footprint**: Streaming analysis with <100MB memory usage
### Enhanced Binary Analysis (NEW in v1.6.0)
- **LIEF Integration**: Advanced ELF/PE/Mach-O analysis with symbol and import extraction
- **Android DEX Support**: Specialized extractor for DEX bytecode files
- **Improved APK Detection**: 25+ components detected vs 1 previously (152K features extracted)
- **Substring Matching**: Detects components even with partial pattern matches
- **Progress Indication**: Real-time progress bars for long analysis operations
- **New Component Signatures**: OkHttp, OpenSSL, SQLite, ICU, FreeType, WebKit
### Archive Support
- **Android APK Analysis**: Extract and analyze AndroidManifest.xml, DEX files, native libraries
- **iOS IPA Analysis**: Parse Info.plist, detect frameworks, analyze executables
- **Java Archive Support**: Process JAR/WAR files with MANIFEST.MF parsing and package detection
- **Python Package Support**: Analyze wheels (.whl) and eggs (.egg) with metadata extraction
- **Nested Archive Processing**: Handle archives containing other archives
- **Comprehensive Format Support**: ZIP, TAR, 7z, and compound formats
### Enhanced Source Analysis
- **CTags Integration**: Advanced source code analysis when universal-ctags is available
- **Multi-language Support**: C/C++, Python, Java, JavaScript, Go, Rust, PHP, Swift, Kotlin
- **Semantic Symbol Extraction**: Functions, classes, structs, constants, and dependencies
- **Graceful Fallback**: Regex-based extraction when CTags is unavailable
### Signature Database
- **90+ OSS Components**: Pre-loaded signatures from Facebook SDK, Jackson, FFmpeg, and more
- **Real-world Detection**: Thousands of component signatures from BSA database migration
- **License Detection**: Automatic license identification for detected components
- **Metadata Rich**: Publisher, version, and ecosystem information for each component
## Installation
### From PyPI
```bash
pip install semantic-copycat-binarysniffer
```
### From Source
```bash
git clone https://github.com/oscarvalenzuelab/semantic-copycat-binarysniffer
cd semantic-copycat-binarysniffer
pip install -e .
```
### With Performance Extras
```bash
pip install semantic-copycat-binarysniffer[fast]
```
## Quick Start
### CLI Usage
```bash
# Analyze a single file (enhanced detection is always enabled)
binarysniffer analyze /path/to/binary
# Analyze mobile applications
binarysniffer analyze app.apk # Android APK
binarysniffer analyze app.ipa # iOS IPA
binarysniffer analyze library.jar # Java JAR
# Analyze directories recursively
binarysniffer analyze /path/to/project --recursive
# Custom threshold (default is 0.5 for optimal detection)
binarysniffer analyze file.exe --threshold 0.3 # More sensitive
binarysniffer analyze file.exe --threshold 0.8 # More conservative
# Export results as JSON
binarysniffer analyze project/ --format json -o results.json
# Deep analysis with pattern filtering
binarysniffer analyze project/ --recursive --deep -p "*.so" -p "*.dll"
# Non-deterministic mode (if needed for performance testing)
binarysniffer --non-deterministic analyze file.bin
```
### Python Library Usage
```python
from binarysniffer import EnhancedBinarySniffer
# Initialize analyzer (enhanced mode is default)
sniffer = EnhancedBinarySniffer()
# Analyze a single file
result = sniffer.analyze_file("/path/to/binary")
for match in result.matches:
print(f"{match.component} - {match.confidence:.2%}")
print(f"License: {match.license}")
# Analyze mobile applications
apk_result = sniffer.analyze_file("app.apk")
ipa_result = sniffer.analyze_file("app.ipa")
jar_result = sniffer.analyze_file("library.jar")
# Analyze with custom threshold (default is 0.3)
result = sniffer.analyze_file("file.exe", confidence_threshold=0.1) # More sensitive
result = sniffer.analyze_file("file.exe", confidence_threshold=0.8) # More conservative
# Directory analysis
results = sniffer.analyze_directory("/path/to/project", recursive=True)
for file_path, result in results.items():
if result.matches:
print(f"{file_path}: {len(result.matches)} components detected")
```
### Creating Signatures
Create custom signatures for your components:
```bash
# From binary files (recommended)
binarysniffer signatures create /usr/bin/ffmpeg --name FFmpeg --version 4.4.1
# From source code
binarysniffer signatures create /path/to/source --name MyLibrary --license MIT
# With full metadata
binarysniffer signatures create binary.so \
--name "My Component" \
--version 2.0.0 \
--license Apache-2.0 \
--publisher "My Company" \
--output signatures/my-component.json
```
## Architecture
The tool uses a multi-tiered approach for efficient matching:
1. **Tier 1 - Bloom Filters**: Quick elimination of non-matches (microseconds)
2. **Tier 2 - MinHash LSH**: Fast similarity search (milliseconds)
3. **Tier 3 - Detailed Matching**: Precise signature verification (seconds)
## Performance
- **Analysis Speed**: ~1 second per binary file (5x faster in v1.6.3)
- **Archive Processing**: ~100-500ms for APK/IPA files (depends on contents)
- **Signature Storage**: ~3.5MB database with 5,136 signatures from 131 components
- **Memory Usage**: <100MB during analysis, <200MB for large archives
- **Deterministic Results**: Consistent detection across runs (NEW in v1.6.3)
## Configuration
Configuration file location: `~/.binarysniffer/config.json`
```json
{
"signature_sources": [
"https://signatures.binarysniffer.io/core.xmdb"
],
"cache_size_mb": 100,
"parallel_workers": 4,
"min_confidence": 0.5,
"auto_update": true,
"update_check_interval_days": 7
}
```
## Signature Database
The tool includes a pre-built signature database with **131 OSS components** including:
- **Mobile SDKs**: Facebook Android SDK, Google Firebase, Google Ads
- **Java Libraries**: Jackson, Apache Commons, Google Guava, Netty
- **Media Libraries**: FFmpeg, x264, x265, Vorbis, Opus
- **Crypto Libraries**: Bounty Castle, mbedTLS variants
- **Development Tools**: Lombok, Dagger, RxJava, OkHttp
### Signature Management
For detailed information on creating, updating, and managing signatures, see [docs/SIGNATURE_MANAGEMENT.md](docs/SIGNATURE_MANAGEMENT.md).
```bash
# View current database stats
python -c "
from binarysniffer.storage.database import SignatureDatabase
db = SignatureDatabase('data/signatures.db')
print(db.get_stats())
"
```
## Development
### Setting up development environment
```bash
git clone https://github.com/oscarvalenzuelab/semantic-copycat-binarysniffer
cd semantic-copycat-binarysniffer
pip install -e .[dev]
# Optional: Install CTags for enhanced source analysis
# macOS: brew install universal-ctags
# Ubuntu: apt install universal-ctags
# Optional: Install LIEF for enhanced binary analysis
pip install lief
```
### Running tests
```bash
# Run all tests
pytest tests/
# Run specific test suites
pytest tests/test_archive_extractor.py -v # Archive processing
pytest tests/test_integration.py -v # End-to-end scenarios
# Run with coverage
pytest tests/ --cov=binarysniffer
```
### Building and Testing Package
```bash
# Build package
python -m build
# Test installation
pip install dist/*.whl
# Test CLI
binarysniffer --help
```
## License
Apache License 2.0 - See LICENSE file for details.
## Contributing
Contributions are welcome! Please read our [Contributing Guide](CONTRIBUTING.md) for details on our code of conduct and the process for submitting pull requests.
Raw data
{
"_id": null,
"home_page": null,
"name": "semantic-copycat-binarysniffer",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "binary-analysis, license-compliance, signature-matching, oss-detection, semantic-analysis, semantic-copycat, code-copycat",
"author": null,
"author_email": "\"Oscar Valenzuela B.\" <oscar.valenzuela.b@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/a1/ed/7e65f02a77bafe9d3165d8a27aa237df5eca8dd2209bdebac577bbedcecf/semantic_copycat_binarysniffer-1.6.6.tar.gz",
"platform": null,
"description": "# Semantic Copycat BinarySniffer\n\nA high-performance CLI tool and Python library for detecting open source components in binaries through semantic signature matching. Specialized for analyzing mobile apps (APK/IPA), Java archives, and source code to identify OSS components and their licenses.\n\n## Features\n\n### Core Analysis\n- **Deterministic Results**: Consistent analysis results across multiple runs (NEW in v1.6.3)\n- **Fast Local Analysis**: SQLite-based signature storage with optimized direct matching\n- **Efficient Matching**: MinHash LSH for similarity detection, trigram indexing for substring matching\n- **Dual Interface**: Use as CLI tool or Python library\n- **Smart Compression**: ZSTD-compressed signatures with ~90% size reduction\n- **Low Memory Footprint**: Streaming analysis with <100MB memory usage\n\n### Enhanced Binary Analysis (NEW in v1.6.0)\n- **LIEF Integration**: Advanced ELF/PE/Mach-O analysis with symbol and import extraction\n- **Android DEX Support**: Specialized extractor for DEX bytecode files\n- **Improved APK Detection**: 25+ components detected vs 1 previously (152K features extracted)\n- **Substring Matching**: Detects components even with partial pattern matches\n- **Progress Indication**: Real-time progress bars for long analysis operations\n- **New Component Signatures**: OkHttp, OpenSSL, SQLite, ICU, FreeType, WebKit\n\n### Archive Support\n- **Android APK Analysis**: Extract and analyze AndroidManifest.xml, DEX files, native libraries\n- **iOS IPA Analysis**: Parse Info.plist, detect frameworks, analyze executables\n- **Java Archive Support**: Process JAR/WAR files with MANIFEST.MF parsing and package detection\n- **Python Package Support**: Analyze wheels (.whl) and eggs (.egg) with metadata extraction\n- **Nested Archive Processing**: Handle archives containing other archives\n- **Comprehensive Format Support**: ZIP, TAR, 7z, and compound formats\n\n### Enhanced Source Analysis\n- **CTags Integration**: Advanced source code analysis when universal-ctags is available\n- **Multi-language Support**: C/C++, Python, Java, JavaScript, Go, Rust, PHP, Swift, Kotlin\n- **Semantic Symbol Extraction**: Functions, classes, structs, constants, and dependencies\n- **Graceful Fallback**: Regex-based extraction when CTags is unavailable\n\n### Signature Database\n- **90+ OSS Components**: Pre-loaded signatures from Facebook SDK, Jackson, FFmpeg, and more\n- **Real-world Detection**: Thousands of component signatures from BSA database migration\n- **License Detection**: Automatic license identification for detected components\n- **Metadata Rich**: Publisher, version, and ecosystem information for each component\n\n## Installation\n\n### From PyPI\n```bash\npip install semantic-copycat-binarysniffer\n```\n\n### From Source\n```bash\ngit clone https://github.com/oscarvalenzuelab/semantic-copycat-binarysniffer\ncd semantic-copycat-binarysniffer\npip install -e .\n```\n\n### With Performance Extras\n```bash\npip install semantic-copycat-binarysniffer[fast]\n```\n\n## Quick Start\n\n### CLI Usage\n\n```bash\n# Analyze a single file (enhanced detection is always enabled)\nbinarysniffer analyze /path/to/binary\n\n# Analyze mobile applications\nbinarysniffer analyze app.apk # Android APK\nbinarysniffer analyze app.ipa # iOS IPA\nbinarysniffer analyze library.jar # Java JAR\n\n# Analyze directories recursively\nbinarysniffer analyze /path/to/project --recursive\n\n# Custom threshold (default is 0.5 for optimal detection)\nbinarysniffer analyze file.exe --threshold 0.3 # More sensitive\nbinarysniffer analyze file.exe --threshold 0.8 # More conservative\n\n# Export results as JSON\nbinarysniffer analyze project/ --format json -o results.json\n\n# Deep analysis with pattern filtering\nbinarysniffer analyze project/ --recursive --deep -p \"*.so\" -p \"*.dll\"\n\n# Non-deterministic mode (if needed for performance testing)\nbinarysniffer --non-deterministic analyze file.bin\n```\n\n### Python Library Usage\n\n```python\nfrom binarysniffer import EnhancedBinarySniffer\n\n# Initialize analyzer (enhanced mode is default)\nsniffer = EnhancedBinarySniffer()\n\n# Analyze a single file\nresult = sniffer.analyze_file(\"/path/to/binary\")\nfor match in result.matches:\n print(f\"{match.component} - {match.confidence:.2%}\")\n print(f\"License: {match.license}\")\n\n# Analyze mobile applications\napk_result = sniffer.analyze_file(\"app.apk\")\nipa_result = sniffer.analyze_file(\"app.ipa\")\njar_result = sniffer.analyze_file(\"library.jar\")\n\n# Analyze with custom threshold (default is 0.3)\nresult = sniffer.analyze_file(\"file.exe\", confidence_threshold=0.1) # More sensitive\nresult = sniffer.analyze_file(\"file.exe\", confidence_threshold=0.8) # More conservative\n\n# Directory analysis\nresults = sniffer.analyze_directory(\"/path/to/project\", recursive=True)\nfor file_path, result in results.items():\n if result.matches:\n print(f\"{file_path}: {len(result.matches)} components detected\")\n```\n\n### Creating Signatures\n\nCreate custom signatures for your components:\n\n```bash\n# From binary files (recommended)\nbinarysniffer signatures create /usr/bin/ffmpeg --name FFmpeg --version 4.4.1\n\n# From source code\nbinarysniffer signatures create /path/to/source --name MyLibrary --license MIT\n\n# With full metadata\nbinarysniffer signatures create binary.so \\\n --name \"My Component\" \\\n --version 2.0.0 \\\n --license Apache-2.0 \\\n --publisher \"My Company\" \\\n --output signatures/my-component.json\n```\n\n## Architecture\n\nThe tool uses a multi-tiered approach for efficient matching:\n\n1. **Tier 1 - Bloom Filters**: Quick elimination of non-matches (microseconds)\n2. **Tier 2 - MinHash LSH**: Fast similarity search (milliseconds)\n3. **Tier 3 - Detailed Matching**: Precise signature verification (seconds)\n\n## Performance\n\n- **Analysis Speed**: ~1 second per binary file (5x faster in v1.6.3)\n- **Archive Processing**: ~100-500ms for APK/IPA files (depends on contents)\n- **Signature Storage**: ~3.5MB database with 5,136 signatures from 131 components\n- **Memory Usage**: <100MB during analysis, <200MB for large archives\n- **Deterministic Results**: Consistent detection across runs (NEW in v1.6.3)\n\n## Configuration\n\nConfiguration file location: `~/.binarysniffer/config.json`\n\n```json\n{\n \"signature_sources\": [\n \"https://signatures.binarysniffer.io/core.xmdb\"\n ],\n \"cache_size_mb\": 100,\n \"parallel_workers\": 4,\n \"min_confidence\": 0.5,\n \"auto_update\": true,\n \"update_check_interval_days\": 7\n}\n```\n\n## Signature Database\n\nThe tool includes a pre-built signature database with **131 OSS components** including:\n- **Mobile SDKs**: Facebook Android SDK, Google Firebase, Google Ads\n- **Java Libraries**: Jackson, Apache Commons, Google Guava, Netty \n- **Media Libraries**: FFmpeg, x264, x265, Vorbis, Opus\n- **Crypto Libraries**: Bounty Castle, mbedTLS variants\n- **Development Tools**: Lombok, Dagger, RxJava, OkHttp\n\n### Signature Management\n\nFor detailed information on creating, updating, and managing signatures, see [docs/SIGNATURE_MANAGEMENT.md](docs/SIGNATURE_MANAGEMENT.md).\n\n```bash\n# View current database stats\npython -c \"\nfrom binarysniffer.storage.database import SignatureDatabase\ndb = SignatureDatabase('data/signatures.db')\nprint(db.get_stats())\n\"\n```\n\n## Development\n\n### Setting up development environment\n```bash\ngit clone https://github.com/oscarvalenzuelab/semantic-copycat-binarysniffer\ncd semantic-copycat-binarysniffer\npip install -e .[dev]\n\n# Optional: Install CTags for enhanced source analysis\n# macOS: brew install universal-ctags\n# Ubuntu: apt install universal-ctags\n\n# Optional: Install LIEF for enhanced binary analysis\npip install lief\n```\n\n### Running tests\n```bash\n# Run all tests\npytest tests/\n\n# Run specific test suites \npytest tests/test_archive_extractor.py -v # Archive processing\npytest tests/test_integration.py -v # End-to-end scenarios\n\n# Run with coverage\npytest tests/ --cov=binarysniffer\n```\n\n### Building and Testing Package\n```bash\n# Build package\npython -m build\n\n# Test installation\npip install dist/*.whl\n\n# Test CLI\nbinarysniffer --help\n```\n\n## License\n\nApache License 2.0 - See LICENSE file for details.\n\n## Contributing\n\nContributions are welcome! Please read our [Contributing Guide](CONTRIBUTING.md) for details on our code of conduct and the process for submitting pull requests.\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "A high-performance CLI and library for detecting open source components in binaries through semantic signature matching",
"version": "1.6.6",
"project_urls": {
"Bug Tracker": "https://github.com/oscarvalenzuelab/semantic-copycat-binarysniffer/issues",
"Documentation": "https://github.com/oscarvalenzuelab/semantic-copycat-binarysniffer/tree/main/docs",
"Homepage": "https://github.com/oscarvalenzuelab/semantic-copycat-binarysniffer"
},
"split_keywords": [
"binary-analysis",
" license-compliance",
" signature-matching",
" oss-detection",
" semantic-analysis",
" semantic-copycat",
" code-copycat"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "f599e50982f51316f45e7f5cab78f9b04770c89be21b81b644431dc90c65f7e1",
"md5": "84dc00cfcbbb6d86eb19ceed4ca713d6",
"sha256": "1acf3e292cd6ad818683f2a092e824f09bbaa15f12caa91d10e27a3e7f82275d"
},
"downloads": -1,
"filename": "semantic_copycat_binarysniffer-1.6.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "84dc00cfcbbb6d86eb19ceed4ca713d6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 249255,
"upload_time": "2025-08-06T06:44:19",
"upload_time_iso_8601": "2025-08-06T06:44:19.057369Z",
"url": "https://files.pythonhosted.org/packages/f5/99/e50982f51316f45e7f5cab78f9b04770c89be21b81b644431dc90c65f7e1/semantic_copycat_binarysniffer-1.6.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "a1ed7e65f02a77bafe9d3165d8a27aa237df5eca8dd2209bdebac577bbedcecf",
"md5": "e8ed1ad1592f4e3413c7bad06c4d7e23",
"sha256": "508cde1afb8e79b3463189b4d833ac137bdd2c9c24f83580dbe8b0119e04d923"
},
"downloads": -1,
"filename": "semantic_copycat_binarysniffer-1.6.6.tar.gz",
"has_sig": false,
"md5_digest": "e8ed1ad1592f4e3413c7bad06c4d7e23",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 151862,
"upload_time": "2025-08-06T06:44:20",
"upload_time_iso_8601": "2025-08-06T06:44:20.524180Z",
"url": "https://files.pythonhosted.org/packages/a1/ed/7e65f02a77bafe9d3165d8a27aa237df5eca8dd2209bdebac577bbedcecf/semantic_copycat_binarysniffer-1.6.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-06 06:44:20",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "oscarvalenzuelab",
"github_project": "semantic-copycat-binarysniffer",
"github_not_found": true,
"lcname": "semantic-copycat-binarysniffer"
}