semantic-copycat-binarysniffer


Namesemantic-copycat-binarysniffer JSON
Version 1.6.6 PyPI version JSON
download
home_pageNone
SummaryA high-performance CLI and library for detecting open source components in binaries through semantic signature matching
upload_time2025-08-06 06:44:20
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseApache-2.0
keywords binary-analysis license-compliance signature-matching oss-detection semantic-analysis semantic-copycat code-copycat
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Semantic Copycat BinarySniffer

A high-performance CLI tool and Python library for detecting open source components in binaries through semantic signature matching. Specialized for analyzing mobile apps (APK/IPA), Java archives, and source code to identify OSS components and their licenses.

## Features

### Core Analysis
- **Deterministic Results**: Consistent analysis results across multiple runs (NEW in v1.6.3)
- **Fast Local Analysis**: SQLite-based signature storage with optimized direct matching
- **Efficient Matching**: MinHash LSH for similarity detection, trigram indexing for substring matching
- **Dual Interface**: Use as CLI tool or Python library
- **Smart Compression**: ZSTD-compressed signatures with ~90% size reduction
- **Low Memory Footprint**: Streaming analysis with <100MB memory usage

### Enhanced Binary Analysis (NEW in v1.6.0)
- **LIEF Integration**: Advanced ELF/PE/Mach-O analysis with symbol and import extraction
- **Android DEX Support**: Specialized extractor for DEX bytecode files
- **Improved APK Detection**: 25+ components detected vs 1 previously (152K features extracted)
- **Substring Matching**: Detects components even with partial pattern matches
- **Progress Indication**: Real-time progress bars for long analysis operations
- **New Component Signatures**: OkHttp, OpenSSL, SQLite, ICU, FreeType, WebKit

### Archive Support
- **Android APK Analysis**: Extract and analyze AndroidManifest.xml, DEX files, native libraries
- **iOS IPA Analysis**: Parse Info.plist, detect frameworks, analyze executables
- **Java Archive Support**: Process JAR/WAR files with MANIFEST.MF parsing and package detection
- **Python Package Support**: Analyze wheels (.whl) and eggs (.egg) with metadata extraction
- **Nested Archive Processing**: Handle archives containing other archives
- **Comprehensive Format Support**: ZIP, TAR, 7z, and compound formats

### Enhanced Source Analysis
- **CTags Integration**: Advanced source code analysis when universal-ctags is available
- **Multi-language Support**: C/C++, Python, Java, JavaScript, Go, Rust, PHP, Swift, Kotlin
- **Semantic Symbol Extraction**: Functions, classes, structs, constants, and dependencies
- **Graceful Fallback**: Regex-based extraction when CTags is unavailable

### Signature Database
- **90+ OSS Components**: Pre-loaded signatures from Facebook SDK, Jackson, FFmpeg, and more
- **Real-world Detection**: Thousands of component signatures from BSA database migration
- **License Detection**: Automatic license identification for detected components
- **Metadata Rich**: Publisher, version, and ecosystem information for each component

## Installation

### From PyPI
```bash
pip install semantic-copycat-binarysniffer
```

### From Source
```bash
git clone https://github.com/oscarvalenzuelab/semantic-copycat-binarysniffer
cd semantic-copycat-binarysniffer
pip install -e .
```

### With Performance Extras
```bash
pip install semantic-copycat-binarysniffer[fast]
```

## Quick Start

### CLI Usage

```bash
# Analyze a single file (enhanced detection is always enabled)
binarysniffer analyze /path/to/binary

# Analyze mobile applications
binarysniffer analyze app.apk                    # Android APK
binarysniffer analyze app.ipa                    # iOS IPA
binarysniffer analyze library.jar                # Java JAR

# Analyze directories recursively
binarysniffer analyze /path/to/project --recursive

# Custom threshold (default is 0.5 for optimal detection)
binarysniffer analyze file.exe --threshold 0.3   # More sensitive
binarysniffer analyze file.exe --threshold 0.8   # More conservative

# Export results as JSON
binarysniffer analyze project/ --format json -o results.json

# Deep analysis with pattern filtering
binarysniffer analyze project/ --recursive --deep -p "*.so" -p "*.dll"

# Non-deterministic mode (if needed for performance testing)
binarysniffer --non-deterministic analyze file.bin
```

### Python Library Usage

```python
from binarysniffer import EnhancedBinarySniffer

# Initialize analyzer (enhanced mode is default)
sniffer = EnhancedBinarySniffer()

# Analyze a single file
result = sniffer.analyze_file("/path/to/binary")
for match in result.matches:
    print(f"{match.component} - {match.confidence:.2%}")
    print(f"License: {match.license}")

# Analyze mobile applications
apk_result = sniffer.analyze_file("app.apk")
ipa_result = sniffer.analyze_file("app.ipa")
jar_result = sniffer.analyze_file("library.jar")

# Analyze with custom threshold (default is 0.3)
result = sniffer.analyze_file("file.exe", confidence_threshold=0.1)  # More sensitive
result = sniffer.analyze_file("file.exe", confidence_threshold=0.8)  # More conservative

# Directory analysis
results = sniffer.analyze_directory("/path/to/project", recursive=True)
for file_path, result in results.items():
    if result.matches:
        print(f"{file_path}: {len(result.matches)} components detected")
```

### Creating Signatures

Create custom signatures for your components:

```bash
# From binary files (recommended)
binarysniffer signatures create /usr/bin/ffmpeg --name FFmpeg --version 4.4.1

# From source code
binarysniffer signatures create /path/to/source --name MyLibrary --license MIT

# With full metadata
binarysniffer signatures create binary.so \
  --name "My Component" \
  --version 2.0.0 \
  --license Apache-2.0 \
  --publisher "My Company" \
  --output signatures/my-component.json
```

## Architecture

The tool uses a multi-tiered approach for efficient matching:

1. **Tier 1 - Bloom Filters**: Quick elimination of non-matches (microseconds)
2. **Tier 2 - MinHash LSH**: Fast similarity search (milliseconds)
3. **Tier 3 - Detailed Matching**: Precise signature verification (seconds)

## Performance

- **Analysis Speed**: ~1 second per binary file (5x faster in v1.6.3)
- **Archive Processing**: ~100-500ms for APK/IPA files (depends on contents)
- **Signature Storage**: ~3.5MB database with 5,136 signatures from 131 components
- **Memory Usage**: <100MB during analysis, <200MB for large archives
- **Deterministic Results**: Consistent detection across runs (NEW in v1.6.3)

## Configuration

Configuration file location: `~/.binarysniffer/config.json`

```json
{
  "signature_sources": [
    "https://signatures.binarysniffer.io/core.xmdb"
  ],
  "cache_size_mb": 100,
  "parallel_workers": 4,
  "min_confidence": 0.5,
  "auto_update": true,
  "update_check_interval_days": 7
}
```

## Signature Database

The tool includes a pre-built signature database with **131 OSS components** including:
- **Mobile SDKs**: Facebook Android SDK, Google Firebase, Google Ads
- **Java Libraries**: Jackson, Apache Commons, Google Guava, Netty  
- **Media Libraries**: FFmpeg, x264, x265, Vorbis, Opus
- **Crypto Libraries**: Bounty Castle, mbedTLS variants
- **Development Tools**: Lombok, Dagger, RxJava, OkHttp

### Signature Management

For detailed information on creating, updating, and managing signatures, see [docs/SIGNATURE_MANAGEMENT.md](docs/SIGNATURE_MANAGEMENT.md).

```bash
# View current database stats
python -c "
from binarysniffer.storage.database import SignatureDatabase
db = SignatureDatabase('data/signatures.db')
print(db.get_stats())
"
```

## Development

### Setting up development environment
```bash
git clone https://github.com/oscarvalenzuelab/semantic-copycat-binarysniffer
cd semantic-copycat-binarysniffer
pip install -e .[dev]

# Optional: Install CTags for enhanced source analysis
# macOS: brew install universal-ctags
# Ubuntu: apt install universal-ctags

# Optional: Install LIEF for enhanced binary analysis
pip install lief
```

### Running tests
```bash
# Run all tests
pytest tests/

# Run specific test suites  
pytest tests/test_archive_extractor.py -v    # Archive processing
pytest tests/test_integration.py -v         # End-to-end scenarios

# Run with coverage
pytest tests/ --cov=binarysniffer
```

### Building and Testing Package
```bash
# Build package
python -m build

# Test installation
pip install dist/*.whl

# Test CLI
binarysniffer --help
```

## License

Apache License 2.0 - See LICENSE file for details.

## Contributing

Contributions are welcome! Please read our [Contributing Guide](CONTRIBUTING.md) for details on our code of conduct and the process for submitting pull requests.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "semantic-copycat-binarysniffer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "binary-analysis, license-compliance, signature-matching, oss-detection, semantic-analysis, semantic-copycat, code-copycat",
    "author": null,
    "author_email": "\"Oscar Valenzuela B.\" <oscar.valenzuela.b@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/a1/ed/7e65f02a77bafe9d3165d8a27aa237df5eca8dd2209bdebac577bbedcecf/semantic_copycat_binarysniffer-1.6.6.tar.gz",
    "platform": null,
    "description": "# Semantic Copycat BinarySniffer\n\nA high-performance CLI tool and Python library for detecting open source components in binaries through semantic signature matching. Specialized for analyzing mobile apps (APK/IPA), Java archives, and source code to identify OSS components and their licenses.\n\n## Features\n\n### Core Analysis\n- **Deterministic Results**: Consistent analysis results across multiple runs (NEW in v1.6.3)\n- **Fast Local Analysis**: SQLite-based signature storage with optimized direct matching\n- **Efficient Matching**: MinHash LSH for similarity detection, trigram indexing for substring matching\n- **Dual Interface**: Use as CLI tool or Python library\n- **Smart Compression**: ZSTD-compressed signatures with ~90% size reduction\n- **Low Memory Footprint**: Streaming analysis with <100MB memory usage\n\n### Enhanced Binary Analysis (NEW in v1.6.0)\n- **LIEF Integration**: Advanced ELF/PE/Mach-O analysis with symbol and import extraction\n- **Android DEX Support**: Specialized extractor for DEX bytecode files\n- **Improved APK Detection**: 25+ components detected vs 1 previously (152K features extracted)\n- **Substring Matching**: Detects components even with partial pattern matches\n- **Progress Indication**: Real-time progress bars for long analysis operations\n- **New Component Signatures**: OkHttp, OpenSSL, SQLite, ICU, FreeType, WebKit\n\n### Archive Support\n- **Android APK Analysis**: Extract and analyze AndroidManifest.xml, DEX files, native libraries\n- **iOS IPA Analysis**: Parse Info.plist, detect frameworks, analyze executables\n- **Java Archive Support**: Process JAR/WAR files with MANIFEST.MF parsing and package detection\n- **Python Package Support**: Analyze wheels (.whl) and eggs (.egg) with metadata extraction\n- **Nested Archive Processing**: Handle archives containing other archives\n- **Comprehensive Format Support**: ZIP, TAR, 7z, and compound formats\n\n### Enhanced Source Analysis\n- **CTags Integration**: Advanced source code analysis when universal-ctags is available\n- **Multi-language Support**: C/C++, Python, Java, JavaScript, Go, Rust, PHP, Swift, Kotlin\n- **Semantic Symbol Extraction**: Functions, classes, structs, constants, and dependencies\n- **Graceful Fallback**: Regex-based extraction when CTags is unavailable\n\n### Signature Database\n- **90+ OSS Components**: Pre-loaded signatures from Facebook SDK, Jackson, FFmpeg, and more\n- **Real-world Detection**: Thousands of component signatures from BSA database migration\n- **License Detection**: Automatic license identification for detected components\n- **Metadata Rich**: Publisher, version, and ecosystem information for each component\n\n## Installation\n\n### From PyPI\n```bash\npip install semantic-copycat-binarysniffer\n```\n\n### From Source\n```bash\ngit clone https://github.com/oscarvalenzuelab/semantic-copycat-binarysniffer\ncd semantic-copycat-binarysniffer\npip install -e .\n```\n\n### With Performance Extras\n```bash\npip install semantic-copycat-binarysniffer[fast]\n```\n\n## Quick Start\n\n### CLI Usage\n\n```bash\n# Analyze a single file (enhanced detection is always enabled)\nbinarysniffer analyze /path/to/binary\n\n# Analyze mobile applications\nbinarysniffer analyze app.apk                    # Android APK\nbinarysniffer analyze app.ipa                    # iOS IPA\nbinarysniffer analyze library.jar                # Java JAR\n\n# Analyze directories recursively\nbinarysniffer analyze /path/to/project --recursive\n\n# Custom threshold (default is 0.5 for optimal detection)\nbinarysniffer analyze file.exe --threshold 0.3   # More sensitive\nbinarysniffer analyze file.exe --threshold 0.8   # More conservative\n\n# Export results as JSON\nbinarysniffer analyze project/ --format json -o results.json\n\n# Deep analysis with pattern filtering\nbinarysniffer analyze project/ --recursive --deep -p \"*.so\" -p \"*.dll\"\n\n# Non-deterministic mode (if needed for performance testing)\nbinarysniffer --non-deterministic analyze file.bin\n```\n\n### Python Library Usage\n\n```python\nfrom binarysniffer import EnhancedBinarySniffer\n\n# Initialize analyzer (enhanced mode is default)\nsniffer = EnhancedBinarySniffer()\n\n# Analyze a single file\nresult = sniffer.analyze_file(\"/path/to/binary\")\nfor match in result.matches:\n    print(f\"{match.component} - {match.confidence:.2%}\")\n    print(f\"License: {match.license}\")\n\n# Analyze mobile applications\napk_result = sniffer.analyze_file(\"app.apk\")\nipa_result = sniffer.analyze_file(\"app.ipa\")\njar_result = sniffer.analyze_file(\"library.jar\")\n\n# Analyze with custom threshold (default is 0.3)\nresult = sniffer.analyze_file(\"file.exe\", confidence_threshold=0.1)  # More sensitive\nresult = sniffer.analyze_file(\"file.exe\", confidence_threshold=0.8)  # More conservative\n\n# Directory analysis\nresults = sniffer.analyze_directory(\"/path/to/project\", recursive=True)\nfor file_path, result in results.items():\n    if result.matches:\n        print(f\"{file_path}: {len(result.matches)} components detected\")\n```\n\n### Creating Signatures\n\nCreate custom signatures for your components:\n\n```bash\n# From binary files (recommended)\nbinarysniffer signatures create /usr/bin/ffmpeg --name FFmpeg --version 4.4.1\n\n# From source code\nbinarysniffer signatures create /path/to/source --name MyLibrary --license MIT\n\n# With full metadata\nbinarysniffer signatures create binary.so \\\n  --name \"My Component\" \\\n  --version 2.0.0 \\\n  --license Apache-2.0 \\\n  --publisher \"My Company\" \\\n  --output signatures/my-component.json\n```\n\n## Architecture\n\nThe tool uses a multi-tiered approach for efficient matching:\n\n1. **Tier 1 - Bloom Filters**: Quick elimination of non-matches (microseconds)\n2. **Tier 2 - MinHash LSH**: Fast similarity search (milliseconds)\n3. **Tier 3 - Detailed Matching**: Precise signature verification (seconds)\n\n## Performance\n\n- **Analysis Speed**: ~1 second per binary file (5x faster in v1.6.3)\n- **Archive Processing**: ~100-500ms for APK/IPA files (depends on contents)\n- **Signature Storage**: ~3.5MB database with 5,136 signatures from 131 components\n- **Memory Usage**: <100MB during analysis, <200MB for large archives\n- **Deterministic Results**: Consistent detection across runs (NEW in v1.6.3)\n\n## Configuration\n\nConfiguration file location: `~/.binarysniffer/config.json`\n\n```json\n{\n  \"signature_sources\": [\n    \"https://signatures.binarysniffer.io/core.xmdb\"\n  ],\n  \"cache_size_mb\": 100,\n  \"parallel_workers\": 4,\n  \"min_confidence\": 0.5,\n  \"auto_update\": true,\n  \"update_check_interval_days\": 7\n}\n```\n\n## Signature Database\n\nThe tool includes a pre-built signature database with **131 OSS components** including:\n- **Mobile SDKs**: Facebook Android SDK, Google Firebase, Google Ads\n- **Java Libraries**: Jackson, Apache Commons, Google Guava, Netty  \n- **Media Libraries**: FFmpeg, x264, x265, Vorbis, Opus\n- **Crypto Libraries**: Bounty Castle, mbedTLS variants\n- **Development Tools**: Lombok, Dagger, RxJava, OkHttp\n\n### Signature Management\n\nFor detailed information on creating, updating, and managing signatures, see [docs/SIGNATURE_MANAGEMENT.md](docs/SIGNATURE_MANAGEMENT.md).\n\n```bash\n# View current database stats\npython -c \"\nfrom binarysniffer.storage.database import SignatureDatabase\ndb = SignatureDatabase('data/signatures.db')\nprint(db.get_stats())\n\"\n```\n\n## Development\n\n### Setting up development environment\n```bash\ngit clone https://github.com/oscarvalenzuelab/semantic-copycat-binarysniffer\ncd semantic-copycat-binarysniffer\npip install -e .[dev]\n\n# Optional: Install CTags for enhanced source analysis\n# macOS: brew install universal-ctags\n# Ubuntu: apt install universal-ctags\n\n# Optional: Install LIEF for enhanced binary analysis\npip install lief\n```\n\n### Running tests\n```bash\n# Run all tests\npytest tests/\n\n# Run specific test suites  \npytest tests/test_archive_extractor.py -v    # Archive processing\npytest tests/test_integration.py -v         # End-to-end scenarios\n\n# Run with coverage\npytest tests/ --cov=binarysniffer\n```\n\n### Building and Testing Package\n```bash\n# Build package\npython -m build\n\n# Test installation\npip install dist/*.whl\n\n# Test CLI\nbinarysniffer --help\n```\n\n## License\n\nApache License 2.0 - See LICENSE file for details.\n\n## Contributing\n\nContributions are welcome! Please read our [Contributing Guide](CONTRIBUTING.md) for details on our code of conduct and the process for submitting pull requests.\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "A high-performance CLI and library for detecting open source components in binaries through semantic signature matching",
    "version": "1.6.6",
    "project_urls": {
        "Bug Tracker": "https://github.com/oscarvalenzuelab/semantic-copycat-binarysniffer/issues",
        "Documentation": "https://github.com/oscarvalenzuelab/semantic-copycat-binarysniffer/tree/main/docs",
        "Homepage": "https://github.com/oscarvalenzuelab/semantic-copycat-binarysniffer"
    },
    "split_keywords": [
        "binary-analysis",
        " license-compliance",
        " signature-matching",
        " oss-detection",
        " semantic-analysis",
        " semantic-copycat",
        " code-copycat"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f599e50982f51316f45e7f5cab78f9b04770c89be21b81b644431dc90c65f7e1",
                "md5": "84dc00cfcbbb6d86eb19ceed4ca713d6",
                "sha256": "1acf3e292cd6ad818683f2a092e824f09bbaa15f12caa91d10e27a3e7f82275d"
            },
            "downloads": -1,
            "filename": "semantic_copycat_binarysniffer-1.6.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "84dc00cfcbbb6d86eb19ceed4ca713d6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 249255,
            "upload_time": "2025-08-06T06:44:19",
            "upload_time_iso_8601": "2025-08-06T06:44:19.057369Z",
            "url": "https://files.pythonhosted.org/packages/f5/99/e50982f51316f45e7f5cab78f9b04770c89be21b81b644431dc90c65f7e1/semantic_copycat_binarysniffer-1.6.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a1ed7e65f02a77bafe9d3165d8a27aa237df5eca8dd2209bdebac577bbedcecf",
                "md5": "e8ed1ad1592f4e3413c7bad06c4d7e23",
                "sha256": "508cde1afb8e79b3463189b4d833ac137bdd2c9c24f83580dbe8b0119e04d923"
            },
            "downloads": -1,
            "filename": "semantic_copycat_binarysniffer-1.6.6.tar.gz",
            "has_sig": false,
            "md5_digest": "e8ed1ad1592f4e3413c7bad06c4d7e23",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 151862,
            "upload_time": "2025-08-06T06:44:20",
            "upload_time_iso_8601": "2025-08-06T06:44:20.524180Z",
            "url": "https://files.pythonhosted.org/packages/a1/ed/7e65f02a77bafe9d3165d8a27aa237df5eca8dd2209bdebac577bbedcecf/semantic_copycat_binarysniffer-1.6.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-06 06:44:20",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "oscarvalenzuelab",
    "github_project": "semantic-copycat-binarysniffer",
    "github_not_found": true,
    "lcname": "semantic-copycat-binarysniffer"
}
        
Elapsed time: 0.66653s