semantic-copycat-oslili


Namesemantic-copycat-oslili JSON
Version 1.5.5 PyPI version JSON
download
home_pagehttps://github.com/oscarvalenzuelab/semantic-copycat-oslili
SummarySemantic Copycat Open Source License Identification Library
upload_time2025-10-25 00:55:55
maintainerNone
docs_urlNone
authorOscar Valenzuela B.
requires_python>=3.8
licenseApache-2.0
keywords license attribution legal sbom spdx copyright compliance open-source package-analysis
VCS
bugtrack_url
requirements requests pyyaml click colorama fuzzywuzzy python-Levenshtein python-tlsh
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # semantic-copycat-oslili

A high-performance tool for identifying licenses and copyright information in local source code, producing detailed evidence of where licenses are detected with support for all 700+ SPDX license identifiers.

## What It Does

`semantic-copycat-oslili` analyzes local source code to produce evidence of:
- **License detection** - Shows which files contain which licenses with confidence scores
- **SPDX identifiers** - Detects SPDX-License-Identifier tags in ALL readable files
- **Package metadata** - Extracts licenses from package.json, pyproject.toml, METADATA files
- **Copyright statements** - Extracts copyright holders and years with intelligent filtering

The tool outputs standardized JSON evidence showing exactly where each license was detected, the detection method used, and confidence scores.

### Key Features

- **Evidence-based output**: Shows exact file paths, confidence scores, and detection methods
- **License hierarchy**: Categorizes licenses as declared vs detected vs referenced
- **Multiple output formats**: Evidence JSON, KissBOM, CycloneDX SBOM
- **Archive extraction**: Automatically extracts and scans zip, tar, and other archive formats
- **Caching support**: Speed up repeated scans with intelligent caching
- **Parallel processing**: Multi-threaded scanning with configurable thread count
- **Three-tier detection**: 
  - Dice-Sørensen similarity matching (97% threshold)
  - TLSH fuzzy hashing with confirmation
  - Regex pattern matching
- **Safe directory traversal**: Depth limiting and symlink loop protection
- **Smart normalization**: Handles license variations and common aliases
- **No file limits**: Processes files of any size with intelligent sampling
- **Enhanced metadata support**: Detects licenses in package.json, METADATA, pyproject.toml
- **False positive filtering**: Advanced filtering for code patterns and invalid matches

### How It Works

#### Three-Tier License Detection System

The tool uses a sophisticated multi-tier approach for maximum accuracy:

1. **Tier 1: Dice-Sørensen Similarity with TLSH Confirmation**
   - Compares license text using Dice-Sørensen coefficient (97% threshold)
   - Confirms matches using TLSH fuzzy hashing to prevent false positives
   - Achieves 97-100% accuracy on standard SPDX licenses

2. **Tier 2: TLSH Fuzzy Hash Matching**
   - Uses Trend Micro Locality Sensitive Hashing for variant detection
   - Catches license variants like MIT-0, BSD-2-Clause vs BSD-3-Clause
   - Pre-computed hashes for all 700+ SPDX licenses

3. **Tier 3: Pattern Recognition**
   - Regex-based detection for license references and identifiers
   - Extracts from comments, headers, and documentation

#### Additional Detection Methods

- **Package Metadata Scanning**: Detects licenses from package.json, composer.json, pyproject.toml, etc.
- **Copyright Extraction**: Advanced pattern matching with validation and deduplication
- **SPDX Identifier Detection**: Finds SPDX-License-Identifier tags in source files

## Installation

```bash
pip install semantic-copycat-oslili
```

### Required Dependencies

The package includes all necessary dependencies including `python-tlsh` for fuzzy hash matching, which is essential for accurate license detection and false positive prevention.

## Usage

### CLI Usage

```bash
# Scan a directory and see evidence (default format)
oslili /path/to/project

# Generate different output formats
oslili ./my-project -f kissbom -o kissbom.json
oslili ./my-project -f cyclonedx-json -o sbom.json
oslili ./my-project -f cyclonedx-xml -o sbom.xml

# Scan with parallel processing (4 threads)
oslili ./my-project --threads 4

# Scan with limited depth (only 2 levels deep)
oslili ./my-project --max-depth 2

# Extract and scan archives
oslili package.tar.gz --max-extraction-depth 2

# Use caching for faster repeated scans
oslili ./my-project --cache-dir ~/.cache/oslili

# Check version
oslili --version

# Save results to file
oslili ./my-project -o license-evidence.json

# With custom configuration and verbose output
oslili ./src --config config.yaml --verbose

# Debug mode for detailed logging
oslili ./project --debug
```

### Example Output

```json
{
  "scan_results": [{
    "path": "./project",
    "license_evidence": [
      {
        "file": "/path/to/project/LICENSE",
        "detected_license": "Apache-2.0",
        "confidence": 0.988,
        "detection_method": "dice-sorensen",
        "category": "declared",
        "match_type": "text_similarity",
        "description": "Text matches Apache-2.0 license (98.8% similarity)"
      },
      {
        "file": "/path/to/project/package.json",
        "detected_license": "Apache-2.0",
        "confidence": 1.0,
        "detection_method": "tag",
        "category": "declared",
        "match_type": "spdx_identifier",
        "description": "SPDX-License-Identifier: Apache-2.0 found"
      }
    ],
    "copyright_evidence": [
      {
        "file": "/path/to/project/src/main.py",
        "holder": "Example Corp",
        "years": [2023, 2024],
        "statement": "Copyright 2023-2024 Example Corp"
      }
    ]
  }],
  "summary": {
    "total_files_scanned": 42,
    "declared_licenses": {"Apache-2.0": 2},
    "detected_licenses": {},
    "referenced_licenses": {},
    "copyright_holders": ["Example Corp"]
  }
}
```


### Library Usage

```python
from semantic_copycat_oslili import LicenseCopyrightDetector

# Initialize detector
detector = LicenseCopyrightDetector()

# Process a local directory
result = detector.process_local_path("/path/to/source")

# Process a single file  
result = detector.process_local_path("/path/to/LICENSE")

# Generate different output formats
evidence = detector.generate_evidence([result])
kissbom = detector.generate_kissbom([result])
cyclonedx = detector.generate_cyclonedx([result], format_type="json")
cyclonedx_xml = detector.generate_cyclonedx([result], format_type="xml")

# Access results directly
for license in result.licenses:
    print(f"License: {license.spdx_id} ({license.confidence:.0%} confidence)")
    print(f"  Category: {license.category}")  # declared, detected, or referenced
for copyright in result.copyrights:
    print(f"Copyright: © {copyright.holder}")
```


## Output Format

The tool outputs JSON evidence showing:
- **File path**: Where the license was found
- **Detected license**: The SPDX identifier of the license
- **Confidence**: How confident the detection is (0.0 to 1.0)
- **Match type**: How the license was detected (license_text, spdx_identifier, license_reference, text_similarity)
- **Description**: Human-readable description of what was found

## Configuration

Create a `config.yaml` file:

```yaml
similarity_threshold: 0.97
max_recursion_depth: 10
max_extraction_depth: 10
thread_count: 4
cache_dir: "~/.cache/oslili"
custom_aliases:
  "Apache 2": "Apache-2.0"
  "MIT License": "MIT"
```

## Documentation

- [Full Usage Guide](docs/USAGE.md) - Comprehensive usage examples and configuration
- [API Reference](docs/API.md) - Python API documentation and examples
- [Changelog](CHANGELOG.md) - Version history and changes

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/oscarvalenzuelab/semantic-copycat-oslili",
    "name": "semantic-copycat-oslili",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "\"Oscar Valenzuela B.\" <oscar.valenzuela.b@gmail.com>",
    "keywords": "license, attribution, legal, sbom, spdx, copyright, compliance, open-source, package-analysis",
    "author": "Oscar Valenzuela B.",
    "author_email": "\"Oscar Valenzuela B.\" <oscar.valenzuela.b@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/93/6b/fe557b23884a6439ccfaadc7fbdf34630af5acb615d100c5109a30dafd20/semantic_copycat_oslili-1.5.5.tar.gz",
    "platform": null,
    "description": "# semantic-copycat-oslili\n\nA high-performance tool for identifying licenses and copyright information in local source code, producing detailed evidence of where licenses are detected with support for all 700+ SPDX license identifiers.\n\n## What It Does\n\n`semantic-copycat-oslili` analyzes local source code to produce evidence of:\n- **License detection** - Shows which files contain which licenses with confidence scores\n- **SPDX identifiers** - Detects SPDX-License-Identifier tags in ALL readable files\n- **Package metadata** - Extracts licenses from package.json, pyproject.toml, METADATA files\n- **Copyright statements** - Extracts copyright holders and years with intelligent filtering\n\nThe tool outputs standardized JSON evidence showing exactly where each license was detected, the detection method used, and confidence scores.\n\n### Key Features\n\n- **Evidence-based output**: Shows exact file paths, confidence scores, and detection methods\n- **License hierarchy**: Categorizes licenses as declared vs detected vs referenced\n- **Multiple output formats**: Evidence JSON, KissBOM, CycloneDX SBOM\n- **Archive extraction**: Automatically extracts and scans zip, tar, and other archive formats\n- **Caching support**: Speed up repeated scans with intelligent caching\n- **Parallel processing**: Multi-threaded scanning with configurable thread count\n- **Three-tier detection**: \n  - Dice-S\u00f8rensen similarity matching (97% threshold)\n  - TLSH fuzzy hashing with confirmation\n  - Regex pattern matching\n- **Safe directory traversal**: Depth limiting and symlink loop protection\n- **Smart normalization**: Handles license variations and common aliases\n- **No file limits**: Processes files of any size with intelligent sampling\n- **Enhanced metadata support**: Detects licenses in package.json, METADATA, pyproject.toml\n- **False positive filtering**: Advanced filtering for code patterns and invalid matches\n\n### How It Works\n\n#### Three-Tier License Detection System\n\nThe tool uses a sophisticated multi-tier approach for maximum accuracy:\n\n1. **Tier 1: Dice-S\u00f8rensen Similarity with TLSH Confirmation**\n   - Compares license text using Dice-S\u00f8rensen coefficient (97% threshold)\n   - Confirms matches using TLSH fuzzy hashing to prevent false positives\n   - Achieves 97-100% accuracy on standard SPDX licenses\n\n2. **Tier 2: TLSH Fuzzy Hash Matching**\n   - Uses Trend Micro Locality Sensitive Hashing for variant detection\n   - Catches license variants like MIT-0, BSD-2-Clause vs BSD-3-Clause\n   - Pre-computed hashes for all 700+ SPDX licenses\n\n3. **Tier 3: Pattern Recognition**\n   - Regex-based detection for license references and identifiers\n   - Extracts from comments, headers, and documentation\n\n#### Additional Detection Methods\n\n- **Package Metadata Scanning**: Detects licenses from package.json, composer.json, pyproject.toml, etc.\n- **Copyright Extraction**: Advanced pattern matching with validation and deduplication\n- **SPDX Identifier Detection**: Finds SPDX-License-Identifier tags in source files\n\n## Installation\n\n```bash\npip install semantic-copycat-oslili\n```\n\n### Required Dependencies\n\nThe package includes all necessary dependencies including `python-tlsh` for fuzzy hash matching, which is essential for accurate license detection and false positive prevention.\n\n## Usage\n\n### CLI Usage\n\n```bash\n# Scan a directory and see evidence (default format)\noslili /path/to/project\n\n# Generate different output formats\noslili ./my-project -f kissbom -o kissbom.json\noslili ./my-project -f cyclonedx-json -o sbom.json\noslili ./my-project -f cyclonedx-xml -o sbom.xml\n\n# Scan with parallel processing (4 threads)\noslili ./my-project --threads 4\n\n# Scan with limited depth (only 2 levels deep)\noslili ./my-project --max-depth 2\n\n# Extract and scan archives\noslili package.tar.gz --max-extraction-depth 2\n\n# Use caching for faster repeated scans\noslili ./my-project --cache-dir ~/.cache/oslili\n\n# Check version\noslili --version\n\n# Save results to file\noslili ./my-project -o license-evidence.json\n\n# With custom configuration and verbose output\noslili ./src --config config.yaml --verbose\n\n# Debug mode for detailed logging\noslili ./project --debug\n```\n\n### Example Output\n\n```json\n{\n  \"scan_results\": [{\n    \"path\": \"./project\",\n    \"license_evidence\": [\n      {\n        \"file\": \"/path/to/project/LICENSE\",\n        \"detected_license\": \"Apache-2.0\",\n        \"confidence\": 0.988,\n        \"detection_method\": \"dice-sorensen\",\n        \"category\": \"declared\",\n        \"match_type\": \"text_similarity\",\n        \"description\": \"Text matches Apache-2.0 license (98.8% similarity)\"\n      },\n      {\n        \"file\": \"/path/to/project/package.json\",\n        \"detected_license\": \"Apache-2.0\",\n        \"confidence\": 1.0,\n        \"detection_method\": \"tag\",\n        \"category\": \"declared\",\n        \"match_type\": \"spdx_identifier\",\n        \"description\": \"SPDX-License-Identifier: Apache-2.0 found\"\n      }\n    ],\n    \"copyright_evidence\": [\n      {\n        \"file\": \"/path/to/project/src/main.py\",\n        \"holder\": \"Example Corp\",\n        \"years\": [2023, 2024],\n        \"statement\": \"Copyright 2023-2024 Example Corp\"\n      }\n    ]\n  }],\n  \"summary\": {\n    \"total_files_scanned\": 42,\n    \"declared_licenses\": {\"Apache-2.0\": 2},\n    \"detected_licenses\": {},\n    \"referenced_licenses\": {},\n    \"copyright_holders\": [\"Example Corp\"]\n  }\n}\n```\n\n\n### Library Usage\n\n```python\nfrom semantic_copycat_oslili import LicenseCopyrightDetector\n\n# Initialize detector\ndetector = LicenseCopyrightDetector()\n\n# Process a local directory\nresult = detector.process_local_path(\"/path/to/source\")\n\n# Process a single file  \nresult = detector.process_local_path(\"/path/to/LICENSE\")\n\n# Generate different output formats\nevidence = detector.generate_evidence([result])\nkissbom = detector.generate_kissbom([result])\ncyclonedx = detector.generate_cyclonedx([result], format_type=\"json\")\ncyclonedx_xml = detector.generate_cyclonedx([result], format_type=\"xml\")\n\n# Access results directly\nfor license in result.licenses:\n    print(f\"License: {license.spdx_id} ({license.confidence:.0%} confidence)\")\n    print(f\"  Category: {license.category}\")  # declared, detected, or referenced\nfor copyright in result.copyrights:\n    print(f\"Copyright: \u00a9 {copyright.holder}\")\n```\n\n\n## Output Format\n\nThe tool outputs JSON evidence showing:\n- **File path**: Where the license was found\n- **Detected license**: The SPDX identifier of the license\n- **Confidence**: How confident the detection is (0.0 to 1.0)\n- **Match type**: How the license was detected (license_text, spdx_identifier, license_reference, text_similarity)\n- **Description**: Human-readable description of what was found\n\n## Configuration\n\nCreate a `config.yaml` file:\n\n```yaml\nsimilarity_threshold: 0.97\nmax_recursion_depth: 10\nmax_extraction_depth: 10\nthread_count: 4\ncache_dir: \"~/.cache/oslili\"\ncustom_aliases:\n  \"Apache 2\": \"Apache-2.0\"\n  \"MIT License\": \"MIT\"\n```\n\n## Documentation\n\n- [Full Usage Guide](docs/USAGE.md) - Comprehensive usage examples and configuration\n- [API Reference](docs/API.md) - Python API documentation and examples\n- [Changelog](CHANGELOG.md) - Version history and changes\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Semantic Copycat Open Source License Identification Library",
    "version": "1.5.5",
    "project_urls": {
        "Changelog": "https://github.com/oscarvalenzuelab/semantic-copycat-oslili/blob/main/CHANGELOG.md",
        "Documentation": "https://github.com/oscarvalenzuelab/semantic-copycat-oslili#readme",
        "Homepage": "https://github.com/oscarvalenzuelab/semantic-copycat-oslili",
        "Issues": "https://github.com/oscarvalenzuelab/semantic-copycat-oslili/issues",
        "Repository": "https://github.com/oscarvalenzuelab/semantic-copycat-oslili.git"
    },
    "split_keywords": [
        "license",
        " attribution",
        " legal",
        " sbom",
        " spdx",
        " copyright",
        " compliance",
        " open-source",
        " package-analysis"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "316b3d8fe0e9988b3290a7704b9ecbe398d91dcc273b4b6c014ad830f0a34ea3",
                "md5": "1965b753d9e03a7ac24240723046a5a3",
                "sha256": "df85394247d62269f2f8e3aa961215e5d1963797c9b1f93a2dce6107b1c93fdc"
            },
            "downloads": -1,
            "filename": "semantic_copycat_oslili-1.5.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1965b753d9e03a7ac24240723046a5a3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 386860,
            "upload_time": "2025-10-25T00:55:53",
            "upload_time_iso_8601": "2025-10-25T00:55:53.764281Z",
            "url": "https://files.pythonhosted.org/packages/31/6b/3d8fe0e9988b3290a7704b9ecbe398d91dcc273b4b6c014ad830f0a34ea3/semantic_copycat_oslili-1.5.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "936bfe557b23884a6439ccfaadc7fbdf34630af5acb615d100c5109a30dafd20",
                "md5": "a35ce75123b6df171a85306ea014a1cf",
                "sha256": "44589312a8572978933bdd66875efb63fbe617a73da8ca32424e16c01c611677"
            },
            "downloads": -1,
            "filename": "semantic_copycat_oslili-1.5.5.tar.gz",
            "has_sig": false,
            "md5_digest": "a35ce75123b6df171a85306ea014a1cf",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 379213,
            "upload_time": "2025-10-25T00:55:55",
            "upload_time_iso_8601": "2025-10-25T00:55:55.158428Z",
            "url": "https://files.pythonhosted.org/packages/93/6b/fe557b23884a6439ccfaadc7fbdf34630af5acb615d100c5109a30dafd20/semantic_copycat_oslili-1.5.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-25 00:55:55",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "oscarvalenzuelab",
    "github_project": "semantic-copycat-oslili",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "requests",
            "specs": [
                [
                    ">=",
                    "2.28.0"
                ]
            ]
        },
        {
            "name": "pyyaml",
            "specs": [
                [
                    ">=",
                    "6.0"
                ]
            ]
        },
        {
            "name": "click",
            "specs": [
                [
                    ">=",
                    "8.0.0"
                ]
            ]
        },
        {
            "name": "colorama",
            "specs": [
                [
                    ">=",
                    "0.4.0"
                ]
            ]
        },
        {
            "name": "fuzzywuzzy",
            "specs": [
                [
                    ">=",
                    "0.18.0"
                ]
            ]
        },
        {
            "name": "python-Levenshtein",
            "specs": [
                [
                    ">=",
                    "0.20.0"
                ]
            ]
        },
        {
            "name": "python-tlsh",
            "specs": [
                [
                    ">=",
                    "4.5.0"
                ]
            ]
        }
    ],
    "lcname": "semantic-copycat-oslili"
}
        
Elapsed time: 3.01664s