# semantic-copycat-oslili
A high-performance tool for identifying licenses and copyright information in local source code, producing detailed evidence of where licenses are detected with support for all 700+ SPDX license identifiers.
## What It Does
`semantic-copycat-oslili` analyzes local source code to produce evidence of:
- **License detection** - Shows which files contain which licenses with confidence scores
- **SPDX identifiers** - Detects SPDX-License-Identifier tags in ALL readable files
- **Package metadata** - Extracts licenses from package.json, pyproject.toml, METADATA files
- **Copyright statements** - Extracts copyright holders and years with intelligent filtering
The tool outputs standardized JSON evidence showing exactly where each license was detected, the detection method used, and confidence scores.
## Key Features
- **Evidence-based output**: Shows exact file paths, confidence scores, and detection methods
- **Parallel processing**: Multi-threaded scanning with configurable thread count
- **Three-tier detection**:
- Dice-Sørensen similarity matching (97% threshold)
- TLSH fuzzy hashing (optional)
- Regex pattern matching
- **Smart normalization**: Handles license variations and common aliases
- **No file limits**: Processes files of any size with intelligent sampling
- **Enhanced metadata support**: Detects licenses in package.json, METADATA, pyproject.toml
- **False positive filtering**: Advanced filtering for code patterns and invalid matches
## Installation
```bash
pip install semantic-copycat-oslili
```
### Required Dependencies
The package includes all necessary dependencies including `python-tlsh` for fuzzy hash matching, which is essential for accurate license detection and false positive prevention.
## Usage
### CLI Usage
```bash
# Scan a directory and see evidence
oslili /path/to/project
# Scan with parallel processing (4 threads)
oslili ./my-project --threads 4
# Scan a specific file
oslili /path/to/LICENSE
# Save results to file
oslili ./my-project -o license-evidence.json
# With custom configuration and verbose output
oslili ./src --config config.yaml --verbose
# Debug mode for detailed logging
oslili ./project --debug
```
### Example Output
```json
{
"scan_results": [{
"path": "./project",
"license_evidence": [
{
"file": "/path/to/project/LICENSE",
"detected_license": "Apache-2.0",
"confidence": 0.988,
"detection_method": "dice-sorensen",
"match_type": "text_similarity",
"description": "Text matches Apache-2.0 license (98.8% similarity)"
},
{
"file": "/path/to/project/package.json",
"detected_license": "Apache-2.0",
"confidence": 1.0,
"detection_method": "tag",
"match_type": "spdx_identifier",
"description": "SPDX-License-Identifier: Apache-2.0 found"
}
],
"copyright_evidence": [
{
"file": "/path/to/project/src/main.py",
"holder": "Example Corp",
"years": [2023, 2024],
"statement": "Copyright 2023-2024 Example Corp"
}
]
}],
"summary": {
"total_files_scanned": 42,
"licenses_found": {
"Apache-2.0": 2
},
"copyrights_found": 1
}
}
```
## How It Works
### Three-Tier License Detection System
The tool uses a sophisticated multi-tier approach for maximum accuracy:
1. **Tier 1: Dice-Sørensen Similarity with TLSH Confirmation**
- Compares license text using Dice-Sørensen coefficient (97% threshold)
- Confirms matches using TLSH fuzzy hashing to prevent false positives
- Achieves 97-100% accuracy on standard SPDX licenses
2. **Tier 2: TLSH Fuzzy Hash Matching**
- Uses Trend Micro Locality Sensitive Hashing for variant detection
- Catches license variants like MIT-0, BSD-2-Clause vs BSD-3-Clause
- Pre-computed hashes for all 700+ SPDX licenses
3. **Tier 3: Pattern Recognition**
- Regex-based detection for license references and identifiers
- Extracts from comments, headers, and documentation
### Additional Detection Methods
- **Package Metadata Scanning**: Detects licenses from package.json, composer.json, pyproject.toml, etc.
- **Copyright Extraction**: Advanced pattern matching with validation and deduplication
- **SPDX Identifier Detection**: Finds SPDX-License-Identifier tags in source files
### Library Usage
```python
from semantic_copycat_oslili import LegalAttributionGenerator
# Initialize generator
generator = LegalAttributionGenerator()
# Process a local directory
result = generator.process_local_path("/path/to/source")
# Process a single file
result = generator.process_local_path("/path/to/LICENSE")
# Generate evidence output
evidence = generator.generate_evidence([result])
print(evidence)
# Access results
for license in result.licenses:
print(f"License: {license.spdx_id} ({license.confidence:.0%} confidence)")
for copyright in result.copyrights:
print(f"Copyright: © {copyright.holder}")
```
## License Detection
The package uses a three-tier license detection system:
1. **Tier 1**: Dice-Sørensen similarity (97% threshold)
2. **Tier 2**: TLSH fuzzy hashing (97% threshold)
3. **Tier 3**: Machine learning or regex pattern matching
## Output Format
The tool outputs JSON evidence showing:
- **File path**: Where the license was found
- **Detected license**: The SPDX identifier of the license
- **Confidence**: How confident the detection is (0.0 to 1.0)
- **Match type**: How the license was detected (license_text, spdx_identifier, license_reference, text_similarity)
- **Description**: Human-readable description of what was found
## Configuration
Create a `config.yaml` file:
```yaml
similarity_threshold: 0.97
max_extraction_depth: 10
thread_count: 4
custom_aliases:
"Apache 2": "Apache-2.0"
"MIT License": "MIT"
```
## Documentation
- [Full Usage Guide](docs/USAGE.md) - Comprehensive usage examples and configuration
- [API Reference](docs/API.md) - Python API documentation and examples
- [Changelog](CHANGELOG.md) - Version history and changes
Raw data
{
"_id": null,
"home_page": "https://github.com/oscarvalenzuelab/semantic-copycat-oslili",
"name": "semantic-copycat-oslili",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "\"Oscar Valenzuela B.\" <oscar.valenzuela.b@gmail.com>",
"keywords": "license, attribution, legal, sbom, spdx, copyright, compliance, open-source, package-analysis",
"author": "Oscar Valenzuela B.",
"author_email": "\"Oscar Valenzuela B.\" <oscar.valenzuela.b@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/56/8b/d7e26b450677b228e0a4d40b7b9967ad1a8e3bf552c15df1b92a41f35ed8/semantic_copycat_oslili-1.2.6.tar.gz",
"platform": null,
"description": "# semantic-copycat-oslili\n\nA high-performance tool for identifying licenses and copyright information in local source code, producing detailed evidence of where licenses are detected with support for all 700+ SPDX license identifiers.\n\n## What It Does\n\n`semantic-copycat-oslili` analyzes local source code to produce evidence of:\n- **License detection** - Shows which files contain which licenses with confidence scores\n- **SPDX identifiers** - Detects SPDX-License-Identifier tags in ALL readable files\n- **Package metadata** - Extracts licenses from package.json, pyproject.toml, METADATA files\n- **Copyright statements** - Extracts copyright holders and years with intelligent filtering\n\nThe tool outputs standardized JSON evidence showing exactly where each license was detected, the detection method used, and confidence scores.\n\n## Key Features\n\n- **Evidence-based output**: Shows exact file paths, confidence scores, and detection methods\n- **Parallel processing**: Multi-threaded scanning with configurable thread count\n- **Three-tier detection**: \n - Dice-S\u00f8rensen similarity matching (97% threshold)\n - TLSH fuzzy hashing (optional)\n - Regex pattern matching\n- **Smart normalization**: Handles license variations and common aliases\n- **No file limits**: Processes files of any size with intelligent sampling\n- **Enhanced metadata support**: Detects licenses in package.json, METADATA, pyproject.toml\n- **False positive filtering**: Advanced filtering for code patterns and invalid matches\n\n## Installation\n\n```bash\npip install semantic-copycat-oslili\n```\n\n### Required Dependencies\n\nThe package includes all necessary dependencies including `python-tlsh` for fuzzy hash matching, which is essential for accurate license detection and false positive prevention.\n\n## Usage\n\n### CLI Usage\n\n```bash\n# Scan a directory and see evidence\noslili /path/to/project\n\n# Scan with parallel processing (4 threads)\noslili ./my-project --threads 4\n\n# Scan a specific file\noslili /path/to/LICENSE\n\n# Save results to file\noslili ./my-project -o license-evidence.json\n\n# With custom configuration and verbose output\noslili ./src --config config.yaml --verbose\n\n# Debug mode for detailed logging\noslili ./project --debug\n```\n\n### Example Output\n\n```json\n{\n \"scan_results\": [{\n \"path\": \"./project\",\n \"license_evidence\": [\n {\n \"file\": \"/path/to/project/LICENSE\",\n \"detected_license\": \"Apache-2.0\",\n \"confidence\": 0.988,\n \"detection_method\": \"dice-sorensen\",\n \"match_type\": \"text_similarity\",\n \"description\": \"Text matches Apache-2.0 license (98.8% similarity)\"\n },\n {\n \"file\": \"/path/to/project/package.json\",\n \"detected_license\": \"Apache-2.0\",\n \"confidence\": 1.0,\n \"detection_method\": \"tag\",\n \"match_type\": \"spdx_identifier\",\n \"description\": \"SPDX-License-Identifier: Apache-2.0 found\"\n }\n ],\n \"copyright_evidence\": [\n {\n \"file\": \"/path/to/project/src/main.py\",\n \"holder\": \"Example Corp\",\n \"years\": [2023, 2024],\n \"statement\": \"Copyright 2023-2024 Example Corp\"\n }\n ]\n }],\n \"summary\": {\n \"total_files_scanned\": 42,\n \"licenses_found\": {\n \"Apache-2.0\": 2\n },\n \"copyrights_found\": 1\n }\n}\n```\n\n## How It Works\n\n### Three-Tier License Detection System\n\nThe tool uses a sophisticated multi-tier approach for maximum accuracy:\n\n1. **Tier 1: Dice-S\u00f8rensen Similarity with TLSH Confirmation**\n - Compares license text using Dice-S\u00f8rensen coefficient (97% threshold)\n - Confirms matches using TLSH fuzzy hashing to prevent false positives\n - Achieves 97-100% accuracy on standard SPDX licenses\n\n2. **Tier 2: TLSH Fuzzy Hash Matching**\n - Uses Trend Micro Locality Sensitive Hashing for variant detection\n - Catches license variants like MIT-0, BSD-2-Clause vs BSD-3-Clause\n - Pre-computed hashes for all 700+ SPDX licenses\n\n3. **Tier 3: Pattern Recognition**\n - Regex-based detection for license references and identifiers\n - Extracts from comments, headers, and documentation\n\n### Additional Detection Methods\n\n- **Package Metadata Scanning**: Detects licenses from package.json, composer.json, pyproject.toml, etc.\n- **Copyright Extraction**: Advanced pattern matching with validation and deduplication\n- **SPDX Identifier Detection**: Finds SPDX-License-Identifier tags in source files\n\n### Library Usage\n\n```python\nfrom semantic_copycat_oslili import LegalAttributionGenerator\n\n# Initialize generator\ngenerator = LegalAttributionGenerator()\n\n# Process a local directory\nresult = generator.process_local_path(\"/path/to/source\")\n\n# Process a single file \nresult = generator.process_local_path(\"/path/to/LICENSE\")\n\n# Generate evidence output\nevidence = generator.generate_evidence([result])\nprint(evidence)\n\n# Access results\nfor license in result.licenses:\n print(f\"License: {license.spdx_id} ({license.confidence:.0%} confidence)\")\nfor copyright in result.copyrights:\n print(f\"Copyright: \u00a9 {copyright.holder}\")\n```\n\n## License Detection\n\nThe package uses a three-tier license detection system:\n\n1. **Tier 1**: Dice-S\u00f8rensen similarity (97% threshold)\n2. **Tier 2**: TLSH fuzzy hashing (97% threshold)\n3. **Tier 3**: Machine learning or regex pattern matching\n\n## Output Format\n\nThe tool outputs JSON evidence showing:\n- **File path**: Where the license was found\n- **Detected license**: The SPDX identifier of the license\n- **Confidence**: How confident the detection is (0.0 to 1.0)\n- **Match type**: How the license was detected (license_text, spdx_identifier, license_reference, text_similarity)\n- **Description**: Human-readable description of what was found\n\n## Configuration\n\nCreate a `config.yaml` file:\n\n```yaml\nsimilarity_threshold: 0.97\nmax_extraction_depth: 10\nthread_count: 4\ncustom_aliases:\n \"Apache 2\": \"Apache-2.0\"\n \"MIT License\": \"MIT\"\n```\n\n## Documentation\n\n- [Full Usage Guide](docs/USAGE.md) - Comprehensive usage examples and configuration\n- [API Reference](docs/API.md) - Python API documentation and examples\n- [Changelog](CHANGELOG.md) - Version history and changes\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Semantic Copycat Open Source License Identification Library",
"version": "1.2.6",
"project_urls": {
"Changelog": "https://github.com/oscarvalenzuelab/semantic-copycat-oslili/blob/main/CHANGELOG.md",
"Documentation": "https://github.com/oscarvalenzuelab/semantic-copycat-oslili#readme",
"Homepage": "https://github.com/oscarvalenzuelab/semantic-copycat-oslili",
"Issues": "https://github.com/oscarvalenzuelab/semantic-copycat-oslili/issues",
"Repository": "https://github.com/oscarvalenzuelab/semantic-copycat-oslili.git"
},
"split_keywords": [
"license",
" attribution",
" legal",
" sbom",
" spdx",
" copyright",
" compliance",
" open-source",
" package-analysis"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "5fd3c242769ebb30f284c991f750ea4f5857b8fc1e7e4d1f4eb945c1f996633d",
"md5": "891c68830e4b00d2876cdc251cd51671",
"sha256": "e9974bfc1829b95962befaca730d088160b439dd3459b5cb774a54673fae15ca"
},
"downloads": -1,
"filename": "semantic_copycat_oslili-1.2.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "891c68830e4b00d2876cdc251cd51671",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 348531,
"upload_time": "2025-08-17T02:01:41",
"upload_time_iso_8601": "2025-08-17T02:01:41.828901Z",
"url": "https://files.pythonhosted.org/packages/5f/d3/c242769ebb30f284c991f750ea4f5857b8fc1e7e4d1f4eb945c1f996633d/semantic_copycat_oslili-1.2.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "568bd7e26b450677b228e0a4d40b7b9967ad1a8e3bf552c15df1b92a41f35ed8",
"md5": "187ed5f8008a63acfc3fbb8ebef4d699",
"sha256": "2f7cf7488441416901195703295bc15f5120ab7e94e13a7c3e413860081be032"
},
"downloads": -1,
"filename": "semantic_copycat_oslili-1.2.6.tar.gz",
"has_sig": false,
"md5_digest": "187ed5f8008a63acfc3fbb8ebef4d699",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 344923,
"upload_time": "2025-08-17T02:01:43",
"upload_time_iso_8601": "2025-08-17T02:01:43.635779Z",
"url": "https://files.pythonhosted.org/packages/56/8b/d7e26b450677b228e0a4d40b7b9967ad1a8e3bf552c15df1b92a41f35ed8/semantic_copycat_oslili-1.2.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-17 02:01:43",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "oscarvalenzuelab",
"github_project": "semantic-copycat-oslili",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "requests",
"specs": [
[
">=",
"2.28.0"
]
]
},
{
"name": "pyyaml",
"specs": [
[
">=",
"6.0"
]
]
},
{
"name": "click",
"specs": [
[
">=",
"8.0.0"
]
]
},
{
"name": "colorama",
"specs": [
[
">=",
"0.4.0"
]
]
},
{
"name": "fuzzywuzzy",
"specs": [
[
">=",
"0.18.0"
]
]
},
{
"name": "python-Levenshtein",
"specs": [
[
">=",
"0.20.0"
]
]
},
{
"name": "python-tlsh",
"specs": [
[
">=",
"4.5.0"
]
]
}
],
"lcname": "semantic-copycat-oslili"
}