# UPMEX - Universal Package Metadata Extractor
Extract metadata and license information from various package formats with a single tool.
## Features
### Core Capabilities
- **Universal Package Support**: Extract metadata from 13 package ecosystems
- **Multi-Format Detection**: Automatic package type identification
- **Standardized Output**: Consistent JSON structure across all formats
- **Native Extraction**: No dependency on external package managers
- **High Performance**: Process packages up to 500MB in under 10 seconds
### Supported Ecosystems
- **Python**: wheel (.whl), sdist (.tar.gz, .zip)
- **NPM/Node.js**: .tgz, .tar.gz packages
- **Java/Maven**: .jar, .war, .ear with POM support
- **Gradle**: build.gradle, build.gradle.kts files
- **CocoaPods**: .podspec, .podspec.json files
- **Conda**: .conda (zip), .tar.bz2 packages
- **Perl/CPAN**: .tar.gz, .zip with META.json/yml
- **Conan C/C++**: conanfile.py, conanfile.txt, .tgz packages
- **Ruby Gems**: .gem packages
- **Rust Crates**: .crate packages
- **Go Modules**: .zip archives, go.mod files
- **NuGet/.NET**: .nupkg packages
- **Linux**: (Planned) Debian .deb, RPM .rpm
### Enhanced License Detection Engine
- **Comprehensive SPDX Support**: 400+ official SPDX license texts with fuzzy matching
- **Multi-Layer Detection**:
- SPDX-License-Identifier exact matching
- Fuzzy hash (LSH) matching against normalized license texts
- Dice-Sørensen coefficient for similarity matching
- Regex-based pattern matching with alias support (GPL-3.0, GPLv3, etc.)
- Full text similarity comparison using SequenceMatcher
- Confidence scoring (0.0-1.0) with detection method tracking
- **Smart File Discovery**: Automatic LICENSE/COPYING/COPYRIGHT/NOTICE file extraction
- **Text Normalization**: Removes variables, dates, and copyright notices for better matching
- **Multi-license Support**: Detects dual/multiple licensing with individual confidence scores
- **Provenance Tracking**: Records detection method and source for attestation
### API Integrations
- **ClearlyDefined**: License and compliance data enrichment
- **Ecosyste.ms**: Package registry metadata and dependencies
- **Maven Central**: Parent POM resolution and inheritance
- **Offline-First**: All features work without internet connectivity
### Advanced Features
- **NO-ASSERTION Handling**: Clear indication for unavailable data
- **Parent POM Resolution**: Automatic Maven inheritance processing
- **Dependency Mapping**: Full dependency tree with version constraints
- **Author Parsing**: Intelligent name/email extraction and normalization
- **Repository Detection**: Automatic VCS URL extraction
- **Platform Support**: Architecture and OS requirement detection
- **Package URL (PURL)**: Generate standard Package URLs for all packages
- **File Hashing**: SHA-1, MD5, and fuzzy hash (TLSH/LSH) for package files
- **JSON Organization**: Structured output with package, metadata, people, licensing sections
- **Data Provenance**: Track source of each data field for attestation
## Installation
```bash
# Install from source
git clone https://github.com/oscarvalenzuelab/semantic-copycat-upmex.git
cd semantic-copycat-upmex
pip install -e .
# Install with all features
pip install -e ".[all]"
# Install for development
pip install -e ".[dev]"
```
## Quick Start
```python
from upmex import PackageExtractor
# Create extractor
extractor = PackageExtractor()
# Extract metadata from a package
metadata = extractor.extract("path/to/package.whl")
# Access metadata
print(f"Package: {metadata.name} v{metadata.version}")
print(f"Type: {metadata.package_type.value}")
print(f"License: {metadata.licenses[0].spdx_id if metadata.licenses else 'Unknown'}")
# Convert to JSON
import json
print(json.dumps(metadata.to_dict(), indent=2))
```
## CLI Usage
```bash
# Basic extraction (offline mode - default)
upmex extract package.whl
# Online mode - fetches parent POMs and queries APIs
upmex extract --online package.jar
# With pretty JSON output
upmex extract --pretty package.whl
# Output to file
upmex extract package.whl -o metadata.json
# Text format output
upmex extract --format text package.tar.gz
# Detect package type
upmex detect package.jar
# Extract license information with confidence scores
upmex license package.tgz --confidence
```
## Configuration
Configuration can be done via JSON files or environment variables:
### Environment Variables
```bash
# API Keys
export PME_CLEARLYDEFINED_API_KEY=your-api-key
export PME_ECOSYSTEMS_API_KEY=your-api-key
# Settings
export PME_LOG_LEVEL=DEBUG
export PME_CACHE_DIR=/path/to/cache
export PME_LICENSE_METHODS=regex,dice_sorensen
export PME_OUTPUT_FORMAT=json
```
### Configuration File
Create a `config.json`:
```json
{
"api": {
"clearlydefined": {
"enabled": true,
"api_key": null
}
},
"license_detection": {
"methods": ["regex", "dice_sorensen"],
"confidence_threshold": 0.85
},
"output": {
"format": "json",
"pretty_print": true
}
}
```
## Supported Package Types
| Ecosystem | Formats | Detection | Metadata | Online Mode | Tested |
|-----------|---------|-----------|----------|-------------|--------|
| Python | .whl, .tar.gz, .zip | ✓ | ✓ | API enrichment | ✓ |
| NPM | .tgz, .tar.gz | ✓ | ✓ | API enrichment | ✓ |
| Java | .jar, .war, .ear | ✓ | ✓ | Parent POM fetch | ✓ |
| Maven | .jar with POM | ✓ | ✓ | Parent POM fetch | ✓ |
| Gradle | build.gradle(.kts) | ✓ | ✓ | API enrichment | ✓ |
| CocoaPods | .podspec(.json) | ✓ | ✓ | API enrichment | ✓ |
| Conda | .conda, .tar.bz2 | ✓ | ✓ | API enrichment | ✓ |
| Perl/CPAN | .tar.gz, .zip | ✓ | ✓ | API enrichment | ✓ |
| Conan | conanfile.py/.txt | ✓ | ✓ | - | ✓ |
| Ruby | .gem | ✓ | ✓ | API enrichment | ✓ |
| Rust | .crate | ✓ | ✓ | API enrichment | ✓ |
| Go | .zip, .mod, go.mod | ✓ | ✓ | API enrichment | ✓ |
| NuGet | .nupkg | ✓ | ✓ | API enrichment | ✓ |
## Changelog
See [CHANGELOG.md](CHANGELOG.md) for a detailed history of changes.
## License
MIT License - see LICENSE file for details.
Raw data
{
"_id": null,
"home_page": null,
"name": "semantic-copycat-upmex",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "\"Oscar Valenzuela B.\" <oscar.valenzuela.b@gmail.com>",
"keywords": "package, metadata, extractor, license, detection, python, npm, maven, jar, wheel, pypi",
"author": null,
"author_email": "\"Oscar Valenzuela B.\" <oscar.valenzuela.b@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/5b/93/5e9c5ac0671acd76b652a9576fc7719c0bd5e1ea34e36690a57f2238f0cb/semantic_copycat_upmex-1.5.0.tar.gz",
"platform": null,
"description": "# UPMEX - Universal Package Metadata Extractor\n\nExtract metadata and license information from various package formats with a single tool.\n\n## Features\n\n### Core Capabilities\n- **Universal Package Support**: Extract metadata from 13 package ecosystems\n- **Multi-Format Detection**: Automatic package type identification\n- **Standardized Output**: Consistent JSON structure across all formats\n- **Native Extraction**: No dependency on external package managers\n- **High Performance**: Process packages up to 500MB in under 10 seconds\n\n### Supported Ecosystems\n- **Python**: wheel (.whl), sdist (.tar.gz, .zip)\n- **NPM/Node.js**: .tgz, .tar.gz packages\n- **Java/Maven**: .jar, .war, .ear with POM support\n- **Gradle**: build.gradle, build.gradle.kts files\n- **CocoaPods**: .podspec, .podspec.json files\n- **Conda**: .conda (zip), .tar.bz2 packages\n- **Perl/CPAN**: .tar.gz, .zip with META.json/yml\n- **Conan C/C++**: conanfile.py, conanfile.txt, .tgz packages\n- **Ruby Gems**: .gem packages\n- **Rust Crates**: .crate packages\n- **Go Modules**: .zip archives, go.mod files\n- **NuGet/.NET**: .nupkg packages\n- **Linux**: (Planned) Debian .deb, RPM .rpm\n\n### Enhanced License Detection Engine\n- **Comprehensive SPDX Support**: 400+ official SPDX license texts with fuzzy matching\n- **Multi-Layer Detection**:\n - SPDX-License-Identifier exact matching\n - Fuzzy hash (LSH) matching against normalized license texts\n - Dice-S\u00f8rensen coefficient for similarity matching\n - Regex-based pattern matching with alias support (GPL-3.0, GPLv3, etc.)\n - Full text similarity comparison using SequenceMatcher\n - Confidence scoring (0.0-1.0) with detection method tracking\n- **Smart File Discovery**: Automatic LICENSE/COPYING/COPYRIGHT/NOTICE file extraction\n- **Text Normalization**: Removes variables, dates, and copyright notices for better matching\n- **Multi-license Support**: Detects dual/multiple licensing with individual confidence scores\n- **Provenance Tracking**: Records detection method and source for attestation\n\n### API Integrations\n- **ClearlyDefined**: License and compliance data enrichment\n- **Ecosyste.ms**: Package registry metadata and dependencies\n- **Maven Central**: Parent POM resolution and inheritance\n- **Offline-First**: All features work without internet connectivity\n\n### Advanced Features\n- **NO-ASSERTION Handling**: Clear indication for unavailable data\n- **Parent POM Resolution**: Automatic Maven inheritance processing\n- **Dependency Mapping**: Full dependency tree with version constraints\n- **Author Parsing**: Intelligent name/email extraction and normalization\n- **Repository Detection**: Automatic VCS URL extraction\n- **Platform Support**: Architecture and OS requirement detection\n- **Package URL (PURL)**: Generate standard Package URLs for all packages\n- **File Hashing**: SHA-1, MD5, and fuzzy hash (TLSH/LSH) for package files\n- **JSON Organization**: Structured output with package, metadata, people, licensing sections\n- **Data Provenance**: Track source of each data field for attestation\n\n## Installation\n\n```bash\n# Install from source\ngit clone https://github.com/oscarvalenzuelab/semantic-copycat-upmex.git\ncd semantic-copycat-upmex\npip install -e .\n\n# Install with all features\npip install -e \".[all]\"\n\n# Install for development\npip install -e \".[dev]\"\n```\n\n## Quick Start\n\n```python\nfrom upmex import PackageExtractor\n\n# Create extractor\nextractor = PackageExtractor()\n\n# Extract metadata from a package\nmetadata = extractor.extract(\"path/to/package.whl\")\n\n# Access metadata\nprint(f\"Package: {metadata.name} v{metadata.version}\")\nprint(f\"Type: {metadata.package_type.value}\")\nprint(f\"License: {metadata.licenses[0].spdx_id if metadata.licenses else 'Unknown'}\")\n\n# Convert to JSON\nimport json\nprint(json.dumps(metadata.to_dict(), indent=2))\n```\n\n## CLI Usage\n\n```bash\n# Basic extraction (offline mode - default)\nupmex extract package.whl\n\n# Online mode - fetches parent POMs and queries APIs\nupmex extract --online package.jar\n\n# With pretty JSON output\nupmex extract --pretty package.whl\n\n# Output to file\nupmex extract package.whl -o metadata.json\n\n# Text format output\nupmex extract --format text package.tar.gz\n\n# Detect package type\nupmex detect package.jar\n\n# Extract license information with confidence scores\nupmex license package.tgz --confidence\n```\n\n## Configuration\n\nConfiguration can be done via JSON files or environment variables:\n\n### Environment Variables\n\n```bash\n# API Keys\nexport PME_CLEARLYDEFINED_API_KEY=your-api-key\nexport PME_ECOSYSTEMS_API_KEY=your-api-key\n\n# Settings\nexport PME_LOG_LEVEL=DEBUG\nexport PME_CACHE_DIR=/path/to/cache\nexport PME_LICENSE_METHODS=regex,dice_sorensen\nexport PME_OUTPUT_FORMAT=json\n```\n\n### Configuration File\n\nCreate a `config.json`:\n\n```json\n{\n \"api\": {\n \"clearlydefined\": {\n \"enabled\": true,\n \"api_key\": null\n }\n },\n \"license_detection\": {\n \"methods\": [\"regex\", \"dice_sorensen\"],\n \"confidence_threshold\": 0.85\n },\n \"output\": {\n \"format\": \"json\",\n \"pretty_print\": true\n }\n}\n```\n\n## Supported Package Types\n\n| Ecosystem | Formats | Detection | Metadata | Online Mode | Tested |\n|-----------|---------|-----------|----------|-------------|--------|\n| Python | .whl, .tar.gz, .zip | \u2713 | \u2713 | API enrichment | \u2713 |\n| NPM | .tgz, .tar.gz | \u2713 | \u2713 | API enrichment | \u2713 |\n| Java | .jar, .war, .ear | \u2713 | \u2713 | Parent POM fetch | \u2713 |\n| Maven | .jar with POM | \u2713 | \u2713 | Parent POM fetch | \u2713 |\n| Gradle | build.gradle(.kts) | \u2713 | \u2713 | API enrichment | \u2713 |\n| CocoaPods | .podspec(.json) | \u2713 | \u2713 | API enrichment | \u2713 |\n| Conda | .conda, .tar.bz2 | \u2713 | \u2713 | API enrichment | \u2713 |\n| Perl/CPAN | .tar.gz, .zip | \u2713 | \u2713 | API enrichment | \u2713 |\n| Conan | conanfile.py/.txt | \u2713 | \u2713 | - | \u2713 |\n| Ruby | .gem | \u2713 | \u2713 | API enrichment | \u2713 |\n| Rust | .crate | \u2713 | \u2713 | API enrichment | \u2713 |\n| Go | .zip, .mod, go.mod | \u2713 | \u2713 | API enrichment | \u2713 |\n| NuGet | .nupkg | \u2713 | \u2713 | API enrichment | \u2713 |\n\n\n## Changelog\n\nSee [CHANGELOG.md](CHANGELOG.md) for a detailed history of changes.\n\n## License\n\nMIT License - see LICENSE file for details.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Universal Package Metadata Extractor - Extract metadata from various package formats",
"version": "1.5.0",
"project_urls": {
"Changelog": "https://github.com/oscarvalenzuelab/semantic-copycat-upmex/blob/main/CHANGELOG.md",
"Documentation": "https://github.com/oscarvalenzuelab/semantic-copycat-upmex#readme",
"Homepage": "https://github.com/oscarvalenzuelab/semantic-copycat-upmex",
"Issues": "https://github.com/oscarvalenzuelab/semantic-copycat-upmex/issues",
"Repository": "https://github.com/oscarvalenzuelab/semantic-copycat-upmex"
},
"split_keywords": [
"package",
" metadata",
" extractor",
" license",
" detection",
" python",
" npm",
" maven",
" jar",
" wheel",
" pypi"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "27b775ea32dfd2525e03097c7f094a7ea5c982301e1293e96f13102d170e8e03",
"md5": "4c599a4f1f0af2687edf0ba90fa29a63",
"sha256": "ffb7eb8d0dca46c76edcc46c4da3aefe7037984ec7df3098e8a630285a9519f6"
},
"downloads": -1,
"filename": "semantic_copycat_upmex-1.5.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4c599a4f1f0af2687edf0ba90fa29a63",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 1294810,
"upload_time": "2025-08-11T07:00:46",
"upload_time_iso_8601": "2025-08-11T07:00:46.197097Z",
"url": "https://files.pythonhosted.org/packages/27/b7/75ea32dfd2525e03097c7f094a7ea5c982301e1293e96f13102d170e8e03/semantic_copycat_upmex-1.5.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "5b935e9c5ac0671acd76b652a9576fc7719c0bd5e1ea34e36690a57f2238f0cb",
"md5": "be23a9a453eb690c32ed0a83245856fe",
"sha256": "9d49a3c358b64901c093744e113e2204aa67fdd56cb15f7247376198dc7fe8ae"
},
"downloads": -1,
"filename": "semantic_copycat_upmex-1.5.0.tar.gz",
"has_sig": false,
"md5_digest": "be23a9a453eb690c32ed0a83245856fe",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 745734,
"upload_time": "2025-08-11T07:00:48",
"upload_time_iso_8601": "2025-08-11T07:00:48.024827Z",
"url": "https://files.pythonhosted.org/packages/5b/93/5e9c5ac0671acd76b652a9576fc7719c0bd5e1ea34e36690a57f2238f0cb/semantic_copycat_upmex-1.5.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-11 07:00:48",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "oscarvalenzuelab",
"github_project": "semantic-copycat-upmex",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "semantic-copycat-upmex"
}