pyelflabeler


Namepyelflabeler JSON
Version 0.2.1 PyPI version JSON
download
home_pageNone
SummaryELF Binary Analysis Tool - Label malware or benignware datasets
upload_time2025-10-28 06:38:58
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT
keywords binary-analysis dataset elf labeling malware security
VCS
bugtrack_url
requirements tqdm pyelftools avclass-malicialab
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ELF Binary Labeler

[中文版本](README_zh-TW.md)

A powerful Python tool for analyzing and labeling ELF binary datasets, designed for malware and benignware classification. This tool extracts comprehensive metadata from binary files including CPU architecture, endianness, packing information, and malware family classification.

## Features

- **Dual Mode Operation**
  - **Malware Mode**: Analyze VirusTotal JSON reports combined with binary files
  - **Benignware Mode**: Direct binary analysis without JSON reports

- **Comprehensive Binary Analysis**
  - ELF header information (CPU, architecture, endianness, file type)
  - Binary metadata (bits, load segments, section headers)
  - File hashing (MD5, SHA256)
  - Packing detection using DiE (Detect It Easy)
  - Malware family classification using AVClass

- **Performance Optimized**
  - Multi-process parallel processing
  - Progress tracking with tqdm
  - Efficient single-pass file reading

- **Modern Architecture**
  - Modular design with separation of concerns
  - Factory pattern for extensibility
  - Abstract base class for easy extension
  - Managed by modern Python tooling (uv, pyproject.toml)

## Prerequisites

### Required Tools

1. **Python 3.10+**

2. **DiE (Detect It Easy)** - for packing detection
   - Download from: https://github.com/horsicq/Detect-It-Easy
   - Ensure `diec` command is available in PATH

3. **AVClass** - for malware family classification (malware mode)
   - Automatically installed via Python dependencies
   - Or manually install: `pip install avclass-malicialab`

## Installation

### Method 1: Install from PyPI (Recommended)

```bash
pip install pyelflabeler
```

After installation, you can run the tool using the `pyelflabeler` command:

```bash
pyelflabeler --help
```

### Method 2: Install from source with uv

[uv](https://github.com/astral-sh/uv) is a fast Python package installer and resolver.

1. Install uv:
   ```bash
   curl -LsSf https://astral.sh/uv/install.sh | sh
   ```

2. Clone and install:
   ```bash
   git clone https://github.com/louiskyee/pyelflabeler.git
   cd pyelflabeler
   uv sync
   ```

3. Run the tool:
   ```bash
   uv run pyelflabeler --help
   # Or use Python module directly
   uv run python -m src.main --help
   ```

### Method 3: Install from source with pip

1. Clone this repository:
   ```bash
   git clone https://github.com/louiskyee/pyelflabeler.git
   cd pyelflabeler
   ```

2. Install in editable mode:
   ```bash
   pip install -e .
   ```

3. Verify installation:
   ```bash
   pyelflabeler --help
   diec --version
   ```

## Usage

### Malware Mode

Analyze VirusTotal JSON reports combined with binary files:

```bash
pyelflabeler --mode malware \
    -i /path/to/json_reports \
    -b /path/to/malware/binaries \
    -o malware_output.csv
```

**Expected Directory Structure:**

Both JSON reports and binaries are organized by SHA256 hash prefix:

```
/path/to/json_reports/
├── 00/
│   ├── 0000002158d35c2bb5e7d96a39ff464ea4c83de8c5fd72094736f79125aaca11.json
│   ├── 0000002a10959ec38b808d8252eed2e814294fbb25d2cd016b24bf853a44857e.json
│   └── ...
├── 01/
│   └── ...
└── ...

/path/to/malware/binaries/
├── 00/
│   ├── 0000002158d35c2bb5e7d96a39ff464ea4c83de8c5fd72094736f79125aaca11
│   ├── 0000002a10959ec38b808d8252eed2e814294fbb25d2cd016b24bf853a44857e
│   └── ...
├── 01/
│   └── ...
└── ...
```

Files are organized in subdirectories named by the first two characters of their SHA256 hash.

### Benignware Mode

Analyze binary files directly without JSON reports:

```bash
pyelflabeler --mode benignware \
    -b /path/to/benignware/binaries \
    -o benignware_output.csv
```

### Command Line Options

| Option | Short | Description | Required |
|--------|-------|-------------|----------|
| `--mode` | `-m` | Analysis mode: `malware` or `benignware` | No (default: malware) |
| `--input_folder` | `-i` | Folder containing JSON reports | Yes (malware mode only) |
| `--binary_folder` | `-b` | Folder containing binary files | Yes (both modes) |
| `--output` | `-o` | Output CSV file path | No (auto-generated) |

## Output Format

The tool generates a CSV file with the following columns:

| Column | Description |
|--------|-------------|
| `file_name` | SHA256 hash of the binary |
| `md5` | MD5 hash |
| `label` | Classification: `Malware` or `Benignware` |
| `file_type` | ELF file type (EXEC, DYN, REL, CORE) |
| `CPU` | CPU architecture (e.g., x86-64, ARM) |
| `bits` | Binary bits (32 or 64) |
| `endianness` | Byte order (little/big endian) |
| `load_segments` | Number of PT_LOAD segments |
| `has_section_name` | Whether section headers exist |
| `family` | Malware family (malware mode only) |
| `first_seen` | First seen timestamp (malware mode) |
| `size` | File size in bytes |
| `diec_is_packed` | Whether binary is packed (True/False) |
| `diec_packer_info` | Packer name and version |
| `diec_packing_method` | Packing method details |

### Example Output

```csv
file_name,md5,label,file_type,CPU,bits,endianness,load_segments,has_section_name,family,first_seen,size,diec_is_packed,diec_packer_info,diec_packing_method
01a2b3c4...,5e6f7g8h...,Malware,EXEC,Advanced Micro Devices X86-64,64,2's complement little endian,2,True,mirai,2024-01-15,45678,True,UPX(3.95),NRV
```

## Error Handling

- Errors and warnings are logged to `{output_filename}_errors.log`
- Failed file analyses continue processing remaining files
- Detailed debug information available in log files

## Performance

- Utilizes all available CPU cores for parallel processing
- Optimized single-pass file reading for ELF analysis
- Progress bars for real-time status updates

Example performance (tested on 8-core system):
- ~1000 files processed in ~5-10 minutes (depending on binary sizes and analysis depth)

## Project Structure

The project follows modern Python best practices with a modular architecture:

```
dataset_labeler/
├── main.py                    # CLI entry point
├── pyproject.toml             # Project configuration (uv)
├── requirements.txt           # Legacy pip support
├── src/
│   ├── main.py                # Main CLI logic
│   ├── config.py              # Configuration management
│   ├── constants.py           # CSV field definitions
│   ├── factory.py             # Factory pattern for analyzer creation
│   ├── analyzers/
│   │   ├── base_analyzer.py       # Abstract base class
│   │   ├── malware_analyzer.py    # Malware analysis
│   │   └── benignware_analyzer.py # Benignware analysis
│   └── utils/
│       ├── elf_utils.py       # ELF binary utilities
│       ├── hash_utils.py      # File hashing
│       └── packer_utils.py    # Packer detection & AVClass
└── tests/                     # Unit tests (coming soon)
```

### Extensibility

Adding a new analyzer type is straightforward:

1. Create a new analyzer class in `src/analyzers/` inheriting from `BaseAnalyzer`
2. Implement `collect_files()` and `process_single_file()` methods
3. Register it in the factory (`src/factory.py`)

Example:
```python
from src.analyzers.base_analyzer import BaseAnalyzer

class CustomAnalyzer(BaseAnalyzer):
    def collect_files(self):
        # Your implementation
        pass

    def process_single_file(self, file_path):
        # Your implementation
        pass
```

## Troubleshooting

### Common Issues

1. **"AVClass not found"**
   - Ensure AVClass is installed and in your PATH
   - Malware mode requires AVClass for family classification

2. **"readelf failed"**
   - Verify binutils is installed: `which readelf`
   - Some non-ELF files will skip readelf analysis

3. **"diec command failed"**
   - Ensure DiE is properly installed
   - Check `diec` is accessible: `which diec`

4. **Permission Denied**
   - Ensure read permissions on input directories
   - Ensure write permissions for output CSV location

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

## License

This project is open source and available under the [MIT License](LICENSE).

## Citation

If you use this tool in your research, please cite:

```bibtex
@software{pyelflabeler,
  title={PyELFLabeler: A Tool for ELF Binary Dataset Analysis},
  author={louiskyee},
  year={2024},
  url={https://github.com/louiskyee/pyelflabeler}
}
```

## Acknowledgments

- [AVClass](https://github.com/malicialab/avclass) - Malware family classification
- [Detect It Easy](https://github.com/horsicq/Detect-It-Easy) - Packer detection
- [tqdm](https://github.com/tqdm/tqdm) - Progress bars

## Contact

For questions, issues, or suggestions, please open an issue on GitHub.

---

**Note**: This tool is designed for security research and educational purposes. Use responsibly and ethically.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pyelflabeler",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "binary-analysis, dataset, elf, labeling, malware, security",
    "author": null,
    "author_email": "louiskyee <bolin8017@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/28/4f/26cac90107e6744bd093bbd9c89ec141661134dc48ddd66c6653ce9bf78a/pyelflabeler-0.2.1.tar.gz",
    "platform": null,
    "description": "# ELF Binary Labeler\n\n[\u4e2d\u6587\u7248\u672c](README_zh-TW.md)\n\nA powerful Python tool for analyzing and labeling ELF binary datasets, designed for malware and benignware classification. This tool extracts comprehensive metadata from binary files including CPU architecture, endianness, packing information, and malware family classification.\n\n## Features\n\n- **Dual Mode Operation**\n  - **Malware Mode**: Analyze VirusTotal JSON reports combined with binary files\n  - **Benignware Mode**: Direct binary analysis without JSON reports\n\n- **Comprehensive Binary Analysis**\n  - ELF header information (CPU, architecture, endianness, file type)\n  - Binary metadata (bits, load segments, section headers)\n  - File hashing (MD5, SHA256)\n  - Packing detection using DiE (Detect It Easy)\n  - Malware family classification using AVClass\n\n- **Performance Optimized**\n  - Multi-process parallel processing\n  - Progress tracking with tqdm\n  - Efficient single-pass file reading\n\n- **Modern Architecture**\n  - Modular design with separation of concerns\n  - Factory pattern for extensibility\n  - Abstract base class for easy extension\n  - Managed by modern Python tooling (uv, pyproject.toml)\n\n## Prerequisites\n\n### Required Tools\n\n1. **Python 3.10+**\n\n2. **DiE (Detect It Easy)** - for packing detection\n   - Download from: https://github.com/horsicq/Detect-It-Easy\n   - Ensure `diec` command is available in PATH\n\n3. **AVClass** - for malware family classification (malware mode)\n   - Automatically installed via Python dependencies\n   - Or manually install: `pip install avclass-malicialab`\n\n## Installation\n\n### Method 1: Install from PyPI (Recommended)\n\n```bash\npip install pyelflabeler\n```\n\nAfter installation, you can run the tool using the `pyelflabeler` command:\n\n```bash\npyelflabeler --help\n```\n\n### Method 2: Install from source with uv\n\n[uv](https://github.com/astral-sh/uv) is a fast Python package installer and resolver.\n\n1. Install uv:\n   ```bash\n   curl -LsSf https://astral.sh/uv/install.sh | sh\n   ```\n\n2. Clone and install:\n   ```bash\n   git clone https://github.com/louiskyee/pyelflabeler.git\n   cd pyelflabeler\n   uv sync\n   ```\n\n3. Run the tool:\n   ```bash\n   uv run pyelflabeler --help\n   # Or use Python module directly\n   uv run python -m src.main --help\n   ```\n\n### Method 3: Install from source with pip\n\n1. Clone this repository:\n   ```bash\n   git clone https://github.com/louiskyee/pyelflabeler.git\n   cd pyelflabeler\n   ```\n\n2. Install in editable mode:\n   ```bash\n   pip install -e .\n   ```\n\n3. Verify installation:\n   ```bash\n   pyelflabeler --help\n   diec --version\n   ```\n\n## Usage\n\n### Malware Mode\n\nAnalyze VirusTotal JSON reports combined with binary files:\n\n```bash\npyelflabeler --mode malware \\\n    -i /path/to/json_reports \\\n    -b /path/to/malware/binaries \\\n    -o malware_output.csv\n```\n\n**Expected Directory Structure:**\n\nBoth JSON reports and binaries are organized by SHA256 hash prefix:\n\n```\n/path/to/json_reports/\n\u251c\u2500\u2500 00/\n\u2502   \u251c\u2500\u2500 0000002158d35c2bb5e7d96a39ff464ea4c83de8c5fd72094736f79125aaca11.json\n\u2502   \u251c\u2500\u2500 0000002a10959ec38b808d8252eed2e814294fbb25d2cd016b24bf853a44857e.json\n\u2502   \u2514\u2500\u2500 ...\n\u251c\u2500\u2500 01/\n\u2502   \u2514\u2500\u2500 ...\n\u2514\u2500\u2500 ...\n\n/path/to/malware/binaries/\n\u251c\u2500\u2500 00/\n\u2502   \u251c\u2500\u2500 0000002158d35c2bb5e7d96a39ff464ea4c83de8c5fd72094736f79125aaca11\n\u2502   \u251c\u2500\u2500 0000002a10959ec38b808d8252eed2e814294fbb25d2cd016b24bf853a44857e\n\u2502   \u2514\u2500\u2500 ...\n\u251c\u2500\u2500 01/\n\u2502   \u2514\u2500\u2500 ...\n\u2514\u2500\u2500 ...\n```\n\nFiles are organized in subdirectories named by the first two characters of their SHA256 hash.\n\n### Benignware Mode\n\nAnalyze binary files directly without JSON reports:\n\n```bash\npyelflabeler --mode benignware \\\n    -b /path/to/benignware/binaries \\\n    -o benignware_output.csv\n```\n\n### Command Line Options\n\n| Option | Short | Description | Required |\n|--------|-------|-------------|----------|\n| `--mode` | `-m` | Analysis mode: `malware` or `benignware` | No (default: malware) |\n| `--input_folder` | `-i` | Folder containing JSON reports | Yes (malware mode only) |\n| `--binary_folder` | `-b` | Folder containing binary files | Yes (both modes) |\n| `--output` | `-o` | Output CSV file path | No (auto-generated) |\n\n## Output Format\n\nThe tool generates a CSV file with the following columns:\n\n| Column | Description |\n|--------|-------------|\n| `file_name` | SHA256 hash of the binary |\n| `md5` | MD5 hash |\n| `label` | Classification: `Malware` or `Benignware` |\n| `file_type` | ELF file type (EXEC, DYN, REL, CORE) |\n| `CPU` | CPU architecture (e.g., x86-64, ARM) |\n| `bits` | Binary bits (32 or 64) |\n| `endianness` | Byte order (little/big endian) |\n| `load_segments` | Number of PT_LOAD segments |\n| `has_section_name` | Whether section headers exist |\n| `family` | Malware family (malware mode only) |\n| `first_seen` | First seen timestamp (malware mode) |\n| `size` | File size in bytes |\n| `diec_is_packed` | Whether binary is packed (True/False) |\n| `diec_packer_info` | Packer name and version |\n| `diec_packing_method` | Packing method details |\n\n### Example Output\n\n```csv\nfile_name,md5,label,file_type,CPU,bits,endianness,load_segments,has_section_name,family,first_seen,size,diec_is_packed,diec_packer_info,diec_packing_method\n01a2b3c4...,5e6f7g8h...,Malware,EXEC,Advanced Micro Devices X86-64,64,2's complement little endian,2,True,mirai,2024-01-15,45678,True,UPX(3.95),NRV\n```\n\n## Error Handling\n\n- Errors and warnings are logged to `{output_filename}_errors.log`\n- Failed file analyses continue processing remaining files\n- Detailed debug information available in log files\n\n## Performance\n\n- Utilizes all available CPU cores for parallel processing\n- Optimized single-pass file reading for ELF analysis\n- Progress bars for real-time status updates\n\nExample performance (tested on 8-core system):\n- ~1000 files processed in ~5-10 minutes (depending on binary sizes and analysis depth)\n\n## Project Structure\n\nThe project follows modern Python best practices with a modular architecture:\n\n```\ndataset_labeler/\n\u251c\u2500\u2500 main.py                    # CLI entry point\n\u251c\u2500\u2500 pyproject.toml             # Project configuration (uv)\n\u251c\u2500\u2500 requirements.txt           # Legacy pip support\n\u251c\u2500\u2500 src/\n\u2502   \u251c\u2500\u2500 main.py                # Main CLI logic\n\u2502   \u251c\u2500\u2500 config.py              # Configuration management\n\u2502   \u251c\u2500\u2500 constants.py           # CSV field definitions\n\u2502   \u251c\u2500\u2500 factory.py             # Factory pattern for analyzer creation\n\u2502   \u251c\u2500\u2500 analyzers/\n\u2502   \u2502   \u251c\u2500\u2500 base_analyzer.py       # Abstract base class\n\u2502   \u2502   \u251c\u2500\u2500 malware_analyzer.py    # Malware analysis\n\u2502   \u2502   \u2514\u2500\u2500 benignware_analyzer.py # Benignware analysis\n\u2502   \u2514\u2500\u2500 utils/\n\u2502       \u251c\u2500\u2500 elf_utils.py       # ELF binary utilities\n\u2502       \u251c\u2500\u2500 hash_utils.py      # File hashing\n\u2502       \u2514\u2500\u2500 packer_utils.py    # Packer detection & AVClass\n\u2514\u2500\u2500 tests/                     # Unit tests (coming soon)\n```\n\n### Extensibility\n\nAdding a new analyzer type is straightforward:\n\n1. Create a new analyzer class in `src/analyzers/` inheriting from `BaseAnalyzer`\n2. Implement `collect_files()` and `process_single_file()` methods\n3. Register it in the factory (`src/factory.py`)\n\nExample:\n```python\nfrom src.analyzers.base_analyzer import BaseAnalyzer\n\nclass CustomAnalyzer(BaseAnalyzer):\n    def collect_files(self):\n        # Your implementation\n        pass\n\n    def process_single_file(self, file_path):\n        # Your implementation\n        pass\n```\n\n## Troubleshooting\n\n### Common Issues\n\n1. **\"AVClass not found\"**\n   - Ensure AVClass is installed and in your PATH\n   - Malware mode requires AVClass for family classification\n\n2. **\"readelf failed\"**\n   - Verify binutils is installed: `which readelf`\n   - Some non-ELF files will skip readelf analysis\n\n3. **\"diec command failed\"**\n   - Ensure DiE is properly installed\n   - Check `diec` is accessible: `which diec`\n\n4. **Permission Denied**\n   - Ensure read permissions on input directories\n   - Ensure write permissions for output CSV location\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n1. Fork the repository\n2. Create your feature branch (`git checkout -b feature/AmazingFeature`)\n3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)\n4. Push to the branch (`git push origin feature/AmazingFeature`)\n5. Open a Pull Request\n\n## License\n\nThis project is open source and available under the [MIT License](LICENSE).\n\n## Citation\n\nIf you use this tool in your research, please cite:\n\n```bibtex\n@software{pyelflabeler,\n  title={PyELFLabeler: A Tool for ELF Binary Dataset Analysis},\n  author={louiskyee},\n  year={2024},\n  url={https://github.com/louiskyee/pyelflabeler}\n}\n```\n\n## Acknowledgments\n\n- [AVClass](https://github.com/malicialab/avclass) - Malware family classification\n- [Detect It Easy](https://github.com/horsicq/Detect-It-Easy) - Packer detection\n- [tqdm](https://github.com/tqdm/tqdm) - Progress bars\n\n## Contact\n\nFor questions, issues, or suggestions, please open an issue on GitHub.\n\n---\n\n**Note**: This tool is designed for security research and educational purposes. Use responsibly and ethically.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "ELF Binary Analysis Tool - Label malware or benignware datasets",
    "version": "0.2.1",
    "project_urls": {
        "Homepage": "https://github.com/louiskyee/pyelflabeler",
        "Issues": "https://github.com/louiskyee/pyelflabeler/issues",
        "Repository": "https://github.com/louiskyee/pyelflabeler"
    },
    "split_keywords": [
        "binary-analysis",
        " dataset",
        " elf",
        " labeling",
        " malware",
        " security"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "02fb8538bebd0d22267ab161852dfbf7fb8bf6855c16c2fca38313b169e9e415",
                "md5": "950500ef55850f2ba6125a120ee23a77",
                "sha256": "e820553fead26ea62988f6cb029c11a3a8ab65948150b1d18da589108c886e91"
            },
            "downloads": -1,
            "filename": "pyelflabeler-0.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "950500ef55850f2ba6125a120ee23a77",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 17631,
            "upload_time": "2025-10-28T06:38:56",
            "upload_time_iso_8601": "2025-10-28T06:38:56.251484Z",
            "url": "https://files.pythonhosted.org/packages/02/fb/8538bebd0d22267ab161852dfbf7fb8bf6855c16c2fca38313b169e9e415/pyelflabeler-0.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "284f26cac90107e6744bd093bbd9c89ec141661134dc48ddd66c6653ce9bf78a",
                "md5": "c241081e9ebbb5f37f8fc9bca776897d",
                "sha256": "b910638bc9d194ec7aa38daee1195eacb2c314e67757d7eaf7746786d7811262"
            },
            "downloads": -1,
            "filename": "pyelflabeler-0.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "c241081e9ebbb5f37f8fc9bca776897d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 16845,
            "upload_time": "2025-10-28T06:38:58",
            "upload_time_iso_8601": "2025-10-28T06:38:58.018589Z",
            "url": "https://files.pythonhosted.org/packages/28/4f/26cac90107e6744bd093bbd9c89ec141661134dc48ddd66c6653ce9bf78a/pyelflabeler-0.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-28 06:38:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "louiskyee",
    "github_project": "pyelflabeler",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "tqdm",
            "specs": [
                [
                    ">=",
                    "4.65.0"
                ]
            ]
        },
        {
            "name": "pyelftools",
            "specs": [
                [
                    ">=",
                    "0.29"
                ]
            ]
        },
        {
            "name": "avclass-malicialab",
            "specs": [
                [
                    ">=",
                    "2.8.10"
                ]
            ]
        }
    ],
    "lcname": "pyelflabeler"
}
        
Elapsed time: 1.66411s