# IOCParser
A tool for extracting Indicators of Compromise (IOCs) from security reports in HTML, PDF, and plain text formats.
Author: Marc Rivero | @seifreed
Version: 1.0.0
## Features
- Extraction of multiple types of IOCs:
- Hashes (MD5, SHA1, SHA256, SHA512)
- Domains
- IP Addresses
- URLs
- Bitcoin addresses
- Email addresses
- Hosts
- CVEs
- Windows Registry entries
- Filenames
- Filepaths
- Yara rules
- Automatic defanging of domains and IPs
- Support for HTML, PDF, and plain text formats
- Support for direct analysis from URLs
- Output in JSON and plain text format
- Checking against MISP warning lists to identify false positives
- Can be used as a command-line tool or as a Python library
## Installation
### From PyPI (Recommended)
```bash
pip install iocparser-tool
```
### From Source
```bash
# Clone the repository
git clone https://github.com/seifreed/iocparser.git
cd iocparser
# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install as a package with all dependencies
pip install -e .
# Or install just the requirements
pip install -r requirements.txt
```
## Quick Start
```bash
# Initialize and download MISP warning lists (do this first)
iocparser --init
# Analyze a PDF file
iocparser -f report.pdf
# Analyze an HTML file
iocparser -f report.html
# Analyze a text file
iocparser -f report.txt
```
## Command Line Usage
### Basic Usage
```bash
# Initialize and download MISP warning lists (do this first)
iocparser --init
# Analyze a PDF file
iocparser -f report.pdf
# Analyze an HTML file
iocparser -f report.html
# Analyze a text file
iocparser -f report.txt
```
### File Type Options
```bash
# Force specific file type (pdf, html, text)
iocparser -f report -t pdf
iocparser -f report -t html
iocparser -f report -t text
```
### Output Options
```bash
# Save outputs to a specific file
iocparser -f report.pdf -o results.json
iocparser -f report.pdf -o results.txt
# Print results to screen only
iocparser -f report.pdf -o -
# Use JSON format (default is text)
iocparser -f report.pdf --json
```
### Analyzing from URL
```bash
# Analyze a report from a URL
iocparser -u https://example.com/report.html
# Specify content type for a URL
iocparser -u https://example.com/report -t html
```
### Additional Options
```
--no-defang Disable automatic defanging of IOCs
--no-check-warnings Don't check IOCs against MISP warning lists
--force-update Force update of MISP warning lists
--init Download and initialize MISP warning lists
-h, --help Show help message
```
## Using as a Library
You can use IOCParser as a library in your Python projects:
```python
# Example 1: Extract IOCs from a file
from iocparser import extract_iocs_from_file
# Process a file (automatically detects file type)
normal_iocs, warning_iocs = extract_iocs_from_file('path/to/report.pdf')
print(f"Found {len(normal_iocs.get('domains', []))} normal domains")
print(f"Found {len(warning_iocs.get('domains', []))} potential false positive domains")
# With additional options
normal_iocs, warning_iocs = extract_iocs_from_file(
'path/to/report.html',
check_warnings=True, # Check against MISP warning lists
force_update=False, # Don't force update MISP lists
file_type='html', # Force file type (optional)
defang=True # Defang the IOCs
)
# Example 2: Extract IOCs from text content directly
from iocparser import extract_iocs_from_text
text = "This sample malware contacts evil.com with IP 192.168.1.1 and uses hash 5f4dcc3b5aa765d61d8327deb882cf99"
normal_iocs, warning_iocs = extract_iocs_from_text(text)
# Print the extracted IOCs
for ioc_type, iocs_list in normal_iocs.items():
print(f"{ioc_type}: {iocs_list}")
```
### Using the Low-Level Components
If you need more control, you can use the individual components directly:
```python
from iocparser import IOCExtractor, PDFParser, HTMLParser, MISPWarningLists
# Extract text from a PDF or HTML file
parser = PDFParser("path/to/report.pdf")
# or
# parser = HTMLParser("path/to/report.html")
text_content = parser.extract_text()
# Extract IOCs
extractor = IOCExtractor(defang=True)
iocs = extractor.extract_all(text_content)
# Check against warning lists
warning_lists = MISPWarningLists()
normal_iocs, warning_iocs = warning_lists.separate_iocs_by_warnings(iocs)
```
### Available Extraction Methods
```python
from iocparser import IOCExtractor
extractor = IOCExtractor(defang=True)
# Extract specific IOC types
md5_hashes = extractor.extract_md5(text)
sha1_hashes = extractor.extract_sha1(text)
sha256_hashes = extractor.extract_sha256(text)
sha512_hashes = extractor.extract_sha512(text)
domains = extractor.extract_domains(text)
ips = extractor.extract_ips(text)
urls = extractor.extract_urls(text)
bitcoin = extractor.extract_bitcoin(text)
yara_rules = extractor.extract_yara_rules(text)
hosts = extractor.extract_hosts(text)
emails = extractor.extract_emails(text)
cves = extractor.extract_cves(text)
registry_keys = extractor.extract_registry(text)
filenames = extractor.extract_filenames(text)
filepaths = extractor.extract_filepaths(text)
# Extract all IOC types at once
all_iocs = extractor.extract_all(text) # Returns a dictionary with all IOCs
```
## Examples
### Extract IOCs from a local PDF report
```bash
iocparser -f reports/APT28_report.pdf
```
### Extract IOCs from a URL and save in JSON format
```bash
iocparser -u https://example.com/security-report.pdf --json
```
### Extract IOCs from an HTML file without defanging
```bash
iocparser -f report.html --no-defang
```
### Use in a Python script to process multiple files
```python
from iocparser import extract_iocs_from_file
import os
reports_dir = "path/to/reports"
for filename in os.listdir(reports_dir):
if filename.endswith(".pdf") or filename.endswith(".html"):
file_path = os.path.join(reports_dir, filename)
print(f"Processing {filename}...")
normal_iocs, warning_iocs = extract_iocs_from_file(file_path)
# Do something with the extracted IOCs
print(f"Found {sum(len(iocs) for iocs in normal_iocs.values())} IOCs")
```
## License
This project is available under the MIT License. You are free to use, modify, and distribute it, provided that you include the original copyright notice and attribution to the original author.
**Required Attribution:**
- Original Author: Marc Rivero | @seifreed
- Repository: https://github.com/seifreed/iocparser
When using this project in your own work, please include a clear reference to the original author and repository.
Raw data
{
"_id": null,
"home_page": "https://github.com/seifreed/iocparser",
"name": "iocparser-tool",
"maintainer": "Marc Rivero",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": null,
"keywords": "security, ioc, malware, threat-intelligence, pdf, html, parser",
"author": "Marc Rivero",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/c8/6c/698ad1cdee2258e0f393e3dea6657e567d758f5bf49e01ccd407a56a3016/iocparser_tool-1.0.1.tar.gz",
"platform": null,
"description": "# IOCParser\n\nA tool for extracting Indicators of Compromise (IOCs) from security reports in HTML, PDF, and plain text formats.\n\nAuthor: Marc Rivero | @seifreed \nVersion: 1.0.0\n\n## Features\n\n- Extraction of multiple types of IOCs:\n - Hashes (MD5, SHA1, SHA256, SHA512)\n - Domains\n - IP Addresses\n - URLs\n - Bitcoin addresses\n - Email addresses\n - Hosts\n - CVEs\n - Windows Registry entries\n - Filenames\n - Filepaths\n - Yara rules\n- Automatic defanging of domains and IPs\n- Support for HTML, PDF, and plain text formats\n- Support for direct analysis from URLs\n- Output in JSON and plain text format\n- Checking against MISP warning lists to identify false positives\n- Can be used as a command-line tool or as a Python library\n\n## Installation\n\n### From PyPI (Recommended)\n\n```bash\npip install iocparser-tool\n```\n\n### From Source\n\n```bash\n# Clone the repository\ngit clone https://github.com/seifreed/iocparser.git\ncd iocparser\n\n# Create and activate virtual environment\npython3 -m venv venv\nsource venv/bin/activate # On Windows: venv\\Scripts\\activate\n\n# Install as a package with all dependencies\npip install -e .\n\n# Or install just the requirements\npip install -r requirements.txt\n```\n\n## Quick Start\n\n```bash\n# Initialize and download MISP warning lists (do this first)\niocparser --init\n\n# Analyze a PDF file\niocparser -f report.pdf\n\n# Analyze an HTML file\niocparser -f report.html\n\n# Analyze a text file\niocparser -f report.txt\n```\n\n## Command Line Usage\n\n### Basic Usage\n\n```bash\n# Initialize and download MISP warning lists (do this first)\niocparser --init\n\n# Analyze a PDF file\niocparser -f report.pdf\n\n# Analyze an HTML file\niocparser -f report.html\n\n# Analyze a text file\niocparser -f report.txt\n```\n\n### File Type Options\n\n```bash\n# Force specific file type (pdf, html, text)\niocparser -f report -t pdf\niocparser -f report -t html\niocparser -f report -t text\n```\n\n### Output Options\n\n```bash\n# Save outputs to a specific file\niocparser -f report.pdf -o results.json\niocparser -f report.pdf -o results.txt\n\n# Print results to screen only\niocparser -f report.pdf -o -\n\n# Use JSON format (default is text)\niocparser -f report.pdf --json\n```\n\n### Analyzing from URL\n\n```bash\n# Analyze a report from a URL\niocparser -u https://example.com/report.html\n\n# Specify content type for a URL\niocparser -u https://example.com/report -t html\n```\n\n### Additional Options\n\n```\n--no-defang Disable automatic defanging of IOCs\n--no-check-warnings Don't check IOCs against MISP warning lists\n--force-update Force update of MISP warning lists\n--init Download and initialize MISP warning lists\n-h, --help Show help message\n```\n\n## Using as a Library\n\nYou can use IOCParser as a library in your Python projects:\n\n```python\n# Example 1: Extract IOCs from a file\nfrom iocparser import extract_iocs_from_file\n\n# Process a file (automatically detects file type)\nnormal_iocs, warning_iocs = extract_iocs_from_file('path/to/report.pdf')\nprint(f\"Found {len(normal_iocs.get('domains', []))} normal domains\")\nprint(f\"Found {len(warning_iocs.get('domains', []))} potential false positive domains\")\n\n# With additional options\nnormal_iocs, warning_iocs = extract_iocs_from_file(\n 'path/to/report.html',\n check_warnings=True, # Check against MISP warning lists\n force_update=False, # Don't force update MISP lists\n file_type='html', # Force file type (optional)\n defang=True # Defang the IOCs\n)\n\n# Example 2: Extract IOCs from text content directly\nfrom iocparser import extract_iocs_from_text\n\ntext = \"This sample malware contacts evil.com with IP 192.168.1.1 and uses hash 5f4dcc3b5aa765d61d8327deb882cf99\"\nnormal_iocs, warning_iocs = extract_iocs_from_text(text)\n\n# Print the extracted IOCs\nfor ioc_type, iocs_list in normal_iocs.items():\n print(f\"{ioc_type}: {iocs_list}\")\n```\n\n### Using the Low-Level Components\n\nIf you need more control, you can use the individual components directly:\n\n```python\nfrom iocparser import IOCExtractor, PDFParser, HTMLParser, MISPWarningLists\n\n# Extract text from a PDF or HTML file\nparser = PDFParser(\"path/to/report.pdf\")\n# or\n# parser = HTMLParser(\"path/to/report.html\")\ntext_content = parser.extract_text()\n\n# Extract IOCs\nextractor = IOCExtractor(defang=True)\niocs = extractor.extract_all(text_content)\n\n# Check against warning lists\nwarning_lists = MISPWarningLists()\nnormal_iocs, warning_iocs = warning_lists.separate_iocs_by_warnings(iocs)\n```\n\n### Available Extraction Methods\n\n```python\nfrom iocparser import IOCExtractor\n\nextractor = IOCExtractor(defang=True)\n\n# Extract specific IOC types\nmd5_hashes = extractor.extract_md5(text)\nsha1_hashes = extractor.extract_sha1(text)\nsha256_hashes = extractor.extract_sha256(text)\nsha512_hashes = extractor.extract_sha512(text)\ndomains = extractor.extract_domains(text)\nips = extractor.extract_ips(text)\nurls = extractor.extract_urls(text)\nbitcoin = extractor.extract_bitcoin(text)\nyara_rules = extractor.extract_yara_rules(text)\nhosts = extractor.extract_hosts(text)\nemails = extractor.extract_emails(text)\ncves = extractor.extract_cves(text)\nregistry_keys = extractor.extract_registry(text)\nfilenames = extractor.extract_filenames(text)\nfilepaths = extractor.extract_filepaths(text)\n\n# Extract all IOC types at once\nall_iocs = extractor.extract_all(text) # Returns a dictionary with all IOCs\n```\n\n## Examples\n\n### Extract IOCs from a local PDF report\n```bash\niocparser -f reports/APT28_report.pdf\n```\n\n### Extract IOCs from a URL and save in JSON format\n```bash\niocparser -u https://example.com/security-report.pdf --json\n```\n\n### Extract IOCs from an HTML file without defanging\n```bash\niocparser -f report.html --no-defang\n```\n\n### Use in a Python script to process multiple files\n```python\nfrom iocparser import extract_iocs_from_file\nimport os\n\nreports_dir = \"path/to/reports\"\nfor filename in os.listdir(reports_dir):\n if filename.endswith(\".pdf\") or filename.endswith(\".html\"):\n file_path = os.path.join(reports_dir, filename)\n print(f\"Processing {filename}...\")\n normal_iocs, warning_iocs = extract_iocs_from_file(file_path)\n \n # Do something with the extracted IOCs\n print(f\"Found {sum(len(iocs) for iocs in normal_iocs.values())} IOCs\")\n```\n\n## License\n\nThis project is available under the MIT License. You are free to use, modify, and distribute it, provided that you include the original copyright notice and attribution to the original author.\n\n**Required Attribution:**\n- Original Author: Marc Rivero | @seifreed\n- Repository: https://github.com/seifreed/iocparser\n\nWhen using this project in your own work, please include a clear reference to the original author and repository. \n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A tool for extracting Indicators of Compromise from security reports",
"version": "1.0.1",
"project_urls": {
"Bug Tracker": "https://github.com/seifreed/iocparser/issues",
"Documentation": "https://github.com/seifreed/iocparser#readme",
"Homepage": "https://github.com/seifreed/iocparser",
"Repository": "https://github.com/seifreed/iocparser"
},
"split_keywords": [
"security",
" ioc",
" malware",
" threat-intelligence",
" pdf",
" html",
" parser"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "eb3192c2be6c74536c5a7d6a966a0cc5730c295109fa28bb983048bb535713e2",
"md5": "48d1f3ec957e4af5dfa3f21e33d35f0b",
"sha256": "7687256c5012ad19f98729f559cc25c9441e0b789149e89175dd478c12e2fd6b"
},
"downloads": -1,
"filename": "iocparser_tool-1.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "48d1f3ec957e4af5dfa3f21e33d35f0b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 19066445,
"upload_time": "2025-07-11T09:55:59",
"upload_time_iso_8601": "2025-07-11T09:55:59.252166Z",
"url": "https://files.pythonhosted.org/packages/eb/31/92c2be6c74536c5a7d6a966a0cc5730c295109fa28bb983048bb535713e2/iocparser_tool-1.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "c86c698ad1cdee2258e0f393e3dea6657e567d758f5bf49e01ccd407a56a3016",
"md5": "dc573724cbdaf2309fda08ccc889ffce",
"sha256": "4c561f2ad06934688de1c8b673f756f7b92e30e81ce07a063d968919309963cc"
},
"downloads": -1,
"filename": "iocparser_tool-1.0.1.tar.gz",
"has_sig": false,
"md5_digest": "dc573724cbdaf2309fda08ccc889ffce",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 37986218,
"upload_time": "2025-07-11T09:56:03",
"upload_time_iso_8601": "2025-07-11T09:56:03.853759Z",
"url": "https://files.pythonhosted.org/packages/c8/6c/698ad1cdee2258e0f393e3dea6657e567d758f5bf49e01ccd407a56a3016/iocparser_tool-1.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-11 09:56:03",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "seifreed",
"github_project": "iocparser",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "colorama",
"specs": [
[
">=",
"0.4.0"
]
]
},
{
"name": "python-magic",
"specs": [
[
">=",
"0.4.0"
]
]
},
{
"name": "requests",
"specs": [
[
">=",
"2.25.0"
]
]
},
{
"name": "tqdm",
"specs": [
[
">=",
"4.60.0"
]
]
},
{
"name": "pdfplumber",
"specs": [
[
">=",
"0.10.0"
]
]
},
{
"name": "beautifulsoup4",
"specs": [
[
">=",
"4.9.0"
]
]
},
{
"name": "lxml",
"specs": [
[
">=",
"4.6.0"
]
]
},
{
"name": "regex",
"specs": [
[
">=",
"2023.0.0"
]
]
},
{
"name": "pdfminer.six",
"specs": [
[
">=",
"20201018"
]
]
}
],
"lcname": "iocparser-tool"
}