<p style="display:flex;align-items:center;justify-content:center">
<img src="docs/assets/images/logo_transparent.png" width="300px" />
</p>
> A fast spam filter written in Python inspired by SpamAssassin integrated with machine learning.
[![test workflow](https://github.com/matteospanio/spam-analyzer/actions/workflows/test.yml/badge.svg)](https://github.com/matteospanio/spam-analyzer/actions/workflows/test.yml/badge.svg)
![CircleCI](https://img.shields.io/circleci/build/github/matteospanio/spam-analyzer?label=circleci-build&logo=CIRCLECI)
[![Coverage Status](https://coveralls.io/repos/github/matteospanio/spam-analyzer/badge.svg?branch=master)](https://coveralls.io/github/matteospanio/spam-analyzer?branch=master)
[![PyPI version](https://badge.fury.io/py/spam-analyzer.svg)](https://badge.fury.io/py/spam-analyzer)
![PyPI - Status](https://img.shields.io/pypi/status/spam-analyzer)
[![Python version](https://img.shields.io/badge/python-3.10%20%7C%203.11-blue)](https://img.shields.io/badge/python-3.10%20%7C%203.7%20%7C%203.11-blue)
[![Downloads](https://pepy.tech/badge/spam-analyzer)](https://pepy.tech/project/spam-analyzer)
[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
[![Dependencies](https://img.shields.io/librariesio/github/matteospanio/spam-analyzer)](https://libraries.io/github/matteospanio/spam-analyzer)
# Table of Contents
- [Table of Contents](#table-of-contents)
- [What is Spam Analyzer?](#what-is-spam-analyzer)
- [Installation](#installation)
- [Usage](#usage)
* [CLI](#cli)
* [Python](#python)
- [Contributing](#contributing)
- [License](#license)
# What is spam-analyzer?
spam-analyzer is a CLI (Command Line Interface) application that aims be a viable alternative to spam filter services.
This program can classify the email given in inputs in spam or non-spam using a machine learning algorithm (Random Forest), the model is trained using a dataset of 19900 emails. Anyway it could be wrong sometimes, if you want to improve the accuracy of the model you can train it with your persolized dataset.
The main features of spam-analyzer are:
1. spam recognition with the option to display a detailed analysis of the email
2. JSON output
3. it can be used as a library in your Python project to extract features from an email
4. it is written in Python with its most modern features to ensure software correctness
5. extensible with plugins
6. 100% containerized with Docker
## What is spam and how does spam-analyzer know it?
The analysis takes in consideration the following main aspects:
- the headers of the email
- the body of the email
- the attachments of the email
The most significant parts are the headers and the body of the email. The headers are analyzed to extract the following features:
- SPF (Sender Policy Framework)
- DKIM (DomainKeys Identified Mail)
- DMARC (Domain-based Message Authentication, Reporting & Conformance)
- If the sender domain is the same as the first in received headers
- The subject of the email
- The send date
- If the send date is compliant to the RFC 2822 and if it was sent from a valid time zone
- The date of the first received header
While the body is analyzed to extract the following features:
- If there are links
- If there are images
- If links are only http or https
- The percentage of the body that is written in uppercase
- The percentage of the body that contains blacklisted words
- The polarity of the body calculated with TextBlob
- The subjectivity of the body calculated with TextBlob
- If it contains mailto links
- If it contains javascript code
- If it contains html code
- If it contains html forms
The task could be solved in a programmatic way, chaining a long set of `if` statements based on the features extracted from the email. However, this approach is not scalable and it is not easy to maintain. Moreover, it is not possible to improve the accuracy of the model without changing the code and, the most important, the analysis would be based on the conaissance of the programmer and not on the data. Since we live in the data era, we should use the data to solve the problem, not the programmer's knowledge. So I decided to use a machine learning algorithm to solve the problem using all the features extracted from the email.
# Installation
spam-analyzer is available on PyPI, so you can install it with pip:
```bash
pip install spam-analyzer
```
For the latest version, you can install it from the source code:
```bash
git clone https://github.com/matteospanio/spam-analyzer.git
cd spam-analyzer
pip install .
```
# Usage
## CLI
spam-analyzer can be used as a CLI application:
```
Usage: spam-analyzer [OPTIONS] COMMAND [ARGS]...
A simple program to analyze emails.
Options:
-h, --help Show this message and exit.
-v, --verbose Enables verbose mode.
--version Show the version and exit.
-C, --config CONFIG_PATH Location of the configuration file. Supports glob
pattern of local path and remote URL.
Commands:
analyze Analyze emails from a file or directory.
configure Configure the program.
plugins Show all available plugins.
```
- `spam-analyzer analyze <file>`: classify the email given in input
- `spam-analyzer -v analyze <file>`: classify the email given in input and display a detailed analysis[^1]
- `spam-analyzer analyze -fmt json <file>`: classify the email given in input and display the result in JSON format (useful for integration with other programs)
- `spam-analyzer analyze -fmt json -o <outpath> <file> `: classify the email given in input and write the result in JSON format in the file given in input[^2]
- `spam-analyzer analyze -l <wordlist> <file>`: classify the email given in input using the wordlist given in input
### Configuration
`spam-analyzer` is thought to be highly configurable: on its first execution it will create a configuration file in `~/.config/spamanalyzer/` with some other default files. You can change the configuration file to customize the behavior of the program. At the moment of writing there are only paths to the wordlist and the model, but in the future there will be more options (e.g. senders blacklist and whitelist, a default path where to copy classified emails,...).
[^1]: The `--verbose` option is available only for the first use case, it will not work in combination with the `--output-format` option.
[^2]: You should use the `--output-file` instead of the `>` operator to write the output in a file, because the `spam-analyzer` program prints some other information on the standard output while processing the email(s).
## Python
```python
from spamanalyzer import SpamAnalyzer
analyzer = SpamAnalyzer(forbidden_words=["viagra", "cialis"])
analysis = await analyzer.analyze("path/to/email.txt")
```
The `spamanalyzer` library provides a really simple interface to extract features from an email. The `SpamAnalyzer` class provides the `analyze` method that takes in input the path to the email and returns a `SpamAnalyzer` object containing the analysis of the email.
Furthermore, the `MailAnalysis` class provides the `is_spam` method that returns `True` if the email is spam, `False` otherwise. Further examples are available in the folder `examples` of the source code.
# Contributing
Contributions are welcome! Please read the [contribution guidelines](CONTRIBUTING.md) first.
# License
spam-analyzer is licensed under the [GPLv3](LICENSE) license.
Raw data
{
"_id": null,
"home_page": "http://docs.spamanalyzer.tech/",
"name": "spam-analyzer",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.10,<4.0",
"maintainer_email": "",
"keywords": "spam,spam-analyzer,cybersecurity",
"author": "Matteo Spanio",
"author_email": "spanio@dei.unipd.it",
"download_url": "https://files.pythonhosted.org/packages/7a/b4/eb0d4cc1e3e8a858c3bbf96aceab66cd94732b19c2eed3f31ea32c41c98a/spam_analyzer-1.0.11.tar.gz",
"platform": null,
"description": "<p style=\"display:flex;align-items:center;justify-content:center\">\n <img src=\"docs/assets/images/logo_transparent.png\" width=\"300px\" />\n</p>\n\n> A fast spam filter written in Python inspired by SpamAssassin integrated with machine learning.\n\n[![test workflow](https://github.com/matteospanio/spam-analyzer/actions/workflows/test.yml/badge.svg)](https://github.com/matteospanio/spam-analyzer/actions/workflows/test.yml/badge.svg)\n![CircleCI](https://img.shields.io/circleci/build/github/matteospanio/spam-analyzer?label=circleci-build&logo=CIRCLECI)\n[![Coverage Status](https://coveralls.io/repos/github/matteospanio/spam-analyzer/badge.svg?branch=master)](https://coveralls.io/github/matteospanio/spam-analyzer?branch=master)\n[![PyPI version](https://badge.fury.io/py/spam-analyzer.svg)](https://badge.fury.io/py/spam-analyzer)\n![PyPI - Status](https://img.shields.io/pypi/status/spam-analyzer)\n[![Python version](https://img.shields.io/badge/python-3.10%20%7C%203.11-blue)](https://img.shields.io/badge/python-3.10%20%7C%203.7%20%7C%203.11-blue)\n[![Downloads](https://pepy.tech/badge/spam-analyzer)](https://pepy.tech/project/spam-analyzer)\n[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)\n[![Dependencies](https://img.shields.io/librariesio/github/matteospanio/spam-analyzer)](https://libraries.io/github/matteospanio/spam-analyzer)\n\n# Table of Contents\n\n- [Table of Contents](#table-of-contents)\n- [What is Spam Analyzer?](#what-is-spam-analyzer)\n- [Installation](#installation)\n- [Usage](#usage)\n * [CLI](#cli)\n * [Python](#python)\n- [Contributing](#contributing)\n- [License](#license)\n\n\n# What is spam-analyzer?\n\nspam-analyzer is a CLI (Command Line Interface) application that aims be a viable alternative to spam filter services.\n\nThis program can classify the email given in inputs in spam or non-spam using a machine learning algorithm (Random Forest), the model is trained using a dataset of 19900 emails. Anyway it could be wrong sometimes, if you want to improve the accuracy of the model you can train it with your persolized dataset.\n\nThe main features of spam-analyzer are:\n\n1. spam recognition with the option to display a detailed analysis of the email\n2. JSON output\n3. it can be used as a library in your Python project to extract features from an email\n4. it is written in Python with its most modern features to ensure software correctness\n5. extensible with plugins\n6. 100% containerized with Docker\n\n## What is spam and how does spam-analyzer know it?\n\nThe analysis takes in consideration the following main aspects:\n- the headers of the email\n- the body of the email\n- the attachments of the email\n\nThe most significant parts are the headers and the body of the email. The headers are analyzed to extract the following features:\n- SPF (Sender Policy Framework)\n- DKIM (DomainKeys Identified Mail)\n- DMARC (Domain-based Message Authentication, Reporting & Conformance)\n- If the sender domain is the same as the first in received headers\n- The subject of the email\n- The send date\n- If the send date is compliant to the RFC 2822 and if it was sent from a valid time zone\n- The date of the first received header\n\nWhile the body is analyzed to extract the following features:\n- If there are links\n- If there are images\n- If links are only http or https\n- The percentage of the body that is written in uppercase\n- The percentage of the body that contains blacklisted words\n- The polarity of the body calculated with TextBlob\n- The subjectivity of the body calculated with TextBlob\n- If it contains mailto links\n- If it contains javascript code\n- If it contains html code\n- If it contains html forms\n\nThe task could be solved in a programmatic way, chaining a long set of `if` statements based on the features extracted from the email. However, this approach is not scalable and it is not easy to maintain. Moreover, it is not possible to improve the accuracy of the model without changing the code and, the most important, the analysis would be based on the conaissance of the programmer and not on the data. Since we live in the data era, we should use the data to solve the problem, not the programmer's knowledge. So I decided to use a machine learning algorithm to solve the problem using all the features extracted from the email.\n\n# Installation\n\nspam-analyzer is available on PyPI, so you can install it with pip:\n\n```bash\npip install spam-analyzer\n```\n\nFor the latest version, you can install it from the source code:\n\n```bash\ngit clone https://github.com/matteospanio/spam-analyzer.git\ncd spam-analyzer\npip install .\n```\n\n# Usage\n\n## CLI\n\nspam-analyzer can be used as a CLI application:\n\n```\nUsage: spam-analyzer [OPTIONS] COMMAND [ARGS]...\n\n A simple program to analyze emails.\n\nOptions:\n -h, --help Show this message and exit.\n -v, --verbose Enables verbose mode.\n --version Show the version and exit.\n -C, --config CONFIG_PATH Location of the configuration file. Supports glob\n pattern of local path and remote URL.\n\nCommands:\n analyze Analyze emails from a file or directory.\n configure Configure the program.\n plugins Show all available plugins.\n```\n\n- `spam-analyzer analyze <file>`: classify the email given in input\n- `spam-analyzer -v analyze <file>`: classify the email given in input and display a detailed analysis[^1]\n- `spam-analyzer analyze -fmt json <file>`: classify the email given in input and display the result in JSON format (useful for integration with other programs)\n- `spam-analyzer analyze -fmt json -o <outpath> <file> `: classify the email given in input and write the result in JSON format in the file given in input[^2]\n- `spam-analyzer analyze -l <wordlist> <file>`: classify the email given in input using the wordlist given in input\n\n### Configuration\n\n`spam-analyzer` is thought to be highly configurable: on its first execution it will create a configuration file in `~/.config/spamanalyzer/` with some other default files. You can change the configuration file to customize the behavior of the program. At the moment of writing there are only paths to the wordlist and the model, but in the future there will be more options (e.g. senders blacklist and whitelist, a default path where to copy classified emails,...).\n\n[^1]: The `--verbose` option is available only for the first use case, it will not work in combination with the `--output-format` option.\n\n[^2]: You should use the `--output-file` instead of the `>` operator to write the output in a file, because the `spam-analyzer` program prints some other information on the standard output while processing the email(s).\n\n## Python\n\n```python\nfrom spamanalyzer import SpamAnalyzer\n\nanalyzer = SpamAnalyzer(forbidden_words=[\"viagra\", \"cialis\"])\nanalysis = await analyzer.analyze(\"path/to/email.txt\")\n```\n\nThe `spamanalyzer` library provides a really simple interface to extract features from an email. The `SpamAnalyzer` class provides the `analyze` method that takes in input the path to the email and returns a `SpamAnalyzer` object containing the analysis of the email.\n\nFurthermore, the `MailAnalysis` class provides the `is_spam` method that returns `True` if the email is spam, `False` otherwise. Further examples are available in the folder `examples` of the source code.\n\n# Contributing\n\nContributions are welcome! Please read the [contribution guidelines](CONTRIBUTING.md) first.\n\n# License\n\nspam-analyzer is licensed under the [GPLv3](LICENSE) license.\n",
"bugtrack_url": null,
"license": "GPLv3",
"summary": "A simple email analyzer",
"version": "1.0.11",
"project_urls": {
"Homepage": "http://docs.spamanalyzer.tech/",
"Repository": "https://github.com/matteospanio/spam-analyzer"
},
"split_keywords": [
"spam",
"spam-analyzer",
"cybersecurity"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "dc30f19cf919f04743f7497b10ffa84f7c5a058c0abc2decd9ba75d9b53cb8a0",
"md5": "b47c9667918dc32451bb4cbb6d582875",
"sha256": "e3d37aea325e1639b3ed174931e49ba15e1441967a0e1f8160b34d211a8b83a4"
},
"downloads": -1,
"filename": "spam_analyzer-1.0.11-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b47c9667918dc32451bb4cbb6d582875",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10,<4.0",
"size": 13513100,
"upload_time": "2023-11-22T21:51:29",
"upload_time_iso_8601": "2023-11-22T21:51:29.426838Z",
"url": "https://files.pythonhosted.org/packages/dc/30/f19cf919f04743f7497b10ffa84f7c5a058c0abc2decd9ba75d9b53cb8a0/spam_analyzer-1.0.11-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "7ab4eb0d4cc1e3e8a858c3bbf96aceab66cd94732b19c2eed3f31ea32c41c98a",
"md5": "31fe267d632ce6ea13cefa58789c7fa4",
"sha256": "7ab50eec4fb82695f92a1727ee7c23a572607114ddb72f8f2cdbad498de05905"
},
"downloads": -1,
"filename": "spam_analyzer-1.0.11.tar.gz",
"has_sig": false,
"md5_digest": "31fe267d632ce6ea13cefa58789c7fa4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10,<4.0",
"size": 13012126,
"upload_time": "2023-11-22T21:51:33",
"upload_time_iso_8601": "2023-11-22T21:51:33.150090Z",
"url": "https://files.pythonhosted.org/packages/7a/b4/eb0d4cc1e3e8a858c3bbf96aceab66cd94732b19c2eed3f31ea32c41c98a/spam_analyzer-1.0.11.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-11-22 21:51:33",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "matteospanio",
"github_project": "spam-analyzer",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"circle": true,
"lcname": "spam-analyzer"
}