avclass-malicialab

Name	avclass-malicialab JSON
Version	2.8.9 JSON
	download
home_page	None
Summary	AVClass is a Python package and command line tool to tag / label malware samples.
upload_time	2024-09-05 13:10:00
maintainer	None
docs_url	None
author	MaliciaLab
requires_python	None
license	MIT License Copyright (c) 2016-2020 MaliciaLab @ IMDEA Software Institute Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	malware malware family tag av label
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

# AVClass

AVClass is a Python package and command line tool to tag / label
malware samples.
You input the AV labels for a large number of malware samples
(e.g., VirusTotal JSON reports)
and it outputs a list of tags extracted from the AV labels of each sample.

By default, AVClass outputs the most likely family name for each sample,
but it can also output other tags that capture
the malware class (e.g., *worm*, *ransomware*, *grayware*),
behaviors (e.g., *spam*, *ddos*), and
file properties (e.g., *packed*, *themida*, *bundle*, *nsis*).

If you are wondering if this is AVClass or AVClass2,
the answer is this is the right place for both.
The old AVClass code has been deprecated and
AVClass2 has been renamed as AVClass.
A longer explanation is below.

## Installation
```shell
pip install avclass-malicialab
```

## Examples

To obtain the most likely family name for each sample, run:

```shell
avclass -f examples/vtv2_sample.json
```

the output on stdout will be:

```
602695c8f2ad76564bddcaf47b76edff zeroaccess
f117cc1477513cb181cc2e9fcaab39b2 winwebsec
```

which simply reports the most common family name for each sample.

For some samples, AVClass may return:

```
5e31d16d6bf35ea117d6d2c4d42ea879 SINGLETON:5e31d16d6bf35ea117d6d2c4d42ea879
```

This means that AVClass was not able to identify a family name for that sample.
AVClass uses the SINGLETON:hash terminology,
(e.g., instead of an empty string or NULL)
so that the second column can be used as a cluster identifier where
each unlabeled sample is placed in its own cluster.
This prevents considering that all unlabeled samples are part of the
same family / cluster.

To extract all tags for each sample run:

```shell
avclass -f examples/vtv2_sample.json -t
```

the output on stdout will be:

```
602695c8f2ad76564bddcaf47b76edff 52 FAM:zeroaccess|19,FILE:os:windows|16,BEH:server|8,CLASS:backdoor|8,FILE:packed|7
f117cc1477513cb181cc2e9fcaab39b2 39 CLASS:rogueware|15,BEH:alertuser|15,FILE:os:windows|11,FAM:winwebsec|4,CLASS:grayware|4,CLASS:grayware:tool|3,FILE:packed|3
```
which means sample *602695c8f2ad76564bddcaf47b76edff*
was flagged by 52 AV engines and that
19 of them mention it belongs to the *zeroaccess* family,
16 that it runs on *windows*,
8 that it is a *backdoor*, and
7 that it is a *packed* file.
Sample *f117cc1477513cb181cc2e9fcaab39b2* is flagged by 39 AV engines and
15 of them mention its class to be *rogueware*,
15 that it has the *alertuser* behavior,
11 that it runs on *windows*,
4 that it belongs to the *winwebsec* family,
and so on.

You can also place the output in a file of your choosing with the _-o_ option:

```shell
avclass -f examples/vtv2_sample.json -o output.txt
```

## Why is AVClass useful?

Because a lot of times security researchers want to extract family and other
information from AV labels, but this process is not as simple as it looks,
especially if you need to do it for large numbers (e.g., millions) of samples.
Some advantages of AVClass are:

1. *Automatic.* It avoids manual work that does not scale for large datasets.

2. *Vendor-agnostic.* It operates on the labels of any available set of AV
engines, which can vary from sample to sample.

3. *Cross-platform.* It can be used for any platforms supported by AV
engines, e.g., Windows or Android malware.

4. *Does not require executables.* AV labels can be obtained from online
services like VirusTotal using a sample's hash,
even when the executable is not available.

5. *Quantified accuracy.* We have evaluated AVClass on millions of
samples and publicly available malware datasets with ground truth.
Evaluation details are in the RAID 2016 and ACSAC 2020 papers
(see References section).

6. *Open source.* The code is available and we are happy to incorporate
suggestions and improvements so that the security community benefits from
the tool.

## Limitations

The main limitations of AVClass is that its output depends
on the input AV labels.
AVClass tries to compensate for the noise on the AV labels,
but it cannot identify tags if AV engines do not provide non-generic tokens
in the labels of a sample.
In particular, it only outputs tags that appear in the labels of
at least 2 AV engines.

Still, there are many samples that can be tagged
and thus we believe you will find it useful.

## Is this AVClass or AVClass2?

The short answer is that the current code in this repo is
based on the code of AVClass2.
The original AVClass code has been deprecated.
Below, we detail this process.

We originally published AVClass in RAID 2016 and made its code
available in this repository in July 2016.
AVClass extracted only the family names from the input samples.

We published AVClass2 in ACSAC 2020 and made its code
available in this repository in September 2020.
AVClass2 extracted all tags from the input samples and included a
compatibility option to provide only the family names in the
same format as the original AVClass.

For 2.5 years, both tools were available in this repository in
separate directories.
In February 2023, we decided to
deprecate the original AVClass code,
rename AVClass2 as AVClass,
release a PyPI package to ease installation, and
clean the command line options.

## Input formats

AVClass supports four input JSONL formats
(i.e., one JSON object per line).

1. VirusTotal v3 API reports,
where each line in the input *file* should be the full JSON of a
VirusTotal API version 3 response with a *File* object report,
e.g., obtained by querying https://www.virustotal.com/api/v3/files/{hash}
There is an example VirusTotal v3 input file in examples/vtv3_sample.json

```shell
avclass -f examples/vtv3_sample.json -o output.txt
```

2. VirusTotal v2 API reports,
where each line in the input *file* should be the full JSON of a
VirusTotal v2 API response to the */file/report* endpoint,
e.g., obtained by querying https://www.virustotal.com/vtapi/v2/file/report?apikey={apikey}&resource={hash}
There is an example VirusTotal v2 input file in examples/vtv2_sample.json

```shell
avclass -f examples/vtv2_sample.json -o output.txt
```

3. OPSWAT MetaDefender reports,
where each line in the input *file* should be the full JSON
obtained from OPSWAT MetaDefender.
There is an example OPSWAT MetaDefender input file in
examples/opswat_md_sample.json

```shell
avclass -f examples/opswat_md_sample.json -o output.txt
```

4. Simplified format,
where each line in the input *file* should be a JSON
with (at least) these fields:
{md5, sha1, sha256, av_labels}.
There is an example of such input file in *examples/malheurReference_lb.json*
If you are obtaining AV labels from sources other than VirusTotal you
may want to convert them to this format.

```shell
avclass -f examples/malheurReference_lb.json -o output.txt
```

**Multiple input files and different formats**

AVClass can handle multiple input files putting the results in the
same output files
(if you want results in separate files, process each input file separately).
AVClass automatically detects the format of each file,
so it is possible to mix input files.

For example, you can provide as input the three test files
(each of a different format) in the examples directory:

```shell
avclass -f examples/vtv3_sample.json -f examples/vtv2_sample.json -f examples/malheurReference_lb.json -f examples/opswat_md_sample.json -o output.txt
```

output.txt will have 3135 lines: 3130 samples from malheurReference_lb.json,
3 samples from vtv2_sample.json, 1 sample from vtv3_sample.json, and
1 sample from opswat_md_sample.json.

You can also provide as input a directory with the -d option and
AVClass will process all files in that directory.

```shell
avclass -d <directory>
```

It is also possible to combine -f with -d,
Thus, this command works:

```shell
avclass -f <file> -d <directory>
```

At this point you have read the most important information on
how to use AVClass.
The following sections describe steps that most users will not need.

## Labeling: Using only Selected AV Engines

By default, AVClass will use the labels of all AV engines that appear in
the input reports.
If you want to limit AVClass to use only the labels of certain AV engines,
you can use the -av option to pass it a file where each line has the name of
an AV engine (case-sensitive).

For example, you could create a file engines.txt with three lines:
BitDefender
F-Secure
Sophos

```shell
avclass -av engines.txt -f examples/vtv2_sample.json -t -o output.txt
```

would output into output.txt:
```
602695c8f2ad76564bddcaf47b76edff 3 FAM:zeroaccess|2
f117cc1477513cb181cc2e9fcaab39b2 3
```

where only the labels of BitDefender, F-Secure, and Sophos have been used
to extract tags.
The output states all three selected engines flag both samples as malicious.
Note that the number of detections is with respect to the provided engines,
i.e., even if the first sample has 52 detections,
the number of detections is a maximum of 3 in this case.
For the first sample, two AV engines identify the family as *zeroaccess* but
for the second sample no tags are identified in the labels
of the three selected AV engines.

## Labeling: Ground Truth Evaluation

If you have family ground truth for some malware samples,
i.e., you know the true family for those samples,
you can evaluate the accuracy of the family tags output by AVClass on
those samples with respect to that ground truth.
The evaluation metrics used are precision, recall, and F1 measure.
See our
[RAID 2016 paper](https://software.imdea.org/~juanca/papers/avclass_raid16.pdf) for their definition.
Note that the ground truth evaluation does not apply to non-family tags,
i.e., it only evaluates family labeling.

```shell
avclass -f examples/malheurReference_lb.json -gt examples/malheurReference_gt.tsv -o malheurReference.labels
```

The output includes these lines:

```
Calculating precision and recall
3131 out of 3131
Precision: 90.81 Recall: 93.95 F1-Measure: 92.35
```

Each line in the *examples/malheurReference_gt.tsv* file has
three **tab-separated** columns (hash, AVClass family, GT family):

```
afdd8f086dfcb8d2cf26c566e784476dd899ec10 adrotator ADROTATOR
```

which indicates that sample afdd8f086dfcb8d2cf26c566e784476dd899ec10
is identified as *adrotator* by AVClass and
its ground truth family is *ADROTATOR*.
Each sample in the input file should also appear in the ground truth file.
Note that the particular label assigned to each family does not matter.
What matters is that all samples in the same family are assigned
the same family name (i.e., the same string in the second column)

The ground truth can be obtained from publicly available malware datasets.
The one in *examples/malheurReference_gt.tsv* comes from the
[Malheur](http://www.mlsec.org/malheur/) dataset.
There are other public datasets with ground truth such as
[Drebin](https://www.sec.cs.tu-bs.de/~danarp/drebin/) or
[Malicia](http://malicia-project.com/dataset.html).

## Update Module

The update module can be used to suggest additions and changes to the input
taxonomy, tagging rules, and expansion rules.
By default, AVClass uses the default taxonomy, tagging, and expansion files
included in the repository.
Thus, we expect that most users will not need to run the update module.
But, below we explain how to run in case you need to.

Using the update module comprises of two steps.
The first step is obtaining an alias file:

```shell
avclass -f examples/malheurReference_lb.json -aliasdetect -o /dev/null
```

The above command will create a file named \<file\>.alias,
malheurReference_lb.alias in our example. This file has 7 columns:

1. t1: token that is an alias
2. t2: tag for which t1 is an alias
3. |t1|: number of input samples where t1 was observed
4. |t2|: number of input samples where t2 was observed
5. |t1^t2|: number of input samples where both t1 and t2 were observed
6. |t1^t2|/|t1|: ratio of input samples where both t1 and t2 were observed over the number of input samples where t1 was observed.
7. |t1^t2|/|t2|: ratio of input samples where both t1 and t2 were observed over the number of input samples where t2 was observed.

The Update Module takes the above file as input with the -alias option,
as well as the default taxonomy, tagging, and expansion files
in the data directory.
It outputs updated taxonomy, tagging, and expansion files that include the
suggested additions and changes.

```shell
avclass-update -alias malheurReference_lb.alias -o output_prefix
```

This will produce three files:
output_prefix.taxonomy, output_prefix.tagging, output_prefix.expansion.
You can diff the output and input files to analyze the proposed changes.

You can also modify the input taxonomy, tagging, and expansion rules in place,
rather than producing new files:

```shell
avclass-update -alias malheurReference_lb.alias -update
```

## Customizing AVClass

AVClass is fully customizable:
Tagging, Expansion and Taxonomy files can be easily modified by the analyst
either manually or by running the update module.

If you change those files manually, we recommend running
afterwards the normalization script to keep them tidy.
It sorts the tags in the taxonomy and performs some basic cleaning like
removing redundant entries:

```shell
avclass-normalize -tax mytaxonomy -tag mytagging -exp myexpansions
```

If the modifications are in the default files in the data directory you can
simply run:

```shell
avclass-normalize
```

## Evaluating and comparing with AVClass

Other researchers may want to independently evaluate AVClass/AVClass2 and
to compare it with their own approaches.
We encourage such evaluation, feedback on limitations, and proposals for
improvement.
However, we have observed a number of common errors in such evaluations that
should be avoided.
Thus, if you need to compare your approach with AVClass/AVClass2,
please read the [evaluation page](EVALUATION.md)

## Dependencies

AVClass is written in Python.
It should run on Python versions above 2.7 and 3.0.

It does not require installing any dependencies.

## Support and Contributing

If you have issues or want to contribute, please file a issue or perform a
pull request through GitHub.

## License

AVClass is released under the MIT license

## References

The design and evaluation of AVClass is detailed in our
[RAID 2016 paper](https://software.imdea.org/~juanca/papers/avclass_raid16.pdf):

> Marcos Sebastián, Richard Rivera, Platon Kotzias, and Juan Caballero.<br>
AVClass: A Tool for Massive Malware Labeling.<br>
In Proceedings of the International Symposium on Research in
Attacks, Intrusions and Defenses,
September 2016.

The design and evaluation of AVClass2 is detailed in our
[ACSAC 2020 paper](https://arxiv.org/pdf/2006.10615.pdf):

> Silvia Sebastián, Juan Caballero.<br>
AVClass2: Massive Malware Tag Extraction from AV Labels.<br>
In proceedings of the Annual Computer Security Applications Conference,
December 2020.

## Contributors

Several members of the MaliciaLab at the
[IMDEA Software Institute](http://software.imdea.org)
have contributed to AVClass:
Marcos Sebastián, Richard Rivera, Platon Kotzias, Srdjan Matic,
Silvia Sebastián, Kevin van Liebergen, and Juan Caballero.

GitHub users with significant contributions to AVClass include
(let us know if you believe you should be listed here):
[eljeffeg](https://github.com/eljeffeg)

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "avclass-malicialab",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "malware, malware family, tag, AV label",
    "author": "MaliciaLab",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/64/32/4cea5c955478b1290bf56a225ffb43968f0009b861a013cca2673caee9db/avclass_malicialab-2.8.9.tar.gz",
    "platform": null,
    "description": "# AVClass\n\nAVClass is a Python package and command line tool to tag / label\nmalware samples.\nYou input the AV labels for a large number of malware samples\n(e.g., VirusTotal JSON reports)\nand it outputs a list of tags extracted from the AV labels of each sample.\n\nBy default, AVClass outputs the most likely family name for each sample,\nbut it can also output other tags that capture\nthe malware class (e.g., *worm*, *ransomware*, *grayware*),\nbehaviors (e.g., *spam*, *ddos*), and\nfile properties (e.g., *packed*, *themida*, *bundle*, *nsis*).\n\nIf you are wondering if this is AVClass or AVClass2,\nthe answer is this is the right place for both.\nThe old AVClass code has been deprecated and\nAVClass2 has been renamed as AVClass.\nA longer explanation is below.\n\n## Installation\n```shell\npip install avclass-malicialab\n```\n\n## Examples\n\nTo obtain the most likely family name for each sample, run:\n\n```shell\navclass -f examples/vtv2_sample.json\n```\n\nthe output on stdout will be:\n\n```\n602695c8f2ad76564bddcaf47b76edff  zeroaccess\nf117cc1477513cb181cc2e9fcaab39b2  winwebsec\n```\n\nwhich simply reports the most common family name for each sample.\n\nFor some samples, AVClass may return:\n\n```\n5e31d16d6bf35ea117d6d2c4d42ea879\tSINGLETON:5e31d16d6bf35ea117d6d2c4d42ea879\n```\n\nThis means that AVClass was not able to identify a family name for that sample.\nAVClass uses the SINGLETON:hash terminology,\n(e.g., instead of an empty string or NULL)\nso that the second column can be used as a cluster identifier where\neach unlabeled sample is placed in its own cluster.\nThis prevents considering that all unlabeled samples are part of the\nsame family / cluster.\n\nTo extract all tags for each sample run:\n\n```shell\navclass -f examples/vtv2_sample.json -t\n```\n\nthe output on stdout will be:\n\n```\n602695c8f2ad76564bddcaf47b76edff  52  FAM:zeroaccess|19,FILE:os:windows|16,BEH:server|8,CLASS:backdoor|8,FILE:packed|7\nf117cc1477513cb181cc2e9fcaab39b2  39  CLASS:rogueware|15,BEH:alertuser|15,FILE:os:windows|11,FAM:winwebsec|4,CLASS:grayware|4,CLASS:grayware:tool|3,FILE:packed|3\n```\nwhich means sample *602695c8f2ad76564bddcaf47b76edff*\nwas flagged by 52 AV engines and that\n19 of them mention it belongs to the *zeroaccess* family,\n16 that it runs on *windows*,\n8 that it is a *backdoor*, and\n7 that it is a *packed* file.\nSample *f117cc1477513cb181cc2e9fcaab39b2* is flagged by 39 AV engines and\n15 of them mention its class to be *rogueware*,\n15 that it has the *alertuser* behavior,\n11 that it runs on *windows*,\n4 that it belongs to the *winwebsec* family,\nand so on.\n\nYou can also place the output in a file of your choosing with the _-o_ option:\n\n```shell\navclass -f examples/vtv2_sample.json -o output.txt\n```\n\n## Why is AVClass useful?\n\nBecause a lot of times security researchers want to extract family and other\ninformation from AV labels, but this process is not as simple as it looks,\nespecially if you need to do it for large numbers (e.g., millions) of samples.\nSome advantages of AVClass are:\n\n1. *Automatic.* It avoids manual work that does not scale for large datasets.\n\n2. *Vendor-agnostic.* It operates on the labels of any available set of AV\nengines, which can vary from sample to sample.\n\n3. *Cross-platform.* It can be used for any platforms supported by AV\nengines, e.g., Windows or Android malware.\n\n4. *Does not require executables.* AV labels can be obtained from online\nservices like VirusTotal using a sample's hash,\neven when the executable is not available.\n\n5. *Quantified accuracy.* We have evaluated AVClass on millions of\nsamples and publicly available malware datasets with ground truth.\nEvaluation details are in the RAID 2016 and ACSAC 2020 papers\n(see References section).\n\n6. *Open source.* The code is available and we are happy to incorporate\nsuggestions and improvements so that the security community benefits from\nthe tool.\n\n## Limitations\n\nThe main limitations of AVClass is that its output depends\non the input AV labels.\nAVClass tries to compensate for the noise on the AV labels,\nbut it cannot identify tags if AV engines do not provide non-generic tokens\nin the labels of a sample.\nIn particular, it only outputs tags that appear in the labels of\nat least 2 AV engines.\n\nStill, there are many samples that can be tagged\nand thus we believe you will find it useful.\n\n## Is this AVClass or AVClass2?\n\nThe short answer is that the current code in this repo is\nbased on the code of AVClass2.\nThe original AVClass code has been deprecated.\nBelow, we detail this process.\n\nWe originally published AVClass in RAID 2016 and made its code\navailable in this repository in July 2016.\nAVClass extracted only the family names from the input samples.\n\nWe published AVClass2 in ACSAC 2020 and made its code\navailable in this repository in September 2020.\nAVClass2 extracted all tags from the input samples and included a\ncompatibility option to provide only the family names in the\nsame format as the original AVClass.\n\nFor 2.5 years, both tools were available in this repository in\nseparate directories.\nIn February 2023, we decided to \ndeprecate the original AVClass code,\nrename AVClass2 as AVClass, \nrelease a PyPI package to ease installation, and \nclean the command line options.\n\n## Input formats\n\nAVClass supports four input JSONL formats\n(i.e., one JSON object per line).\n\n1. VirusTotal v3 API reports,\nwhere each line in the input *file* should be the full JSON of a\nVirusTotal API version 3 response with a *File* object report,\ne.g., obtained by querying https://www.virustotal.com/api/v3/files/{hash}\nThere is an example VirusTotal v3 input file in examples/vtv3_sample.json\n\n```shell\navclass -f examples/vtv3_sample.json -o output.txt\n```\n\n2. VirusTotal v2 API reports,\nwhere each line in the input *file* should be the full JSON of a\nVirusTotal v2 API response to the */file/report* endpoint,\ne.g., obtained by querying https://www.virustotal.com/vtapi/v2/file/report?apikey={apikey}&resource={hash}\nThere is an example VirusTotal v2 input file in examples/vtv2_sample.json\n\n```shell\navclass -f examples/vtv2_sample.json -o output.txt\n```\n\n3. OPSWAT MetaDefender reports,\nwhere each line in the input *file* should be the full JSON\nobtained from OPSWAT MetaDefender.\nThere is an example OPSWAT MetaDefender input file in\nexamples/opswat_md_sample.json\n\n```shell\navclass -f examples/opswat_md_sample.json -o output.txt\n```\n\n4. Simplified format,\nwhere each line in the input *file* should be a JSON\nwith (at least) these fields:\n{md5, sha1, sha256, av_labels}.\nThere is an example of such input file in *examples/malheurReference_lb.json*\nIf you are obtaining AV labels from sources other than VirusTotal you \nmay want to convert them to this format.\n\n```shell\navclass -f examples/malheurReference_lb.json -o output.txt\n```\n\n**Multiple input files and different formats**\n\nAVClass can handle multiple input files putting the results in the\nsame output files\n(if you want results in separate files, process each input file separately).\nAVClass automatically detects the format of each file, \nso it is possible to mix input files.\n\nFor example, you can provide as input the three test files\n(each of a different format) in the examples directory:\n\n```shell\navclass -f examples/vtv3_sample.json -f examples/vtv2_sample.json -f examples/malheurReference_lb.json -f examples/opswat_md_sample.json -o output.txt\n```\n\noutput.txt will have 3135 lines: 3130 samples from malheurReference_lb.json,\n3 samples from vtv2_sample.json, 1 sample from vtv3_sample.json, and\n1 sample from opswat_md_sample.json.\n\nYou can also provide as input a directory with the -d option and \nAVClass will process all files in that directory. \n\n```shell\navclass -d <directory>\n```\n\nIt is also possible to combine -f with -d,\nThus, this command works:\n\n```shell\navclass -f <file> -d <directory>\n```\n\nAt this point you have read the most important information on\nhow to use AVClass.\nThe following sections describe steps that most users will not need.\n\n## Labeling: Using only Selected AV Engines\n\nBy default, AVClass will use the labels of all AV engines that appear in\nthe input reports.\nIf you want to limit AVClass to use only the labels of certain AV engines,\nyou can use the -av option to pass it a file where each line has the name of\nan AV engine (case-sensitive).\n\nFor example, you could create a file engines.txt with three lines:\nBitDefender\nF-Secure\nSophos\n\n```shell\navclass -av engines.txt -f examples/vtv2_sample.json -t -o output.txt\n```\n\nwould output into output.txt:\n```\n602695c8f2ad76564bddcaf47b76edff  3 FAM:zeroaccess|2\nf117cc1477513cb181cc2e9fcaab39b2  3\n```\n\nwhere only the labels of BitDefender, F-Secure, and Sophos have been used\nto extract tags.\nThe output states all three selected engines flag both samples as malicious.\nNote that the number of detections is with respect to the provided engines,\ni.e., even if the first sample has 52 detections,\nthe number of detections is a maximum of 3 in this case.\nFor the first sample, two AV engines identify the family as *zeroaccess* but\nfor the second sample no tags are identified in the labels\nof the three selected AV engines.\n\n## Labeling: Ground Truth Evaluation\n\nIf you have family ground truth for some malware samples,\ni.e., you know the true family for those samples,\nyou can evaluate the accuracy of the family tags output by AVClass on\nthose samples with respect to that ground truth.\nThe evaluation metrics used are precision, recall, and F1 measure.\nSee our\n[RAID 2016 paper](https://software.imdea.org/~juanca/papers/avclass_raid16.pdf) for their definition.\nNote that the ground truth evaluation does not apply to non-family tags,\ni.e., it only evaluates family labeling.\n\n```shell\navclass -f examples/malheurReference_lb.json -gt examples/malheurReference_gt.tsv -o malheurReference.labels\n```\n\nThe output includes these lines:\n\n```\nCalculating precision and recall\n3131 out of 3131\nPrecision: 90.81  Recall: 93.95 F1-Measure: 92.35\n```\n\nEach line in the *examples/malheurReference_gt.tsv* file has\nthree **tab-separated** columns (hash, AVClass family, GT family):\n\n```\nafdd8f086dfcb8d2cf26c566e784476dd899ec10 adrotator ADROTATOR\n```\n\nwhich indicates that sample afdd8f086dfcb8d2cf26c566e784476dd899ec10 \nis identified as *adrotator* by AVClass and \nits ground truth family is *ADROTATOR*.\nEach sample in the input file should also appear in the ground truth file.\nNote that the particular label assigned to each family does not matter.\nWhat matters is that all samples in the same family are assigned\nthe same family name (i.e., the same string in the second column)\n\nThe ground truth can be obtained from publicly available malware datasets.\nThe one in *examples/malheurReference_gt.tsv* comes from the\n[Malheur](http://www.mlsec.org/malheur/) dataset.\nThere are other public datasets with ground truth such as\n[Drebin](https://www.sec.cs.tu-bs.de/~danarp/drebin/) or\n[Malicia](http://malicia-project.com/dataset.html).\n\n\n## Update Module\n\nThe update module can be used to suggest additions and changes to the input\ntaxonomy, tagging rules, and expansion rules.\nBy default, AVClass uses the default taxonomy, tagging, and expansion files\nincluded in the repository.\nThus, we expect that most users will not need to run the update module.\nBut, below we explain how to run in case you need to.\n\nUsing the update module comprises of two steps.\nThe first step is obtaining an alias file:\n\n```shell\navclass -f examples/malheurReference_lb.json -aliasdetect -o /dev/null\n```\n\nThe above command will create a file named \\<file\\>.alias,\nmalheurReference_lb.alias in our example. This file has 7 columns:\n\n1. t1: token that is an alias\n2. t2: tag for which t1 is an alias\n3. |t1|: number of input samples where t1 was observed\n4. |t2|: number of input samples where t2 was observed\n5. |t1^t2|: number of input samples where both t1 and t2 were observed\n6. |t1^t2|/|t1|: ratio of input samples where both t1 and t2 were observed over the number of input samples where t1 was observed.\n7. |t1^t2|/|t2|: ratio of input samples where both t1 and t2 were observed over the number of input samples where t2 was observed.\n\nThe Update Module takes the above file as input with the -alias option,\nas well as the default taxonomy, tagging, and expansion files\nin the data directory.\nIt outputs updated taxonomy, tagging, and expansion files that include the\nsuggested additions and changes.\n\n```shell\navclass-update -alias malheurReference_lb.alias -o output_prefix\n```\n\nThis will produce three files:\noutput_prefix.taxonomy, output_prefix.tagging, output_prefix.expansion.\nYou can diff the output and input files to analyze the proposed changes.\n\nYou can also modify the input taxonomy, tagging, and expansion rules in place,\nrather than producing new files:\n\n```shell\navclass-update -alias malheurReference_lb.alias -update\n```\n\n## Customizing AVClass\n\nAVClass is fully customizable:\nTagging, Expansion and Taxonomy files can be easily modified by the analyst\neither manually or by running the update module.\n\nIf you change those files manually, we recommend running\nafterwards the normalization script to keep them tidy.\nIt sorts the tags in the taxonomy and performs some basic cleaning like\nremoving redundant entries:\n\n```shell\navclass-normalize -tax mytaxonomy -tag mytagging -exp myexpansions\n```\n\nIf the modifications are in the default files in the data directory you can\nsimply run:\n\n```shell\navclass-normalize\n```\n\n## Evaluating and comparing with AVClass\n\nOther researchers may want to independently evaluate AVClass/AVClass2 and\nto compare it with their own approaches.\nWe encourage such evaluation, feedback on limitations, and proposals for\nimprovement.\nHowever, we have observed a number of common errors in such evaluations that\nshould be avoided.\nThus, if you need to compare your approach with AVClass/AVClass2,\nplease read the [evaluation page](EVALUATION.md)\n\n## Dependencies\n\nAVClass is written in Python.\nIt should run on Python versions above 2.7 and 3.0.\n\nIt does not require installing any dependencies.\n\n## Support and Contributing\n\nIf you have issues or want to contribute, please file a issue or perform a\npull request through GitHub.\n\n## License\n\nAVClass is released under the MIT license\n\n## References\n\nThe design and evaluation of AVClass is detailed in our\n[RAID 2016 paper](https://software.imdea.org/~juanca/papers/avclass_raid16.pdf):\n\n> Marcos Sebasti\u00e1n, Richard Rivera, Platon Kotzias, and Juan Caballero.<br>\nAVClass: A Tool for Massive Malware Labeling.<br>\nIn Proceedings of the International Symposium on Research in\nAttacks, Intrusions and Defenses,\nSeptember 2016.\n\nThe design and evaluation of AVClass2 is detailed in our\n[ACSAC 2020 paper](https://arxiv.org/pdf/2006.10615.pdf):\n\n> Silvia Sebasti\u00e1n, Juan Caballero.<br>\nAVClass2: Massive Malware Tag Extraction from AV Labels.<br>\nIn proceedings of the Annual Computer Security Applications Conference,\nDecember 2020.\n\n## Contributors\n\nSeveral members of the MaliciaLab at the\n[IMDEA Software Institute](http://software.imdea.org)\nhave contributed to AVClass:\nMarcos Sebasti\u00e1n, Richard Rivera, Platon Kotzias, Srdjan Matic,\nSilvia Sebasti\u00e1n, Kevin van Liebergen, and Juan Caballero.\n\nGitHub users with significant contributions to AVClass include\n(let us know if you believe you should be listed here):\n[eljeffeg](https://github.com/eljeffeg)\n\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2016-2020 MaliciaLab @ IMDEA Software Institute  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.  ",
    "summary": "AVClass is a Python package and command line tool to tag / label malware samples.",
    "version": "2.8.9",
    "project_urls": {
        "Homepage": "https://github.com/malicialab/avclass"
    },
    "split_keywords": [
        "malware",
        " malware family",
        " tag",
        " av label"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c52f2e5490996ef8fdf631dc9f578e29152e62635ccf5d5b4a88ea745a4fa57a",
                "md5": "f006bc74ef131ea316af99582c9961db",
                "sha256": "6365373b54581f6f7918ef91f14f7f081c68ac38d67e1ea92863c91ef4704376"
            },
            "downloads": -1,
            "filename": "avclass_malicialab-2.8.9-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f006bc74ef131ea316af99582c9961db",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 68371,
            "upload_time": "2024-09-05T13:09:58",
            "upload_time_iso_8601": "2024-09-05T13:09:58.835080Z",
            "url": "https://files.pythonhosted.org/packages/c5/2f/2e5490996ef8fdf631dc9f578e29152e62635ccf5d5b4a88ea745a4fa57a/avclass_malicialab-2.8.9-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "64324cea5c955478b1290bf56a225ffb43968f0009b861a013cca2673caee9db",
                "md5": "a1c6893efafd35dc28f19944e8a314fc",
                "sha256": "2a87f3981399995c494ca61cb2f6175e155584cd5376b2fd7108138ec8f48559"
            },
            "downloads": -1,
            "filename": "avclass_malicialab-2.8.9.tar.gz",
            "has_sig": false,
            "md5_digest": "a1c6893efafd35dc28f19944e8a314fc",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 71529,
            "upload_time": "2024-09-05T13:10:00",
            "upload_time_iso_8601": "2024-09-05T13:10:00.611279Z",
            "url": "https://files.pythonhosted.org/packages/64/32/4cea5c955478b1290bf56a225ffb43968f0009b861a013cca2673caee9db/avclass_malicialab-2.8.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-05 13:10:00",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "malicialab",
    "github_project": "avclass",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "avclass-malicialab"
}

MaliciaLab