cesnet-datazoo


Namecesnet-datazoo JSON
Version 0.1.5 PyPI version JSON
download
home_pageNone
SummaryA toolkit for large network traffic datasets
upload_time2024-04-30 12:14:45
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseBSD-3-Clause
keywords traffic classification datasets machine learning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <p align="center">
    <img src="https://raw.githubusercontent.com/CESNET/cesnet-datazoo/main/docs/images/datazoo.svg" width="450">
</p>

[![](https://img.shields.io/badge/license-BSD-blue.svg)](https://github.com/CESNET/cesnet-datazoo/blob/main/LICENCE)
[![](https://img.shields.io/badge/docs-cesnet--datazoo-blue.svg)](https://cesnet.github.io/cesnet-datazoo/)
[![](https://img.shields.io/badge/python->=3.10-blue.svg)](https://pypi.org/project/cesnet-datazoo/)
[![](https://img.shields.io/pypi/v/cesnet-datazoo)](https://pypi.org/project/cesnet-datazoo/)


The goal of this project is to provide tools for working with large network traffic datasets and to facilitate research in the traffic classification area. The core functions of the `cesnet-datazoo` package are:

- A common API for downloading, configuring, and loading of three public datasets of encrypted network traffic.
- Extensive configuration options for:
    - Selection of train, validation, and test periods.
    - Selection of application classes and splitting classes between *known* and *unknown*.
    - Data transformations, such as feature scaling.
- Built on suitable data structures for experiments with large datasets. There are several caching mechanisms to make repeated runs faster, for example, when searching for the best model configuration.
- Datasets are offered in multiple sizes to give users an option to start the experiments at a smaller scale (also faster dataset download, disk space, etc.). The default is the `S` size containing 25 million samples. 

:brain: :brain: See a related project [CESNET Models](https://github.com/CESNET/cesnet-models) providing pre-trained neural networks for traffic classification. :brain: :brain:

:notebook: :notebook: Example Jupyter notebooks are included in a separate [CESNET Traffic Classification Examples](https://github.com/CESNET/cesnet-tcexamples) repo. :notebook: :notebook:

## Datasets
The following datasets are available in the `cesnet-datazoo` package:

| Name                               | CESNET-TLS22                                                                                                                                                                                   | CESNET-QUIC22                                                                                                                                             | CESNET-TLS-Year22                                                                                                                                                                              |
| ---------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| _Protocol_                         | TLS                                                                                                                                                                                            | QUIC                                                                                                                                                      | TLS                                                                                                                                                                                            |
| _Published in_                     | 2022                                                                                                                                                                                           | 2023                                                                                                                                                      | 2023                                                                                                                                                                                           |
| _Collection duration_              | 2 weeks                                                                                                                                                                                        | 4 weeks                                                                                                                                                   | 1 year                                                                                                                                                                                         |
| _Collection period_                | 4.10.2021 - 17.10.2021                                                                                                                                                                         | 31.10.2022 - 27.11.2022                                                                                                                                   | 1.1.2022 - 31.12.2022                                                                                                                                                                          |                                                                                                                                                                                           | ID, SRC_IP, DST_IP, DST_ASN, SRC_PORT, DST_PORT, PROTOCOL, QUIC_VERSION, QUIC_SNI, QUIC_USERAGENT, TIME_FIRST, TIME_LAST                                  | ID, SRC_IP, DST_IP, DST_ASN, DST_PORT, PROTOCOL, TLS_SNI, TLS_JA3, TIME_FIRST, TIME_LAST                                                                                                       |
| _Application count_                | 191                                                                                                                                                                                            | 102                                                                                                                                                       | 180                                                                                                                                                                                            |
| _Available samples_                | 141392195                                                                                                                                                                                      | 153226273                                                                                                                                                 | 507739073                                                                                                                                                                                      |
| _Available dataset sizes_          | XS, S, M, L                                                                                                                                                                                    | XS, S, M, L                                                                                                                                               | XS, S, M, L                                                                                                                                                                                    |
| _Cite_                             | [https://doi.org/10.1016/j.comnet.2022.109467](https://doi.org/10.1016/j.comnet.2022.109467)                                                                                                   | [https://doi.org/10.1016/j.dib.2023.108888](https://doi.org/10.1016/j.dib.2023.108888)                                                                    |                                                                                                                                                                                                |
| _Zenodo URL_                       | [https://zenodo.org/record/7965515](https://zenodo.org/record/7965515)                                                                                                                         | [https://zenodo.org/record/7963302](https://zenodo.org/record/7963302)                                                                                    |                                                                                                                                                                                                |
| _Related papers_                   |                                                                                                                                                                                                | [https://doi.org/10.23919/TMA58422.2023.10199052](https://doi.org/10.23919/TMA58422.2023.10199052)                                                        |                                                                                                                                                                                                |

## Installation

Install the package from pip with:

```bash
pip install cesnet-datazoo
```

or for editable install with:

```bash
pip install -e git+https://github.com/CESNET/cesnet-datazoo
```

## Examples
#### Initialize dataset to create train, validation, and test dataframes

```py
from cesnet_datazoo.datasets import CESNET_QUIC22
from cesnet_datazoo.config import DatasetConfig, AppSelection

dataset = CESNET_QUIC22("/datasets/CESNET-QUIC22/", size="XS")
dataset_config = DatasetConfig(
    dataset=dataset,
    apps_selection=AppSelection.ALL_KNOWN,
    train_period_name="W-2022-44",
    test_period_name="W-2022-45",
)
dataset.set_dataset_config_and_initialize(dataset_config)
train_dataframe = dataset.get_train_df()
val_dataframe = dataset.get_val_df()
test_dataframe = dataset.get_test_df()
```

The [`DatasetConfig`](https://cesnet.github.io/cesnet-datazoo/reference_dataset_config/) class handles the configuration of datasets, and calling `set_dataset_config_and_initialize` initializes train, validation, and test sets with the desired configuration.
Data can be read into Pandas DataFrames as shown here or via PyTorch DataLoaders. See [`CesnetDataset`](https://cesnet.github.io/cesnet-datazoo/reference_cesnet_dataset/) reference.

See more examples in the [documentation](https://cesnet.github.io/cesnet-datazoo/getting_started/).

## Papers

* [DataZoo: Streamlining Traffic Classification Experiments](https://doi.org/10.1145/3630050.3630176) <br>
Jan Luxemburk and Karel Hynek <br>
CoNEXT Workshop on Explainable and Safety Bounded, Fidelitous, Machine Learning for Networking (SAFE), 2023

## Acknowledgments

This project was supported by the Ministry of the Interior of the Czech Republic, grant No. VJ02010024: Flow-Based Encrypted Traffic Analysis.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "cesnet-datazoo",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "Jan Luxemburk <luxemburk@cesnet.cz>, Karel Hynek <hynekkar@cesnet.cz>",
    "keywords": "traffic classification, datasets, machine learning",
    "author": null,
    "author_email": "Jan Luxemburk <luxemburk@cesnet.cz>, Karel Hynek <hynekkar@cesnet.cz>",
    "download_url": "https://files.pythonhosted.org/packages/15/73/e858bb8b7da08a850b9541154d9a82f0afe22a8f7aa0cb061e397c274f8e/cesnet_datazoo-0.1.5.tar.gz",
    "platform": null,
    "description": "<p align=\"center\">\r\n    <img src=\"https://raw.githubusercontent.com/CESNET/cesnet-datazoo/main/docs/images/datazoo.svg\" width=\"450\">\r\n</p>\r\n\r\n[![](https://img.shields.io/badge/license-BSD-blue.svg)](https://github.com/CESNET/cesnet-datazoo/blob/main/LICENCE)\r\n[![](https://img.shields.io/badge/docs-cesnet--datazoo-blue.svg)](https://cesnet.github.io/cesnet-datazoo/)\r\n[![](https://img.shields.io/badge/python->=3.10-blue.svg)](https://pypi.org/project/cesnet-datazoo/)\r\n[![](https://img.shields.io/pypi/v/cesnet-datazoo)](https://pypi.org/project/cesnet-datazoo/)\r\n\r\n\r\nThe goal of this project is to provide tools for working with large network traffic datasets and to facilitate research in the traffic classification area. The core functions of the `cesnet-datazoo` package are:\r\n\r\n- A common API for downloading, configuring, and loading of three public datasets of encrypted network traffic.\r\n- Extensive configuration options for:\r\n    - Selection of train, validation, and test periods.\r\n    - Selection of application classes and splitting classes between *known* and *unknown*.\r\n    - Data transformations, such as feature scaling.\r\n- Built on suitable data structures for experiments with large datasets. There are several caching mechanisms to make repeated runs faster, for example, when searching for the best model configuration.\r\n- Datasets are offered in multiple sizes to give users an option to start the experiments at a smaller scale (also faster dataset download, disk space, etc.). The default is the `S` size containing 25 million samples. \r\n\r\n:brain: :brain: See a related project [CESNET Models](https://github.com/CESNET/cesnet-models) providing pre-trained neural networks for traffic classification. :brain: :brain:\r\n\r\n:notebook: :notebook: Example Jupyter notebooks are included in a separate [CESNET Traffic Classification Examples](https://github.com/CESNET/cesnet-tcexamples) repo. :notebook: :notebook:\r\n\r\n## Datasets\r\nThe following datasets are available in the `cesnet-datazoo` package:\r\n\r\n| Name                               | CESNET-TLS22                                                                                                                                                                                   | CESNET-QUIC22                                                                                                                                             | CESNET-TLS-Year22                                                                                                                                                                              |\r\n| ---------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\r\n| _Protocol_                         | TLS                                                                                                                                                                                            | QUIC                                                                                                                                                      | TLS                                                                                                                                                                                            |\r\n| _Published in_                     | 2022                                                                                                                                                                                           | 2023                                                                                                                                                      | 2023                                                                                                                                                                                           |\r\n| _Collection duration_              | 2 weeks                                                                                                                                                                                        | 4 weeks                                                                                                                                                   | 1 year                                                                                                                                                                                         |\r\n| _Collection period_                | 4.10.2021 - 17.10.2021                                                                                                                                                                         | 31.10.2022 - 27.11.2022                                                                                                                                   | 1.1.2022 - 31.12.2022                                                                                                                                                                          |                                                                                                                                                                                           | ID, SRC_IP, DST_IP, DST_ASN, SRC_PORT, DST_PORT, PROTOCOL, QUIC_VERSION, QUIC_SNI, QUIC_USERAGENT, TIME_FIRST, TIME_LAST                                  | ID, SRC_IP, DST_IP, DST_ASN, DST_PORT, PROTOCOL, TLS_SNI, TLS_JA3, TIME_FIRST, TIME_LAST                                                                                                       |\r\n| _Application count_                | 191                                                                                                                                                                                            | 102                                                                                                                                                       | 180                                                                                                                                                                                            |\r\n| _Available samples_                | 141392195                                                                                                                                                                                      | 153226273                                                                                                                                                 | 507739073                                                                                                                                                                                      |\r\n| _Available dataset sizes_          | XS, S, M, L                                                                                                                                                                                    | XS, S, M, L                                                                                                                                               | XS, S, M, L                                                                                                                                                                                    |\r\n| _Cite_                             | [https://doi.org/10.1016/j.comnet.2022.109467](https://doi.org/10.1016/j.comnet.2022.109467)                                                                                                   | [https://doi.org/10.1016/j.dib.2023.108888](https://doi.org/10.1016/j.dib.2023.108888)                                                                    |                                                                                                                                                                                                |\r\n| _Zenodo URL_                       | [https://zenodo.org/record/7965515](https://zenodo.org/record/7965515)                                                                                                                         | [https://zenodo.org/record/7963302](https://zenodo.org/record/7963302)                                                                                    |                                                                                                                                                                                                |\r\n| _Related papers_                   |                                                                                                                                                                                                | [https://doi.org/10.23919/TMA58422.2023.10199052](https://doi.org/10.23919/TMA58422.2023.10199052)                                                        |                                                                                                                                                                                                |\r\n\r\n## Installation\r\n\r\nInstall the package from pip with:\r\n\r\n```bash\r\npip install cesnet-datazoo\r\n```\r\n\r\nor for editable install with:\r\n\r\n```bash\r\npip install -e git+https://github.com/CESNET/cesnet-datazoo\r\n```\r\n\r\n## Examples\r\n#### Initialize dataset to create train, validation, and test dataframes\r\n\r\n```py\r\nfrom cesnet_datazoo.datasets import CESNET_QUIC22\r\nfrom cesnet_datazoo.config import DatasetConfig, AppSelection\r\n\r\ndataset = CESNET_QUIC22(\"/datasets/CESNET-QUIC22/\", size=\"XS\")\r\ndataset_config = DatasetConfig(\r\n    dataset=dataset,\r\n    apps_selection=AppSelection.ALL_KNOWN,\r\n    train_period_name=\"W-2022-44\",\r\n    test_period_name=\"W-2022-45\",\r\n)\r\ndataset.set_dataset_config_and_initialize(dataset_config)\r\ntrain_dataframe = dataset.get_train_df()\r\nval_dataframe = dataset.get_val_df()\r\ntest_dataframe = dataset.get_test_df()\r\n```\r\n\r\nThe [`DatasetConfig`](https://cesnet.github.io/cesnet-datazoo/reference_dataset_config/) class handles the configuration of datasets, and calling `set_dataset_config_and_initialize` initializes train, validation, and test sets with the desired configuration.\r\nData can be read into Pandas DataFrames as shown here or via PyTorch DataLoaders. See [`CesnetDataset`](https://cesnet.github.io/cesnet-datazoo/reference_cesnet_dataset/) reference.\r\n\r\nSee more examples in the [documentation](https://cesnet.github.io/cesnet-datazoo/getting_started/).\r\n\r\n## Papers\r\n\r\n* [DataZoo: Streamlining Traffic Classification Experiments](https://doi.org/10.1145/3630050.3630176) <br>\r\nJan Luxemburk and Karel Hynek <br>\r\nCoNEXT Workshop on Explainable and Safety Bounded, Fidelitous, Machine Learning for Networking (SAFE), 2023\r\n\r\n## Acknowledgments\r\n\r\nThis project was supported by the Ministry of the Interior of the Czech Republic, grant No. VJ02010024: Flow-Based Encrypted Traffic Analysis.\r\n",
    "bugtrack_url": null,
    "license": "BSD-3-Clause",
    "summary": "A toolkit for large network traffic datasets",
    "version": "0.1.5",
    "project_urls": {
        "Bug Tracker": "https://github.com/CESNET/cesnet-datazoo/issues",
        "Documentation": "https://cesnet.github.io/cesnet-datazoo/",
        "Homepage": "https://github.com/CESNET/cesnet-datazoo"
    },
    "split_keywords": [
        "traffic classification",
        " datasets",
        " machine learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ed35427e60b6e48a7face8bf386c390b9353f24cd9d394d5cb19f79609cad45c",
                "md5": "fa9493ae0756ca9512f4413a54b84c7a",
                "sha256": "b5a751b9be69909cef29fe91a5bb5a7c37f1e6b915b2ff479ce059cdb03d7939"
            },
            "downloads": -1,
            "filename": "cesnet_datazoo-0.1.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "fa9493ae0756ca9512f4413a54b84c7a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 51374,
            "upload_time": "2024-04-30T12:14:43",
            "upload_time_iso_8601": "2024-04-30T12:14:43.642132Z",
            "url": "https://files.pythonhosted.org/packages/ed/35/427e60b6e48a7face8bf386c390b9353f24cd9d394d5cb19f79609cad45c/cesnet_datazoo-0.1.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1573e858bb8b7da08a850b9541154d9a82f0afe22a8f7aa0cb061e397c274f8e",
                "md5": "255b671ec2241e4addf8431c6173e8e2",
                "sha256": "18fffb6d7edefa8cf35f2cc597d45413daa2252836c02f4c8f12d42d2320ad25"
            },
            "downloads": -1,
            "filename": "cesnet_datazoo-0.1.5.tar.gz",
            "has_sig": false,
            "md5_digest": "255b671ec2241e4addf8431c6173e8e2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 47841,
            "upload_time": "2024-04-30T12:14:45",
            "upload_time_iso_8601": "2024-04-30T12:14:45.113473Z",
            "url": "https://files.pythonhosted.org/packages/15/73/e858bb8b7da08a850b9541154d9a82f0afe22a8f7aa0cb061e397c274f8e/cesnet_datazoo-0.1.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-30 12:14:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "CESNET",
    "github_project": "cesnet-datazoo",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "cesnet-datazoo"
}
        
Elapsed time: 0.31467s