cesnet-datazoo

Name	cesnet-datazoo JSON
Version	0.1.10 JSON
	download
home_page	None
Summary	A toolkit for large network traffic datasets
upload_time	2024-11-06 11:20:00
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	BSD-3-Clause
keywords	traffic classification datasets machine learning
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <p align="center">
    <img src="https://raw.githubusercontent.com/CESNET/cesnet-datazoo/main/docs/images/datazoo.svg" width="450">
</p>

[![](https://img.shields.io/badge/license-BSD-blue.svg)](https://github.com/CESNET/cesnet-datazoo/blob/main/LICENCE)
[![](https://img.shields.io/badge/docs-cesnet--datazoo-blue.svg)](https://cesnet.github.io/cesnet-datazoo/)
[![](https://img.shields.io/badge/python->=3.10-blue.svg)](https://pypi.org/project/cesnet-datazoo/)
[![](https://img.shields.io/pypi/v/cesnet-datazoo)](https://pypi.org/project/cesnet-datazoo/)


The goal of this project is to provide tools for working with large network traffic datasets and to facilitate research in the traffic classification area. The core functions of the `cesnet-datazoo` package are:

- A common API for downloading, configuring, and loading of three public datasets of encrypted network traffic.
- Extensive configuration options for:
    - Selection of train, validation, and test periods.
    - Selection of application classes and splitting classes between *known* and *unknown*.
    - Data transformations, such as feature scaling.
- Built on suitable data structures for experiments with large datasets. There are several caching mechanisms to make repeated runs faster, for example, when searching for the best model configuration.
- Datasets are offered in multiple sizes to give users an option to start the experiments at a smaller scale (also faster dataset download, disk space, etc.). The default is the `S` size containing 25 million samples.

:brain: :brain: See a related project [CESNET Models](https://github.com/CESNET/cesnet-models) providing pre-trained neural networks for traffic classification. :brain: :brain:

:notebook: :notebook: Example Jupyter notebooks are included in a separate [CESNET Traffic Classification Examples](https://github.com/CESNET/cesnet-tcexamples) repo. :notebook: :notebook:

## Datasets
The `cesnet-datazoo` package currently provides three datasets with details in the following table (you might need to scroll the table horizontally to see all datasets).

1. CESNET-TLS22
2. CESNET-QUIC22
3. CESNET-TLS-Year22

| Name                               | CESNET-TLS22                                                                                                                                                                                   | CESNET-QUIC22                                                                                                                                             | CESNET-TLS-Year22                                                                                                                                                                              |
| ---------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| _Protocol_                         | TLS                                                                                                                                                                                            | QUIC                                                                                                                                                      | TLS                                                                                                                                                                                            |
| _Published in_                     | 2022                                                                                                                                                                                           | 2023                                                                                                                                                      | 2023                                                                                                                                                                                           |
| _Collection duration_              | 2 weeks                                                                                                                                                                                        | 4 weeks                                                                                                                                                   | 1 year                                                                                                                                                                                         |
| _Collection period_                | 4.10.2021 - 17.10.2021                                                                                                                                                                         | 31.10.2022 - 27.11.2022                                                                                                                                   | 1.1.2022 - 31.12.2022                                                                                                                                                                          |
| _Application count_                | 191                                                                                                                                                                                            | 102                                                                                                                                                       | 180                                                                                                                                                                                            |
| _Available samples_                | 141392195                                                                                                                                                                                      | 153226273                                                                                                                                                 | 507739073                                                                                                                                                                                      |
| _Available dataset sizes_          | XS, S, M, L                                                                                                                                                                                    | XS, S, M, L                                                                                                                                               | XS, S, M, L                                                                                                                                                                                    |
| _Cite_                             | [https://doi.org/10.1016/j.comnet.2022.109467](https://doi.org/10.1016/j.comnet.2022.109467)                                                                                                   | [https://doi.org/10.1016/j.dib.2023.108888](https://doi.org/10.1016/j.dib.2023.108888)                                                                    | [https://doi.org/10.1038/s41597-024-03927-4](https://doi.org/10.1038/s41597-024-03927-4)                                                                                                       |
| _Zenodo URL_                       | [https://zenodo.org/record/7965515](https://zenodo.org/record/7965515)                                                                                                                         | [https://zenodo.org/record/7963302](https://zenodo.org/record/7963302)                                                                                    | [https://zenodo.org/records/10608607](https://zenodo.org/records/10608607)                                                                                                                     |
| _Related papers_                   |                                                                                                                                                                                                | [https://doi.org/10.23919/TMA58422.2023.10199052](https://doi.org/10.23919/TMA58422.2023.10199052)                                                        |                                                                                                                                                                                                |

## Installation

Install the package from pip with:

```bash
pip install cesnet-datazoo
```

or for editable install with:

```bash
pip install -e git+https://github.com/CESNET/cesnet-datazoo
```

## Examples
#### Initialize dataset to create train, validation, and test dataframes

```py
from cesnet_datazoo.datasets import CESNET_QUIC22
from cesnet_datazoo.config import DatasetConfig, AppSelection

dataset = CESNET_QUIC22("/datasets/CESNET-QUIC22/", size="XS")
dataset_config = DatasetConfig(
    dataset=dataset,
    apps_selection=AppSelection.ALL_KNOWN,
    train_period_name="W-2022-44",
    test_period_name="W-2022-45",
)
dataset.set_dataset_config_and_initialize(dataset_config)
train_dataframe = dataset.get_train_df()
val_dataframe = dataset.get_val_df()
test_dataframe = dataset.get_test_df()
```

The [`DatasetConfig`](https://cesnet.github.io/cesnet-datazoo/reference_dataset_config/) class handles the configuration of datasets, and calling `set_dataset_config_and_initialize` initializes train, validation, and test sets with the desired configuration.
Data can be read into Pandas DataFrames as shown here or via PyTorch DataLoaders. See [`CesnetDataset`](https://cesnet.github.io/cesnet-datazoo/reference_cesnet_dataset/) reference.

See more examples in the [documentation](https://cesnet.github.io/cesnet-datazoo/getting_started/).

## Papers

* [DataZoo: Streamlining Traffic Classification Experiments](https://doi.org/10.1145/3630050.3630176) <br>
Jan Luxemburk and Karel Hynek <br>
CoNEXT Workshop on Explainable and Safety Bounded, Fidelitous, Machine Learning for Networking (SAFE), 2023

## Acknowledgments

This project was supported by the Ministry of the Interior of the Czech Republic, grant No. VJ02010024: Flow-Based Encrypted Traffic Analysis.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "cesnet-datazoo",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "Jan Luxemburk <luxemburk@cesnet.cz>, Karel Hynek <hynekkar@cesnet.cz>",
    "keywords": "traffic classification, datasets, machine learning",
    "author": null,
    "author_email": "Jan Luxemburk <luxemburk@cesnet.cz>, Karel Hynek <hynekkar@cesnet.cz>",
    "download_url": "https://files.pythonhosted.org/packages/54/c7/55a26543f66701f73e6cf90250f6c9735a66a6e29bb6b0a5059648c4b511/cesnet_datazoo-0.1.10.tar.gz",
    "platform": null,
    "description": "<p align=\"center\">\r\n    <img src=\"https://raw.githubusercontent.com/CESNET/cesnet-datazoo/main/docs/images/datazoo.svg\" width=\"450\">\r\n</p>\r\n\r\n[![](https://img.shields.io/badge/license-BSD-blue.svg)](https://github.com/CESNET/cesnet-datazoo/blob/main/LICENCE)\r\n[![](https://img.shields.io/badge/docs-cesnet--datazoo-blue.svg)](https://cesnet.github.io/cesnet-datazoo/)\r\n[![](https://img.shields.io/badge/python->=3.10-blue.svg)](https://pypi.org/project/cesnet-datazoo/)\r\n[![](https://img.shields.io/pypi/v/cesnet-datazoo)](https://pypi.org/project/cesnet-datazoo/)\r\n\r\n\r\nThe goal of this project is to provide tools for working with large network traffic datasets and to facilitate research in the traffic classification area. The core functions of the `cesnet-datazoo` package are:\r\n\r\n- A common API for downloading, configuring, and loading of three public datasets of encrypted network traffic.\r\n- Extensive configuration options for:\r\n    - Selection of train, validation, and test periods.\r\n    - Selection of application classes and splitting classes between *known* and *unknown*.\r\n    - Data transformations, such as feature scaling.\r\n- Built on suitable data structures for experiments with large datasets. There are several caching mechanisms to make repeated runs faster, for example, when searching for the best model configuration.\r\n- Datasets are offered in multiple sizes to give users an option to start the experiments at a smaller scale (also faster dataset download, disk space, etc.). The default is the `S` size containing 25 million samples.\r\n\r\n:brain: :brain: See a related project [CESNET Models](https://github.com/CESNET/cesnet-models) providing pre-trained neural networks for traffic classification. :brain: :brain:\r\n\r\n:notebook: :notebook: Example Jupyter notebooks are included in a separate [CESNET Traffic Classification Examples](https://github.com/CESNET/cesnet-tcexamples) repo. :notebook: :notebook:\r\n\r\n## Datasets\r\nThe `cesnet-datazoo` package currently provides three datasets with details in the following table (you might need to scroll the table horizontally to see all datasets).\r\n\r\n1. CESNET-TLS22\r\n2. CESNET-QUIC22\r\n3. CESNET-TLS-Year22\r\n\r\n| Name                               | CESNET-TLS22                                                                                                                                                                                   | CESNET-QUIC22                                                                                                                                             | CESNET-TLS-Year22                                                                                                                                                                              |\r\n| ---------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\r\n| _Protocol_                         | TLS                                                                                                                                                                                            | QUIC                                                                                                                                                      | TLS                                                                                                                                                                                            |\r\n| _Published in_                     | 2022                                                                                                                                                                                           | 2023                                                                                                                                                      | 2023                                                                                                                                                                                           |\r\n| _Collection duration_              | 2 weeks                                                                                                                                                                                        | 4 weeks                                                                                                                                                   | 1 year                                                                                                                                                                                         |\r\n| _Collection period_                | 4.10.2021 - 17.10.2021                                                                                                                                                                         | 31.10.2022 - 27.11.2022                                                                                                                                   | 1.1.2022 - 31.12.2022                                                                                                                                                                          |\r\n| _Application count_                | 191                                                                                                                                                                                            | 102                                                                                                                                                       | 180                                                                                                                                                                                            |\r\n| _Available samples_                | 141392195                                                                                                                                                                                      | 153226273                                                                                                                                                 | 507739073                                                                                                                                                                                      |\r\n| _Available dataset sizes_          | XS, S, M, L                                                                                                                                                                                    | XS, S, M, L                                                                                                                                               | XS, S, M, L                                                                                                                                                                                    |\r\n| _Cite_                             | [https://doi.org/10.1016/j.comnet.2022.109467](https://doi.org/10.1016/j.comnet.2022.109467)                                                                                                   | [https://doi.org/10.1016/j.dib.2023.108888](https://doi.org/10.1016/j.dib.2023.108888)                                                                    | [https://doi.org/10.1038/s41597-024-03927-4](https://doi.org/10.1038/s41597-024-03927-4)                                                                                                       |\r\n| _Zenodo URL_                       | [https://zenodo.org/record/7965515](https://zenodo.org/record/7965515)                                                                                                                         | [https://zenodo.org/record/7963302](https://zenodo.org/record/7963302)                                                                                    | [https://zenodo.org/records/10608607](https://zenodo.org/records/10608607)                                                                                                                     |\r\n| _Related papers_                   |                                                                                                                                                                                                | [https://doi.org/10.23919/TMA58422.2023.10199052](https://doi.org/10.23919/TMA58422.2023.10199052)                                                        |                                                                                                                                                                                                |\r\n\r\n## Installation\r\n\r\nInstall the package from pip with:\r\n\r\n```bash\r\npip install cesnet-datazoo\r\n```\r\n\r\nor for editable install with:\r\n\r\n```bash\r\npip install -e git+https://github.com/CESNET/cesnet-datazoo\r\n```\r\n\r\n## Examples\r\n#### Initialize dataset to create train, validation, and test dataframes\r\n\r\n```py\r\nfrom cesnet_datazoo.datasets import CESNET_QUIC22\r\nfrom cesnet_datazoo.config import DatasetConfig, AppSelection\r\n\r\ndataset = CESNET_QUIC22(\"/datasets/CESNET-QUIC22/\", size=\"XS\")\r\ndataset_config = DatasetConfig(\r\n    dataset=dataset,\r\n    apps_selection=AppSelection.ALL_KNOWN,\r\n    train_period_name=\"W-2022-44\",\r\n    test_period_name=\"W-2022-45\",\r\n)\r\ndataset.set_dataset_config_and_initialize(dataset_config)\r\ntrain_dataframe = dataset.get_train_df()\r\nval_dataframe = dataset.get_val_df()\r\ntest_dataframe = dataset.get_test_df()\r\n```\r\n\r\nThe [`DatasetConfig`](https://cesnet.github.io/cesnet-datazoo/reference_dataset_config/) class handles the configuration of datasets, and calling `set_dataset_config_and_initialize` initializes train, validation, and test sets with the desired configuration.\r\nData can be read into Pandas DataFrames as shown here or via PyTorch DataLoaders. See [`CesnetDataset`](https://cesnet.github.io/cesnet-datazoo/reference_cesnet_dataset/) reference.\r\n\r\nSee more examples in the [documentation](https://cesnet.github.io/cesnet-datazoo/getting_started/).\r\n\r\n## Papers\r\n\r\n* [DataZoo: Streamlining Traffic Classification Experiments](https://doi.org/10.1145/3630050.3630176) <br>\r\nJan Luxemburk and Karel Hynek <br>\r\nCoNEXT Workshop on Explainable and Safety Bounded, Fidelitous, Machine Learning for Networking (SAFE), 2023\r\n\r\n## Acknowledgments\r\n\r\nThis project was supported by the Ministry of the Interior of the Czech Republic, grant No. VJ02010024: Flow-Based Encrypted Traffic Analysis.\r\n",
    "bugtrack_url": null,
    "license": "BSD-3-Clause",
    "summary": "A toolkit for large network traffic datasets",
    "version": "0.1.10",
    "project_urls": {
        "Bug Tracker": "https://github.com/CESNET/cesnet-datazoo/issues",
        "Documentation": "https://cesnet.github.io/cesnet-datazoo/",
        "Homepage": "https://github.com/CESNET/cesnet-datazoo"
    },
    "split_keywords": [
        "traffic classification",
        " datasets",
        " machine learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "81b6f4bf35bf6e094939a03c0613e174a3ec5146e2e00567e1eb10650367aefb",
                "md5": "d601429890b55e5cea23780c0fc8f221",
                "sha256": "45fc8f10e56ee9d957a3c0fffe72e48cdfbe6e2b5aecd03fe4282337f12a4bf1"
            },
            "downloads": -1,
            "filename": "cesnet_datazoo-0.1.10-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d601429890b55e5cea23780c0fc8f221",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 51741,
            "upload_time": "2024-11-06T11:19:58",
            "upload_time_iso_8601": "2024-11-06T11:19:58.934744Z",
            "url": "https://files.pythonhosted.org/packages/81/b6/f4bf35bf6e094939a03c0613e174a3ec5146e2e00567e1eb10650367aefb/cesnet_datazoo-0.1.10-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "54c755a26543f66701f73e6cf90250f6c9735a66a6e29bb6b0a5059648c4b511",
                "md5": "62ffba222096e12f98141f1a944936f2",
                "sha256": "5f55c74412b9d0dec0f840c89728177cf1ef9d5ff4e2d8ade447ccc35f622af4"
            },
            "downloads": -1,
            "filename": "cesnet_datazoo-0.1.10.tar.gz",
            "has_sig": false,
            "md5_digest": "62ffba222096e12f98141f1a944936f2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 48228,
            "upload_time": "2024-11-06T11:20:00",
            "upload_time_iso_8601": "2024-11-06T11:20:00.396869Z",
            "url": "https://files.pythonhosted.org/packages/54/c7/55a26543f66701f73e6cf90250f6c9735a66a6e29bb6b0a5059648c4b511/cesnet_datazoo-0.1.10.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-06 11:20:00",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "CESNET",
    "github_project": "cesnet-datazoo",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "cesnet-datazoo"
}

None