<p align="center">
<img src="https://raw.githubusercontent.com/CESNET/cesnet-datazoo/main/docs/images/datazoo.svg" width="450">
</p>
[![](https://img.shields.io/badge/license-BSD-blue.svg)](https://github.com/CESNET/cesnet-datazoo/blob/main/LICENCE)
[![](https://img.shields.io/badge/docs-cesnet--datazoo-blue.svg)](https://cesnet.github.io/cesnet-datazoo/)
[![](https://img.shields.io/badge/python->=3.10-blue.svg)](https://pypi.org/project/cesnet-datazoo/)
[![](https://img.shields.io/pypi/v/cesnet-datazoo)](https://pypi.org/project/cesnet-datazoo/)
The goal of this project is to provide tools for working with large network traffic datasets and to facilitate research in the traffic classification area. The core functions of the `cesnet-datazoo` package are:
- A common API for downloading, configuring, and loading of three public datasets of encrypted network traffic.
- Extensive configuration options for:
- Selection of train, validation, and test periods.
- Selection of application classes and splitting classes between *known* and *unknown*.
- Data transformations, such as feature scaling.
- Built on suitable data structures for experiments with large datasets. There are several caching mechanisms to make repeated runs faster, for example, when searching for the best model configuration.
- Datasets are offered in multiple sizes to give users an option to start the experiments at a smaller scale (also faster dataset download, disk space, etc.). The default is the `S` size containing 25 million samples.
:brain: :brain: See a related project [CESNET Models](https://github.com/CESNET/cesnet-models) providing pre-trained neural networks for traffic classification. :brain: :brain:
:notebook: :notebook: Example Jupyter notebooks are included in a separate [CESNET Traffic Classification Examples](https://github.com/CESNET/cesnet-tcexamples) repo. :notebook: :notebook:
## Datasets
The `cesnet-datazoo` package currently provides three datasets with details in the following table (you might need to scroll the table horizontally to see all datasets).
1. CESNET-TLS22
2. CESNET-QUIC22
3. CESNET-TLS-Year22
| Name | CESNET-TLS22 | CESNET-QUIC22 | CESNET-TLS-Year22 |
| ---------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| _Protocol_ | TLS | QUIC | TLS |
| _Published in_ | 2022 | 2023 | 2023 |
| _Collection duration_ | 2 weeks | 4 weeks | 1 year |
| _Collection period_ | 4.10.2021 - 17.10.2021 | 31.10.2022 - 27.11.2022 | 1.1.2022 - 31.12.2022 |
| _Application count_ | 191 | 102 | 180 |
| _Available samples_ | 141392195 | 153226273 | 507739073 |
| _Available dataset sizes_ | XS, S, M, L | XS, S, M, L | XS, S, M, L |
| _Cite_ | [https://doi.org/10.1016/j.comnet.2022.109467](https://doi.org/10.1016/j.comnet.2022.109467) | [https://doi.org/10.1016/j.dib.2023.108888](https://doi.org/10.1016/j.dib.2023.108888) | [https://doi.org/10.1038/s41597-024-03927-4](https://doi.org/10.1038/s41597-024-03927-4) |
| _Zenodo URL_ | [https://zenodo.org/record/7965515](https://zenodo.org/record/7965515) | [https://zenodo.org/record/7963302](https://zenodo.org/record/7963302) | [https://zenodo.org/records/10608607](https://zenodo.org/records/10608607) |
| _Related papers_ | | [https://doi.org/10.23919/TMA58422.2023.10199052](https://doi.org/10.23919/TMA58422.2023.10199052) | |
## Installation
Install the package from pip with:
```bash
pip install cesnet-datazoo
```
or for editable install with:
```bash
pip install -e git+https://github.com/CESNET/cesnet-datazoo
```
## Examples
#### Initialize dataset to create train, validation, and test dataframes
```py
from cesnet_datazoo.datasets import CESNET_QUIC22
from cesnet_datazoo.config import DatasetConfig, AppSelection
dataset = CESNET_QUIC22("/datasets/CESNET-QUIC22/", size="XS")
dataset_config = DatasetConfig(
dataset=dataset,
apps_selection=AppSelection.ALL_KNOWN,
train_period_name="W-2022-44",
test_period_name="W-2022-45",
)
dataset.set_dataset_config_and_initialize(dataset_config)
train_dataframe = dataset.get_train_df()
val_dataframe = dataset.get_val_df()
test_dataframe = dataset.get_test_df()
```
The [`DatasetConfig`](https://cesnet.github.io/cesnet-datazoo/reference_dataset_config/) class handles the configuration of datasets, and calling `set_dataset_config_and_initialize` initializes train, validation, and test sets with the desired configuration.
Data can be read into Pandas DataFrames as shown here or via PyTorch DataLoaders. See [`CesnetDataset`](https://cesnet.github.io/cesnet-datazoo/reference_cesnet_dataset/) reference.
See more examples in the [documentation](https://cesnet.github.io/cesnet-datazoo/getting_started/).
## Papers
* [DataZoo: Streamlining Traffic Classification Experiments](https://doi.org/10.1145/3630050.3630176) <br>
Jan Luxemburk and Karel Hynek <br>
CoNEXT Workshop on Explainable and Safety Bounded, Fidelitous, Machine Learning for Networking (SAFE), 2023
## Acknowledgments
This project was supported by the Ministry of the Interior of the Czech Republic, grant No. VJ02010024: Flow-Based Encrypted Traffic Analysis.
Raw data
{
"_id": null,
"home_page": null,
"name": "cesnet-datazoo",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": "Jan Luxemburk <luxemburk@cesnet.cz>, Karel Hynek <hynekkar@cesnet.cz>",
"keywords": "traffic classification, datasets, machine learning",
"author": null,
"author_email": "Jan Luxemburk <luxemburk@cesnet.cz>, Karel Hynek <hynekkar@cesnet.cz>",
"download_url": "https://files.pythonhosted.org/packages/54/c7/55a26543f66701f73e6cf90250f6c9735a66a6e29bb6b0a5059648c4b511/cesnet_datazoo-0.1.10.tar.gz",
"platform": null,
"description": "<p align=\"center\">\r\n <img src=\"https://raw.githubusercontent.com/CESNET/cesnet-datazoo/main/docs/images/datazoo.svg\" width=\"450\">\r\n</p>\r\n\r\n[![](https://img.shields.io/badge/license-BSD-blue.svg)](https://github.com/CESNET/cesnet-datazoo/blob/main/LICENCE)\r\n[![](https://img.shields.io/badge/docs-cesnet--datazoo-blue.svg)](https://cesnet.github.io/cesnet-datazoo/)\r\n[![](https://img.shields.io/badge/python->=3.10-blue.svg)](https://pypi.org/project/cesnet-datazoo/)\r\n[![](https://img.shields.io/pypi/v/cesnet-datazoo)](https://pypi.org/project/cesnet-datazoo/)\r\n\r\n\r\nThe goal of this project is to provide tools for working with large network traffic datasets and to facilitate research in the traffic classification area. The core functions of the `cesnet-datazoo` package are:\r\n\r\n- A common API for downloading, configuring, and loading of three public datasets of encrypted network traffic.\r\n- Extensive configuration options for:\r\n - Selection of train, validation, and test periods.\r\n - Selection of application classes and splitting classes between *known* and *unknown*.\r\n - Data transformations, such as feature scaling.\r\n- Built on suitable data structures for experiments with large datasets. There are several caching mechanisms to make repeated runs faster, for example, when searching for the best model configuration.\r\n- Datasets are offered in multiple sizes to give users an option to start the experiments at a smaller scale (also faster dataset download, disk space, etc.). The default is the `S` size containing 25 million samples.\r\n\r\n:brain: :brain: See a related project [CESNET Models](https://github.com/CESNET/cesnet-models) providing pre-trained neural networks for traffic classification. :brain: :brain:\r\n\r\n:notebook: :notebook: Example Jupyter notebooks are included in a separate [CESNET Traffic Classification Examples](https://github.com/CESNET/cesnet-tcexamples) repo. :notebook: :notebook:\r\n\r\n## Datasets\r\nThe `cesnet-datazoo` package currently provides three datasets with details in the following table (you might need to scroll the table horizontally to see all datasets).\r\n\r\n1. CESNET-TLS22\r\n2. CESNET-QUIC22\r\n3. CESNET-TLS-Year22\r\n\r\n| Name | CESNET-TLS22 | CESNET-QUIC22 | CESNET-TLS-Year22 |\r\n| ---------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\r\n| _Protocol_ | TLS | QUIC | TLS |\r\n| _Published in_ | 2022 | 2023 | 2023 |\r\n| _Collection duration_ | 2 weeks | 4 weeks | 1 year |\r\n| _Collection period_ | 4.10.2021 - 17.10.2021 | 31.10.2022 - 27.11.2022 | 1.1.2022 - 31.12.2022 |\r\n| _Application count_ | 191 | 102 | 180 |\r\n| _Available samples_ | 141392195 | 153226273 | 507739073 |\r\n| _Available dataset sizes_ | XS, S, M, L | XS, S, M, L | XS, S, M, L |\r\n| _Cite_ | [https://doi.org/10.1016/j.comnet.2022.109467](https://doi.org/10.1016/j.comnet.2022.109467) | [https://doi.org/10.1016/j.dib.2023.108888](https://doi.org/10.1016/j.dib.2023.108888) | [https://doi.org/10.1038/s41597-024-03927-4](https://doi.org/10.1038/s41597-024-03927-4) |\r\n| _Zenodo URL_ | [https://zenodo.org/record/7965515](https://zenodo.org/record/7965515) | [https://zenodo.org/record/7963302](https://zenodo.org/record/7963302) | [https://zenodo.org/records/10608607](https://zenodo.org/records/10608607) |\r\n| _Related papers_ | | [https://doi.org/10.23919/TMA58422.2023.10199052](https://doi.org/10.23919/TMA58422.2023.10199052) | |\r\n\r\n## Installation\r\n\r\nInstall the package from pip with:\r\n\r\n```bash\r\npip install cesnet-datazoo\r\n```\r\n\r\nor for editable install with:\r\n\r\n```bash\r\npip install -e git+https://github.com/CESNET/cesnet-datazoo\r\n```\r\n\r\n## Examples\r\n#### Initialize dataset to create train, validation, and test dataframes\r\n\r\n```py\r\nfrom cesnet_datazoo.datasets import CESNET_QUIC22\r\nfrom cesnet_datazoo.config import DatasetConfig, AppSelection\r\n\r\ndataset = CESNET_QUIC22(\"/datasets/CESNET-QUIC22/\", size=\"XS\")\r\ndataset_config = DatasetConfig(\r\n dataset=dataset,\r\n apps_selection=AppSelection.ALL_KNOWN,\r\n train_period_name=\"W-2022-44\",\r\n test_period_name=\"W-2022-45\",\r\n)\r\ndataset.set_dataset_config_and_initialize(dataset_config)\r\ntrain_dataframe = dataset.get_train_df()\r\nval_dataframe = dataset.get_val_df()\r\ntest_dataframe = dataset.get_test_df()\r\n```\r\n\r\nThe [`DatasetConfig`](https://cesnet.github.io/cesnet-datazoo/reference_dataset_config/) class handles the configuration of datasets, and calling `set_dataset_config_and_initialize` initializes train, validation, and test sets with the desired configuration.\r\nData can be read into Pandas DataFrames as shown here or via PyTorch DataLoaders. See [`CesnetDataset`](https://cesnet.github.io/cesnet-datazoo/reference_cesnet_dataset/) reference.\r\n\r\nSee more examples in the [documentation](https://cesnet.github.io/cesnet-datazoo/getting_started/).\r\n\r\n## Papers\r\n\r\n* [DataZoo: Streamlining Traffic Classification Experiments](https://doi.org/10.1145/3630050.3630176) <br>\r\nJan Luxemburk and Karel Hynek <br>\r\nCoNEXT Workshop on Explainable and Safety Bounded, Fidelitous, Machine Learning for Networking (SAFE), 2023\r\n\r\n## Acknowledgments\r\n\r\nThis project was supported by the Ministry of the Interior of the Czech Republic, grant No. VJ02010024: Flow-Based Encrypted Traffic Analysis.\r\n",
"bugtrack_url": null,
"license": "BSD-3-Clause",
"summary": "A toolkit for large network traffic datasets",
"version": "0.1.10",
"project_urls": {
"Bug Tracker": "https://github.com/CESNET/cesnet-datazoo/issues",
"Documentation": "https://cesnet.github.io/cesnet-datazoo/",
"Homepage": "https://github.com/CESNET/cesnet-datazoo"
},
"split_keywords": [
"traffic classification",
" datasets",
" machine learning"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "81b6f4bf35bf6e094939a03c0613e174a3ec5146e2e00567e1eb10650367aefb",
"md5": "d601429890b55e5cea23780c0fc8f221",
"sha256": "45fc8f10e56ee9d957a3c0fffe72e48cdfbe6e2b5aecd03fe4282337f12a4bf1"
},
"downloads": -1,
"filename": "cesnet_datazoo-0.1.10-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d601429890b55e5cea23780c0fc8f221",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 51741,
"upload_time": "2024-11-06T11:19:58",
"upload_time_iso_8601": "2024-11-06T11:19:58.934744Z",
"url": "https://files.pythonhosted.org/packages/81/b6/f4bf35bf6e094939a03c0613e174a3ec5146e2e00567e1eb10650367aefb/cesnet_datazoo-0.1.10-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "54c755a26543f66701f73e6cf90250f6c9735a66a6e29bb6b0a5059648c4b511",
"md5": "62ffba222096e12f98141f1a944936f2",
"sha256": "5f55c74412b9d0dec0f840c89728177cf1ef9d5ff4e2d8ade447ccc35f622af4"
},
"downloads": -1,
"filename": "cesnet_datazoo-0.1.10.tar.gz",
"has_sig": false,
"md5_digest": "62ffba222096e12f98141f1a944936f2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 48228,
"upload_time": "2024-11-06T11:20:00",
"upload_time_iso_8601": "2024-11-06T11:20:00.396869Z",
"url": "https://files.pythonhosted.org/packages/54/c7/55a26543f66701f73e6cf90250f6c9735a66a6e29bb6b0a5059648c4b511/cesnet_datazoo-0.1.10.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-06 11:20:00",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "CESNET",
"github_project": "cesnet-datazoo",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "cesnet-datazoo"
}