nids-datasets

Name	nids-datasets JSON
Version	0.1.5 JSON
	download
home_page	https://github.com/rdpahalavan/nids-datasets
Summary	Download and utilize specially curated and extracted datasets from the original UNSW-NB15 and CIC-IDS2017 datasets
upload_time	2023-08-02 19:29:31
maintainer
docs_url	None
author	Pahalavan R D
requires_python	>=3.7.0
license	Apache License 2.0
keywords	dataset nids unsw-nb15 cic-ids2017
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # NIDS Datasets

The `nids-datasets` package provides functionality to download and utilize specially curated and extracted datasets from the original CIC-IDS2017 and UNSW-NB15 datasets. These datasets, which initially were only flow datasets, have been enhanced to include packet-level information from the raw PCAP files. The dataset contains both packet-level and flow-level data for over 230 million packets, with 179 million packets from UNSW-NB15 and 54 million packets from CIC-IDS2017.

## Installation

Install the `nids-datasets` package using pip:

```shell
pip install nids-datasets
```

Import the package in your Python script:

```python
from nids_datasets import Dataset, DatasetInfo
```

## Dataset Information

The `nids-datasets` package currently supports two datasets: [UNSW-NB15](https://research.unsw.edu.au/projects/unsw-nb15-dataset) and [CIC-IDS2017](https://www.unb.ca/cic/datasets/ids-2017.html). Each of these datasets contains a mix of normal traffic and different types of attack traffic, which are identified by their respective labels. The UNSW-NB15 dataset has 10 unique class labels, and the CIC-IDS2017 dataset has 24 unique class labels. 

- UNSW-NB15 Labels: 'normal', 'exploits', 'dos', 'fuzzers', 'generic', 'reconnaissance', 'worms', 'shellcode', 'backdoor', 'analysis'
- CIC-IDS2017 Labels: 'BENIGN', 'FTP-Patator', 'SSH-Patator', 'DoS slowloris', 'DoS Slowhttptest', 'DoS Hulk', 'Heartbleed', 'Web Attack – Brute Force', 'Web Attack – XSS', 'Web Attack – SQL Injection', 'Infiltration', 'Bot', 'PortScan', 'DDoS', 'normal', 'exploits', 'dos', 'fuzzers', 'generic', 'reconnaissance', 'worms', 'shellcode', 'backdoor', 'analysis', 'DoS GoldenEye'

## Subsets of the Dataset

Each dataset consists of four subsets:

1. Network-Flows - Contains flow-level data.
2. Packet-Fields - Contains packet header information.
3. Packet-Bytes - Contains packet byte information in the range (0-255).
4. Payload-Bytes - Contains payload byte information in the range (0-255).

Each subset contains 18 files (except Network-Flows, which has one file), where the data is stored in parquet format. In total, this package provides access to 110 files. You can choose to download all subsets or select specific subsets or specific files depending on your analysis requirements.

## Getting Information on the Datasets

The `DatasetInfo` function provides a summary of the dataset in a pandas dataframe format. It displays the number of packets for each class label across all 18 files in the dataset. This overview can guide you in selecting specific files for download and analysis.

```python
df = DatasetInfo(dataset='UNSW-NB15') # or dataset='CIC-IDS2017'
df
```

## Downloading the Datasets

The `Dataset` class allows you to specify the dataset, subset, and files that you are interested in. The specified data will then be downloaded.

```python
dataset = 'UNSW-NB15' # or 'CIC-IDS2017'
subset = ['Network-Flows', 'Packet-Fields', 'Payload-Bytes'] # or 'all' for all subsets
files = [3, 5, 10] # or 'all' for all files

data = Dataset(dataset=dataset, subset=subset, files=files)
data.download()
```

The directory structure after downloading files:

```
UNSW-NB15
│
├───Network-Flows
│   └───UNSW_Flow.parquet
│
├───Packet-Fields
│   ├───Packet_Fields_File_3.parquet
│   ├───Packet_Fields_File_5.parquet
│   └───Packet_Fields_File_10.parquet
│
└───Payload-Bytes
    ├───Payload_Bytes_File_3.parquet
    ├───Payload_Bytes_File_5.parquet
    └───Payload_Bytes_File_10.parquet
```

You can then load the parquet files using pandas:

```python
import pandas as pd
df = pd.read_parquet('UNSW-NB15/Packet-Fields/Packet_Fields_File_10.parquet')
```

## Merging Subsets

The `merge()` method allows you to merge all data of each packet across all subsets, providing both flow-level and packet-level information in a single file.

```python
data.merge()
```

The merge method, by default, uses the details specified while instantiating the `Dataset` class. You can also pass subset=list of subsets and files=list of files you want to merge.

The directory structure after merging files:

```
UNSW-NB15
│
├───Network-Flows
│   └───UNSW_Flow.parquet
│
├───Packet-Fields
│   ├───Packet_Fields_File_3.parquet
│   ├───Packet_Fields_File_5.parquet
│   └───Packet_Fields_File_10.parquet
│
├───Payload-Bytes
│   ├───Payload_Bytes_File_3.parquet
│   ├───Payload_Bytes_File_5.parquet
│   └───Payload_Bytes_File_10.parquet
│
└───Network-Flows+Packet-Fields+Payload-Bytes
    ├───Network_Flows+Packet_Fields+Payload_Bytes_File_3.parquet
    ├───Network_Flows+Packet_Fields+Payload_Bytes_File_5.parquet
    └───Network_Flows+Packet_Fields+Payload_Bytes_File_10.parquet
```

## Extracting Bytes

Packet-Bytes and Payload-Bytes subset contains the first 1500-1600 bytes. To retrieve all bytes (up to 65535 bytes) from the Packet-Bytes and Payload-Bytes subsets, use the `Bytes()` method. This function requires files in the Packet-Fields subset to operate. You can specify how many bytes you want to extract by passing the max_bytes parameter.

```python
data.bytes(payload=True, max_bytes=2500)
```

Use packet=True to extract packet bytes. You can also pass files=list of files to retrieve bytes.

The directory structure after extracting bytes:

```
UNSW-NB15
│
├───Network-Flows
│   └───UNSW_Flow.parquet
│
├───Packet-Fields
│   ├───Packet_Fields_File_3.parquet
│   ├───Packet_Fields_File_5.parquet
│   └───Packet_Fields_File_10.parquet
│
├───Payload-Bytes
│   ├───Payload_Bytes_File_3.parquet
│   ├───Payload_Bytes_File_5.parquet
│   └───Payload_Bytes_File_10.parquet
│
├───Network-Flows+Packet-Fields+Payload-Bytes
│   ├───Network_Flows+Packet_Fields+Payload_Bytes_File_3.parquet
│   ├───Network_Flows+Packet_Fields+Payload_Bytes_File_5.parquet
│   └───Network_Flows+Packet_Fields+Payload_Bytes_File_10.parquet
│
└───Payload-Bytes-2500
    ├───Payload_Bytes_File_3.parquet
    ├───Payload_Bytes_File_5.parquet
    └───Payload_Bytes_File_10.parquet
```

## Reading the Datasets

The `read()` method allows you to read files using Hugging Face's [`load_dataset`](https://huggingface.co/docs/datasets/loading) method, one subset at a time. This method can be used directly without the `download()` method, as downloading happens automatically. The dataset and files parameters are optional if the same details are used to instantiate the `Dataset` class.

```python
dataset = data.read(dataset='UNSW-NB15', subset='Packet-Fields', files=[1,2])
```

The `read()` method returns a dataset that you can convert to a pandas dataframe or save to a CSV, parquet, or any other desired file format:

```python
df = dataset.to_pandas()
dataset.to_csv('file_path_to_save.csv')
dataset.to_parquet('file_path_to_save.parquet')
```

To get specific packets using their index, you can use the `packets` parameter.

```python
dataset = data.read(dataset='UNSW-NB15', subset='Packet-Fields', files=[1,2], packets='100:600')
# This will return 100th packet to 599th packet (total of 500 packets)

dataset = data.read(dataset='UNSW-NB15', subset='Packet-Fields', files=[1,2], packets=':10%')
# This will return the first 10 percent of packets.

dataset = data.read(dataset='UNSW-NB15', subset='Packet-Fields', files=[1,2], packets='20%:30%')
# This will return the packets from 20th percent to 30th percent.
```

To use multiprocessing, pass how many processes to use in the `num_proc` parameter.

```python
dataset = data.read(dataset='UNSW-NB15', subset='Packet-Fields', files=[1,2], num_proc=2)
```

For scenarios where you want to process one packet at a time, you can use the `stream=True` parameter:

```python
dataset = data.read(dataset='UNSW-NB15', subset='Packet-Fields', files=[1,2], stream=True)
print(next(iter(dataset)))
```

## Notes

The size of these datasets is large, and depending on the subset(s) selected and the number of bytes extracted, the operations can be resource-intensive. Therefore, it's recommended to ensure you have sufficient disk space and RAM when using this package.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/rdpahalavan/nids-datasets",
    "name": "nids-datasets",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7.0",
    "maintainer_email": "",
    "keywords": "Dataset NIDS UNSW-NB15 CIC-IDS2017",
    "author": "Pahalavan R D",
    "author_email": "rdpahalavan24@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/44/c8/a572f7d387ea43f8dc14da0d6e3aec7b2dfbb1c635dc2b2d9969d790a7df/nids_datasets-0.1.5.tar.gz",
    "platform": null,
    "description": "# NIDS Datasets\n\nThe `nids-datasets` package provides functionality to download and utilize specially curated and extracted datasets from the original CIC-IDS2017 and UNSW-NB15 datasets. These datasets, which initially were only flow datasets, have been enhanced to include packet-level information from the raw PCAP files. The dataset contains both packet-level and flow-level data for over 230 million packets, with 179 million packets from UNSW-NB15 and 54 million packets from CIC-IDS2017.\n\n## Installation\n\nInstall the `nids-datasets` package using pip:\n\n```shell\npip install nids-datasets\n```\n\nImport the package in your Python script:\n\n```python\nfrom nids_datasets import Dataset, DatasetInfo\n```\n\n## Dataset Information\n\nThe `nids-datasets` package currently supports two datasets: [UNSW-NB15](https://research.unsw.edu.au/projects/unsw-nb15-dataset) and [CIC-IDS2017](https://www.unb.ca/cic/datasets/ids-2017.html). Each of these datasets contains a mix of normal traffic and different types of attack traffic, which are identified by their respective labels. The UNSW-NB15 dataset has 10 unique class labels, and the CIC-IDS2017 dataset has 24 unique class labels. \n\n- UNSW-NB15 Labels: 'normal', 'exploits', 'dos', 'fuzzers', 'generic', 'reconnaissance', 'worms', 'shellcode', 'backdoor', 'analysis'\n- CIC-IDS2017 Labels: 'BENIGN', 'FTP-Patator', 'SSH-Patator', 'DoS slowloris', 'DoS Slowhttptest', 'DoS Hulk', 'Heartbleed', 'Web Attack \u2013 Brute Force', 'Web Attack \u2013 XSS', 'Web Attack \u2013 SQL Injection', 'Infiltration', 'Bot', 'PortScan', 'DDoS', 'normal', 'exploits', 'dos', 'fuzzers', 'generic', 'reconnaissance', 'worms', 'shellcode', 'backdoor', 'analysis', 'DoS GoldenEye'\n\n## Subsets of the Dataset\n\nEach dataset consists of four subsets:\n\n1. Network-Flows - Contains flow-level data.\n2. Packet-Fields - Contains packet header information.\n3. Packet-Bytes - Contains packet byte information in the range (0-255).\n4. Payload-Bytes - Contains payload byte information in the range (0-255).\n\nEach subset contains 18 files (except Network-Flows, which has one file), where the data is stored in parquet format. In total, this package provides access to 110 files. You can choose to download all subsets or select specific subsets or specific files depending on your analysis requirements.\n\n## Getting Information on the Datasets\n\nThe `DatasetInfo` function provides a summary of the dataset in a pandas dataframe format. It displays the number of packets for each class label across all 18 files in the dataset. This overview can guide you in selecting specific files for download and analysis.\n\n```python\ndf = DatasetInfo(dataset='UNSW-NB15') # or dataset='CIC-IDS2017'\ndf\n```\n\n## Downloading the Datasets\n\nThe `Dataset` class allows you to specify the dataset, subset, and files that you are interested in. The specified data will then be downloaded.\n\n```python\ndataset = 'UNSW-NB15' # or 'CIC-IDS2017'\nsubset = ['Network-Flows', 'Packet-Fields', 'Payload-Bytes'] # or 'all' for all subsets\nfiles = [3, 5, 10] # or 'all' for all files\n\ndata = Dataset(dataset=dataset, subset=subset, files=files)\ndata.download()\n```\n\nThe directory structure after downloading files:\n\n```\nUNSW-NB15\n\u2502\n\u251c\u2500\u2500\u2500Network-Flows\n\u2502   \u2514\u2500\u2500\u2500UNSW_Flow.parquet\n\u2502\n\u251c\u2500\u2500\u2500Packet-Fields\n\u2502   \u251c\u2500\u2500\u2500Packet_Fields_File_3.parquet\n\u2502   \u251c\u2500\u2500\u2500Packet_Fields_File_5.parquet\n\u2502   \u2514\u2500\u2500\u2500Packet_Fields_File_10.parquet\n\u2502\n\u2514\u2500\u2500\u2500Payload-Bytes\n    \u251c\u2500\u2500\u2500Payload_Bytes_File_3.parquet\n    \u251c\u2500\u2500\u2500Payload_Bytes_File_5.parquet\n    \u2514\u2500\u2500\u2500Payload_Bytes_File_10.parquet\n```\n\nYou can then load the parquet files using pandas:\n\n```python\nimport pandas as pd\ndf = pd.read_parquet('UNSW-NB15/Packet-Fields/Packet_Fields_File_10.parquet')\n```\n\n## Merging Subsets\n\nThe `merge()` method allows you to merge all data of each packet across all subsets, providing both flow-level and packet-level information in a single file.\n\n```python\ndata.merge()\n```\n\nThe merge method, by default, uses the details specified while instantiating the `Dataset` class. You can also pass subset=list of subsets and files=list of files you want to merge.\n\nThe directory structure after merging files:\n\n```\nUNSW-NB15\n\u2502\n\u251c\u2500\u2500\u2500Network-Flows\n\u2502   \u2514\u2500\u2500\u2500UNSW_Flow.parquet\n\u2502\n\u251c\u2500\u2500\u2500Packet-Fields\n\u2502   \u251c\u2500\u2500\u2500Packet_Fields_File_3.parquet\n\u2502   \u251c\u2500\u2500\u2500Packet_Fields_File_5.parquet\n\u2502   \u2514\u2500\u2500\u2500Packet_Fields_File_10.parquet\n\u2502\n\u251c\u2500\u2500\u2500Payload-Bytes\n\u2502   \u251c\u2500\u2500\u2500Payload_Bytes_File_3.parquet\n\u2502   \u251c\u2500\u2500\u2500Payload_Bytes_File_5.parquet\n\u2502   \u2514\u2500\u2500\u2500Payload_Bytes_File_10.parquet\n\u2502\n\u2514\u2500\u2500\u2500Network-Flows+Packet-Fields+Payload-Bytes\n    \u251c\u2500\u2500\u2500Network_Flows+Packet_Fields+Payload_Bytes_File_3.parquet\n    \u251c\u2500\u2500\u2500Network_Flows+Packet_Fields+Payload_Bytes_File_5.parquet\n    \u2514\u2500\u2500\u2500Network_Flows+Packet_Fields+Payload_Bytes_File_10.parquet\n```\n\n## Extracting Bytes\n\nPacket-Bytes and Payload-Bytes subset contains the first 1500-1600 bytes. To retrieve all bytes (up to 65535 bytes) from the Packet-Bytes and Payload-Bytes subsets, use the `Bytes()` method. This function requires files in the Packet-Fields subset to operate. You can specify how many bytes you want to extract by passing the max_bytes parameter.\n\n```python\ndata.bytes(payload=True, max_bytes=2500)\n```\n\nUse packet=True to extract packet bytes. You can also pass files=list of files to retrieve bytes.\n\nThe directory structure after extracting bytes:\n\n```\nUNSW-NB15\n\u2502\n\u251c\u2500\u2500\u2500Network-Flows\n\u2502   \u2514\u2500\u2500\u2500UNSW_Flow.parquet\n\u2502\n\u251c\u2500\u2500\u2500Packet-Fields\n\u2502   \u251c\u2500\u2500\u2500Packet_Fields_File_3.parquet\n\u2502   \u251c\u2500\u2500\u2500Packet_Fields_File_5.parquet\n\u2502   \u2514\u2500\u2500\u2500Packet_Fields_File_10.parquet\n\u2502\n\u251c\u2500\u2500\u2500Payload-Bytes\n\u2502   \u251c\u2500\u2500\u2500Payload_Bytes_File_3.parquet\n\u2502   \u251c\u2500\u2500\u2500Payload_Bytes_File_5.parquet\n\u2502   \u2514\u2500\u2500\u2500Payload_Bytes_File_10.parquet\n\u2502\n\u251c\u2500\u2500\u2500Network-Flows+Packet-Fields+Payload-Bytes\n\u2502   \u251c\u2500\u2500\u2500Network_Flows+Packet_Fields+Payload_Bytes_File_3.parquet\n\u2502   \u251c\u2500\u2500\u2500Network_Flows+Packet_Fields+Payload_Bytes_File_5.parquet\n\u2502   \u2514\u2500\u2500\u2500Network_Flows+Packet_Fields+Payload_Bytes_File_10.parquet\n\u2502\n\u2514\u2500\u2500\u2500Payload-Bytes-2500\n    \u251c\u2500\u2500\u2500Payload_Bytes_File_3.parquet\n    \u251c\u2500\u2500\u2500Payload_Bytes_File_5.parquet\n    \u2514\u2500\u2500\u2500Payload_Bytes_File_10.parquet\n```\n\n## Reading the Datasets\n\nThe `read()` method allows you to read files using Hugging Face's [`load_dataset`](https://huggingface.co/docs/datasets/loading) method, one subset at a time. This method can be used directly without the `download()` method, as downloading happens automatically. The dataset and files parameters are optional if the same details are used to instantiate the `Dataset` class.\n\n```python\ndataset = data.read(dataset='UNSW-NB15', subset='Packet-Fields', files=[1,2])\n```\n\nThe `read()` method returns a dataset that you can convert to a pandas dataframe or save to a CSV, parquet, or any other desired file format:\n\n```python\ndf = dataset.to_pandas()\ndataset.to_csv('file_path_to_save.csv')\ndataset.to_parquet('file_path_to_save.parquet')\n```\n\nTo get specific packets using their index, you can use the `packets` parameter.\n\n```python\ndataset = data.read(dataset='UNSW-NB15', subset='Packet-Fields', files=[1,2], packets='100:600')\n# This will return 100th packet to 599th packet (total of 500 packets)\n\ndataset = data.read(dataset='UNSW-NB15', subset='Packet-Fields', files=[1,2], packets=':10%')\n# This will return the first 10 percent of packets.\n\ndataset = data.read(dataset='UNSW-NB15', subset='Packet-Fields', files=[1,2], packets='20%:30%')\n# This will return the packets from 20th percent to 30th percent.\n```\n\nTo use multiprocessing, pass how many processes to use in the `num_proc` parameter.\n\n```python\ndataset = data.read(dataset='UNSW-NB15', subset='Packet-Fields', files=[1,2], num_proc=2)\n```\n\nFor scenarios where you want to process one packet at a time, you can use the `stream=True` parameter:\n\n```python\ndataset = data.read(dataset='UNSW-NB15', subset='Packet-Fields', files=[1,2], stream=True)\nprint(next(iter(dataset)))\n```\n\n## Notes\n\nThe size of these datasets is large, and depending on the subset(s) selected and the number of bytes extracted, the operations can be resource-intensive. Therefore, it's recommended to ensure you have sufficient disk space and RAM when using this package.\n",
    "bugtrack_url": null,
    "license": "Apache License 2.0",
    "summary": "Download and utilize specially curated and extracted datasets from the original UNSW-NB15 and CIC-IDS2017 datasets",
    "version": "0.1.5",
    "project_urls": {
        "Homepage": "https://github.com/rdpahalavan/nids-datasets"
    },
    "split_keywords": [
        "dataset",
        "nids",
        "unsw-nb15",
        "cic-ids2017"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "16c1f1a3ed381e141e3bb219a09b157bcab3aa315887c1cea1d9f99094c5e6e0",
                "md5": "44240fb067779ec76e6d54b3e85b06a7",
                "sha256": "5276b07b10937c42f3410242aaf6e8c466569f0186d6bcedde08647a97af64c3"
            },
            "downloads": -1,
            "filename": "nids_datasets-0.1.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "44240fb067779ec76e6d54b3e85b06a7",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7.0",
            "size": 10554,
            "upload_time": "2023-08-02T19:29:29",
            "upload_time_iso_8601": "2023-08-02T19:29:29.449140Z",
            "url": "https://files.pythonhosted.org/packages/16/c1/f1a3ed381e141e3bb219a09b157bcab3aa315887c1cea1d9f99094c5e6e0/nids_datasets-0.1.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "44c8a572f7d387ea43f8dc14da0d6e3aec7b2dfbb1c635dc2b2d9969d790a7df",
                "md5": "761c7ed07d472625c39f9eff3234367c",
                "sha256": "8d7af1847f3f38087767c74f21c9b4bda7da69e2ce54926869051da02d74be43"
            },
            "downloads": -1,
            "filename": "nids_datasets-0.1.5.tar.gz",
            "has_sig": false,
            "md5_digest": "761c7ed07d472625c39f9eff3234367c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7.0",
            "size": 7245,
            "upload_time": "2023-08-02T19:29:31",
            "upload_time_iso_8601": "2023-08-02T19:29:31.272003Z",
            "url": "https://files.pythonhosted.org/packages/44/c8/a572f7d387ea43f8dc14da0d6e3aec7b2dfbb1c635dc2b2d9969d790a7df/nids_datasets-0.1.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-02 19:29:31",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "rdpahalavan",
    "github_project": "nids-datasets",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "nids-datasets"
}

Pahalavan R D