dandi-s3-log-parser


Namedandi-s3-log-parser JSON
Version 0.4.2 PyPI version JSON
download
home_pageNone
SummaryParse S3 logs to more easily calculate usage metrics per asset.
upload_time2024-09-14 07:03:53
maintainerNone
docs_urlNone
authorNone
requires_python>=3.12
licenseBSD 3-Clause License Copyright (c) 2024, CatalystNeuro All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
keywords aws download tracking log s3
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <p align="center">
  <h1 align="center">DANDI S3 Log Parser</h3>
  <p align="center">
    <a href="https://pypi.org/project/dandi_s3_log_parser/"><img alt="Ubuntu" src="https://img.shields.io/badge/Ubuntu-E95420?style=flat&logo=ubuntu&logoColor=white"></a>
    <a href="https://pypi.org/project/dandi_s3_log_parser/"><img alt="Supported Python versions" src="https://img.shields.io/pypi/pyversions/dandi_s3_log_parser.svg"></a>
    <a href="https://codecov.io/github/CatalystNeuro/dandi_s3_log_parser?branch=main"><img alt="codecov" src="https://codecov.io/github/CatalystNeuro/dandi_s3_log_parser/coverage.svg?branch=main"></a>
  </p>
  <p align="center">
    <a href="https://pypi.org/project/dandi_s3_log_parser/"><img alt="PyPI latest release version" src="https://badge.fury.io/py/dandi_s3_log_parser.svg?id=py&kill_cache=1"></a>
    <a href="https://github.com/catalystneuro/dandi_s3_log_parser/blob/main/license.txt"><img alt="License: BSD-3" src="https://img.shields.io/pypi/l/dandi_s3_log_parser.svg"></a>
  </p>
  <p align="center">
    <a href="https://github.com/psf/black"><img alt="Python code style: Black" src="https://img.shields.io/badge/python_code_style-black-000000.svg"></a>
    <a href="https://github.com/astral-sh/ruff"><img alt="Python code style: Ruff" src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json"></a>
  </p>
</p>

Extraction of minimal information from consolidated raw S3 logs for public sharing and plotting.

Developed for the [DANDI Archive](https://dandiarchive.org/).

Read more about [S3 logging on AWS](https://web.archive.org/web/20240807191829/https://docs.aws.amazon.com/AmazonS3/latest/userguide/LogFormat.html).

A few summary facts as of 2024:

- A single line of a raw S3 log file can be between 400-1000+ bytes.
- Some of the busiest daily logs on the archive can have around 5,014,386 lines.
- There are more than 6 TB of log files collected in total.
- This parser reduces that total to less than 25 GB of final essential information on NWB assets (Zarr size TBD).



## Installation

```bash
pip install dandi_s3_log_parser
```



## Workflow

The process is comprised of three modular steps.

### 1. **Reduction**

Filter out:

- Non-success status codes.
- Excluded IP addresses.
- Operation types other than the one specified (`REST.GET.OBJECT` by default).

Then, only limit data extraction to a handful of specified fields from each full line of the raw logs; by default, `object_key`, `timestamp`, `ip_address`, and `bytes_sent`.

In the summer of 2024, this reduced 6 TB of raw logs to less than 170 GB.

The process is designed to be easily parallelized and interruptible, meaning that you can feel free to kill any processes while they are running and restart later without losing most progress.

### 2. **Binning**

To make the mapping to Dandisets more efficient, the reduced logs are binned by their object keys (asset blob IDs) for fast lookup. Zarr assets specifically group by the parent blob ID, *e.g.*, a request for `zarr/abcdefg/group1/dataset1/0` will be binned by `zarr/abcdefg`.

This step reduces the total file sizes from step (1) even further by reducing repeated object keys, though it does create a large number of small files.

In the summer of 2024, this brought 170 GB of reduced logs down to less than 80 GB (20 GB of `blobs` spread across 253,676 files and 60 GB of `zarr` spread across 4,775 files).

### 3. **Mapping**

The final step, which should be run periodically to keep the desired usage logs per Dandiset up to date, is to scan through all currently known Dandisets and their versions, mapping the asset blob IDs to their filenames and generating the most recently parsed usage logs that can be shared publicly.

In the summer of 2024, this brought 80 GB of binned logs down to around 20 GB of Dandiset logs.



## Usage

### Reduction

To reduce:

```bash
reduce_all_dandi_raw_s3_logs \
  --raw_s3_logs_folder_path < base raw S3 logs folder > \
  --reduced_s3_logs_folder_path < reduced S3 logs folder path > \
  --maximum_number_of_workers < number of workers to use > \
  --maximum_buffer_size_in_mb < approximate amount of RAM to use > \
  --excluded_ips < comma-separated list of known IPs to exclude >
```

For example, on Drogon:

```bash
reduce_all_dandi_raw_s3_logs \
  --raw_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs \
  --reduced_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs-reduced \
  --maximum_number_of_workers 3 \
  --maximum_buffer_size_in_mb 3000 \
  --excluded_ips < Drogons IP >
```

In the summer of 2024, this process took less than 10 hours to process all 6 TB of raw log data (using 3 workers at 3 GB buffer size).

### Binning

To bin:

```bash
bin_all_reduced_s3_logs_by_object_key \
  --reduced_s3_logs_folder_path < reduced S3 logs folder path > \
  --binned_s3_logs_folder_path < binned S3 logs folder path >
```

For example, on Drogon:

```bash
bin_all_reduced_s3_logs_by_object_key \
  --reduced_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs-reduced \
  --binned_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs-binned
```

This process is not as friendly to random interruption as the reduction step is. If corruption is detected, the target binning folder will have to be cleaned before re-attempting.

The `--file_processing_limit < integer >` flag can be used to limit the number of files processed in a single run, which can be useful for breaking the process up into smaller pieces, such as:

```bash
bin_all_reduced_s3_logs_by_object_key \
  --reduced_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs-reduced \
  --binned_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs-binned \
```

In the summer of 2024, this process took less than 5 hours to bin all 170 GB of reduced logs into the 80 GB of data per object key.

### Mapping

To map:

```bash
map_binned_s3_logs_to_dandisets \
  --binned_s3_logs_folder_path < binned S3 logs folder path > \
  --mapped_s3_logs_folder_path < mapped Dandiset logs folder > \
  --excluded_dandisets < comma-separated list of six-digit IDs to exclude > \
  --restrict_to_dandisets < comma-separated list of six-digit IDs to restrict mapping to >
```

For example, on Drogon:

```bash
map_binned_s3_logs_to_dandisets \
  --binned_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs-binned \
  --mapped_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs-mapped \
  --excluded_dandisets 000108
```

In the summer of 2024, this blobs process took less than 8 hours to complete (with caches; 10 hours without caches) with one worker.

Some Dandisets may take disproportionately longer than others to process. For this reason, the command also accepts `--excluded_dandisets` and `--restrict_to_dandisets`.

This is strongly suggested for skipping `000108` in the main run and processing it separately (possibly on a different CRON cycle altogether).

```bash
map_binned_s3_logs_to_dandisets \
  --binned_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs-binned \
  --mapped_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs-mapped \
  --restrict_to_dandisets 000108
```

In the summer of 2024, this took less than 15 hours to complete.

The mapping process can theoretically be designed to work in parallel (and thus much faster), but this would take some effort to design. If interested, please open an issue to request this feature.



## Submit line decoding errors

Please email line decoding errors collected from your local config file (located in `~/.dandi_s3_log_parser/errors`) to the core maintainer before raising issues or submitting PRs contributing them as examples, to more easily correct any aspects that might require anonymization.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "dandi-s3-log-parser",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.12",
    "maintainer_email": null,
    "keywords": "aws, download tracking, log, s3",
    "author": null,
    "author_email": "Cody Baker <cody.c.baker.phd@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/5b/ff/26feede3414a076a3b132968c7deca9474458837ef1649bab01036bd06c9/dandi_s3_log_parser-0.4.2.tar.gz",
    "platform": null,
    "description": "<p align=\"center\">\n  <h1 align=\"center\">DANDI S3 Log Parser</h3>\n  <p align=\"center\">\n    <a href=\"https://pypi.org/project/dandi_s3_log_parser/\"><img alt=\"Ubuntu\" src=\"https://img.shields.io/badge/Ubuntu-E95420?style=flat&logo=ubuntu&logoColor=white\"></a>\n    <a href=\"https://pypi.org/project/dandi_s3_log_parser/\"><img alt=\"Supported Python versions\" src=\"https://img.shields.io/pypi/pyversions/dandi_s3_log_parser.svg\"></a>\n    <a href=\"https://codecov.io/github/CatalystNeuro/dandi_s3_log_parser?branch=main\"><img alt=\"codecov\" src=\"https://codecov.io/github/CatalystNeuro/dandi_s3_log_parser/coverage.svg?branch=main\"></a>\n  </p>\n  <p align=\"center\">\n    <a href=\"https://pypi.org/project/dandi_s3_log_parser/\"><img alt=\"PyPI latest release version\" src=\"https://badge.fury.io/py/dandi_s3_log_parser.svg?id=py&kill_cache=1\"></a>\n    <a href=\"https://github.com/catalystneuro/dandi_s3_log_parser/blob/main/license.txt\"><img alt=\"License: BSD-3\" src=\"https://img.shields.io/pypi/l/dandi_s3_log_parser.svg\"></a>\n  </p>\n  <p align=\"center\">\n    <a href=\"https://github.com/psf/black\"><img alt=\"Python code style: Black\" src=\"https://img.shields.io/badge/python_code_style-black-000000.svg\"></a>\n    <a href=\"https://github.com/astral-sh/ruff\"><img alt=\"Python code style: Ruff\" src=\"https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json\"></a>\n  </p>\n</p>\n\nExtraction of minimal information from consolidated raw S3 logs for public sharing and plotting.\n\nDeveloped for the [DANDI Archive](https://dandiarchive.org/).\n\nRead more about [S3 logging on AWS](https://web.archive.org/web/20240807191829/https://docs.aws.amazon.com/AmazonS3/latest/userguide/LogFormat.html).\n\nA few summary facts as of 2024:\n\n- A single line of a raw S3 log file can be between 400-1000+ bytes.\n- Some of the busiest daily logs on the archive can have around 5,014,386 lines.\n- There are more than 6 TB of log files collected in total.\n- This parser reduces that total to less than 25 GB of final essential information on NWB assets (Zarr size TBD).\n\n\n\n## Installation\n\n```bash\npip install dandi_s3_log_parser\n```\n\n\n\n## Workflow\n\nThe process is comprised of three modular steps.\n\n### 1. **Reduction**\n\nFilter out:\n\n- Non-success status codes.\n- Excluded IP addresses.\n- Operation types other than the one specified (`REST.GET.OBJECT` by default).\n\nThen, only limit data extraction to a handful of specified fields from each full line of the raw logs; by default, `object_key`, `timestamp`, `ip_address`, and `bytes_sent`.\n\nIn the summer of 2024, this reduced 6 TB of raw logs to less than 170 GB.\n\nThe process is designed to be easily parallelized and interruptible, meaning that you can feel free to kill any processes while they are running and restart later without losing most progress.\n\n### 2. **Binning**\n\nTo make the mapping to Dandisets more efficient, the reduced logs are binned by their object keys (asset blob IDs) for fast lookup. Zarr assets specifically group by the parent blob ID, *e.g.*, a request for `zarr/abcdefg/group1/dataset1/0` will be binned by `zarr/abcdefg`.\n\nThis step reduces the total file sizes from step (1) even further by reducing repeated object keys, though it does create a large number of small files.\n\nIn the summer of 2024, this brought 170 GB of reduced logs down to less than 80 GB (20 GB of `blobs` spread across 253,676 files and 60 GB of `zarr` spread across 4,775 files).\n\n### 3. **Mapping**\n\nThe final step, which should be run periodically to keep the desired usage logs per Dandiset up to date, is to scan through all currently known Dandisets and their versions, mapping the asset blob IDs to their filenames and generating the most recently parsed usage logs that can be shared publicly.\n\nIn the summer of 2024, this brought 80 GB of binned logs down to around 20 GB of Dandiset logs.\n\n\n\n## Usage\n\n### Reduction\n\nTo reduce:\n\n```bash\nreduce_all_dandi_raw_s3_logs \\\n  --raw_s3_logs_folder_path < base raw S3 logs folder > \\\n  --reduced_s3_logs_folder_path < reduced S3 logs folder path > \\\n  --maximum_number_of_workers < number of workers to use > \\\n  --maximum_buffer_size_in_mb < approximate amount of RAM to use > \\\n  --excluded_ips < comma-separated list of known IPs to exclude >\n```\n\nFor example, on Drogon:\n\n```bash\nreduce_all_dandi_raw_s3_logs \\\n  --raw_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs \\\n  --reduced_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs-reduced \\\n  --maximum_number_of_workers 3 \\\n  --maximum_buffer_size_in_mb 3000 \\\n  --excluded_ips < Drogons IP >\n```\n\nIn the summer of 2024, this process took less than 10 hours to process all 6 TB of raw log data (using 3 workers at 3 GB buffer size).\n\n### Binning\n\nTo bin:\n\n```bash\nbin_all_reduced_s3_logs_by_object_key \\\n  --reduced_s3_logs_folder_path < reduced S3 logs folder path > \\\n  --binned_s3_logs_folder_path < binned S3 logs folder path >\n```\n\nFor example, on Drogon:\n\n```bash\nbin_all_reduced_s3_logs_by_object_key \\\n  --reduced_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs-reduced \\\n  --binned_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs-binned\n```\n\nThis process is not as friendly to random interruption as the reduction step is. If corruption is detected, the target binning folder will have to be cleaned before re-attempting.\n\nThe `--file_processing_limit < integer >` flag can be used to limit the number of files processed in a single run, which can be useful for breaking the process up into smaller pieces, such as:\n\n```bash\nbin_all_reduced_s3_logs_by_object_key \\\n  --reduced_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs-reduced \\\n  --binned_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs-binned \\\n```\n\nIn the summer of 2024, this process took less than 5 hours to bin all 170 GB of reduced logs into the 80 GB of data per object key.\n\n### Mapping\n\nTo map:\n\n```bash\nmap_binned_s3_logs_to_dandisets \\\n  --binned_s3_logs_folder_path < binned S3 logs folder path > \\\n  --mapped_s3_logs_folder_path < mapped Dandiset logs folder > \\\n  --excluded_dandisets < comma-separated list of six-digit IDs to exclude > \\\n  --restrict_to_dandisets < comma-separated list of six-digit IDs to restrict mapping to >\n```\n\nFor example, on Drogon:\n\n```bash\nmap_binned_s3_logs_to_dandisets \\\n  --binned_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs-binned \\\n  --mapped_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs-mapped \\\n  --excluded_dandisets 000108\n```\n\nIn the summer of 2024, this blobs process took less than 8 hours to complete (with caches; 10 hours without caches) with one worker.\n\nSome Dandisets may take disproportionately longer than others to process. For this reason, the command also accepts `--excluded_dandisets` and `--restrict_to_dandisets`.\n\nThis is strongly suggested for skipping `000108` in the main run and processing it separately (possibly on a different CRON cycle altogether).\n\n```bash\nmap_binned_s3_logs_to_dandisets \\\n  --binned_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs-binned \\\n  --mapped_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs-mapped \\\n  --restrict_to_dandisets 000108\n```\n\nIn the summer of 2024, this took less than 15 hours to complete.\n\nThe mapping process can theoretically be designed to work in parallel (and thus much faster), but this would take some effort to design. If interested, please open an issue to request this feature.\n\n\n\n## Submit line decoding errors\n\nPlease email line decoding errors collected from your local config file (located in `~/.dandi_s3_log_parser/errors`) to the core maintainer before raising issues or submitting PRs contributing them as examples, to more easily correct any aspects that might require anonymization.\n",
    "bugtrack_url": null,
    "license": "BSD 3-Clause License  Copyright (c) 2024, CatalystNeuro All rights reserved.  Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.  3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.",
    "summary": "Parse S3 logs to more easily calculate usage metrics per asset.",
    "version": "0.4.2",
    "project_urls": null,
    "split_keywords": [
        "aws",
        " download tracking",
        " log",
        " s3"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c9e3cc862fe64a21d97cfb8d86ea761c0f868bcd98b2c93109edfad64c4e6349",
                "md5": "ef79c72d132f77e4b53b7cb9b694130f",
                "sha256": "0f2e66615366a5ba867fe60a9d8f60d26cc2a0125187f713f25943b7a9734a25"
            },
            "downloads": -1,
            "filename": "dandi_s3_log_parser-0.4.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ef79c72d132f77e4b53b7cb9b694130f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.12",
            "size": 29268,
            "upload_time": "2024-09-14T07:03:52",
            "upload_time_iso_8601": "2024-09-14T07:03:52.399538Z",
            "url": "https://files.pythonhosted.org/packages/c9/e3/cc862fe64a21d97cfb8d86ea761c0f868bcd98b2c93109edfad64c4e6349/dandi_s3_log_parser-0.4.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5bff26feede3414a076a3b132968c7deca9474458837ef1649bab01036bd06c9",
                "md5": "6142ddda5347558df3de2cefde25be29",
                "sha256": "a22921a184c7003750f4202566906343534cec619ebcaf143b8703fe9a033ede"
            },
            "downloads": -1,
            "filename": "dandi_s3_log_parser-0.4.2.tar.gz",
            "has_sig": false,
            "md5_digest": "6142ddda5347558df3de2cefde25be29",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.12",
            "size": 34801,
            "upload_time": "2024-09-14T07:03:53",
            "upload_time_iso_8601": "2024-09-14T07:03:53.926147Z",
            "url": "https://files.pythonhosted.org/packages/5b/ff/26feede3414a076a3b132968c7deca9474458837ef1649bab01036bd06c9/dandi_s3_log_parser-0.4.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-14 07:03:53",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "dandi-s3-log-parser"
}
        
Elapsed time: 1.95987s