cloudcatalog


Namecloudcatalog JSON
Version 1.0.2 PyPI version JSON
download
home_pageNone
SummaryAPI for accessing the generalized CloudCatalog (cloudcatalog) specification for sharing data in and across clouds
upload_time2025-02-19 15:11:21
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT License
keywords cloud index catalog aws
VCS
bugtrack_url
requirements boto3 botocore pandas python-dateutil requests
Travis-CI No Travis.
coveralls test coverage
            # CloudCatalog (cloudcatalog) API

CloudCatalog is a generalized indexing specification for large cloud datasets.
The push to open science means many more published datasets, and finding and accessing is important to solve. CloudCatalog is an indexing method for sharing big datasets in cloud systems. It is scientist-friendly and it is easy to generate a set of indices. It uses static index files in time-ordered CSV format that are easy to fetch, easy to access via an API, and very low cost in both money and bandwidth needed to support. Metadata is kept in a simple JSON schema. We also provide a Python client toolset for scientists to access datasets that use CloudCatalog. 

The CloudCatalog specification and tools are open source, created by the HelioCloud project, and already used for 2 Petabytes of publicly available NASA and scientist-contributed data. We hope the community continues to adopt this CloudCatalog standard (in github, linked off heliocloud.org).

* For sharing datasets across cloud frameworks
* Decentralized: data owners control their own data and access
* RESTful & serverless (indices are flat files alongside their datasets)
* Removes need for doing slow/expensive disk ‘ls’ on large holdings
* Global registry JSON points to owner-controlled ‘buckets’
* Uses minimal JSON to list metadata, CSV files for indices
* Searchable
* Public specification here on GitHub.

[The Specification](docs/cloudcatalog-spec.md) enables anyone to index a public dataset such that other users can find it and retrieve file listings in a cost-effective serverless fashion.

The API is designed for retrieving file catalog (index) files from a specific ID entry in a catalog within a bucket. It also includes search functionality for searching through all data index catalogs found in the bucket list.

## Use Case
Suppose there is a mission on S3 that follows the HelioCloud 'CloudCatalog' specification, and you want to obtain specific files from this mission.

### Initial Setup and Global Catalog
First, install the tool if it has not been already installed. Then, import the tool into a script or shell. You will likely want to search the global catalog to find the specific bucket/catalog containing the data catalog files. You first create a CatalogRegistry object to pull from the default global catalog. This lists buckets not datasets; each bucket owner retains direct ownership over which of their datasets they wish to expose to the public.

```python
import cloudcatalog

cr = cloudcatalog.CatalogRegistry()
print(cr.get_catalog())

print(cr.get_entries())
```

### Finding and Requesting the File Catalog
At this point, you should have found the bucket containing the data of interest. Next, you will want to search the bucket-specific catalog (data catalog) for the ID representing the mission you want to obtain data for.

```python
bucket_name = cr.get_endpoint('e.g. Bucket Mnemonic') # or hard-code, e.g. 's3://mybucket'
# If not a public bucket, pass access_key or boto S3 client params to access it
fr = cloudcatalog.CloudCatalog(bucket_name)  

# Print out the entire local catalog (datasets)
print(fr.get_catalog())

# To find the specific ID we can also get the ID + Title by
print(fr.get_entries())

# Now with the ID we can request the catalog index files as a Pandas dataframe
fr_id = 'a_dataset_id_from_the_catalog'
start_date = '2007-02-01T00:00:00Z'  # A ISO 8601 standard time
stop_date = None  # A ISO 8601 standard time or None for everything after start_date
myfiles = fr.request_cloud_catalog(fr_id, start_date=start_date, end_date=end_date, overwrite=False)
```

### Searching the Entire Catalog
You can use the EntireCatalogSearch class to find a catalog entry:

```python
search = cloudcatalog.EntireCatalogSearch()
top_search_result = search.search_by_keywords(['vector', 'mission', 'useful'])[0]
print(top_search_result)
```

### Specific example for an SDO fetch of the filelist for all the 94A EUV images (1,624,900 files)
``` python
import cloudcatalog
fr = cloudcatalog.CloudCatalog("s3://gov-nasa-hdrl-data1/")
dataid = "aia_0094"
start, stop = fr.get_entry(dataid)['start'], fr.get_entry(dataid)['stop']
mySDOlist = fr.request_cloud_catalog(dataid, start, stop)
```

### Add-on example for an MMS fetch of the filelist for all of a specific MMS item (64,383 files)
``` python
dataid = "mms1_feeps_brst_electron"
start, stop = fr.get_entry(dataid)['start'], fr.get_entry(dataid)['stop']
myMMSlist = fr.request_cloud_catalog(dataid, start, stop)
```

### Streaming Data from the File Catalog
You now have a pandas DataFrame with startdate, stopdate, key, and filesize for all the files of the mission within your specified start and end dates. From here, you can use the key to stream some of the data through EC2, a Lambda, or other processing methods.

This tool also offers a simple function for streaming the data once the file catalog is obtained:

```python
cloudcatalog.CloudCatalog.stream(cloud_catalog, lambda bfile, startdate, stopdate, filesize: print(len(bo.read()), filesize))
```

## Full Notebook Tutorial

For an in-depth walkthrough using the CloudCatalog on NASA datasets, see [CloudCatalog-Demo.ipynb](https://github.com/heliocloud-data/science-tutorials/blob/main/CloudCatalog-Demo.ipynb)

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "cloudcatalog",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "cloud, index, catalog, AWS",
    "author": null,
    "author_email": "Johns Hopkins University Applied Physics Laboratory LLC <sandy.antunes@jhuapl.edu>",
    "download_url": "https://files.pythonhosted.org/packages/fd/52/e5499c31dc2677a32928e02a191b1c32c84f0802a8f294093338b8139ef6/cloudcatalog-1.0.2.tar.gz",
    "platform": null,
    "description": "# CloudCatalog (cloudcatalog) API\n\nCloudCatalog is a generalized indexing specification for large cloud datasets.\nThe push to open science means many more published datasets, and finding and accessing is important to solve. CloudCatalog is an indexing method for sharing big datasets in cloud systems. It is scientist-friendly and it is easy to generate a set of indices. It uses static index files in time-ordered CSV format that are easy to fetch, easy to access via an API, and very low cost in both money and bandwidth needed to support. Metadata is kept in a simple JSON schema. We also provide a Python client toolset for scientists to access datasets that use CloudCatalog. \n\nThe CloudCatalog specification and tools are open source, created by the HelioCloud project, and already used for 2 Petabytes of publicly available NASA and scientist-contributed data. We hope the community continues to adopt this CloudCatalog standard (in github, linked off heliocloud.org).\n\n* For sharing datasets across cloud frameworks\n* Decentralized: data owners control their own data and access\n* RESTful & serverless (indices are flat files alongside their datasets)\n* Removes need for doing slow/expensive disk \u2018ls\u2019 on large holdings\n* Global registry JSON points to owner-controlled \u2018buckets\u2019\n* Uses minimal JSON to list metadata, CSV files for indices\n* Searchable\n* Public specification here on GitHub.\n\n[The Specification](docs/cloudcatalog-spec.md) enables anyone to index a public dataset such that other users can find it and retrieve file listings in a cost-effective serverless fashion.\n\nThe API is designed for retrieving file catalog (index) files from a specific ID entry in a catalog within a bucket. It also includes search functionality for searching through all data index catalogs found in the bucket list.\n\n## Use Case\nSuppose there is a mission on S3 that follows the HelioCloud 'CloudCatalog' specification, and you want to obtain specific files from this mission.\n\n### Initial Setup and Global Catalog\nFirst, install the tool if it has not been already installed. Then, import the tool into a script or shell. You will likely want to search the global catalog to find the specific bucket/catalog containing the data catalog files. You first create a CatalogRegistry object to pull from the default global catalog. This lists buckets not datasets; each bucket owner retains direct ownership over which of their datasets they wish to expose to the public.\n\n```python\nimport cloudcatalog\n\ncr = cloudcatalog.CatalogRegistry()\nprint(cr.get_catalog())\n\nprint(cr.get_entries())\n```\n\n### Finding and Requesting the File Catalog\nAt this point, you should have found the bucket containing the data of interest. Next, you will want to search the bucket-specific catalog (data catalog) for the ID representing the mission you want to obtain data for.\n\n```python\nbucket_name = cr.get_endpoint('e.g. Bucket Mnemonic') # or hard-code, e.g. 's3://mybucket'\n# If not a public bucket, pass access_key or boto S3 client params to access it\nfr = cloudcatalog.CloudCatalog(bucket_name)  \n\n# Print out the entire local catalog (datasets)\nprint(fr.get_catalog())\n\n# To find the specific ID we can also get the ID + Title by\nprint(fr.get_entries())\n\n# Now with the ID we can request the catalog index files as a Pandas dataframe\nfr_id = 'a_dataset_id_from_the_catalog'\nstart_date = '2007-02-01T00:00:00Z'  # A ISO 8601 standard time\nstop_date = None  # A ISO 8601 standard time or None for everything after start_date\nmyfiles = fr.request_cloud_catalog(fr_id, start_date=start_date, end_date=end_date, overwrite=False)\n```\n\n### Searching the Entire Catalog\nYou can use the EntireCatalogSearch class to find a catalog entry:\n\n```python\nsearch = cloudcatalog.EntireCatalogSearch()\ntop_search_result = search.search_by_keywords(['vector', 'mission', 'useful'])[0]\nprint(top_search_result)\n```\n\n### Specific example for an SDO fetch of the filelist for all the 94A EUV images (1,624,900 files)\n``` python\nimport cloudcatalog\nfr = cloudcatalog.CloudCatalog(\"s3://gov-nasa-hdrl-data1/\")\ndataid = \"aia_0094\"\nstart, stop = fr.get_entry(dataid)['start'], fr.get_entry(dataid)['stop']\nmySDOlist = fr.request_cloud_catalog(dataid, start, stop)\n```\n\n### Add-on example for an MMS fetch of the filelist for all of a specific MMS item (64,383 files)\n``` python\ndataid = \"mms1_feeps_brst_electron\"\nstart, stop = fr.get_entry(dataid)['start'], fr.get_entry(dataid)['stop']\nmyMMSlist = fr.request_cloud_catalog(dataid, start, stop)\n```\n\n### Streaming Data from the File Catalog\nYou now have a pandas DataFrame with startdate, stopdate, key, and filesize for all the files of the mission within your specified start and end dates. From here, you can use the key to stream some of the data through EC2, a Lambda, or other processing methods.\n\nThis tool also offers a simple function for streaming the data once the file catalog is obtained:\n\n```python\ncloudcatalog.CloudCatalog.stream(cloud_catalog, lambda bfile, startdate, stopdate, filesize: print(len(bo.read()), filesize))\n```\n\n## Full Notebook Tutorial\n\nFor an in-depth walkthrough using the CloudCatalog on NASA datasets, see [CloudCatalog-Demo.ipynb](https://github.com/heliocloud-data/science-tutorials/blob/main/CloudCatalog-Demo.ipynb)\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "API for accessing the generalized CloudCatalog (cloudcatalog) specification for sharing data in and across clouds",
    "version": "1.0.2",
    "project_urls": {
        "Documentation": "https://github.com/heliocloud-data/cloudcatalog",
        "Homepage": "https://heliocloud.org",
        "Repository": "https://github.com/heliocloud-data/cloudcatalog"
    },
    "split_keywords": [
        "cloud",
        " index",
        " catalog",
        " aws"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6ac6d90ccfef0509786e9f41a5fb0119936e123e9291e441405d8bd57531b12d",
                "md5": "d37b9f7394dbaf6de0f36aac1d0dbdf6",
                "sha256": "0302964cf189a8b7a50d1170b8d466753ea1e5ce3016e41f45cfa303219fb928"
            },
            "downloads": -1,
            "filename": "cloudcatalog-1.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d37b9f7394dbaf6de0f36aac1d0dbdf6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 18104,
            "upload_time": "2025-02-19T15:11:20",
            "upload_time_iso_8601": "2025-02-19T15:11:20.235551Z",
            "url": "https://files.pythonhosted.org/packages/6a/c6/d90ccfef0509786e9f41a5fb0119936e123e9291e441405d8bd57531b12d/cloudcatalog-1.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fd52e5499c31dc2677a32928e02a191b1c32c84f0802a8f294093338b8139ef6",
                "md5": "8485dc52321670e0decfb04aec990049",
                "sha256": "f966fd2bc9d0cb4e54fe077d43ca00569b55e057403c91e10554a652a9077a3b"
            },
            "downloads": -1,
            "filename": "cloudcatalog-1.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "8485dc52321670e0decfb04aec990049",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 75607,
            "upload_time": "2025-02-19T15:11:21",
            "upload_time_iso_8601": "2025-02-19T15:11:21.872879Z",
            "url": "https://files.pythonhosted.org/packages/fd/52/e5499c31dc2677a32928e02a191b1c32c84f0802a8f294093338b8139ef6/cloudcatalog-1.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-19 15:11:21",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "heliocloud-data",
    "github_project": "cloudcatalog",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": false,
    "requirements": [
        {
            "name": "boto3",
            "specs": []
        },
        {
            "name": "botocore",
            "specs": []
        },
        {
            "name": "pandas",
            "specs": []
        },
        {
            "name": "python-dateutil",
            "specs": []
        },
        {
            "name": "requests",
            "specs": []
        }
    ],
    "lcname": "cloudcatalog"
}
        
Elapsed time: 0.75346s