HerdingCats


NameHerdingCats JSON
Version 0.1.7 PyPI version JSON
download
home_pagehttps://github.com/CHRISCARLON/Herding-CATs
SummaryA tool for exploring open data catalogues
upload_time2025-01-21 15:10:22
maintainerNone
docs_urlNone
authorchristophercarlon
requires_python>=3.11
licenseMIT
keywords open data data catalogues datastores ckan open datasoft
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # HerdingCATs 🐈‍⬛

> [!NOTE]  
> Version 0.1.7 Documentation

[![codecov](https://codecov.io/gh/CHRISCARLON/Herding-CATs/graph/badge.svg?token=Y9Z0QA39S3)](https://codecov.io/gh/CHRISCARLON/Herding-CATs)

## Purpose

**The aim of this project is simple: create a basic Python library to explore and interact with open data catalogues**.

This will improve and speed up how users:

- Navigate open data catalogues
- Find the data that they need
- Get that data into a format and/or location for further analysis

Simply...

```bash
pip install HerdingCats
```

or

```bash
poetry add HerdingCats
```

> [!NOTE]
> Herding-CATs is currently under active development. Features may change as the project evolves.
>
> Due to slight variations in how organisations set up and deploy their opendata catalogues, methods may not work 100% of the time for all catalogues.
>
> We will do our best to ensure that most methods work across all catalogues and that a good variety of data catalogues is present.

## Current Default Open Data Catalogues

Herding-CATs supports the following catalogues by default:

### Supported Catalogues

| Catalogue Name                          | Website                          | Catalogue Backend |
| --------------------------------------- | -------------------------------- | ----------------- |
| London Datastore                        | data.london.gov.uk               | CKAN              |
| Subak Data Catalogue                    | data.subak.org                   | CKAN              |
| UK Gov Open Data                        | data.gov.uk                      | CKAN              |
| Humanitarian Data Exchange              | data.humdata.org                 | CKAN              |
| UK Power Networks                       | ukpowernetworks.opendatasoft.com | Open Datasoft     |
| Infrabel                                | opendata.infrabel.be             | Open Datasoft     |
| Paris                                   | opendata.paris.fr                | Open Datasoft     |
| Toulouse                                | data.toulouse-metropole.fr       | Open Datasoft     |
| Elia Belgian Energy                     | opendata.elia.be                 | Open Datasoft     |
| EDF Energy                              | opendata.edf.fr                  | Open Datasoft     |
| Cadent Gas                              | cadentgas.opendatasoft.com       | Open Datasoft     |
| French Gov Open Data                    | data.gouv.fr                     | Bespoke backend   |
| Gestionnaire de Réseaux de Distribution | opendata.agenceore.fr            | Open Datasoft     |

## Overview

This Python library provides a way to explore and interact with CKAN, OpenDataSoft, and French Government data catalogues.

HerdingCATs follows a Session -> Explorer -> Loader pattern.

It includes six main classes:

1. `CkanCatExplorer`: For exploring CKAN-based data catalogues
2. `OpenDataSoftCatExplorer`: For exploring OpenDataSoft-based data catalogues
3. `FrenchGouvCatExplorer`: For exploring the French Government data catalogue
4. `CkanCatResourceLoader`: For loading and transforming CKAN catalogue data
5. `OpenDataSoftResourceLoader`: For loading and transforming OpenDataSoft catalogue data
6. `FrenchGouvResourceLoader`: For loading and transforming French Government catalogue data

All explorer classes work with a `CatSession` object that handles the connection to the chosen data catalogue.

## Usage

### CKAN Components

#### CkanCatExplorer

```python
import HerdingCats as hc

def main():
    with hc.CatSession(hc.CkanDataCatalogues.LONDON_DATA_STORE) as session:
        explore = hc.CkanCatExplorer(session)

if __name__ == "__main__":
    main()
```

##### Methods

1. `check_site_health()`: Checks the health of the CKAN site
2. `get_package_count()`: Returns the total number of packages in a catalogue
3. `get_package_list()`: Returns a dictionary of all available packages
4. `get_package_list_dataframe(df_type: Literal["pandas", "polars"])`: Returns a dataframe of all available packages
5. `get_package_list_extra()`: Returns a list with extra package information
6. `get_package_list_dataframe_extra(df_type: Literal["pandas", "polars"])`: Returns a dataframe with extra package information
7. `get_organisation_list()`: Returns total number of organizations and their details
8. `show_package_info(package_name: Union[str, dict, Any])`: Returns package metadata including resource information
9. `show_package_info_dataframe(package_name: Union[str, dict, Any], df_type: Literal["pandas", "polars"])`: Returns package metadata as a dataframe
10. `package_search(search_query: str, num_rows: int)`: Searches for packages and returns results
11. `package_search_condense(search_query: str, num_rows: int)`: Returns a condensed view of package information
12. `package_search_condense_dataframe(search_query: str, num_rows: int, df_type: Literal["pandas", "polars"])`: Returns a condensed view with packed resources as a dataframe
13. `package_search_condense_dataframe_unpack(search_query: str, num_rows: int, df_type: Literal["pandas", "polars"])`: Returns a condensed view with unpacked resources as a dataframe
14. `extract_resource_url(package_info: List[Dict])`: Extracts resource URLs and metadata from package info. This is used to get the resource URL and format for the CKAN data loader class.

### OpenDataSoft Components

#### OpenDataSoftCatExplorer

```python
import HerdingCats as hc

def main():
    with hc.CatSession(hc.OpenDataSoftDataCatalogues.UK_POWER_NETWORKS) as session:
        explore = hc.OpenDataSoftCatExplorer(session)

if __name__ == "__main__":
    main()
```

##### Methods

1. `check_site_health()`: Checks the health of the OpenDataSoft site
2. `fetch_all_datasets()`: Retrieves all datasets from an OpenDataSoft catalogue
3. `show_dataset_info(dataset_id)`: Returns detailed metadata about a specific dataset
4. `show_dataset_export_options(dataset_id)`: Returns available export formats and download URLs

### French Government Components

#### FrenchGouvCatExplorer

```python
import HerdingCats as hc

def main():
    with hc.CatSession(hc.FrenchGouvCatalogue.GOUV_FR) as session:
        explore = hc.FrenchGouvCatExplorer(session)

if __name__ == "__main__":
    main()
```

##### Methods

1. `check_health_check()`: Checks the health of the French Government data portal
2. `get_all_datasets()`: Returns a dictionary of all available datasets
3. `get_dataset_meta(identifier: str)`: Returns metadata for a specific dataset
4. `get_dataset_meta_dataframe(identifier: str, df_type: Literal["pandas", "polars"])`: Returns dataset metadata as a dataframe
5. `get_multiple_datasets_meta(identifiers: list)`: Fetches metadata for multiple datasets
6. `get_dataset_resource_meta(data: dict)`: Returns metadata for dataset resources
7. `get_dataset_resource_meta_dataframe(data: dict, df_type: Literal["pandas", "polars"])`: Returns resource metadata as a dataframe
8. `get_all_orgs()`: Returns all organizations in the catalogue

### Resource Loaders

All three resource loader classes (`CkanCatResourceLoader`, `OpenDataSoftResourceLoader`, and `FrenchGouvResourceLoader`) support the following methods:

#### DataFrame Loaders

- `polars_data_loader()`: Loads data into a Polars DataFrame
- `pandas_data_loader()`: Loads data into a Pandas DataFrame

#### Database Loaders

- `duckdb_data_loader()`: Loads data into a DuckDB database
- `motherduck_data_loader()`: Loads data into MotherDuck (CKAN only - this will change in the future)

#### Cloud Storage Loaders

- `aws_s3_data_loader()`: Loads data into AWS S3 as either raw data (depending on the format) or parquet file (if you choose to load as parquet)

## Examples

### CKAN Example

```python
import HerdingCats as hc

def main():
    with hc.CatSession(hc.CkanDataCatalogues.HUMANITARIAN_DATA_STORE) as session:
        explore = hc.CkanCatExplorer(session)
        loader = hc.CkanCatResourceLoader()

        # Get list of all packages
        packages = explore.get_package_list()

        # Get info for a specific package
        data = explore.show_package_info("package_name")

        # Extract resource URLs
        resources = explore.extract_resource_url(data)

        # Load into different formats
        df_polars = loader.polars_data_loader(resources)

        # Specify the desired format if you want to otherwise it will defaul to the first dataset in the list
        df_pandas = loader.pandas_data_loader(resources, desired_format="parquet")

if __name__ == "__main__":
    main()
```

### OpenDataSoft Example

```python
import HerdingCats as hc

def main():
    with hc.CatSession(hc.OpenDataSoftDataCatalogues.UK_POWER_NETWORKS) as session:
        explore = hc.OpenDataSoftCatExplorer(session)
        loader = hc.OpenDataSoftResourceLoader()

        # Get export options for a dataset
        data = explore.show_dataset_export_options("package_name")

        # Load into Polars DataFrame (some catalogues require an API key)
        df = loader.polars_data_loader(data, format_type="parquet", api_key="your_api_key")

if __name__ == "__main__":
    main()
```

### French Government Example

```python
import HerdingCats as hc

def main():
    with hc.CatSession(hc.FrenchGouvCatalogue.GOUV_FR) as session:
        explore = hc.FrenchGouvCatExplorer(session)
        loader = hc.FrenchGouvResourceLoader()

        # Get all datasets
        datasets = explore.get_all_datasets()

        # Get metadata for a specific dataset
        meta_data = explore.get_dataset_meta("dataset-id")

        # Get resource metadata for a specific dataset
        resource_meta = explore.get_dataset_resource_meta(meta_data)

        # Load resource metadata into Polars DataFrame and specify the format of the data you want to load
        df = loader.polars_data_loader(resource_meta, "csv")

if __name__ == "__main__":
    main()
```

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

For major changes, please open an issue first to discuss what you would like to change.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/CHRISCARLON/Herding-CATs",
    "name": "HerdingCats",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "open data, data catalogues, datastores, ckan, open datasoft",
    "author": "christophercarlon",
    "author_email": "chris@enmeshed.dev",
    "download_url": "https://files.pythonhosted.org/packages/9a/dc/d7db7f5aae60d4e658e1480b76f14f7a97975105d8905f7fd6ca11cdd4c9/herdingcats-0.1.7.tar.gz",
    "platform": null,
    "description": "# HerdingCATs \ud83d\udc08\u200d\u2b1b\n\n> [!NOTE]  \n> Version 0.1.7 Documentation\n\n[![codecov](https://codecov.io/gh/CHRISCARLON/Herding-CATs/graph/badge.svg?token=Y9Z0QA39S3)](https://codecov.io/gh/CHRISCARLON/Herding-CATs)\n\n## Purpose\n\n**The aim of this project is simple: create a basic Python library to explore and interact with open data catalogues**.\n\nThis will improve and speed up how users:\n\n- Navigate open data catalogues\n- Find the data that they need\n- Get that data into a format and/or location for further analysis\n\nSimply...\n\n```bash\npip install HerdingCats\n```\n\nor\n\n```bash\npoetry add HerdingCats\n```\n\n> [!NOTE]\n> Herding-CATs is currently under active development. Features may change as the project evolves.\n>\n> Due to slight variations in how organisations set up and deploy their opendata catalogues, methods may not work 100% of the time for all catalogues.\n>\n> We will do our best to ensure that most methods work across all catalogues and that a good variety of data catalogues is present.\n\n## Current Default Open Data Catalogues\n\nHerding-CATs supports the following catalogues by default:\n\n### Supported Catalogues\n\n| Catalogue Name                          | Website                          | Catalogue Backend |\n| --------------------------------------- | -------------------------------- | ----------------- |\n| London Datastore                        | data.london.gov.uk               | CKAN              |\n| Subak Data Catalogue                    | data.subak.org                   | CKAN              |\n| UK Gov Open Data                        | data.gov.uk                      | CKAN              |\n| Humanitarian Data Exchange              | data.humdata.org                 | CKAN              |\n| UK Power Networks                       | ukpowernetworks.opendatasoft.com | Open Datasoft     |\n| Infrabel                                | opendata.infrabel.be             | Open Datasoft     |\n| Paris                                   | opendata.paris.fr                | Open Datasoft     |\n| Toulouse                                | data.toulouse-metropole.fr       | Open Datasoft     |\n| Elia Belgian Energy                     | opendata.elia.be                 | Open Datasoft     |\n| EDF Energy                              | opendata.edf.fr                  | Open Datasoft     |\n| Cadent Gas                              | cadentgas.opendatasoft.com       | Open Datasoft     |\n| French Gov Open Data                    | data.gouv.fr                     | Bespoke backend   |\n| Gestionnaire de R\u00e9seaux de Distribution | opendata.agenceore.fr            | Open Datasoft     |\n\n## Overview\n\nThis Python library provides a way to explore and interact with CKAN, OpenDataSoft, and French Government data catalogues.\n\nHerdingCATs follows a Session -> Explorer -> Loader pattern.\n\nIt includes six main classes:\n\n1. `CkanCatExplorer`: For exploring CKAN-based data catalogues\n2. `OpenDataSoftCatExplorer`: For exploring OpenDataSoft-based data catalogues\n3. `FrenchGouvCatExplorer`: For exploring the French Government data catalogue\n4. `CkanCatResourceLoader`: For loading and transforming CKAN catalogue data\n5. `OpenDataSoftResourceLoader`: For loading and transforming OpenDataSoft catalogue data\n6. `FrenchGouvResourceLoader`: For loading and transforming French Government catalogue data\n\nAll explorer classes work with a `CatSession` object that handles the connection to the chosen data catalogue.\n\n## Usage\n\n### CKAN Components\n\n#### CkanCatExplorer\n\n```python\nimport HerdingCats as hc\n\ndef main():\n    with hc.CatSession(hc.CkanDataCatalogues.LONDON_DATA_STORE) as session:\n        explore = hc.CkanCatExplorer(session)\n\nif __name__ == \"__main__\":\n    main()\n```\n\n##### Methods\n\n1. `check_site_health()`: Checks the health of the CKAN site\n2. `get_package_count()`: Returns the total number of packages in a catalogue\n3. `get_package_list()`: Returns a dictionary of all available packages\n4. `get_package_list_dataframe(df_type: Literal[\"pandas\", \"polars\"])`: Returns a dataframe of all available packages\n5. `get_package_list_extra()`: Returns a list with extra package information\n6. `get_package_list_dataframe_extra(df_type: Literal[\"pandas\", \"polars\"])`: Returns a dataframe with extra package information\n7. `get_organisation_list()`: Returns total number of organizations and their details\n8. `show_package_info(package_name: Union[str, dict, Any])`: Returns package metadata including resource information\n9. `show_package_info_dataframe(package_name: Union[str, dict, Any], df_type: Literal[\"pandas\", \"polars\"])`: Returns package metadata as a dataframe\n10. `package_search(search_query: str, num_rows: int)`: Searches for packages and returns results\n11. `package_search_condense(search_query: str, num_rows: int)`: Returns a condensed view of package information\n12. `package_search_condense_dataframe(search_query: str, num_rows: int, df_type: Literal[\"pandas\", \"polars\"])`: Returns a condensed view with packed resources as a dataframe\n13. `package_search_condense_dataframe_unpack(search_query: str, num_rows: int, df_type: Literal[\"pandas\", \"polars\"])`: Returns a condensed view with unpacked resources as a dataframe\n14. `extract_resource_url(package_info: List[Dict])`: Extracts resource URLs and metadata from package info. This is used to get the resource URL and format for the CKAN data loader class.\n\n### OpenDataSoft Components\n\n#### OpenDataSoftCatExplorer\n\n```python\nimport HerdingCats as hc\n\ndef main():\n    with hc.CatSession(hc.OpenDataSoftDataCatalogues.UK_POWER_NETWORKS) as session:\n        explore = hc.OpenDataSoftCatExplorer(session)\n\nif __name__ == \"__main__\":\n    main()\n```\n\n##### Methods\n\n1. `check_site_health()`: Checks the health of the OpenDataSoft site\n2. `fetch_all_datasets()`: Retrieves all datasets from an OpenDataSoft catalogue\n3. `show_dataset_info(dataset_id)`: Returns detailed metadata about a specific dataset\n4. `show_dataset_export_options(dataset_id)`: Returns available export formats and download URLs\n\n### French Government Components\n\n#### FrenchGouvCatExplorer\n\n```python\nimport HerdingCats as hc\n\ndef main():\n    with hc.CatSession(hc.FrenchGouvCatalogue.GOUV_FR) as session:\n        explore = hc.FrenchGouvCatExplorer(session)\n\nif __name__ == \"__main__\":\n    main()\n```\n\n##### Methods\n\n1. `check_health_check()`: Checks the health of the French Government data portal\n2. `get_all_datasets()`: Returns a dictionary of all available datasets\n3. `get_dataset_meta(identifier: str)`: Returns metadata for a specific dataset\n4. `get_dataset_meta_dataframe(identifier: str, df_type: Literal[\"pandas\", \"polars\"])`: Returns dataset metadata as a dataframe\n5. `get_multiple_datasets_meta(identifiers: list)`: Fetches metadata for multiple datasets\n6. `get_dataset_resource_meta(data: dict)`: Returns metadata for dataset resources\n7. `get_dataset_resource_meta_dataframe(data: dict, df_type: Literal[\"pandas\", \"polars\"])`: Returns resource metadata as a dataframe\n8. `get_all_orgs()`: Returns all organizations in the catalogue\n\n### Resource Loaders\n\nAll three resource loader classes (`CkanCatResourceLoader`, `OpenDataSoftResourceLoader`, and `FrenchGouvResourceLoader`) support the following methods:\n\n#### DataFrame Loaders\n\n- `polars_data_loader()`: Loads data into a Polars DataFrame\n- `pandas_data_loader()`: Loads data into a Pandas DataFrame\n\n#### Database Loaders\n\n- `duckdb_data_loader()`: Loads data into a DuckDB database\n- `motherduck_data_loader()`: Loads data into MotherDuck (CKAN only - this will change in the future)\n\n#### Cloud Storage Loaders\n\n- `aws_s3_data_loader()`: Loads data into AWS S3 as either raw data (depending on the format) or parquet file (if you choose to load as parquet)\n\n## Examples\n\n### CKAN Example\n\n```python\nimport HerdingCats as hc\n\ndef main():\n    with hc.CatSession(hc.CkanDataCatalogues.HUMANITARIAN_DATA_STORE) as session:\n        explore = hc.CkanCatExplorer(session)\n        loader = hc.CkanCatResourceLoader()\n\n        # Get list of all packages\n        packages = explore.get_package_list()\n\n        # Get info for a specific package\n        data = explore.show_package_info(\"package_name\")\n\n        # Extract resource URLs\n        resources = explore.extract_resource_url(data)\n\n        # Load into different formats\n        df_polars = loader.polars_data_loader(resources)\n\n        # Specify the desired format if you want to otherwise it will defaul to the first dataset in the list\n        df_pandas = loader.pandas_data_loader(resources, desired_format=\"parquet\")\n\nif __name__ == \"__main__\":\n    main()\n```\n\n### OpenDataSoft Example\n\n```python\nimport HerdingCats as hc\n\ndef main():\n    with hc.CatSession(hc.OpenDataSoftDataCatalogues.UK_POWER_NETWORKS) as session:\n        explore = hc.OpenDataSoftCatExplorer(session)\n        loader = hc.OpenDataSoftResourceLoader()\n\n        # Get export options for a dataset\n        data = explore.show_dataset_export_options(\"package_name\")\n\n        # Load into Polars DataFrame (some catalogues require an API key)\n        df = loader.polars_data_loader(data, format_type=\"parquet\", api_key=\"your_api_key\")\n\nif __name__ == \"__main__\":\n    main()\n```\n\n### French Government Example\n\n```python\nimport HerdingCats as hc\n\ndef main():\n    with hc.CatSession(hc.FrenchGouvCatalogue.GOUV_FR) as session:\n        explore = hc.FrenchGouvCatExplorer(session)\n        loader = hc.FrenchGouvResourceLoader()\n\n        # Get all datasets\n        datasets = explore.get_all_datasets()\n\n        # Get metadata for a specific dataset\n        meta_data = explore.get_dataset_meta(\"dataset-id\")\n\n        # Get resource metadata for a specific dataset\n        resource_meta = explore.get_dataset_resource_meta(meta_data)\n\n        # Load resource metadata into Polars DataFrame and specify the format of the data you want to load\n        df = loader.polars_data_loader(resource_meta, \"csv\")\n\nif __name__ == \"__main__\":\n    main()\n```\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\nFor major changes, please open an issue first to discuss what you would like to change.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A tool for exploring open data catalogues",
    "version": "0.1.7",
    "project_urls": {
        "Homepage": "https://github.com/CHRISCARLON/Herding-CATs",
        "Repository": "https://github.com/CHRISCARLON/Herding-CATs"
    },
    "split_keywords": [
        "open data",
        " data catalogues",
        " datastores",
        " ckan",
        " open datasoft"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d5171556f295d98f57631ec5f8073dfa436bbd85f3cd49f0b4fe3ad517f0bdca",
                "md5": "aa18296d2de432780b42259e102958fa",
                "sha256": "abeeb08010403317d283f25d858dfd51694ae7f2a837c20ab23174aecaba2628"
            },
            "downloads": -1,
            "filename": "herdingcats-0.1.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "aa18296d2de432780b42259e102958fa",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 27407,
            "upload_time": "2025-01-21T15:10:19",
            "upload_time_iso_8601": "2025-01-21T15:10:19.934529Z",
            "url": "https://files.pythonhosted.org/packages/d5/17/1556f295d98f57631ec5f8073dfa436bbd85f3cd49f0b4fe3ad517f0bdca/herdingcats-0.1.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9adcd7db7f5aae60d4e658e1480b76f14f7a97975105d8905f7fd6ca11cdd4c9",
                "md5": "0a979d6aae21a25f0160ae24571c1205",
                "sha256": "179feff14653d30ccaf5fe080169a0ad77d77b6dcada505bdd837fc3a368b505"
            },
            "downloads": -1,
            "filename": "herdingcats-0.1.7.tar.gz",
            "has_sig": false,
            "md5_digest": "0a979d6aae21a25f0160ae24571c1205",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 24740,
            "upload_time": "2025-01-21T15:10:22",
            "upload_time_iso_8601": "2025-01-21T15:10:22.264161Z",
            "url": "https://files.pythonhosted.org/packages/9a/dc/d7db7f5aae60d4e658e1480b76f14f7a97975105d8905f7fd6ca11cdd4c9/herdingcats-0.1.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-21 15:10:22",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "CHRISCARLON",
    "github_project": "Herding-CATs",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "herdingcats"
}
        
Elapsed time: 0.79017s