datacatalog-storage


Namedatacatalog-storage JSON
Version 1.0.0 PyPI version JSON
download
home_pageNone
SummaryCatalog-based data storage system with pluggable serializers
upload_time2025-08-17 21:08:40
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords data storage serialization catalog numpy scipy
VCS
bugtrack_url
requirements numpy scipy duckdb pytest
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # DataCat - Data Storage System

[![PyPI version](https://badge.fury.io/py/datacat.svg)](https://pypi.org/project/datacat/)
[![GitHub tag](https://img.shields.io/github/v/tag/papasaidfine/datacat?sort=semver)](https://github.com/papasaidfine/datacat/tags)

A data storage system with catalog storage and pluggable serializers.

## Features

- **CatalogStorage**: Manages DuckDB catalog with hashed file paths
- **Serializer Interface**: Pluggable serialization system
- **SparseMatrixSerializer**: Handles scipy sparse matrices and numpy arrays
- **NumpyArraySerializer**: Pure numpy arrays without pickle dependency

## Installation

```bash
pip install -r requirements.txt
```

## Usage

### With Sparse Matrices

```python
from datacat import CatalogStorage, SparseMatrixSerializer
import numpy as np
import scipy.sparse as sp

# Initialize with sparse matrix support
serializer = SparseMatrixSerializer()
storage = CatalogStorage(
    catalog_columns=['dim1', 'dim2', 'date'],
    serializer=serializer
)

# Save mixed data
data = {
    'returns': sp.csr_matrix([[1, 2, 0], [0, 0, 3]]),
    'stock_names': np.array(['AAPL', 'MSFT']),
    'weights': np.array([0.5, 0.5])
}
storage.save(data, dim1="v1", dim2="v2", date="2024-01-01")
```

### With Pure NumPy Arrays

```python
from datacat import CatalogStorage, NumpyArraySerializer
import numpy as np

# Initialize with numpy-only support (no pickle)
serializer = NumpyArraySerializer()
storage = CatalogStorage(
    catalog_columns=['experiment', 'model', 'date'],
    serializer=serializer
)

# Save pure numpy data
data = {
    'features': np.random.rand(100, 10).astype(np.float32),
    'labels': np.array(['class_A', 'class_B'] * 50),
    'timestamps': np.array(['2024-01-01', '2024-01-02'], dtype='datetime64[D]')
}
storage.save(data, experiment="classification", model="cnn", date="2024-01-01")
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "datacatalog-storage",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "DataCat Team <datacat@example.com>",
    "keywords": "data, storage, serialization, catalog, numpy, scipy",
    "author": null,
    "author_email": "DataCat Team <datacat@example.com>",
    "download_url": "https://files.pythonhosted.org/packages/ef/f1/4081dd62e7e33aa853ef212e62e9f6c9ce788bc6f047fd5a6cd15f6dd95e/datacatalog_storage-1.0.0.tar.gz",
    "platform": null,
    "description": "# DataCat - Data Storage System\n\n[![PyPI version](https://badge.fury.io/py/datacat.svg)](https://pypi.org/project/datacat/)\n[![GitHub tag](https://img.shields.io/github/v/tag/papasaidfine/datacat?sort=semver)](https://github.com/papasaidfine/datacat/tags)\n\nA data storage system with catalog storage and pluggable serializers.\n\n## Features\n\n- **CatalogStorage**: Manages DuckDB catalog with hashed file paths\n- **Serializer Interface**: Pluggable serialization system\n- **SparseMatrixSerializer**: Handles scipy sparse matrices and numpy arrays\n- **NumpyArraySerializer**: Pure numpy arrays without pickle dependency\n\n## Installation\n\n```bash\npip install -r requirements.txt\n```\n\n## Usage\n\n### With Sparse Matrices\n\n```python\nfrom datacat import CatalogStorage, SparseMatrixSerializer\nimport numpy as np\nimport scipy.sparse as sp\n\n# Initialize with sparse matrix support\nserializer = SparseMatrixSerializer()\nstorage = CatalogStorage(\n    catalog_columns=['dim1', 'dim2', 'date'],\n    serializer=serializer\n)\n\n# Save mixed data\ndata = {\n    'returns': sp.csr_matrix([[1, 2, 0], [0, 0, 3]]),\n    'stock_names': np.array(['AAPL', 'MSFT']),\n    'weights': np.array([0.5, 0.5])\n}\nstorage.save(data, dim1=\"v1\", dim2=\"v2\", date=\"2024-01-01\")\n```\n\n### With Pure NumPy Arrays\n\n```python\nfrom datacat import CatalogStorage, NumpyArraySerializer\nimport numpy as np\n\n# Initialize with numpy-only support (no pickle)\nserializer = NumpyArraySerializer()\nstorage = CatalogStorage(\n    catalog_columns=['experiment', 'model', 'date'],\n    serializer=serializer\n)\n\n# Save pure numpy data\ndata = {\n    'features': np.random.rand(100, 10).astype(np.float32),\n    'labels': np.array(['class_A', 'class_B'] * 50),\n    'timestamps': np.array(['2024-01-01', '2024-01-02'], dtype='datetime64[D]')\n}\nstorage.save(data, experiment=\"classification\", model=\"cnn\", date=\"2024-01-01\")\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Catalog-based data storage system with pluggable serializers",
    "version": "1.0.0",
    "project_urls": {
        "Bug Reports": "https://github.com/papasaidfine/datacat/issues",
        "Documentation": "https://github.com/papasaidfine/datacat#readme",
        "Homepage": "https://github.com/papasaidfine/datacat",
        "Repository": "https://github.com/papasaidfine/datacat"
    },
    "split_keywords": [
        "data",
        " storage",
        " serialization",
        " catalog",
        " numpy",
        " scipy"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e7682ae169e4d086b1af47d332bfeb07eb5c96ae33b940bed75dbc4e5aee4f80",
                "md5": "4c619699b32bd5c7fa23482c071bc3a5",
                "sha256": "43add7cb627d3654b97650e14ed564b07948901136b1ad1eb0d172970ab620be"
            },
            "downloads": -1,
            "filename": "datacatalog_storage-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4c619699b32bd5c7fa23482c071bc3a5",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 11361,
            "upload_time": "2025-08-17T21:08:37",
            "upload_time_iso_8601": "2025-08-17T21:08:37.732093Z",
            "url": "https://files.pythonhosted.org/packages/e7/68/2ae169e4d086b1af47d332bfeb07eb5c96ae33b940bed75dbc4e5aee4f80/datacatalog_storage-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "eff14081dd62e7e33aa853ef212e62e9f6c9ce788bc6f047fd5a6cd15f6dd95e",
                "md5": "8d51034fa755cde05a1f512c14f3395d",
                "sha256": "107d422ce36c2cdcbba09ddc71a6dc3c35b45829a59eb550bd559161c2da2503"
            },
            "downloads": -1,
            "filename": "datacatalog_storage-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "8d51034fa755cde05a1f512c14f3395d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 11751,
            "upload_time": "2025-08-17T21:08:40",
            "upload_time_iso_8601": "2025-08-17T21:08:40.243009Z",
            "url": "https://files.pythonhosted.org/packages/ef/f1/4081dd62e7e33aa853ef212e62e9f6c9ce788bc6f047fd5a6cd15f6dd95e/datacatalog_storage-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-17 21:08:40",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "papasaidfine",
    "github_project": "datacat",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.20.0"
                ]
            ]
        },
        {
            "name": "scipy",
            "specs": [
                [
                    ">=",
                    "1.7.0"
                ]
            ]
        },
        {
            "name": "duckdb",
            "specs": [
                [
                    ">=",
                    "0.8.0"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": [
                [
                    ">=",
                    "7.0.0"
                ]
            ]
        }
    ],
    "lcname": "datacatalog-storage"
}
        
Elapsed time: 0.79831s