# DataCat - Data Storage System
[](https://pypi.org/project/datacat/)
[](https://github.com/papasaidfine/datacat/tags)
A data storage system with catalog storage and pluggable serializers.
## Features
- **CatalogStorage**: Manages DuckDB catalog with hashed file paths
- **Serializer Interface**: Pluggable serialization system
- **SparseMatrixSerializer**: Handles scipy sparse matrices and numpy arrays
- **NumpyArraySerializer**: Pure numpy arrays without pickle dependency
## Installation
```bash
pip install -r requirements.txt
```
## Usage
### With Sparse Matrices
```python
from datacat import CatalogStorage, SparseMatrixSerializer
import numpy as np
import scipy.sparse as sp
# Initialize with sparse matrix support
serializer = SparseMatrixSerializer()
storage = CatalogStorage(
catalog_columns=['dim1', 'dim2', 'date'],
serializer=serializer
)
# Save mixed data
data = {
'returns': sp.csr_matrix([[1, 2, 0], [0, 0, 3]]),
'stock_names': np.array(['AAPL', 'MSFT']),
'weights': np.array([0.5, 0.5])
}
storage.save(data, dim1="v1", dim2="v2", date="2024-01-01")
```
### With Pure NumPy Arrays
```python
from datacat import CatalogStorage, NumpyArraySerializer
import numpy as np
# Initialize with numpy-only support (no pickle)
serializer = NumpyArraySerializer()
storage = CatalogStorage(
catalog_columns=['experiment', 'model', 'date'],
serializer=serializer
)
# Save pure numpy data
data = {
'features': np.random.rand(100, 10).astype(np.float32),
'labels': np.array(['class_A', 'class_B'] * 50),
'timestamps': np.array(['2024-01-01', '2024-01-02'], dtype='datetime64[D]')
}
storage.save(data, experiment="classification", model="cnn", date="2024-01-01")
```
Raw data
{
"_id": null,
"home_page": null,
"name": "datacatalog-storage",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "DataCat Team <datacat@example.com>",
"keywords": "data, storage, serialization, catalog, numpy, scipy",
"author": null,
"author_email": "DataCat Team <datacat@example.com>",
"download_url": "https://files.pythonhosted.org/packages/ef/f1/4081dd62e7e33aa853ef212e62e9f6c9ce788bc6f047fd5a6cd15f6dd95e/datacatalog_storage-1.0.0.tar.gz",
"platform": null,
"description": "# DataCat - Data Storage System\n\n[](https://pypi.org/project/datacat/)\n[](https://github.com/papasaidfine/datacat/tags)\n\nA data storage system with catalog storage and pluggable serializers.\n\n## Features\n\n- **CatalogStorage**: Manages DuckDB catalog with hashed file paths\n- **Serializer Interface**: Pluggable serialization system\n- **SparseMatrixSerializer**: Handles scipy sparse matrices and numpy arrays\n- **NumpyArraySerializer**: Pure numpy arrays without pickle dependency\n\n## Installation\n\n```bash\npip install -r requirements.txt\n```\n\n## Usage\n\n### With Sparse Matrices\n\n```python\nfrom datacat import CatalogStorage, SparseMatrixSerializer\nimport numpy as np\nimport scipy.sparse as sp\n\n# Initialize with sparse matrix support\nserializer = SparseMatrixSerializer()\nstorage = CatalogStorage(\n catalog_columns=['dim1', 'dim2', 'date'],\n serializer=serializer\n)\n\n# Save mixed data\ndata = {\n 'returns': sp.csr_matrix([[1, 2, 0], [0, 0, 3]]),\n 'stock_names': np.array(['AAPL', 'MSFT']),\n 'weights': np.array([0.5, 0.5])\n}\nstorage.save(data, dim1=\"v1\", dim2=\"v2\", date=\"2024-01-01\")\n```\n\n### With Pure NumPy Arrays\n\n```python\nfrom datacat import CatalogStorage, NumpyArraySerializer\nimport numpy as np\n\n# Initialize with numpy-only support (no pickle)\nserializer = NumpyArraySerializer()\nstorage = CatalogStorage(\n catalog_columns=['experiment', 'model', 'date'],\n serializer=serializer\n)\n\n# Save pure numpy data\ndata = {\n 'features': np.random.rand(100, 10).astype(np.float32),\n 'labels': np.array(['class_A', 'class_B'] * 50),\n 'timestamps': np.array(['2024-01-01', '2024-01-02'], dtype='datetime64[D]')\n}\nstorage.save(data, experiment=\"classification\", model=\"cnn\", date=\"2024-01-01\")\n```\n",
"bugtrack_url": null,
"license": null,
"summary": "Catalog-based data storage system with pluggable serializers",
"version": "1.0.0",
"project_urls": {
"Bug Reports": "https://github.com/papasaidfine/datacat/issues",
"Documentation": "https://github.com/papasaidfine/datacat#readme",
"Homepage": "https://github.com/papasaidfine/datacat",
"Repository": "https://github.com/papasaidfine/datacat"
},
"split_keywords": [
"data",
" storage",
" serialization",
" catalog",
" numpy",
" scipy"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "e7682ae169e4d086b1af47d332bfeb07eb5c96ae33b940bed75dbc4e5aee4f80",
"md5": "4c619699b32bd5c7fa23482c071bc3a5",
"sha256": "43add7cb627d3654b97650e14ed564b07948901136b1ad1eb0d172970ab620be"
},
"downloads": -1,
"filename": "datacatalog_storage-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4c619699b32bd5c7fa23482c071bc3a5",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 11361,
"upload_time": "2025-08-17T21:08:37",
"upload_time_iso_8601": "2025-08-17T21:08:37.732093Z",
"url": "https://files.pythonhosted.org/packages/e7/68/2ae169e4d086b1af47d332bfeb07eb5c96ae33b940bed75dbc4e5aee4f80/datacatalog_storage-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "eff14081dd62e7e33aa853ef212e62e9f6c9ce788bc6f047fd5a6cd15f6dd95e",
"md5": "8d51034fa755cde05a1f512c14f3395d",
"sha256": "107d422ce36c2cdcbba09ddc71a6dc3c35b45829a59eb550bd559161c2da2503"
},
"downloads": -1,
"filename": "datacatalog_storage-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "8d51034fa755cde05a1f512c14f3395d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 11751,
"upload_time": "2025-08-17T21:08:40",
"upload_time_iso_8601": "2025-08-17T21:08:40.243009Z",
"url": "https://files.pythonhosted.org/packages/ef/f1/4081dd62e7e33aa853ef212e62e9f6c9ce788bc6f047fd5a6cd15f6dd95e/datacatalog_storage-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-17 21:08:40",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "papasaidfine",
"github_project": "datacat",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "numpy",
"specs": [
[
">=",
"1.20.0"
]
]
},
{
"name": "scipy",
"specs": [
[
">=",
"1.7.0"
]
]
},
{
"name": "duckdb",
"specs": [
[
">=",
"0.8.0"
]
]
},
{
"name": "pytest",
"specs": [
[
">=",
"7.0.0"
]
]
}
],
"lcname": "datacatalog-storage"
}