docset


Namedocset JSON
Version 0.5.5 PyPI version JSON
download
home_pagehttps://github.com/XoriieInpottn/docset
SummaryA dataset format/utilities used to store document objects based on BSON.
upload_time2023-02-03 07:19:16
maintainer
docs_urlNone
authorxi
requires_python
licenseBSD-3-Clause License
keywords dataset file format
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # DocSet (Document Dataset)

The project contains writer and reader utilities to store and extract documents (python dict) by a single ".ds" file as follow:

```mermaid
graph LR
classDef nobg fill:none,stroke:none;
classDef box fill:none,stroke:#000;
ds[(person.ds)]:::box
writer[DocSetWriter]:::box
reader[DocSetReader]:::box
doc1("{'name': 'Alice', 'age': 17, ...}<br>{'name': 'Bob', 'age': 18, ...}<br>...<br>{'name': 'Tony', 'age': 20, ...}"):::nobg
doc1_("{'name': 'Alice', 'age': 17, ...}<br>{'name': 'Bob', 'age': 18, ...}<br>...<br>{'name': 'Tony', 'age': 20, ...}"):::nobg
doc1 --> writer --> ds
ds --> reader --> doc1_
```

## Installation

* Install from PyPI repository.

   ```bash
   pip3 install docset
   ```


## Tutorial

Here, we give an example on a very simple machine learning task. Suppose we want to train a model to perform an image classification task, one important step is to construct a train dataset. To achieve this, we usually use a text file (e.g., csv, json) to store the "image path" and "label" information, and we store all the actual image files in another folder. This "text+folder" dataset can be fine in most of the situations, while it will suffer poor performance when the total number of samples is very large. The reason is that the file system is not good at reading/writing huge numbers of tiny files which are separately stored on the disk. So, storing the whole dataset in a single file is one possible way to solve this problem. 

The following examples first show how to make a DocSet file based on the "text+folder" storage, and then show how to read or iterate this dataset.

### Make a Dataset

```python
import csv
import os

import cv2 as cv
import numpy as np

from docset import DocSet

csv_path = 'train.csv'
image_dir_path = 'images'
ds_path = 'train.ds'

with csv.DictReader(csv_path) as reader, DocSet(ds_path, 'w') as ds:
    for row in reader:
        image_path = os.path.join(image_dir_path, row['filename'])
        image = cv.imread(image_path, cv.IMREAD_COLOR)  # load the image as ndarray
        doc = {
            'feature': image.astype(np.float32) / 255.0,  # convert the image into [0, 1] range
            'label': row['label']
        }  # a data sample is represented by dict, ndarray can be used directly
        ds.write(doc)

```

### Read the Dataset

```python
from docset import DocSet

ds_path = 'train.ds'

with DocSet(ds_path, 'r') as ds:
    print(len(ds), 'samples')
    doc = ds[0]  # the sample can be access by subscript, a dict will be returned
    print('feature shape', doc['feature'].shape)
    print('feature dtype', doc['feature'].dtype)
    print('label', doc['label'])

```

### View the Dataset in Console

```bash
docset /path/to/the/file.ds
```



### Integrate with Pytorch

```python
from torch.utils.data import Dataset, DataLoader

from docset import DocSet


class MyDataset(Dataset):

    def __init__(self, path):
        self._impl = DocSet(path, 'r')

    def __len__(self):
        return len(self._impl)

    def __getitem__(self, i: int):
        doc = self._impl[i]
        # perform any data processing
        return doc


ds_path = 'train.ds'

train_loader = DataLoader(
    MyDataset(ds_path),
    batch_size=256,
    shuffle=True,
    num_workers=5,
    pin_memory=True
)

for epoch in range(50):
    for doc in train_loader:
        feature = doc['feature']
        label = doc['label']
        # invoke the train function of the model
        # model.train(feature, label)

```

## Technical Details

To be added...


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/XoriieInpottn/docset",
    "name": "docset",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "dataset,file format",
    "author": "xi",
    "author_email": "gylv@mail.ustc.edu.cn",
    "download_url": "",
    "platform": "any",
    "description": "# DocSet (Document Dataset)\n\nThe project contains writer and reader utilities to store and extract documents (python dict) by a single \".ds\" file as follow:\n\n```mermaid\ngraph LR\nclassDef nobg fill:none,stroke:none;\nclassDef box fill:none,stroke:#000;\nds[(person.ds)]:::box\nwriter[DocSetWriter]:::box\nreader[DocSetReader]:::box\ndoc1(\"{'name': 'Alice', 'age': 17, ...}<br>{'name': 'Bob', 'age': 18, ...}<br>...<br>{'name': 'Tony', 'age': 20, ...}\"):::nobg\ndoc1_(\"{'name': 'Alice', 'age': 17, ...}<br>{'name': 'Bob', 'age': 18, ...}<br>...<br>{'name': 'Tony', 'age': 20, ...}\"):::nobg\ndoc1 --> writer --> ds\nds --> reader --> doc1_\n```\n\n## Installation\n\n* Install from PyPI repository.\n\n   ```bash\n   pip3 install docset\n   ```\n\n\n## Tutorial\n\nHere, we give an example on a very simple machine learning task. Suppose we want to train a model to perform an image classification task, one important step is to construct a train dataset. To achieve this, we usually use a text file (e.g., csv, json) to store the \"image path\" and \"label\" information, and we store all the actual image files in another folder. This \"text+folder\" dataset can be fine in most of the situations, while it will suffer poor performance when the total number of samples is very large. The reason is that the file system is not good at reading/writing huge numbers of tiny files which are separately stored on the disk. So, storing the whole dataset in a single file is one possible way to solve this problem. \n\nThe following examples first show how to make a DocSet file based on the \"text+folder\" storage, and then show how to read or iterate this dataset.\n\n### Make a Dataset\n\n```python\nimport csv\nimport os\n\nimport cv2 as cv\nimport numpy as np\n\nfrom docset import DocSet\n\ncsv_path = 'train.csv'\nimage_dir_path = 'images'\nds_path = 'train.ds'\n\nwith csv.DictReader(csv_path) as reader, DocSet(ds_path, 'w') as ds:\n    for row in reader:\n        image_path = os.path.join(image_dir_path, row['filename'])\n        image = cv.imread(image_path, cv.IMREAD_COLOR)  # load the image as ndarray\n        doc = {\n            'feature': image.astype(np.float32) / 255.0,  # convert the image into [0, 1] range\n            'label': row['label']\n        }  # a data sample is represented by dict, ndarray can be used directly\n        ds.write(doc)\n\n```\n\n### Read the Dataset\n\n```python\nfrom docset import DocSet\n\nds_path = 'train.ds'\n\nwith DocSet(ds_path, 'r') as ds:\n    print(len(ds), 'samples')\n    doc = ds[0]  # the sample can be access by subscript, a dict will be returned\n    print('feature shape', doc['feature'].shape)\n    print('feature dtype', doc['feature'].dtype)\n    print('label', doc['label'])\n\n```\n\n### View the Dataset in Console\n\n```bash\ndocset /path/to/the/file.ds\n```\n\n\n\n### Integrate with Pytorch\n\n```python\nfrom torch.utils.data import Dataset, DataLoader\n\nfrom docset import DocSet\n\n\nclass MyDataset(Dataset):\n\n    def __init__(self, path):\n        self._impl = DocSet(path, 'r')\n\n    def __len__(self):\n        return len(self._impl)\n\n    def __getitem__(self, i: int):\n        doc = self._impl[i]\n        # perform any data processing\n        return doc\n\n\nds_path = 'train.ds'\n\ntrain_loader = DataLoader(\n    MyDataset(ds_path),\n    batch_size=256,\n    shuffle=True,\n    num_workers=5,\n    pin_memory=True\n)\n\nfor epoch in range(50):\n    for doc in train_loader:\n        feature = doc['feature']\n        label = doc['label']\n        # invoke the train function of the model\n        # model.train(feature, label)\n\n```\n\n## Technical Details\n\nTo be added...\n\n",
    "bugtrack_url": null,
    "license": "BSD-3-Clause License",
    "summary": "A dataset format/utilities used to store document objects based on BSON.",
    "version": "0.5.5",
    "split_keywords": [
        "dataset",
        "file format"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f89d8f6f6ef9fe38c2a18caaab6414d41518d137e518cb47c12390fb32ce9f82",
                "md5": "19a4ccd9ceec3a58b12c39ee43577ca8",
                "sha256": "89bd7dbeb94e3781bafd6dfca5ac9fd6cd2534cbf253b13bb751bbf6036b5bca"
            },
            "downloads": -1,
            "filename": "docset-0.5.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "19a4ccd9ceec3a58b12c39ee43577ca8",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 10144,
            "upload_time": "2023-02-03T07:19:16",
            "upload_time_iso_8601": "2023-02-03T07:19:16.096810Z",
            "url": "https://files.pythonhosted.org/packages/f8/9d/8f6f6ef9fe38c2a18caaab6414d41518d137e518cb47c12390fb32ce9f82/docset-0.5.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-02-03 07:19:16",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "XoriieInpottn",
    "github_project": "docset",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "docset"
}
        
xi
Elapsed time: 0.04080s