endofactory


Nameendofactory JSON
Version 0.1.4 PyPI version JSON
download
home_pagehttps://github.com/your-org/EndoFactory
SummaryRevolutionary EndoVQA dataset construction tool for rapid dataset mixing and configuration
upload_time2025-08-21 08:08:17
maintainerNone
docs_urlNone
authorTiramisuQiao
requires_python>=3.9
licenseMIT
keywords endoscopy medical vqa dataset polars yaml cli
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # EndoFactory 🏭

![EndoFactory Logo](assets/logo.png)

Revolutionary tool for constructing EndoVQA datasets through YAML configuration.

[![codecov](https://codecov.io/github/TiramisuQiao/EndoFactory/graph/badge.svg?token=N4SZ3BLO4P)](https://codecov.io/github/TiramisuQiao/EndoFactory)

## Quick Start

### 1. Installation

If you just want the cli, pip it!

```bash
pip install endofactory -i https://pypi.org/simple
```

Or if you want to be contributor,

```bash
git clone <repository-url>
cd EndoFactory
poetry install
```

### 2. Generate Test Data

```bash
poetry run python tests/test_data_generator.py
```

### 3. Create Configuration

```bash
poetry run python -m endofactory.cli create-config --output config.yaml
```

### 4. Edit Configuration

```yaml
datasets:
  - name: endoscopy_vqa_v1
    image_path: /path/to/images
    parquet_path: /path/to/metadata.parquet
    weight: 0.6
  - name: medical_vqa_v2
    image_path: /path/to/images2
    parquet_path: /path/to/metadata2.parquet
    weight: 0.4

columns:
  - uuid
  - question
  - answer
  - options
  - task
  - category

task_proportions:
  task_proportions:
    classification: 0.5
    detection: 0.3
    segmentation: 0.2

export:
  output_path: /path/to/output
  format: parquet
```

### 5. Build Dataset

```bash
poetry run python -m endofactory.cli build config.yaml --verbose
```

### 6. View Results

```bash
poetry run python -m endofactory.cli view output/endovqa_dataset.parquet
```

## CLI Commands

### `create-config`

Generate example configuration file

```bash
endofactory create-config [--output CONFIG_PATH]
```

### `build`

Build mixed dataset from configuration

```bash
endofactory build CONFIG_PATH [--verbose]
```

### `stats`

Show dataset statistics

```bash
endofactory stats CONFIG_PATH
```

### `view`

Visualize parquet file structure and data

```bash
endofactory view PARQUET_FILE [--rows N] [--columns]
```

## Configuration Options

### Dataset Weights

Control proportion of each dataset in final mix:

```yaml
datasets:
  - name: dataset_a
    weight: 0.7  # 70% from dataset_a
  - name: dataset_b  
    weight: 0.3  # 30% from dataset_b
```

### Task Proportions

Control distribution of different task types:

```yaml
task_proportions:
  task_proportions:
    classification: 0.4
    detection: 0.4
    segmentation: 0.2
  subtask_proportions:
    classification:
      organ_classification: 0.6
      disease_classification: 0.4
```

### Global Columns

Specify columns to extract (missing columns filled with null):

```yaml
columns:
  - uuid
  - question
  - answer
  - task
```

## Features

- **🚀 Fast Dataset Mixing**: YAML-based configuration for dataset blending
- **📊 Task Proportion Control**: Precise control over task/subtask distribution
- **💾 Multiple Export Formats**: Support for Parquet and JSONL output
- **🔍 Data Visualization**: One-command parquet file inspection
- **🛡️ Schema Flexibility**: Automatic handling of different column structures
- **⚡ High Performance**: Polars-powered data processing
- **🎯 Reproducible**: Configurable random seeds

## Project Structure

```bash
EndoFactory/
├── src/endofactory/
│   ├── __init__.py
│   ├── cli.py
│   ├── config.py
│   ├── core.py
│   └── yaml_loader.py
├── tests/
├── assets/
├── example_config.yaml
└── pyproject.toml
```

## Requirements

- Python >= 3.9
- Poetry (recommended) or pip

## License

MIT License

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/your-org/EndoFactory",
    "name": "endofactory",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "endoscopy, medical, VQA, dataset, polars, yaml, cli",
    "author": "TiramisuQiao",
    "author_email": "tlmsq@outlook.com",
    "download_url": "https://files.pythonhosted.org/packages/4e/a9/14c5f4873abf1e6f29016ad0a821cdecacd1fbbf80dac35731f23b4e90e1/endofactory-0.1.4.tar.gz",
    "platform": null,
    "description": "# EndoFactory \ud83c\udfed\n\n![EndoFactory Logo](assets/logo.png)\n\nRevolutionary tool for constructing EndoVQA datasets through YAML configuration.\n\n[![codecov](https://codecov.io/github/TiramisuQiao/EndoFactory/graph/badge.svg?token=N4SZ3BLO4P)](https://codecov.io/github/TiramisuQiao/EndoFactory)\n\n## Quick Start\n\n### 1. Installation\n\nIf you just want the cli, pip it!\n\n```bash\npip install endofactory -i https://pypi.org/simple\n```\n\nOr if you want to be contributor,\n\n```bash\ngit clone <repository-url>\ncd EndoFactory\npoetry install\n```\n\n### 2. Generate Test Data\n\n```bash\npoetry run python tests/test_data_generator.py\n```\n\n### 3. Create Configuration\n\n```bash\npoetry run python -m endofactory.cli create-config --output config.yaml\n```\n\n### 4. Edit Configuration\n\n```yaml\ndatasets:\n  - name: endoscopy_vqa_v1\n    image_path: /path/to/images\n    parquet_path: /path/to/metadata.parquet\n    weight: 0.6\n  - name: medical_vqa_v2\n    image_path: /path/to/images2\n    parquet_path: /path/to/metadata2.parquet\n    weight: 0.4\n\ncolumns:\n  - uuid\n  - question\n  - answer\n  - options\n  - task\n  - category\n\ntask_proportions:\n  task_proportions:\n    classification: 0.5\n    detection: 0.3\n    segmentation: 0.2\n\nexport:\n  output_path: /path/to/output\n  format: parquet\n```\n\n### 5. Build Dataset\n\n```bash\npoetry run python -m endofactory.cli build config.yaml --verbose\n```\n\n### 6. View Results\n\n```bash\npoetry run python -m endofactory.cli view output/endovqa_dataset.parquet\n```\n\n## CLI Commands\n\n### `create-config`\n\nGenerate example configuration file\n\n```bash\nendofactory create-config [--output CONFIG_PATH]\n```\n\n### `build`\n\nBuild mixed dataset from configuration\n\n```bash\nendofactory build CONFIG_PATH [--verbose]\n```\n\n### `stats`\n\nShow dataset statistics\n\n```bash\nendofactory stats CONFIG_PATH\n```\n\n### `view`\n\nVisualize parquet file structure and data\n\n```bash\nendofactory view PARQUET_FILE [--rows N] [--columns]\n```\n\n## Configuration Options\n\n### Dataset Weights\n\nControl proportion of each dataset in final mix:\n\n```yaml\ndatasets:\n  - name: dataset_a\n    weight: 0.7  # 70% from dataset_a\n  - name: dataset_b  \n    weight: 0.3  # 30% from dataset_b\n```\n\n### Task Proportions\n\nControl distribution of different task types:\n\n```yaml\ntask_proportions:\n  task_proportions:\n    classification: 0.4\n    detection: 0.4\n    segmentation: 0.2\n  subtask_proportions:\n    classification:\n      organ_classification: 0.6\n      disease_classification: 0.4\n```\n\n### Global Columns\n\nSpecify columns to extract (missing columns filled with null):\n\n```yaml\ncolumns:\n  - uuid\n  - question\n  - answer\n  - task\n```\n\n## Features\n\n- **\ud83d\ude80 Fast Dataset Mixing**: YAML-based configuration for dataset blending\n- **\ud83d\udcca Task Proportion Control**: Precise control over task/subtask distribution\n- **\ud83d\udcbe Multiple Export Formats**: Support for Parquet and JSONL output\n- **\ud83d\udd0d Data Visualization**: One-command parquet file inspection\n- **\ud83d\udee1\ufe0f Schema Flexibility**: Automatic handling of different column structures\n- **\u26a1 High Performance**: Polars-powered data processing\n- **\ud83c\udfaf Reproducible**: Configurable random seeds\n\n## Project Structure\n\n```bash\nEndoFactory/\n\u251c\u2500\u2500 src/endofactory/\n\u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u251c\u2500\u2500 cli.py\n\u2502   \u251c\u2500\u2500 config.py\n\u2502   \u251c\u2500\u2500 core.py\n\u2502   \u2514\u2500\u2500 yaml_loader.py\n\u251c\u2500\u2500 tests/\n\u251c\u2500\u2500 assets/\n\u251c\u2500\u2500 example_config.yaml\n\u2514\u2500\u2500 pyproject.toml\n```\n\n## Requirements\n\n- Python >= 3.9\n- Poetry (recommended) or pip\n\n## License\n\nMIT License\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Revolutionary EndoVQA dataset construction tool for rapid dataset mixing and configuration",
    "version": "0.1.4",
    "project_urls": {
        "Homepage": "https://github.com/your-org/EndoFactory",
        "Issues": "https://github.com/your-org/EndoFactory/issues",
        "Repository": "https://github.com/your-org/EndoFactory"
    },
    "split_keywords": [
        "endoscopy",
        " medical",
        " vqa",
        " dataset",
        " polars",
        " yaml",
        " cli"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d1e4b767396c19db5570987e618b2a3c6a35b3c05f498df17a8304da0514ac54",
                "md5": "34058343de0d6fda4f0b2e1fb97e35cd",
                "sha256": "426b16079408c0ff8f0283dd217e7c2cc006975fb7d1243b1580766a2698235e"
            },
            "downloads": -1,
            "filename": "endofactory-0.1.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "34058343de0d6fda4f0b2e1fb97e35cd",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 18797,
            "upload_time": "2025-08-21T08:08:14",
            "upload_time_iso_8601": "2025-08-21T08:08:14.642712Z",
            "url": "https://files.pythonhosted.org/packages/d1/e4/b767396c19db5570987e618b2a3c6a35b3c05f498df17a8304da0514ac54/endofactory-0.1.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4ea914c5f4873abf1e6f29016ad0a821cdecacd1fbbf80dac35731f23b4e90e1",
                "md5": "434a0148aaee96a89413f3506be96b66",
                "sha256": "d7e14e25d5a808a8fb9ef3e8706f5c9f073a4774be3b4674c3c2877b6c05b0be"
            },
            "downloads": -1,
            "filename": "endofactory-0.1.4.tar.gz",
            "has_sig": false,
            "md5_digest": "434a0148aaee96a89413f3506be96b66",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 17634,
            "upload_time": "2025-08-21T08:08:17",
            "upload_time_iso_8601": "2025-08-21T08:08:17.087296Z",
            "url": "https://files.pythonhosted.org/packages/4e/a9/14c5f4873abf1e6f29016ad0a821cdecacd1fbbf80dac35731f23b4e90e1/endofactory-0.1.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-21 08:08:17",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "your-org",
    "github_project": "EndoFactory",
    "github_not_found": true,
    "lcname": "endofactory"
}
        
Elapsed time: 1.05443s