# EndoFactory 🏭

Revolutionary tool for constructing EndoVQA datasets through YAML configuration.
[](https://codecov.io/github/TiramisuQiao/EndoFactory)
## Quick Start
### 1. Installation
If you just want the cli, pip it!
```bash
pip install endofactory -i https://pypi.org/simple
```
Or if you want to be contributor,
```bash
git clone <repository-url>
cd EndoFactory
poetry install
```
### 2. Generate Test Data
```bash
poetry run python tests/test_data_generator.py
```
### 3. Create Configuration
```bash
poetry run python -m endofactory.cli create-config --output config.yaml
```
### 4. Edit Configuration
```yaml
datasets:
- name: endoscopy_vqa_v1
image_path: /path/to/images
parquet_path: /path/to/metadata.parquet
weight: 0.6
- name: medical_vqa_v2
image_path: /path/to/images2
parquet_path: /path/to/metadata2.parquet
weight: 0.4
columns:
- uuid
- question
- answer
- options
- task
- category
task_proportions:
task_proportions:
classification: 0.5
detection: 0.3
segmentation: 0.2
export:
output_path: /path/to/output
format: parquet
```
### 5. Build Dataset
```bash
poetry run python -m endofactory.cli build config.yaml --verbose
```
### 6. View Results
```bash
poetry run python -m endofactory.cli view output/endovqa_dataset.parquet
```
## CLI Commands
### `create-config`
Generate example configuration file
```bash
endofactory create-config [--output CONFIG_PATH]
```
### `build`
Build mixed dataset from configuration
```bash
endofactory build CONFIG_PATH [--verbose]
```
### `stats`
Show dataset statistics
```bash
endofactory stats CONFIG_PATH
```
### `view`
Visualize parquet file structure and data
```bash
endofactory view PARQUET_FILE [--rows N] [--columns]
```
## Configuration Options
### Dataset Weights
Control proportion of each dataset in final mix:
```yaml
datasets:
- name: dataset_a
weight: 0.7 # 70% from dataset_a
- name: dataset_b
weight: 0.3 # 30% from dataset_b
```
### Task Proportions
Control distribution of different task types:
```yaml
task_proportions:
task_proportions:
classification: 0.4
detection: 0.4
segmentation: 0.2
subtask_proportions:
classification:
organ_classification: 0.6
disease_classification: 0.4
```
### Global Columns
Specify columns to extract (missing columns filled with null):
```yaml
columns:
- uuid
- question
- answer
- task
```
## Features
- **🚀 Fast Dataset Mixing**: YAML-based configuration for dataset blending
- **📊 Task Proportion Control**: Precise control over task/subtask distribution
- **💾 Multiple Export Formats**: Support for Parquet and JSONL output
- **🔍 Data Visualization**: One-command parquet file inspection
- **🛡️ Schema Flexibility**: Automatic handling of different column structures
- **⚡ High Performance**: Polars-powered data processing
- **🎯 Reproducible**: Configurable random seeds
## Project Structure
```bash
EndoFactory/
├── src/endofactory/
│ ├── __init__.py
│ ├── cli.py
│ ├── config.py
│ ├── core.py
│ └── yaml_loader.py
├── tests/
├── assets/
├── example_config.yaml
└── pyproject.toml
```
## Requirements
- Python >= 3.9
- Poetry (recommended) or pip
## License
MIT License
Raw data
{
"_id": null,
"home_page": "https://github.com/your-org/EndoFactory",
"name": "endofactory",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "endoscopy, medical, VQA, dataset, polars, yaml, cli",
"author": "TiramisuQiao",
"author_email": "tlmsq@outlook.com",
"download_url": "https://files.pythonhosted.org/packages/4e/a9/14c5f4873abf1e6f29016ad0a821cdecacd1fbbf80dac35731f23b4e90e1/endofactory-0.1.4.tar.gz",
"platform": null,
"description": "# EndoFactory \ud83c\udfed\n\n\n\nRevolutionary tool for constructing EndoVQA datasets through YAML configuration.\n\n[](https://codecov.io/github/TiramisuQiao/EndoFactory)\n\n## Quick Start\n\n### 1. Installation\n\nIf you just want the cli, pip it!\n\n```bash\npip install endofactory -i https://pypi.org/simple\n```\n\nOr if you want to be contributor,\n\n```bash\ngit clone <repository-url>\ncd EndoFactory\npoetry install\n```\n\n### 2. Generate Test Data\n\n```bash\npoetry run python tests/test_data_generator.py\n```\n\n### 3. Create Configuration\n\n```bash\npoetry run python -m endofactory.cli create-config --output config.yaml\n```\n\n### 4. Edit Configuration\n\n```yaml\ndatasets:\n - name: endoscopy_vqa_v1\n image_path: /path/to/images\n parquet_path: /path/to/metadata.parquet\n weight: 0.6\n - name: medical_vqa_v2\n image_path: /path/to/images2\n parquet_path: /path/to/metadata2.parquet\n weight: 0.4\n\ncolumns:\n - uuid\n - question\n - answer\n - options\n - task\n - category\n\ntask_proportions:\n task_proportions:\n classification: 0.5\n detection: 0.3\n segmentation: 0.2\n\nexport:\n output_path: /path/to/output\n format: parquet\n```\n\n### 5. Build Dataset\n\n```bash\npoetry run python -m endofactory.cli build config.yaml --verbose\n```\n\n### 6. View Results\n\n```bash\npoetry run python -m endofactory.cli view output/endovqa_dataset.parquet\n```\n\n## CLI Commands\n\n### `create-config`\n\nGenerate example configuration file\n\n```bash\nendofactory create-config [--output CONFIG_PATH]\n```\n\n### `build`\n\nBuild mixed dataset from configuration\n\n```bash\nendofactory build CONFIG_PATH [--verbose]\n```\n\n### `stats`\n\nShow dataset statistics\n\n```bash\nendofactory stats CONFIG_PATH\n```\n\n### `view`\n\nVisualize parquet file structure and data\n\n```bash\nendofactory view PARQUET_FILE [--rows N] [--columns]\n```\n\n## Configuration Options\n\n### Dataset Weights\n\nControl proportion of each dataset in final mix:\n\n```yaml\ndatasets:\n - name: dataset_a\n weight: 0.7 # 70% from dataset_a\n - name: dataset_b \n weight: 0.3 # 30% from dataset_b\n```\n\n### Task Proportions\n\nControl distribution of different task types:\n\n```yaml\ntask_proportions:\n task_proportions:\n classification: 0.4\n detection: 0.4\n segmentation: 0.2\n subtask_proportions:\n classification:\n organ_classification: 0.6\n disease_classification: 0.4\n```\n\n### Global Columns\n\nSpecify columns to extract (missing columns filled with null):\n\n```yaml\ncolumns:\n - uuid\n - question\n - answer\n - task\n```\n\n## Features\n\n- **\ud83d\ude80 Fast Dataset Mixing**: YAML-based configuration for dataset blending\n- **\ud83d\udcca Task Proportion Control**: Precise control over task/subtask distribution\n- **\ud83d\udcbe Multiple Export Formats**: Support for Parquet and JSONL output\n- **\ud83d\udd0d Data Visualization**: One-command parquet file inspection\n- **\ud83d\udee1\ufe0f Schema Flexibility**: Automatic handling of different column structures\n- **\u26a1 High Performance**: Polars-powered data processing\n- **\ud83c\udfaf Reproducible**: Configurable random seeds\n\n## Project Structure\n\n```bash\nEndoFactory/\n\u251c\u2500\u2500 src/endofactory/\n\u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u251c\u2500\u2500 cli.py\n\u2502 \u251c\u2500\u2500 config.py\n\u2502 \u251c\u2500\u2500 core.py\n\u2502 \u2514\u2500\u2500 yaml_loader.py\n\u251c\u2500\u2500 tests/\n\u251c\u2500\u2500 assets/\n\u251c\u2500\u2500 example_config.yaml\n\u2514\u2500\u2500 pyproject.toml\n```\n\n## Requirements\n\n- Python >= 3.9\n- Poetry (recommended) or pip\n\n## License\n\nMIT License\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Revolutionary EndoVQA dataset construction tool for rapid dataset mixing and configuration",
"version": "0.1.4",
"project_urls": {
"Homepage": "https://github.com/your-org/EndoFactory",
"Issues": "https://github.com/your-org/EndoFactory/issues",
"Repository": "https://github.com/your-org/EndoFactory"
},
"split_keywords": [
"endoscopy",
" medical",
" vqa",
" dataset",
" polars",
" yaml",
" cli"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "d1e4b767396c19db5570987e618b2a3c6a35b3c05f498df17a8304da0514ac54",
"md5": "34058343de0d6fda4f0b2e1fb97e35cd",
"sha256": "426b16079408c0ff8f0283dd217e7c2cc006975fb7d1243b1580766a2698235e"
},
"downloads": -1,
"filename": "endofactory-0.1.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "34058343de0d6fda4f0b2e1fb97e35cd",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 18797,
"upload_time": "2025-08-21T08:08:14",
"upload_time_iso_8601": "2025-08-21T08:08:14.642712Z",
"url": "https://files.pythonhosted.org/packages/d1/e4/b767396c19db5570987e618b2a3c6a35b3c05f498df17a8304da0514ac54/endofactory-0.1.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "4ea914c5f4873abf1e6f29016ad0a821cdecacd1fbbf80dac35731f23b4e90e1",
"md5": "434a0148aaee96a89413f3506be96b66",
"sha256": "d7e14e25d5a808a8fb9ef3e8706f5c9f073a4774be3b4674c3c2877b6c05b0be"
},
"downloads": -1,
"filename": "endofactory-0.1.4.tar.gz",
"has_sig": false,
"md5_digest": "434a0148aaee96a89413f3506be96b66",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 17634,
"upload_time": "2025-08-21T08:08:17",
"upload_time_iso_8601": "2025-08-21T08:08:17.087296Z",
"url": "https://files.pythonhosted.org/packages/4e/a9/14c5f4873abf1e6f29016ad0a821cdecacd1fbbf80dac35731f23b4e90e1/endofactory-0.1.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-21 08:08:17",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "your-org",
"github_project": "EndoFactory",
"github_not_found": true,
"lcname": "endofactory"
}