# Dataset Toolkit
一个功能强大、易于使用的 Python 工具包,用于处理计算机视觉数据集。支持多种数据格式的加载、合并、转换和导出。
## ✨ 特性
- 🔄 **多格式支持**:支持 YOLO、COCO 等常见格式
- 🔗 **数据集合并**:轻松合并多个数据集,支持类别重映射
- 📤 **灵活导出**:导出为 COCO JSON、TXT 等多种格式
- 🛠️ **工具函数**:提供坐标转换等实用工具
- 📦 **标准化数据模型**:统一的内部数据表示,方便扩展
## 📦 安装
### 从 PyPI 安装(推荐)
```bash
pip install dataset-toolkit
```
### 从源码安装
```bash
git clone https://github.com/yourusername/dataset-toolkit.git
cd dataset-toolkit
pip install -e .
```
### 开发模式安装
```bash
pip install -e ".[dev]"
```
## 🚀 快速开始
### 基本用法
```python
from dataset_toolkit import load_yolo_from_local, merge_datasets, export_to_coco
# 1. 加载 YOLO 格式数据集
dataset1 = load_yolo_from_local(
"/path/to/dataset1",
categories={0: 'cat', 1: 'dog'}
)
dataset2 = load_yolo_from_local(
"/path/to/dataset2",
categories={0: 'car', 1: 'bicycle'}
)
# 2. 合并数据集(带类别重映射)
final_categories = {0: 'animal', 1: 'vehicle'}
category_mapping = {
'cat': 'animal',
'dog': 'animal',
'car': 'vehicle',
'bicycle': 'vehicle'
}
merged = merge_datasets(
datasets=[dataset1, dataset2],
category_mapping=category_mapping,
final_categories=final_categories,
new_dataset_name="merged_dataset"
)
# 3. 导出为 COCO 格式
export_to_coco(merged, "output/merged.json")
```
### 链式 API 用法
```python
from dataset_toolkit import DatasetPipeline
# 使用管道模式处理数据集
pipeline = DatasetPipeline()
result = (pipeline
.load_yolo("/path/to/dataset1", {0: 'cat', 1: 'dog'})
.load_yolo("/path/to/dataset2", {0: 'car'})
.merge(
category_mapping={'cat': 'animal', 'dog': 'animal', 'car': 'vehicle'},
final_categories={0: 'animal', 1: 'vehicle'}
)
.export_coco("output/merged.json")
.execute())
```
## 📚 API 文档
### 数据加载器
#### `load_yolo_from_local(dataset_path, categories)`
从本地文件系统加载 YOLO 格式的数据集。
**参数:**
- `dataset_path` (str): 数据集根目录路径,应包含 `images/` 和 `labels/` 子目录
- `categories` (Dict[int, str]): 类别ID到类别名的映射
**返回:**
- `Dataset`: 标准化的数据集对象
**示例:**
```python
dataset = load_yolo_from_local(
"/data/my_dataset",
categories={0: 'person', 1: 'car'}
)
```
### 数据处理器
#### `merge_datasets(datasets, category_mapping, final_categories, new_dataset_name)`
合并多个数据集,支持类别重映射。
**参数:**
- `datasets` (List[Dataset]): 要合并的数据集列表
- `category_mapping` (Dict[str, str]): 旧类别名到新类别名的映射
- `final_categories` (Dict[int, str]): 最终的类别体系
- `new_dataset_name` (str, optional): 合并后数据集的名称
**返回:**
- `Dataset`: 合并后的数据集对象
### 数据导出器
#### `export_to_coco(dataset, output_path)`
导出为 COCO JSON 格式。
**参数:**
- `dataset` (Dataset): 要导出的数据集
- `output_path` (str): 输出文件路径
#### `export_to_txt(dataset, output_path, use_relative_paths, base_path)`
导出为 TXT 格式。
**参数:**
- `dataset` (Dataset): 要导出的数据集
- `output_path` (str): 输出文件路径
- `use_relative_paths` (bool, optional): 是否使用相对路径
- `base_path` (str, optional): 相对路径的基准目录
## 🏗️ 架构设计
```
dataset_toolkit/
├── models.py # 数据模型定义
├── loaders/ # 数据加载器
│ ├── local_loader.py # 本地文件系统加载器
│ └── remote_loader.py # 远程数据源加载器(待开发)
├── processors/ # 数据处理器
│ ├── merger.py # 数据集合并
│ └── filter.py # 数据过滤(待开发)
├── exporters/ # 数据导出器
│ ├── coco_exporter.py # COCO格式导出
│ └── txt_exporter.py # TXT格式导出
└── utils/ # 工具函数
└── coords.py # 坐标转换
```
## 🔧 高级用法
### 自定义数据加载器
```python
from dataset_toolkit.models import Dataset, ImageAnnotation
from dataset_toolkit.loaders import BaseLoader
class CustomLoader(BaseLoader):
def load(self, path, **kwargs):
# 实现你的自定义加载逻辑
dataset = Dataset(name="custom")
# ... 加载数据 ...
return dataset
```
### 批量处理
```python
from pathlib import Path
from dataset_toolkit import load_yolo_from_local, export_to_coco
# 批量处理多个数据集
dataset_dirs = [
"/data/dataset1",
"/data/dataset2",
"/data/dataset3"
]
categories = {0: 'object'}
for dataset_dir in dataset_dirs:
ds = load_yolo_from_local(dataset_dir, categories)
output_name = Path(dataset_dir).name + ".json"
export_to_coco(ds, f"output/{output_name}")
```
## 🧪 测试
运行测试:
```bash
pytest
```
生成测试覆盖率报告:
```bash
pytest --cov=dataset_toolkit --cov-report=html
```
## 📝 开发计划
- [ ] 支持更多数据格式(Pascal VOC、YOLO v8等)
- [ ] 添加数据增强功能
- [ ] 支持远程数据源(S3、HTTP等)
- [ ] 添加数据统计和可视化功能
- [ ] 提供命令行工具
- [ ] 支持视频数据集
## 🤝 贡献
欢迎提交 Issue 和 Pull Request!
## 📄 许可证
MIT License
## 📧 联系方式
如有问题,请通过以下方式联系:
- Email: your.email@example.com
- GitHub Issues: https://github.com/yourusername/dataset-toolkit/issues
Raw data
{
"_id": null,
"home_page": "https://github.com/yourusername/dataset-toolkit",
"name": "dataset-toolkit",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "computer vision, dataset, yolo, coco, machine learning",
"author": "wenxiang.han",
"author_email": "\"wenxiang.han\" <wenxiang.han@anker-in.com>",
"download_url": "https://files.pythonhosted.org/packages/9d/2b/cd636533b84d1dd9b419c63aa05d9142aedfcdd990712579c34d73afc9b0/dataset_toolkit-0.1.2.tar.gz",
"platform": null,
"description": "# Dataset Toolkit\n\n\u4e00\u4e2a\u529f\u80fd\u5f3a\u5927\u3001\u6613\u4e8e\u4f7f\u7528\u7684 Python \u5de5\u5177\u5305\uff0c\u7528\u4e8e\u5904\u7406\u8ba1\u7b97\u673a\u89c6\u89c9\u6570\u636e\u96c6\u3002\u652f\u6301\u591a\u79cd\u6570\u636e\u683c\u5f0f\u7684\u52a0\u8f7d\u3001\u5408\u5e76\u3001\u8f6c\u6362\u548c\u5bfc\u51fa\u3002\n\n## \u2728 \u7279\u6027\n\n- \ud83d\udd04 **\u591a\u683c\u5f0f\u652f\u6301**\uff1a\u652f\u6301 YOLO\u3001COCO \u7b49\u5e38\u89c1\u683c\u5f0f\n- \ud83d\udd17 **\u6570\u636e\u96c6\u5408\u5e76**\uff1a\u8f7b\u677e\u5408\u5e76\u591a\u4e2a\u6570\u636e\u96c6\uff0c\u652f\u6301\u7c7b\u522b\u91cd\u6620\u5c04\n- \ud83d\udce4 **\u7075\u6d3b\u5bfc\u51fa**\uff1a\u5bfc\u51fa\u4e3a COCO JSON\u3001TXT \u7b49\u591a\u79cd\u683c\u5f0f\n- \ud83d\udee0\ufe0f **\u5de5\u5177\u51fd\u6570**\uff1a\u63d0\u4f9b\u5750\u6807\u8f6c\u6362\u7b49\u5b9e\u7528\u5de5\u5177\n- \ud83d\udce6 **\u6807\u51c6\u5316\u6570\u636e\u6a21\u578b**\uff1a\u7edf\u4e00\u7684\u5185\u90e8\u6570\u636e\u8868\u793a\uff0c\u65b9\u4fbf\u6269\u5c55\n\n## \ud83d\udce6 \u5b89\u88c5\n\n### \u4ece PyPI \u5b89\u88c5\uff08\u63a8\u8350\uff09\n\n```bash\npip install dataset-toolkit\n```\n\n### \u4ece\u6e90\u7801\u5b89\u88c5\n\n```bash\ngit clone https://github.com/yourusername/dataset-toolkit.git\ncd dataset-toolkit\npip install -e .\n```\n\n### \u5f00\u53d1\u6a21\u5f0f\u5b89\u88c5\n\n```bash\npip install -e \".[dev]\"\n```\n\n## \ud83d\ude80 \u5feb\u901f\u5f00\u59cb\n\n### \u57fa\u672c\u7528\u6cd5\n\n```python\nfrom dataset_toolkit import load_yolo_from_local, merge_datasets, export_to_coco\n\n# 1. \u52a0\u8f7d YOLO \u683c\u5f0f\u6570\u636e\u96c6\ndataset1 = load_yolo_from_local(\n \"/path/to/dataset1\",\n categories={0: 'cat', 1: 'dog'}\n)\n\ndataset2 = load_yolo_from_local(\n \"/path/to/dataset2\",\n categories={0: 'car', 1: 'bicycle'}\n)\n\n# 2. \u5408\u5e76\u6570\u636e\u96c6\uff08\u5e26\u7c7b\u522b\u91cd\u6620\u5c04\uff09\nfinal_categories = {0: 'animal', 1: 'vehicle'}\ncategory_mapping = {\n 'cat': 'animal',\n 'dog': 'animal',\n 'car': 'vehicle',\n 'bicycle': 'vehicle'\n}\n\nmerged = merge_datasets(\n datasets=[dataset1, dataset2],\n category_mapping=category_mapping,\n final_categories=final_categories,\n new_dataset_name=\"merged_dataset\"\n)\n\n# 3. \u5bfc\u51fa\u4e3a COCO \u683c\u5f0f\nexport_to_coco(merged, \"output/merged.json\")\n```\n\n### \u94fe\u5f0f API \u7528\u6cd5\n\n```python\nfrom dataset_toolkit import DatasetPipeline\n\n# \u4f7f\u7528\u7ba1\u9053\u6a21\u5f0f\u5904\u7406\u6570\u636e\u96c6\npipeline = DatasetPipeline()\nresult = (pipeline\n .load_yolo(\"/path/to/dataset1\", {0: 'cat', 1: 'dog'})\n .load_yolo(\"/path/to/dataset2\", {0: 'car'})\n .merge(\n category_mapping={'cat': 'animal', 'dog': 'animal', 'car': 'vehicle'},\n final_categories={0: 'animal', 1: 'vehicle'}\n )\n .export_coco(\"output/merged.json\")\n .execute())\n```\n\n## \ud83d\udcda API \u6587\u6863\n\n### \u6570\u636e\u52a0\u8f7d\u5668\n\n#### `load_yolo_from_local(dataset_path, categories)`\n\n\u4ece\u672c\u5730\u6587\u4ef6\u7cfb\u7edf\u52a0\u8f7d YOLO \u683c\u5f0f\u7684\u6570\u636e\u96c6\u3002\n\n**\u53c2\u6570\uff1a**\n- `dataset_path` (str): \u6570\u636e\u96c6\u6839\u76ee\u5f55\u8def\u5f84\uff0c\u5e94\u5305\u542b `images/` \u548c `labels/` \u5b50\u76ee\u5f55\n- `categories` (Dict[int, str]): \u7c7b\u522bID\u5230\u7c7b\u522b\u540d\u7684\u6620\u5c04\n\n**\u8fd4\u56de\uff1a**\n- `Dataset`: \u6807\u51c6\u5316\u7684\u6570\u636e\u96c6\u5bf9\u8c61\n\n**\u793a\u4f8b\uff1a**\n```python\ndataset = load_yolo_from_local(\n \"/data/my_dataset\",\n categories={0: 'person', 1: 'car'}\n)\n```\n\n### \u6570\u636e\u5904\u7406\u5668\n\n#### `merge_datasets(datasets, category_mapping, final_categories, new_dataset_name)`\n\n\u5408\u5e76\u591a\u4e2a\u6570\u636e\u96c6\uff0c\u652f\u6301\u7c7b\u522b\u91cd\u6620\u5c04\u3002\n\n**\u53c2\u6570\uff1a**\n- `datasets` (List[Dataset]): \u8981\u5408\u5e76\u7684\u6570\u636e\u96c6\u5217\u8868\n- `category_mapping` (Dict[str, str]): \u65e7\u7c7b\u522b\u540d\u5230\u65b0\u7c7b\u522b\u540d\u7684\u6620\u5c04\n- `final_categories` (Dict[int, str]): \u6700\u7ec8\u7684\u7c7b\u522b\u4f53\u7cfb\n- `new_dataset_name` (str, optional): \u5408\u5e76\u540e\u6570\u636e\u96c6\u7684\u540d\u79f0\n\n**\u8fd4\u56de\uff1a**\n- `Dataset`: \u5408\u5e76\u540e\u7684\u6570\u636e\u96c6\u5bf9\u8c61\n\n### \u6570\u636e\u5bfc\u51fa\u5668\n\n#### `export_to_coco(dataset, output_path)`\n\n\u5bfc\u51fa\u4e3a COCO JSON \u683c\u5f0f\u3002\n\n**\u53c2\u6570\uff1a**\n- `dataset` (Dataset): \u8981\u5bfc\u51fa\u7684\u6570\u636e\u96c6\n- `output_path` (str): \u8f93\u51fa\u6587\u4ef6\u8def\u5f84\n\n#### `export_to_txt(dataset, output_path, use_relative_paths, base_path)`\n\n\u5bfc\u51fa\u4e3a TXT \u683c\u5f0f\u3002\n\n**\u53c2\u6570\uff1a**\n- `dataset` (Dataset): \u8981\u5bfc\u51fa\u7684\u6570\u636e\u96c6\n- `output_path` (str): \u8f93\u51fa\u6587\u4ef6\u8def\u5f84\n- `use_relative_paths` (bool, optional): \u662f\u5426\u4f7f\u7528\u76f8\u5bf9\u8def\u5f84\n- `base_path` (str, optional): \u76f8\u5bf9\u8def\u5f84\u7684\u57fa\u51c6\u76ee\u5f55\n\n## \ud83c\udfd7\ufe0f \u67b6\u6784\u8bbe\u8ba1\n\n```\ndataset_toolkit/\n\u251c\u2500\u2500 models.py # \u6570\u636e\u6a21\u578b\u5b9a\u4e49\n\u251c\u2500\u2500 loaders/ # \u6570\u636e\u52a0\u8f7d\u5668\n\u2502 \u251c\u2500\u2500 local_loader.py # \u672c\u5730\u6587\u4ef6\u7cfb\u7edf\u52a0\u8f7d\u5668\n\u2502 \u2514\u2500\u2500 remote_loader.py # \u8fdc\u7a0b\u6570\u636e\u6e90\u52a0\u8f7d\u5668\uff08\u5f85\u5f00\u53d1\uff09\n\u251c\u2500\u2500 processors/ # \u6570\u636e\u5904\u7406\u5668\n\u2502 \u251c\u2500\u2500 merger.py # \u6570\u636e\u96c6\u5408\u5e76\n\u2502 \u2514\u2500\u2500 filter.py # \u6570\u636e\u8fc7\u6ee4\uff08\u5f85\u5f00\u53d1\uff09\n\u251c\u2500\u2500 exporters/ # \u6570\u636e\u5bfc\u51fa\u5668\n\u2502 \u251c\u2500\u2500 coco_exporter.py # COCO\u683c\u5f0f\u5bfc\u51fa\n\u2502 \u2514\u2500\u2500 txt_exporter.py # TXT\u683c\u5f0f\u5bfc\u51fa\n\u2514\u2500\u2500 utils/ # \u5de5\u5177\u51fd\u6570\n \u2514\u2500\u2500 coords.py # \u5750\u6807\u8f6c\u6362\n```\n\n## \ud83d\udd27 \u9ad8\u7ea7\u7528\u6cd5\n\n### \u81ea\u5b9a\u4e49\u6570\u636e\u52a0\u8f7d\u5668\n\n```python\nfrom dataset_toolkit.models import Dataset, ImageAnnotation\nfrom dataset_toolkit.loaders import BaseLoader\n\nclass CustomLoader(BaseLoader):\n def load(self, path, **kwargs):\n # \u5b9e\u73b0\u4f60\u7684\u81ea\u5b9a\u4e49\u52a0\u8f7d\u903b\u8f91\n dataset = Dataset(name=\"custom\")\n # ... \u52a0\u8f7d\u6570\u636e ...\n return dataset\n```\n\n### \u6279\u91cf\u5904\u7406\n\n```python\nfrom pathlib import Path\nfrom dataset_toolkit import load_yolo_from_local, export_to_coco\n\n# \u6279\u91cf\u5904\u7406\u591a\u4e2a\u6570\u636e\u96c6\ndataset_dirs = [\n \"/data/dataset1\",\n \"/data/dataset2\",\n \"/data/dataset3\"\n]\n\ncategories = {0: 'object'}\n\nfor dataset_dir in dataset_dirs:\n ds = load_yolo_from_local(dataset_dir, categories)\n output_name = Path(dataset_dir).name + \".json\"\n export_to_coco(ds, f\"output/{output_name}\")\n```\n\n## \ud83e\uddea \u6d4b\u8bd5\n\n\u8fd0\u884c\u6d4b\u8bd5\uff1a\n\n```bash\npytest\n```\n\n\u751f\u6210\u6d4b\u8bd5\u8986\u76d6\u7387\u62a5\u544a\uff1a\n\n```bash\npytest --cov=dataset_toolkit --cov-report=html\n```\n\n## \ud83d\udcdd \u5f00\u53d1\u8ba1\u5212\n\n- [ ] \u652f\u6301\u66f4\u591a\u6570\u636e\u683c\u5f0f\uff08Pascal VOC\u3001YOLO v8\u7b49\uff09\n- [ ] \u6dfb\u52a0\u6570\u636e\u589e\u5f3a\u529f\u80fd\n- [ ] \u652f\u6301\u8fdc\u7a0b\u6570\u636e\u6e90\uff08S3\u3001HTTP\u7b49\uff09\n- [ ] \u6dfb\u52a0\u6570\u636e\u7edf\u8ba1\u548c\u53ef\u89c6\u5316\u529f\u80fd\n- [ ] \u63d0\u4f9b\u547d\u4ee4\u884c\u5de5\u5177\n- [ ] \u652f\u6301\u89c6\u9891\u6570\u636e\u96c6\n\n## \ud83e\udd1d \u8d21\u732e\n\n\u6b22\u8fce\u63d0\u4ea4 Issue \u548c Pull Request\uff01\n\n## \ud83d\udcc4 \u8bb8\u53ef\u8bc1\n\nMIT License\n\n## \ud83d\udce7 \u8054\u7cfb\u65b9\u5f0f\n\n\u5982\u6709\u95ee\u9898\uff0c\u8bf7\u901a\u8fc7\u4ee5\u4e0b\u65b9\u5f0f\u8054\u7cfb\uff1a\n- Email: your.email@example.com\n- GitHub Issues: https://github.com/yourusername/dataset-toolkit/issues\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "\u4e00\u4e2a\u7528\u4e8e\u52a0\u8f7d\u3001\u5904\u7406\u548c\u5bfc\u51fa\u8ba1\u7b97\u673a\u89c6\u89c9\u6570\u636e\u96c6\u7684\u5de5\u5177\u5305",
"version": "0.1.2",
"project_urls": {
"Bug Tracker": "https://github.com/yourusername/dataset-toolkit/issues",
"Documentation": "https://dataset-toolkit.readthedocs.io",
"Homepage": "https://github.com/yourusername/dataset-toolkit",
"Repository": "https://github.com/yourusername/dataset-toolkit"
},
"split_keywords": [
"computer vision",
" dataset",
" yolo",
" coco",
" machine learning"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "1f70528b90d4b02f5fc4c990d4cc33208eb80b4b03220344a82abb83426b1ca4",
"md5": "f003c56018ce315c13702c4b86dd0559",
"sha256": "449fde499dd42824a8f51cf9a9224cb3d153249290bdfac352f9daf99a91ee1d"
},
"downloads": -1,
"filename": "dataset_toolkit-0.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f003c56018ce315c13702c4b86dd0559",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 17123,
"upload_time": "2025-10-20T04:31:13",
"upload_time_iso_8601": "2025-10-20T04:31:13.415183Z",
"url": "https://files.pythonhosted.org/packages/1f/70/528b90d4b02f5fc4c990d4cc33208eb80b4b03220344a82abb83426b1ca4/dataset_toolkit-0.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "9d2bcd636533b84d1dd9b419c63aa05d9142aedfcdd990712579c34d73afc9b0",
"md5": "eb977fe0c7c16cf0f3df1bed3208e123",
"sha256": "e79d1097a30755b749c2261d04b756dc243795f659b6a877696f3fd5297a741e"
},
"downloads": -1,
"filename": "dataset_toolkit-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "eb977fe0c7c16cf0f3df1bed3208e123",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 20596,
"upload_time": "2025-10-20T04:31:14",
"upload_time_iso_8601": "2025-10-20T04:31:14.767042Z",
"url": "https://files.pythonhosted.org/packages/9d/2b/cd636533b84d1dd9b419c63aa05d9142aedfcdd990712579c34d73afc9b0/dataset_toolkit-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-20 04:31:14",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "yourusername",
"github_project": "dataset-toolkit",
"github_not_found": true,
"lcname": "dataset-toolkit"
}