# DataChunker
高效数据分块处理模块,用于将大型字典列表数据分割成固定大小的块,并将每个块保存为JSON文件。
## 功能特性
- 支持顺序和并行处理模式
- 支持流式数据处理
- 可配置的分块大小和输出目录
- 详细的日志记录
- 异常处理和错误恢复
- 文件管理功能
## 安装
### 从PyPI安装
```bash
pip install py-data-chunker
```
### 从源代码安装
1. 克隆仓库或下载源代码
```bash
git clone https://gitee.com/ugly-xue/data_chunker.git
```
2. 进入项目目录
```bash
cd data_chunker
```
3. 运行安装命令
```bash
pip install .
```
## 快速开始
### 基本使用
```python
from data_chunker import DataChunker, generate_sample_data
# 创建数据分块处理器
chunker = DataChunker(chunk_size=500, output_dir="output")
# 生成示例数据
data = generate_sample_data(3000)
# 处理数据
results = chunker.process(data)
print(f"处理完成,共保存 {len(results)} 个文件")
```
### 并行处理
```python
from data_chunker import DataChunker
# 创建支持并行处理的实例
chunker = DataChunker(parallel=True, workers=4)
# 处理数据
results = chunker.process(data)
```
### 流式处理
```python
from data_chunker import DataChunker, json_stream_reader
# 创建实例
chunker = DataChunker()
# 处理流式数据
results = chunker.process_stream(json_stream_reader("large_data.json"))
```
## API参考
### DataChunker类
#### 初始化参数
- `chunk_size` (int): 每个分块的大小,默认为500
- `output_dir` (str): 输出目录,默认为"output"
- `prefix` (str): 文件名前缀,默认为"chunk"
- `timestamp` (bool): 是否在文件名中添加时间戳,默认为True
- `parallel` (bool): 是否使用并行处理,默认为False
- `workers` (int): 并行工作线程数,默认为4
- `logger` (Logger): 自定义日志记录器,默认为None
#### 主要方法
- `process(data)`: 处理数据并将其分割保存为JSON文件
- `process_stream(stream_generator)`: 处理流式数据
- `split(data)`: 将数据分割成多个固定大小的块(生成器)
- `set_config(**kwargs)`: 动态更新配置参数
- `get_file_list()`: 获取已保存的文件列表
- `clear_output()`: 清空输出目录
### 工具函数
- `setup_logger()`: 初始化日志记录器
- `generate_sample_data(size=3000)`: 生成示例数据
- `json_stream_reader(file_path)`: 流式读取JSON文件
- `read_json_file(file_path)`: 读取JSON文件
- `write_json_file(data, file_path, indent=2)`: 写入JSON文件
## 高级用法
### 自定义处理逻辑
继承DataChunker类并覆盖`_process_chunk`方法,自定义处理逻辑。
```python
from datetime import datetime
from data_chunker import DataChunker
class CustomDataChunker(DataChunker):
def _process_chunk(self, chunk, chunk_index):
# 自定义处理逻辑
for item in chunk:
item["processed"] = True
item["process_time"] = datetime.now().isoformat()
# 调用父类方法保存
return super()._process_chunk(chunk, chunk_index)
# 使用自定义处理器
custom_chunker = CustomDataChunker()
results = custom_chunker.process(data)
```
### 集成到数据处理管道
集成DataChunker到数据处理管道中,实现数据处理和保存。
```python
from data_chunker import DataChunker
def data_processing_pipeline():
# 创建分块处理器
chunker = DataChunker(output_dir="processed_data")
# 从数据源获取数据
raw_data = get_data_from_source()
# 处理数据
results = chunker.process(raw_data)
# 对处理结果进行后续操作
for file_path in results:
process_result_file(file_path)
return results
```
## 异常处理
模块定义了以下异常类:
- `DataChunkerError`: 基础异常类
- `FileSaveError`: 文件保存错误
- `ChunkSizeError`: 分块大小错误
- `InvalidDataError`: 无效数据错误
### 使用方法
```python
from data_chunker import DataChunker, FileSaveError
try:
chunker = DataChunker(chunk_size=0) # 这会抛出ChunkSizeError
except ChunkSizeError as e:
print(f"配置错误: {e}")
try:
results = chunker.process(data)
except FileSaveError as e:
print(f"文件保存失败: {e}")
```
## 性能建议
1. **小数据集**:使用顺序处理模式(`parallel=False`)
2. **大数据集**:使用并行处理模式(`parallel=True`)并适当增加工作线程数
3. **极大数据集**:使用`process_stream`方法进行流式处理,避免内存溢出
4. **文件I/O密集型**:考虑使用SSD存储或增加并行工作线程数
## 贡献指南
1. Fork本项目
2. 创建特性分支(`git checkout -b feature/AmazingFeature`)
3. 提交更改(`git commit -m 'Add some AmazingFeature'`)
4. 推送到分支(`git push origin feature/AmazingFeature`)
5. 开启Pull Request
## 许可证
本项目采用MIT许可证。详见[LICENSE](LICENSE)文件。
## 支持
如果您遇到任何问题或有改进建议,请通过以下方式联系:
- 邮箱: basui6996@gmail.com
- Gitee: [提交Issue](https://gitee.com/ugly-xue/data_chunker/issues)
## 版本历史
- 1.0.0 (2025-09-18)
- 初始版本发布
- 实现基本数据分块功能
- 支持顺序和并行处理
- 添加流式处理支持
- 1.0.1 (2025-09-18)
- 修改了下载模块的方式
Raw data
{
"_id": null,
"home_page": "https://gitee.com/ugly-xue/data_chunker",
"name": "py-data-chunker",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": null,
"keywords": "data, chunk, split, json, processing",
"author": "anjiu",
"author_email": "basui6996@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/0e/6c/c457612c3f9e60f1797261ffa5658188c2d03492811e237ab045dc52a6c7/py_data_chunker-1.0.1.tar.gz",
"platform": null,
"description": "# DataChunker\r\n\r\n\u9ad8\u6548\u6570\u636e\u5206\u5757\u5904\u7406\u6a21\u5757\uff0c\u7528\u4e8e\u5c06\u5927\u578b\u5b57\u5178\u5217\u8868\u6570\u636e\u5206\u5272\u6210\u56fa\u5b9a\u5927\u5c0f\u7684\u5757\uff0c\u5e76\u5c06\u6bcf\u4e2a\u5757\u4fdd\u5b58\u4e3aJSON\u6587\u4ef6\u3002\r\n\r\n## \u529f\u80fd\u7279\u6027\r\n\r\n- \u652f\u6301\u987a\u5e8f\u548c\u5e76\u884c\u5904\u7406\u6a21\u5f0f\r\n- \u652f\u6301\u6d41\u5f0f\u6570\u636e\u5904\u7406\r\n- \u53ef\u914d\u7f6e\u7684\u5206\u5757\u5927\u5c0f\u548c\u8f93\u51fa\u76ee\u5f55\r\n- \u8be6\u7ec6\u7684\u65e5\u5fd7\u8bb0\u5f55\r\n- \u5f02\u5e38\u5904\u7406\u548c\u9519\u8bef\u6062\u590d\r\n- \u6587\u4ef6\u7ba1\u7406\u529f\u80fd\r\n\r\n## \u5b89\u88c5\r\n\r\n### \u4ecePyPI\u5b89\u88c5\r\n\r\n```bash \r\npip install py-data-chunker\r\n```\r\n\r\n### \u4ece\u6e90\u4ee3\u7801\u5b89\u88c5\r\n\r\n1. \u514b\u9686\u4ed3\u5e93\u6216\u4e0b\u8f7d\u6e90\u4ee3\u7801\r\n\r\n```bash\r\ngit clone https://gitee.com/ugly-xue/data_chunker.git\r\n```\r\n\r\n2. \u8fdb\u5165\u9879\u76ee\u76ee\u5f55\r\n\r\n```bash\r\ncd data_chunker\r\n```\r\n\r\n3. \u8fd0\u884c\u5b89\u88c5\u547d\u4ee4\r\n\r\n```bash\r\npip install .\r\n```\r\n\r\n## \u5feb\u901f\u5f00\u59cb\r\n\r\n### \u57fa\u672c\u4f7f\u7528\r\n\r\n```python\r\nfrom data_chunker import DataChunker, generate_sample_data\r\n\r\n# \u521b\u5efa\u6570\u636e\u5206\u5757\u5904\u7406\u5668\r\nchunker = DataChunker(chunk_size=500, output_dir=\"output\")\r\n\r\n# \u751f\u6210\u793a\u4f8b\u6570\u636e\r\ndata = generate_sample_data(3000)\r\n\r\n# \u5904\u7406\u6570\u636e\r\nresults = chunker.process(data)\r\n\r\nprint(f\"\u5904\u7406\u5b8c\u6210\uff0c\u5171\u4fdd\u5b58 {len(results)} \u4e2a\u6587\u4ef6\")\r\n```\r\n\r\n### \u5e76\u884c\u5904\u7406\r\n\r\n```python\r\nfrom data_chunker import DataChunker\r\n\r\n# \u521b\u5efa\u652f\u6301\u5e76\u884c\u5904\u7406\u7684\u5b9e\u4f8b\r\nchunker = DataChunker(parallel=True, workers=4)\r\n\r\n# \u5904\u7406\u6570\u636e\r\nresults = chunker.process(data)\r\n```\r\n\r\n### \u6d41\u5f0f\u5904\u7406\r\n\r\n```python\r\nfrom data_chunker import DataChunker, json_stream_reader\r\n\r\n# \u521b\u5efa\u5b9e\u4f8b\r\nchunker = DataChunker()\r\n\r\n# \u5904\u7406\u6d41\u5f0f\u6570\u636e\r\nresults = chunker.process_stream(json_stream_reader(\"large_data.json\"))\r\n```\r\n\r\n## API\u53c2\u8003\r\n\r\n### DataChunker\u7c7b\r\n\r\n#### \u521d\u59cb\u5316\u53c2\u6570\r\n\r\n- `chunk_size` (int): \u6bcf\u4e2a\u5206\u5757\u7684\u5927\u5c0f\uff0c\u9ed8\u8ba4\u4e3a500\r\n- `output_dir` (str): \u8f93\u51fa\u76ee\u5f55\uff0c\u9ed8\u8ba4\u4e3a\"output\"\r\n- `prefix` (str): \u6587\u4ef6\u540d\u524d\u7f00\uff0c\u9ed8\u8ba4\u4e3a\"chunk\"\r\n- `timestamp` (bool): \u662f\u5426\u5728\u6587\u4ef6\u540d\u4e2d\u6dfb\u52a0\u65f6\u95f4\u6233\uff0c\u9ed8\u8ba4\u4e3aTrue\r\n- `parallel` (bool): \u662f\u5426\u4f7f\u7528\u5e76\u884c\u5904\u7406\uff0c\u9ed8\u8ba4\u4e3aFalse\r\n- `workers` (int): \u5e76\u884c\u5de5\u4f5c\u7ebf\u7a0b\u6570\uff0c\u9ed8\u8ba4\u4e3a4\r\n- `logger` (Logger): \u81ea\u5b9a\u4e49\u65e5\u5fd7\u8bb0\u5f55\u5668\uff0c\u9ed8\u8ba4\u4e3aNone\r\n\r\n#### \u4e3b\u8981\u65b9\u6cd5\r\n\r\n- `process(data)`: \u5904\u7406\u6570\u636e\u5e76\u5c06\u5176\u5206\u5272\u4fdd\u5b58\u4e3aJSON\u6587\u4ef6\r\n- `process_stream(stream_generator)`: \u5904\u7406\u6d41\u5f0f\u6570\u636e\r\n- `split(data)`: \u5c06\u6570\u636e\u5206\u5272\u6210\u591a\u4e2a\u56fa\u5b9a\u5927\u5c0f\u7684\u5757\uff08\u751f\u6210\u5668\uff09\r\n- `set_config(**kwargs)`: \u52a8\u6001\u66f4\u65b0\u914d\u7f6e\u53c2\u6570\r\n- `get_file_list()`: \u83b7\u53d6\u5df2\u4fdd\u5b58\u7684\u6587\u4ef6\u5217\u8868\r\n- `clear_output()`: \u6e05\u7a7a\u8f93\u51fa\u76ee\u5f55\r\n\r\n### \u5de5\u5177\u51fd\u6570\r\n\r\n- `setup_logger()`: \u521d\u59cb\u5316\u65e5\u5fd7\u8bb0\u5f55\u5668\r\n- `generate_sample_data(size=3000)`: \u751f\u6210\u793a\u4f8b\u6570\u636e\r\n- `json_stream_reader(file_path)`: \u6d41\u5f0f\u8bfb\u53d6JSON\u6587\u4ef6\r\n- `read_json_file(file_path)`: \u8bfb\u53d6JSON\u6587\u4ef6\r\n- `write_json_file(data, file_path, indent=2)`: \u5199\u5165JSON\u6587\u4ef6\r\n\r\n## \u9ad8\u7ea7\u7528\u6cd5\r\n\r\n### \u81ea\u5b9a\u4e49\u5904\u7406\u903b\u8f91\r\n\r\n\u7ee7\u627fDataChunker\u7c7b\u5e76\u8986\u76d6`_process_chunk`\u65b9\u6cd5\uff0c\u81ea\u5b9a\u4e49\u5904\u7406\u903b\u8f91\u3002\r\n\r\n```python\r\nfrom datetime import datetime\r\nfrom data_chunker import DataChunker\r\n\r\n\r\nclass CustomDataChunker(DataChunker):\r\n def _process_chunk(self, chunk, chunk_index):\r\n # \u81ea\u5b9a\u4e49\u5904\u7406\u903b\u8f91\r\n for item in chunk:\r\n item[\"processed\"] = True\r\n item[\"process_time\"] = datetime.now().isoformat()\r\n # \u8c03\u7528\u7236\u7c7b\u65b9\u6cd5\u4fdd\u5b58\r\n return super()._process_chunk(chunk, chunk_index)\r\n\r\n\r\n# \u4f7f\u7528\u81ea\u5b9a\u4e49\u5904\u7406\u5668\r\ncustom_chunker = CustomDataChunker()\r\n\r\nresults = custom_chunker.process(data)\r\n```\r\n\r\n### \u96c6\u6210\u5230\u6570\u636e\u5904\u7406\u7ba1\u9053\r\n\r\n\u96c6\u6210DataChunker\u5230\u6570\u636e\u5904\u7406\u7ba1\u9053\u4e2d\uff0c\u5b9e\u73b0\u6570\u636e\u5904\u7406\u548c\u4fdd\u5b58\u3002\r\n\r\n```python\r\nfrom data_chunker import DataChunker\r\n\r\n\r\ndef data_processing_pipeline():\r\n # \u521b\u5efa\u5206\u5757\u5904\u7406\u5668\r\n chunker = DataChunker(output_dir=\"processed_data\")\r\n # \u4ece\u6570\u636e\u6e90\u83b7\u53d6\u6570\u636e\r\n raw_data = get_data_from_source()\r\n # \u5904\u7406\u6570\u636e\r\n results = chunker.process(raw_data)\r\n # \u5bf9\u5904\u7406\u7ed3\u679c\u8fdb\u884c\u540e\u7eed\u64cd\u4f5c\r\n for file_path in results:\r\n process_result_file(file_path)\r\n return results\r\n```\r\n\r\n## \u5f02\u5e38\u5904\u7406\r\n\r\n\u6a21\u5757\u5b9a\u4e49\u4e86\u4ee5\u4e0b\u5f02\u5e38\u7c7b\uff1a\r\n\r\n- `DataChunkerError`: \u57fa\u7840\u5f02\u5e38\u7c7b\r\n- `FileSaveError`: \u6587\u4ef6\u4fdd\u5b58\u9519\u8bef\r\n- `ChunkSizeError`: \u5206\u5757\u5927\u5c0f\u9519\u8bef\r\n- `InvalidDataError`: \u65e0\u6548\u6570\u636e\u9519\u8bef\r\n\r\n### \u4f7f\u7528\u65b9\u6cd5\r\n\r\n```python\r\nfrom data_chunker import DataChunker, FileSaveError\r\n\r\ntry:\r\n chunker = DataChunker(chunk_size=0) # \u8fd9\u4f1a\u629b\u51faChunkSizeError\r\nexcept ChunkSizeError as e:\r\n print(f\"\u914d\u7f6e\u9519\u8bef: {e}\")\r\ntry:\r\n results = chunker.process(data)\r\nexcept FileSaveError as e:\r\n print(f\"\u6587\u4ef6\u4fdd\u5b58\u5931\u8d25: {e}\")\r\n```\r\n\r\n## \u6027\u80fd\u5efa\u8bae\r\n\r\n1. **\u5c0f\u6570\u636e\u96c6**\uff1a\u4f7f\u7528\u987a\u5e8f\u5904\u7406\u6a21\u5f0f\uff08`parallel=False`\uff09\r\n2. **\u5927\u6570\u636e\u96c6**\uff1a\u4f7f\u7528\u5e76\u884c\u5904\u7406\u6a21\u5f0f\uff08`parallel=True`\uff09\u5e76\u9002\u5f53\u589e\u52a0\u5de5\u4f5c\u7ebf\u7a0b\u6570\r\n3. **\u6781\u5927\u6570\u636e\u96c6**\uff1a\u4f7f\u7528`process_stream`\u65b9\u6cd5\u8fdb\u884c\u6d41\u5f0f\u5904\u7406\uff0c\u907f\u514d\u5185\u5b58\u6ea2\u51fa\r\n4. **\u6587\u4ef6I/O\u5bc6\u96c6\u578b**\uff1a\u8003\u8651\u4f7f\u7528SSD\u5b58\u50a8\u6216\u589e\u52a0\u5e76\u884c\u5de5\u4f5c\u7ebf\u7a0b\u6570\r\n\r\n## \u8d21\u732e\u6307\u5357\r\n\r\n1. Fork\u672c\u9879\u76ee\r\n2. \u521b\u5efa\u7279\u6027\u5206\u652f\uff08`git checkout -b feature/AmazingFeature`\uff09\r\n3. \u63d0\u4ea4\u66f4\u6539\uff08`git commit -m 'Add some AmazingFeature'`\uff09\r\n4. \u63a8\u9001\u5230\u5206\u652f\uff08`git push origin feature/AmazingFeature`\uff09\r\n5. \u5f00\u542fPull Request\r\n\r\n## \u8bb8\u53ef\u8bc1\r\n\r\n\u672c\u9879\u76ee\u91c7\u7528MIT\u8bb8\u53ef\u8bc1\u3002\u8be6\u89c1[LICENSE](LICENSE)\u6587\u4ef6\u3002\r\n\r\n## \u652f\u6301\r\n\r\n\u5982\u679c\u60a8\u9047\u5230\u4efb\u4f55\u95ee\u9898\u6216\u6709\u6539\u8fdb\u5efa\u8bae\uff0c\u8bf7\u901a\u8fc7\u4ee5\u4e0b\u65b9\u5f0f\u8054\u7cfb\uff1a\r\n\r\n- \u90ae\u7bb1: basui6996@gmail.com\r\n- Gitee: [\u63d0\u4ea4Issue](https://gitee.com/ugly-xue/data_chunker/issues)\r\n\r\n## \u7248\u672c\u5386\u53f2\r\n\r\n- 1.0.0 (2025-09-18)\r\n - \u521d\u59cb\u7248\u672c\u53d1\u5e03\r\n - \u5b9e\u73b0\u57fa\u672c\u6570\u636e\u5206\u5757\u529f\u80fd\r\n - \u652f\u6301\u987a\u5e8f\u548c\u5e76\u884c\u5904\u7406\r\n - \u6dfb\u52a0\u6d41\u5f0f\u5904\u7406\u652f\u6301\r\n- 1.0.1 (2025-09-18)\r\n - \u4fee\u6539\u4e86\u4e0b\u8f7d\u6a21\u5757\u7684\u65b9\u5f0f\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "\u9ad8\u6548\u6570\u636e\u5206\u5757\u5904\u7406\u6a21\u5757\uff0c\u7528\u4e8e\u5c06\u5927\u578b\u5b57\u5178\u5217\u8868\u6570\u636e\u5206\u5272\u6210\u56fa\u5b9a\u5927\u5c0f\u7684\u5757",
"version": "1.0.1",
"project_urls": {
"Documentation": "https://gitee.com/ugly-xue/data_chunker/docs",
"Homepage": "https://gitee.com/ugly-xue/data_chunker",
"Source": "https://gitee.com/ugly-xue/data_chunker",
"Tracker": "https://gitee.com/ugly-xue/data_chunker/issues"
},
"split_keywords": [
"data",
" chunk",
" split",
" json",
" processing"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "d3e7da493a19aaf2ec25e1b2193090845da6de77f6d8f7c132da8263954a86cb",
"md5": "b98a7f821d4dd4eeed87c1bbcc6df605",
"sha256": "db411bc543addd5752a8af522e5a7835d011fb360ba99fbf2e45ec6580ab90da"
},
"downloads": -1,
"filename": "py_data_chunker-1.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b98a7f821d4dd4eeed87c1bbcc6df605",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 11763,
"upload_time": "2025-09-18T05:31:20",
"upload_time_iso_8601": "2025-09-18T05:31:20.739620Z",
"url": "https://files.pythonhosted.org/packages/d3/e7/da493a19aaf2ec25e1b2193090845da6de77f6d8f7c132da8263954a86cb/py_data_chunker-1.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "0e6cc457612c3f9e60f1797261ffa5658188c2d03492811e237ab045dc52a6c7",
"md5": "b4c0afccc8e731501c7aa1271f97a1b0",
"sha256": "3c824e52e27e42d3964896432f35df8286f2cd7ed3216501d72712827caa64c4"
},
"downloads": -1,
"filename": "py_data_chunker-1.0.1.tar.gz",
"has_sig": false,
"md5_digest": "b4c0afccc8e731501c7aa1271f97a1b0",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 18071,
"upload_time": "2025-09-18T05:31:22",
"upload_time_iso_8601": "2025-09-18T05:31:22.092854Z",
"url": "https://files.pythonhosted.org/packages/0e/6c/c457612c3f9e60f1797261ffa5658188c2d03492811e237ab045dc52a6c7/py_data_chunker-1.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-18 05:31:22",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "py-data-chunker"
}