py-data-chunker


Namepy-data-chunker JSON
Version 1.0.1 PyPI version JSON
download
home_pagehttps://gitee.com/ugly-xue/data_chunker
Summary高效数据分块处理模块,用于将大型字典列表数据分割成固定大小的块
upload_time2025-09-18 05:31:22
maintainerNone
docs_urlNone
authoranjiu
requires_python>=3.6
licenseMIT
keywords data chunk split json processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # DataChunker

高效数据分块处理模块,用于将大型字典列表数据分割成固定大小的块,并将每个块保存为JSON文件。

## 功能特性

- 支持顺序和并行处理模式
- 支持流式数据处理
- 可配置的分块大小和输出目录
- 详细的日志记录
- 异常处理和错误恢复
- 文件管理功能

## 安装

### 从PyPI安装

```bash 
pip install py-data-chunker
```

### 从源代码安装

1. 克隆仓库或下载源代码

```bash
git clone https://gitee.com/ugly-xue/data_chunker.git
```

2. 进入项目目录

```bash
cd data_chunker
```

3. 运行安装命令

```bash
pip install .
```

## 快速开始

### 基本使用

```python
from data_chunker import DataChunker, generate_sample_data

# 创建数据分块处理器
chunker = DataChunker(chunk_size=500, output_dir="output")

# 生成示例数据
data = generate_sample_data(3000)

# 处理数据
results = chunker.process(data)

print(f"处理完成,共保存 {len(results)} 个文件")
```

### 并行处理

```python
from data_chunker import DataChunker

# 创建支持并行处理的实例
chunker = DataChunker(parallel=True, workers=4)

# 处理数据
results = chunker.process(data)
```

### 流式处理

```python
from data_chunker import DataChunker, json_stream_reader

# 创建实例
chunker = DataChunker()

# 处理流式数据
results = chunker.process_stream(json_stream_reader("large_data.json"))
```

## API参考

### DataChunker类

#### 初始化参数

- `chunk_size` (int): 每个分块的大小,默认为500
- `output_dir` (str): 输出目录,默认为"output"
- `prefix` (str): 文件名前缀,默认为"chunk"
- `timestamp` (bool): 是否在文件名中添加时间戳,默认为True
- `parallel` (bool): 是否使用并行处理,默认为False
- `workers` (int): 并行工作线程数,默认为4
- `logger` (Logger): 自定义日志记录器,默认为None

#### 主要方法

- `process(data)`: 处理数据并将其分割保存为JSON文件
- `process_stream(stream_generator)`: 处理流式数据
- `split(data)`: 将数据分割成多个固定大小的块(生成器)
- `set_config(**kwargs)`: 动态更新配置参数
- `get_file_list()`: 获取已保存的文件列表
- `clear_output()`: 清空输出目录

### 工具函数

- `setup_logger()`: 初始化日志记录器
- `generate_sample_data(size=3000)`: 生成示例数据
- `json_stream_reader(file_path)`: 流式读取JSON文件
- `read_json_file(file_path)`: 读取JSON文件
- `write_json_file(data, file_path, indent=2)`: 写入JSON文件

## 高级用法

### 自定义处理逻辑

继承DataChunker类并覆盖`_process_chunk`方法,自定义处理逻辑。

```python
from datetime import datetime
from data_chunker import DataChunker


class CustomDataChunker(DataChunker):
    def _process_chunk(self, chunk, chunk_index):
        # 自定义处理逻辑
        for item in chunk:
            item["processed"] = True
            item["process_time"] = datetime.now().isoformat()
        # 调用父类方法保存
        return super()._process_chunk(chunk, chunk_index)


# 使用自定义处理器
custom_chunker = CustomDataChunker()

results = custom_chunker.process(data)
```

### 集成到数据处理管道

集成DataChunker到数据处理管道中,实现数据处理和保存。

```python
from data_chunker import DataChunker


def data_processing_pipeline():
    # 创建分块处理器
    chunker = DataChunker(output_dir="processed_data")
    # 从数据源获取数据
    raw_data = get_data_from_source()
    # 处理数据
    results = chunker.process(raw_data)
    # 对处理结果进行后续操作
    for file_path in results:
        process_result_file(file_path)
    return results
```

## 异常处理

模块定义了以下异常类:

- `DataChunkerError`: 基础异常类
- `FileSaveError`: 文件保存错误
- `ChunkSizeError`: 分块大小错误
- `InvalidDataError`: 无效数据错误

### 使用方法

```python
from data_chunker import DataChunker, FileSaveError

try:
    chunker = DataChunker(chunk_size=0)  # 这会抛出ChunkSizeError
except ChunkSizeError as e:
    print(f"配置错误: {e}")
try:
    results = chunker.process(data)
except FileSaveError as e:
    print(f"文件保存失败: {e}")
```

## 性能建议

1. **小数据集**:使用顺序处理模式(`parallel=False`)
2. **大数据集**:使用并行处理模式(`parallel=True`)并适当增加工作线程数
3. **极大数据集**:使用`process_stream`方法进行流式处理,避免内存溢出
4. **文件I/O密集型**:考虑使用SSD存储或增加并行工作线程数

## 贡献指南

1. Fork本项目
2. 创建特性分支(`git checkout -b feature/AmazingFeature`)
3. 提交更改(`git commit -m 'Add some AmazingFeature'`)
4. 推送到分支(`git push origin feature/AmazingFeature`)
5. 开启Pull Request

## 许可证

本项目采用MIT许可证。详见[LICENSE](LICENSE)文件。

## 支持

如果您遇到任何问题或有改进建议,请通过以下方式联系:

- 邮箱: basui6996@gmail.com
- Gitee: [提交Issue](https://gitee.com/ugly-xue/data_chunker/issues)

## 版本历史

- 1.0.0 (2025-09-18)
    - 初始版本发布
    - 实现基本数据分块功能
    - 支持顺序和并行处理
    - 添加流式处理支持
- 1.0.1 (2025-09-18)
    - 修改了下载模块的方式

            

Raw data

            {
    "_id": null,
    "home_page": "https://gitee.com/ugly-xue/data_chunker",
    "name": "py-data-chunker",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": "data, chunk, split, json, processing",
    "author": "anjiu",
    "author_email": "basui6996@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/0e/6c/c457612c3f9e60f1797261ffa5658188c2d03492811e237ab045dc52a6c7/py_data_chunker-1.0.1.tar.gz",
    "platform": null,
    "description": "# DataChunker\r\n\r\n\u9ad8\u6548\u6570\u636e\u5206\u5757\u5904\u7406\u6a21\u5757\uff0c\u7528\u4e8e\u5c06\u5927\u578b\u5b57\u5178\u5217\u8868\u6570\u636e\u5206\u5272\u6210\u56fa\u5b9a\u5927\u5c0f\u7684\u5757\uff0c\u5e76\u5c06\u6bcf\u4e2a\u5757\u4fdd\u5b58\u4e3aJSON\u6587\u4ef6\u3002\r\n\r\n## \u529f\u80fd\u7279\u6027\r\n\r\n- \u652f\u6301\u987a\u5e8f\u548c\u5e76\u884c\u5904\u7406\u6a21\u5f0f\r\n- \u652f\u6301\u6d41\u5f0f\u6570\u636e\u5904\u7406\r\n- \u53ef\u914d\u7f6e\u7684\u5206\u5757\u5927\u5c0f\u548c\u8f93\u51fa\u76ee\u5f55\r\n- \u8be6\u7ec6\u7684\u65e5\u5fd7\u8bb0\u5f55\r\n- \u5f02\u5e38\u5904\u7406\u548c\u9519\u8bef\u6062\u590d\r\n- \u6587\u4ef6\u7ba1\u7406\u529f\u80fd\r\n\r\n## \u5b89\u88c5\r\n\r\n### \u4ecePyPI\u5b89\u88c5\r\n\r\n```bash \r\npip install py-data-chunker\r\n```\r\n\r\n### \u4ece\u6e90\u4ee3\u7801\u5b89\u88c5\r\n\r\n1. \u514b\u9686\u4ed3\u5e93\u6216\u4e0b\u8f7d\u6e90\u4ee3\u7801\r\n\r\n```bash\r\ngit clone https://gitee.com/ugly-xue/data_chunker.git\r\n```\r\n\r\n2. \u8fdb\u5165\u9879\u76ee\u76ee\u5f55\r\n\r\n```bash\r\ncd data_chunker\r\n```\r\n\r\n3. \u8fd0\u884c\u5b89\u88c5\u547d\u4ee4\r\n\r\n```bash\r\npip install .\r\n```\r\n\r\n## \u5feb\u901f\u5f00\u59cb\r\n\r\n### \u57fa\u672c\u4f7f\u7528\r\n\r\n```python\r\nfrom data_chunker import DataChunker, generate_sample_data\r\n\r\n# \u521b\u5efa\u6570\u636e\u5206\u5757\u5904\u7406\u5668\r\nchunker = DataChunker(chunk_size=500, output_dir=\"output\")\r\n\r\n# \u751f\u6210\u793a\u4f8b\u6570\u636e\r\ndata = generate_sample_data(3000)\r\n\r\n# \u5904\u7406\u6570\u636e\r\nresults = chunker.process(data)\r\n\r\nprint(f\"\u5904\u7406\u5b8c\u6210\uff0c\u5171\u4fdd\u5b58 {len(results)} \u4e2a\u6587\u4ef6\")\r\n```\r\n\r\n### \u5e76\u884c\u5904\u7406\r\n\r\n```python\r\nfrom data_chunker import DataChunker\r\n\r\n# \u521b\u5efa\u652f\u6301\u5e76\u884c\u5904\u7406\u7684\u5b9e\u4f8b\r\nchunker = DataChunker(parallel=True, workers=4)\r\n\r\n# \u5904\u7406\u6570\u636e\r\nresults = chunker.process(data)\r\n```\r\n\r\n### \u6d41\u5f0f\u5904\u7406\r\n\r\n```python\r\nfrom data_chunker import DataChunker, json_stream_reader\r\n\r\n# \u521b\u5efa\u5b9e\u4f8b\r\nchunker = DataChunker()\r\n\r\n# \u5904\u7406\u6d41\u5f0f\u6570\u636e\r\nresults = chunker.process_stream(json_stream_reader(\"large_data.json\"))\r\n```\r\n\r\n## API\u53c2\u8003\r\n\r\n### DataChunker\u7c7b\r\n\r\n#### \u521d\u59cb\u5316\u53c2\u6570\r\n\r\n- `chunk_size` (int): \u6bcf\u4e2a\u5206\u5757\u7684\u5927\u5c0f\uff0c\u9ed8\u8ba4\u4e3a500\r\n- `output_dir` (str): \u8f93\u51fa\u76ee\u5f55\uff0c\u9ed8\u8ba4\u4e3a\"output\"\r\n- `prefix` (str): \u6587\u4ef6\u540d\u524d\u7f00\uff0c\u9ed8\u8ba4\u4e3a\"chunk\"\r\n- `timestamp` (bool): \u662f\u5426\u5728\u6587\u4ef6\u540d\u4e2d\u6dfb\u52a0\u65f6\u95f4\u6233\uff0c\u9ed8\u8ba4\u4e3aTrue\r\n- `parallel` (bool): \u662f\u5426\u4f7f\u7528\u5e76\u884c\u5904\u7406\uff0c\u9ed8\u8ba4\u4e3aFalse\r\n- `workers` (int): \u5e76\u884c\u5de5\u4f5c\u7ebf\u7a0b\u6570\uff0c\u9ed8\u8ba4\u4e3a4\r\n- `logger` (Logger): \u81ea\u5b9a\u4e49\u65e5\u5fd7\u8bb0\u5f55\u5668\uff0c\u9ed8\u8ba4\u4e3aNone\r\n\r\n#### \u4e3b\u8981\u65b9\u6cd5\r\n\r\n- `process(data)`: \u5904\u7406\u6570\u636e\u5e76\u5c06\u5176\u5206\u5272\u4fdd\u5b58\u4e3aJSON\u6587\u4ef6\r\n- `process_stream(stream_generator)`: \u5904\u7406\u6d41\u5f0f\u6570\u636e\r\n- `split(data)`: \u5c06\u6570\u636e\u5206\u5272\u6210\u591a\u4e2a\u56fa\u5b9a\u5927\u5c0f\u7684\u5757\uff08\u751f\u6210\u5668\uff09\r\n- `set_config(**kwargs)`: \u52a8\u6001\u66f4\u65b0\u914d\u7f6e\u53c2\u6570\r\n- `get_file_list()`: \u83b7\u53d6\u5df2\u4fdd\u5b58\u7684\u6587\u4ef6\u5217\u8868\r\n- `clear_output()`: \u6e05\u7a7a\u8f93\u51fa\u76ee\u5f55\r\n\r\n### \u5de5\u5177\u51fd\u6570\r\n\r\n- `setup_logger()`: \u521d\u59cb\u5316\u65e5\u5fd7\u8bb0\u5f55\u5668\r\n- `generate_sample_data(size=3000)`: \u751f\u6210\u793a\u4f8b\u6570\u636e\r\n- `json_stream_reader(file_path)`: \u6d41\u5f0f\u8bfb\u53d6JSON\u6587\u4ef6\r\n- `read_json_file(file_path)`: \u8bfb\u53d6JSON\u6587\u4ef6\r\n- `write_json_file(data, file_path, indent=2)`: \u5199\u5165JSON\u6587\u4ef6\r\n\r\n## \u9ad8\u7ea7\u7528\u6cd5\r\n\r\n### \u81ea\u5b9a\u4e49\u5904\u7406\u903b\u8f91\r\n\r\n\u7ee7\u627fDataChunker\u7c7b\u5e76\u8986\u76d6`_process_chunk`\u65b9\u6cd5\uff0c\u81ea\u5b9a\u4e49\u5904\u7406\u903b\u8f91\u3002\r\n\r\n```python\r\nfrom datetime import datetime\r\nfrom data_chunker import DataChunker\r\n\r\n\r\nclass CustomDataChunker(DataChunker):\r\n    def _process_chunk(self, chunk, chunk_index):\r\n        # \u81ea\u5b9a\u4e49\u5904\u7406\u903b\u8f91\r\n        for item in chunk:\r\n            item[\"processed\"] = True\r\n            item[\"process_time\"] = datetime.now().isoformat()\r\n        # \u8c03\u7528\u7236\u7c7b\u65b9\u6cd5\u4fdd\u5b58\r\n        return super()._process_chunk(chunk, chunk_index)\r\n\r\n\r\n# \u4f7f\u7528\u81ea\u5b9a\u4e49\u5904\u7406\u5668\r\ncustom_chunker = CustomDataChunker()\r\n\r\nresults = custom_chunker.process(data)\r\n```\r\n\r\n### \u96c6\u6210\u5230\u6570\u636e\u5904\u7406\u7ba1\u9053\r\n\r\n\u96c6\u6210DataChunker\u5230\u6570\u636e\u5904\u7406\u7ba1\u9053\u4e2d\uff0c\u5b9e\u73b0\u6570\u636e\u5904\u7406\u548c\u4fdd\u5b58\u3002\r\n\r\n```python\r\nfrom data_chunker import DataChunker\r\n\r\n\r\ndef data_processing_pipeline():\r\n    # \u521b\u5efa\u5206\u5757\u5904\u7406\u5668\r\n    chunker = DataChunker(output_dir=\"processed_data\")\r\n    # \u4ece\u6570\u636e\u6e90\u83b7\u53d6\u6570\u636e\r\n    raw_data = get_data_from_source()\r\n    # \u5904\u7406\u6570\u636e\r\n    results = chunker.process(raw_data)\r\n    # \u5bf9\u5904\u7406\u7ed3\u679c\u8fdb\u884c\u540e\u7eed\u64cd\u4f5c\r\n    for file_path in results:\r\n        process_result_file(file_path)\r\n    return results\r\n```\r\n\r\n## \u5f02\u5e38\u5904\u7406\r\n\r\n\u6a21\u5757\u5b9a\u4e49\u4e86\u4ee5\u4e0b\u5f02\u5e38\u7c7b\uff1a\r\n\r\n- `DataChunkerError`: \u57fa\u7840\u5f02\u5e38\u7c7b\r\n- `FileSaveError`: \u6587\u4ef6\u4fdd\u5b58\u9519\u8bef\r\n- `ChunkSizeError`: \u5206\u5757\u5927\u5c0f\u9519\u8bef\r\n- `InvalidDataError`: \u65e0\u6548\u6570\u636e\u9519\u8bef\r\n\r\n### \u4f7f\u7528\u65b9\u6cd5\r\n\r\n```python\r\nfrom data_chunker import DataChunker, FileSaveError\r\n\r\ntry:\r\n    chunker = DataChunker(chunk_size=0)  # \u8fd9\u4f1a\u629b\u51faChunkSizeError\r\nexcept ChunkSizeError as e:\r\n    print(f\"\u914d\u7f6e\u9519\u8bef: {e}\")\r\ntry:\r\n    results = chunker.process(data)\r\nexcept FileSaveError as e:\r\n    print(f\"\u6587\u4ef6\u4fdd\u5b58\u5931\u8d25: {e}\")\r\n```\r\n\r\n## \u6027\u80fd\u5efa\u8bae\r\n\r\n1. **\u5c0f\u6570\u636e\u96c6**\uff1a\u4f7f\u7528\u987a\u5e8f\u5904\u7406\u6a21\u5f0f\uff08`parallel=False`\uff09\r\n2. **\u5927\u6570\u636e\u96c6**\uff1a\u4f7f\u7528\u5e76\u884c\u5904\u7406\u6a21\u5f0f\uff08`parallel=True`\uff09\u5e76\u9002\u5f53\u589e\u52a0\u5de5\u4f5c\u7ebf\u7a0b\u6570\r\n3. **\u6781\u5927\u6570\u636e\u96c6**\uff1a\u4f7f\u7528`process_stream`\u65b9\u6cd5\u8fdb\u884c\u6d41\u5f0f\u5904\u7406\uff0c\u907f\u514d\u5185\u5b58\u6ea2\u51fa\r\n4. **\u6587\u4ef6I/O\u5bc6\u96c6\u578b**\uff1a\u8003\u8651\u4f7f\u7528SSD\u5b58\u50a8\u6216\u589e\u52a0\u5e76\u884c\u5de5\u4f5c\u7ebf\u7a0b\u6570\r\n\r\n## \u8d21\u732e\u6307\u5357\r\n\r\n1. Fork\u672c\u9879\u76ee\r\n2. \u521b\u5efa\u7279\u6027\u5206\u652f\uff08`git checkout -b feature/AmazingFeature`\uff09\r\n3. \u63d0\u4ea4\u66f4\u6539\uff08`git commit -m 'Add some AmazingFeature'`\uff09\r\n4. \u63a8\u9001\u5230\u5206\u652f\uff08`git push origin feature/AmazingFeature`\uff09\r\n5. \u5f00\u542fPull Request\r\n\r\n## \u8bb8\u53ef\u8bc1\r\n\r\n\u672c\u9879\u76ee\u91c7\u7528MIT\u8bb8\u53ef\u8bc1\u3002\u8be6\u89c1[LICENSE](LICENSE)\u6587\u4ef6\u3002\r\n\r\n## \u652f\u6301\r\n\r\n\u5982\u679c\u60a8\u9047\u5230\u4efb\u4f55\u95ee\u9898\u6216\u6709\u6539\u8fdb\u5efa\u8bae\uff0c\u8bf7\u901a\u8fc7\u4ee5\u4e0b\u65b9\u5f0f\u8054\u7cfb\uff1a\r\n\r\n- \u90ae\u7bb1: basui6996@gmail.com\r\n- Gitee: [\u63d0\u4ea4Issue](https://gitee.com/ugly-xue/data_chunker/issues)\r\n\r\n## \u7248\u672c\u5386\u53f2\r\n\r\n- 1.0.0 (2025-09-18)\r\n    - \u521d\u59cb\u7248\u672c\u53d1\u5e03\r\n    - \u5b9e\u73b0\u57fa\u672c\u6570\u636e\u5206\u5757\u529f\u80fd\r\n    - \u652f\u6301\u987a\u5e8f\u548c\u5e76\u884c\u5904\u7406\r\n    - \u6dfb\u52a0\u6d41\u5f0f\u5904\u7406\u652f\u6301\r\n- 1.0.1 (2025-09-18)\r\n    - \u4fee\u6539\u4e86\u4e0b\u8f7d\u6a21\u5757\u7684\u65b9\u5f0f\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "\u9ad8\u6548\u6570\u636e\u5206\u5757\u5904\u7406\u6a21\u5757\uff0c\u7528\u4e8e\u5c06\u5927\u578b\u5b57\u5178\u5217\u8868\u6570\u636e\u5206\u5272\u6210\u56fa\u5b9a\u5927\u5c0f\u7684\u5757",
    "version": "1.0.1",
    "project_urls": {
        "Documentation": "https://gitee.com/ugly-xue/data_chunker/docs",
        "Homepage": "https://gitee.com/ugly-xue/data_chunker",
        "Source": "https://gitee.com/ugly-xue/data_chunker",
        "Tracker": "https://gitee.com/ugly-xue/data_chunker/issues"
    },
    "split_keywords": [
        "data",
        " chunk",
        " split",
        " json",
        " processing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d3e7da493a19aaf2ec25e1b2193090845da6de77f6d8f7c132da8263954a86cb",
                "md5": "b98a7f821d4dd4eeed87c1bbcc6df605",
                "sha256": "db411bc543addd5752a8af522e5a7835d011fb360ba99fbf2e45ec6580ab90da"
            },
            "downloads": -1,
            "filename": "py_data_chunker-1.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b98a7f821d4dd4eeed87c1bbcc6df605",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 11763,
            "upload_time": "2025-09-18T05:31:20",
            "upload_time_iso_8601": "2025-09-18T05:31:20.739620Z",
            "url": "https://files.pythonhosted.org/packages/d3/e7/da493a19aaf2ec25e1b2193090845da6de77f6d8f7c132da8263954a86cb/py_data_chunker-1.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "0e6cc457612c3f9e60f1797261ffa5658188c2d03492811e237ab045dc52a6c7",
                "md5": "b4c0afccc8e731501c7aa1271f97a1b0",
                "sha256": "3c824e52e27e42d3964896432f35df8286f2cd7ed3216501d72712827caa64c4"
            },
            "downloads": -1,
            "filename": "py_data_chunker-1.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "b4c0afccc8e731501c7aa1271f97a1b0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 18071,
            "upload_time": "2025-09-18T05:31:22",
            "upload_time_iso_8601": "2025-09-18T05:31:22.092854Z",
            "url": "https://files.pythonhosted.org/packages/0e/6c/c457612c3f9e60f1797261ffa5658188c2d03492811e237ab045dc52a6c7/py_data_chunker-1.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-18 05:31:22",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "py-data-chunker"
}
        
Elapsed time: 2.29936s