# MiniSpark
MiniSpark是一个轻量级的Python库,用于从多种数据源读取数据并在本地进行高效处理,类似于Apache Spark的功能。
作者: 段福 (duanfu456@163.cm)
## 功能特性
- 多数据源支持:MySQL、DuckDB、SQLite、CSV、Excel和JSON
- 本地数据处理引擎(DuckDB/SQLite)
- 统一API接口
- 查询结果表注册与复用
- 支持自定义函数处理数据(简化API,直接传入整行数据)
- 支持将字段值按分隔符拆分成多行(支持单个或多个分隔符)
- 支持自定义函数返回多个列
- 支持查看已注册的表信息
- DataProcessor处理后的数据可自动注册到本地引擎
- 灵活的配置管理,支持多种配置方式
- 可配置的重复列名处理策略
## 安装
```bash
pip install minispqrk
```
对于特定数据库支持,可以安装额外的依赖:
```bash
# MySQL支持
pip install minispqrk[mysql]
# DuckDB支持
pip install minispqrk[duckdb]
# Excel支持
pip install minispqrk[excel]
```
## CLI工具
安装后可以使用命令行工具:
```bash
# 查看帮助
minispark --help
# 运行示例
minispark example
```
## 支持的数据源
1. **关系型数据库**:
- MySQL
- DuckDB
- SQLite
2. **文件格式**:
- CSV
- Excel (xlsx/xls)
- JSON
## 各数据源使用示例
### CSV连接器
```python
from minispark import MiniSpark, CSVConnector
import pandas as pd
// 创建示例数据
data = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35]
})
data.to_csv('sample.csv', index=False)
// 初始化MiniSpark
spark = MiniSpark()
// 创建CSV连接器
csv_connector = CSVConnector('csv_connector')
spark.add_connector('csv', csv_connector)
// 从CSV文件加载数据
df = spark.load_data('csv', 'sample.csv', 'sample_table')
print(df)
```
**指定不同分隔符**:
```python
// 使用分号分隔符
semicolon_connector = CSVConnector('semicolon_csv', delimiter=';')
// 使用制表符分隔符
tab_connector = CSVConnector('tab_csv', delimiter='\t')
// 使用管道符分隔符
pipe_connector = CSVConnector('pipe_csv', delimiter='|')
```
### Excel连接器
```python
from minispark import MiniSpark, ExcelConnector
import pandas as pd
// 创建示例数据(包含多个工作表)
products_data = pd.DataFrame({
'id': [1, 2, 3],
'product': ['Laptop', 'Phone', 'Tablet'],
'price': [1000, 500, 300]
})
orders_data = pd.DataFrame({
'order_id': [101, 102],
'product_id': [1, 2],
'quantity': [2, 1]
})
// 保存为包含多个工作表的Excel文件
with pd.ExcelWriter('data.xlsx') as writer:
products_data.to_excel(writer, sheet_name='Products', index=False)
orders_data.to_excel(writer, sheet_name='Orders', index=False)
// 初始化MiniSpark
spark = MiniSpark()
// 方法1:创建通用Excel连接器(推荐)
excel_connector = ExcelConnector('excel_connector')
spark.add_connector('excel', excel_connector)
// 使用同一个连接器读取不同工作表
products_df = spark.load_data('excel', 'data.xlsx', 'products_table', sheet_name='Products')
orders_df = spark.load_data('excel', 'data.xlsx', 'orders_table', sheet_name='Orders')
// 方法2:创建指定默认工作表的Excel连接器
default_excel_connector = ExcelConnector('default_excel', sheet_name='Products')
spark.add_connector('default_excel', default_excel_connector)
// 使用默认工作表加载数据
products_df = spark.load_data('default_excel', 'data.xlsx', 'products_table')
// 覆盖默认工作表
orders_df = spark.load_data('default_excel', 'data.xlsx', 'orders_table', sheet_name='Orders')
```
### JSON连接器
```python
from minispark import MiniSpark, JSONConnector
import json
// 创建示例数据
data = [
{"id": 1, "name": "Alice", "skills": ["Python", "SQL"]},
{"id": 2, "name": "Bob", "skills": ["Java", "Docker"]},
{"id": 3, "name": "Charlie", "skills": ["Excel", "Communication"]}
]
with open('employees.json', 'w') as f:
json.dump(data, f)
// 初始化MiniSpark
spark = MiniSpark()
// 创建JSON连接器
json_connector = JSONConnector('json_connector')
spark.add_connector('json', json_connector)
// 从JSON文件加载数据
df = spark.load_data('json', 'employees.json', 'employees_table')
print(df)
```
### SQLite连接器
```python
from minispark import MiniSpark, SQLiteConnector
import sqlite3
// 创建示例数据库和数据
conn = sqlite3.connect('sample.db')
cursor = conn.cursor()
cursor.execute('''
CREATE TABLE IF NOT EXISTS users (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
email TEXT
)
''')
cursor.execute("INSERT INTO users (name, email) VALUES ('Alice', 'alice@example.com')")
cursor.execute("INSERT INTO users (name, email) VALUES ('Bob', 'bob@example.com')")
conn.commit()
conn.close()
// 初始化MiniSpark
spark = MiniSpark()
// 创建SQLite连接器
sqlite_connector = SQLiteConnector('sqlite_connector', 'sample.db')
spark.add_connector('sqlite', sqlite_connector)
// 从SQLite数据库查询数据
df = spark.load_data('sqlite', 'SELECT * FROM users', 'users_table')
print(df)
```
### MySQL连接器
```python
from minispark import MiniSpark, MySQLConnector
// 初始化MiniSpark
spark = MiniSpark()
// 创建MySQL连接器
mysql_connector = MySQLConnector(
name='mysql_connector',
host='localhost',
port=3306,
user='username',
password='password',
database='database_name'
)
spark.add_connector('mysql', mysql_connector)
// 从MySQL数据库查询数据
df = spark.load_data('mysql', 'SELECT * FROM table_name LIMIT 10', 'mysql_table')
print(df)
```
### DuckDB连接器
```python
from minispark import MiniSpark, DuckDBConnector
// 初始化MiniSpark
spark = MiniSpark()
// 创建DuckDB连接器(内存数据库)
duckdb_connector = DuckDBConnector('duckdb_connector')
spark.add_connector('duckdb', duckdb_connector)
// 执行查询
df = spark.load_data('duckdb', 'SELECT 1 as number', 'test_table')
print(df)
```
### 跨数据源查询示例
```python
from minispark import MiniSpark, CSVConnector, JSONConnector
// 初始化MiniSpark
spark = MiniSpark()
// 添加多个数据源
csv_connector = CSVConnector('csv_connector')
json_connector = JSONConnector('json_connector')
spark.add_connector('csv', csv_connector)
spark.add_connector('json', json_connector)
// 从不同数据源加载数据
employees_df = spark.load_data('csv', 'employees.csv', 'employees')
skills_df = spark.load_data('json', 'skills.json', 'skills')
// 在本地引擎中执行跨数据源查询
result = spark.execute_query("""
SELECT e.name, e.department, e.salary
FROM employees e
WHERE e.salary > 7000
ORDER BY e.salary DESC
""", 'high_salary_employees')
print(result)
```
### 2. 本地处理引擎
- SQLite引擎:轻量级本地数据库引擎
- DuckDB引擎:高性能分析型数据库引擎
### 3. 数据处理功能
- 注册自定义函数并在数据处理中应用
- 直接应用匿名函数进行数据转换
- 使用swifter加速Pandas操作
### 4. 查询结果表注册
`execute_query`方法支持将查询结果直接注册为表,方便后续关联查询:
```python
// 将查询结果注册为新表
spark.execute_query("""
SELECT department, AVG(salary) as avg_salary
FROM employees
GROUP BY department
""", table_name="department_avg")
// 后续可以直接查询已注册的表
result = spark.execute_query("SELECT * FROM department_avg WHERE avg_salary > 50000")
```
通过提供`table_name`参数,查询结果将自动注册为可重用的表。如果需要执行查询但不希望注册结果,可以设置`register=False`。
## JSON支持
MiniSpark现在支持JSON数据源,可以处理多种JSON格式:
1. 对象数组格式
2. 单个对象格式
3. 嵌套对象格式
### JSON使用示例
```python
from minispark import MiniSpark, JSONConnector
// 初始化MiniSpark
spark = MiniSpark()
// 添加JSON连接器
json_connector = JSONConnector('json')
spark.add_connector('json', json_connector)
// 从JSON文件加载数据
df = spark.load_data('json', 'data.json', 'my_table')
// 处理复杂数据类型(如数组、嵌套对象)
// 这些数据在加载时会被自动转换为字符串格式以兼容SQL引擎
```
## 运行测试
项目包含一系列测试用例,确保功能正常工作。要运行所有测试:
```bash
// 从项目根目录运行
python -m unittest discover test
// 或者使用测试运行脚本
python test/run_tests.py
```
## 示例程序
项目提供了一些完整的示例程序,展示了MiniSpark的各种功能。这些示例位于[examples](file://./examples)目录中:
要运行示例程序:
```bash
cd examples
python example_row_function.py
python example_multi_column.py
python comprehensive_example.py
```
## 配置
MiniSpark支持多种配置方式,提供了灵活的配置管理机制:
### 1. 配置文件方式(默认)
使用`config.toml`文件进行配置:
```toml
// 本地处理引擎配置
[engine]
// 引擎类型,支持 duckdb 或 sqlite
type = "duckdb"
// 数据库路径,:memory: 表示内存模式
database_path = ":memory:"
// 临时数据存储配置
[storage]
// 存储格式,支持 parquet 或 avro
format = "parquet"
// 重复列名处理方式,支持 rename/error/keep_first
handle_duplicate_columns = "rename"
```
### 2. 配置字典方式
可以直接传递配置字典:
```python
from minispark import MiniSpark
config = {
"engine": {
"type": "sqlite",
"database_path": ":memory:"
},
"storage": {
"format": "parquet"
}
}
spark = MiniSpark(config=config)
```
### 3. 指定配置文件路径
可以指定配置文件的路径:
```python
from minispark import MiniSpark
spark = MiniSpark(config_path="/path/to/your/config.toml")
```
### 4. 点对象方式访问和修改配置
可以通过点对象方式访问和修改配置:
```python
from minispark import MiniSpark
spark = MiniSpark()
// 访问配置
print(spark.config.engine.type)
print(spark.config.engine.database_path)
print(spark.config.storage.format)
// 修改配置
spark.config.engine.type = "sqlite"
spark.config.engine.database_path = ":memory:"
spark.config.storage.format = "parquet"
```
### 5. 属性方式访问和修改配置
可以通过属性方式访问和修改配置:
```python
from minispark import MiniSpark
spark = MiniSpark()
// 访问配置
print(spark.config.engine.type)
print(spark.config.engine.database_path)
print(spark.config.storage.format)
print(spark.config.handle_duplicate_columns)
// 修改配置
spark.config.engine.type = "sqlite"
spark.config.engine.database_path = ":memory:"
spark.config.storage.format = "parquet"
spark.config.handle_duplicate_columns = "error"
```
### 6. Setter方法方式
可以使用setter方法修改配置:
```python
from minispark import MiniSpark
spark = MiniSpark()
// 设置新的配置字典
spark.set_config({
"engine": {"type": "sqlite"},
"storage": {"format": "parquet"},
"handle_duplicate_columns": "error"
})
// 通过配置文件路径设置配置
spark.set_config_path("/path/to/your/config.toml")
```
## 依赖
- Python 3.9+
- pandas>=1.3.0
- sqlalchemy>=1.4.0
- toml>=0.10.2
- swifter>=1.0.0
可选依赖:
- pymysql>=1.0.0 (MySQL支持)
- duckdb>=0.3.0 (DuckDB支持)
- openpyxl>=3.0.0, xlrd>=2.0.0 (Excel支持)
## 数据处理功能
MiniSpark提供了一个强大的数据处理器,可用于对数据进行各种操作,处理后的结果数据表可以自动注册到本地引擎中,方便后续查询和分析。
## 重复列名处理策略
MiniSpark支持三种处理重复列名的策略:
1. **rename**(默认):自动重命名重复列,在重复列名后添加后缀(如`_2`, `_3`等)
2. **error**:当发现重复列名时抛出异常
3. **keep_first**:只保留第一个重复列,删除其他重复列
### 1. 自定义函数应用
可以将Python函数应用于数据表的行,函数接收整行数据作为参数:
```python
from minispark import MiniSpark
// 初始化MiniSpark
spark = MiniSpark()
// 获取数据处理器
processor = spark.processor
// 定义处理整行数据的函数
def calculate_employee_benefits(row):
// 根据多个字段综合计算员工福利
base_benefits = row['salary'] * 0.1
// IT部门有额外福利
it_bonus = 5000 if row['department'] == 'IT' else 0
// 工龄超过5年有额外福利
experience_bonus = 2000 if row['years_of_service'] > 5 else 0
return base_benefits + it_bonus + experience_bonus
// 应用处理整行数据的函数
df_with_benefits = processor.apply_custom_function(
df,
'benefits', // 新列名
calculate_employee_benefits, // 函数
table_name='employees_with_benefits' // 自动注册为新表
)
// 注册并使用处理整行数据的函数
def calculate_performance_score(row):
// 基于多个因素计算绩效得分
base_score = row['salary'] / 1000
bonus_factor = row['bonus'] / 100
return base_score + bonus_factor
processor.register_function('performance_score', calculate_performance_score)
df_with_score = processor.apply_function(
df,
'performance_score', // 新列名
'performance_score' // 已注册的函数名
)
// 支持返回多个列的函数
def calculate_min_max_salary(row):
// 返回最小和最大薪资的元组
return (row['salary'] * 0.8, row['salary'] * 1.2)
// 创建两个新列来接收返回值
df_with_ranges = processor.apply_custom_function(
df,
['min_salary', 'max_salary'], // 多个新列名
calculate_min_max_salary, // 返回多个值的函数
table_name='employees_with_ranges' // 自动注册为新表
)
```
### 2. 查看已注册的表
可以使用`list_tables`方法查看所有已注册的表及其信息:
```python
from minispark import MiniSpark
import pandas as pd
// 初始化MiniSpark
spark = MiniSpark()
// 创建一些示例数据
users_data = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35]
})
orders_data = pd.DataFrame({
'order_id': [101, 102, 103],
'user_id': [1, 2, 1],
'amount': [100.0, 200.0, 150.0]
})
// 注册表到引擎
spark.engine.register_table('users', users_data)
spark.engine.register_table('orders', orders_data)
// 查看所有已注册的表
table_info = spark.list_tables()
print(table_info)
```
### 3. 字段拆分功能(支持单个或多个分隔符)
支持将包含分隔符的字段拆分成多行:
```python
from minispark import MiniSpark
// 初始化MiniSpark
spark = MiniSpark()
// 获取数据处理器
processor = spark.processor
// 假设有一个DataFrame,其中"tags"列包含用逗号分隔的标签
// 例如: "python,spark,data" => 拆分成3行,每行一个标签
df = spark.load_data('csv', 'data.csv', 'original_data')
// 使用单个分隔符
df_exploded = processor.explode_column(
df,
'tags',
',',
table_name='exploded_data' // 自动注册为新表
)
// 使用多个分隔符(分号、竖线和连字符)
df_multi_exploded = processor.explode_column(
df,
'description',
[';', '|', '-'],
table_name='multi_exploded_data' // 自动注册为新表
)
// 现在可以将拆分后的数据注册到引擎中进行SQL查询
spark.engine.register_table('exploded_data', df_exploded)
result = spark.execute_query("SELECT * FROM exploded_data WHERE tags = 'python'")
// 链式操作示例:连续拆分多个列
df_step1 = processor.explode_column(df, 'tags', ',')
df_step2 = processor.explode_column(df_step1, 'description', [';', '|'])
df_step3 = processor.explode_column(df_step2, 'features', ['-', '#'])
```
# 文档目录
```
examples/
├── example_row_function.py # 简化API处理整行数据示例
├── example_multi_column.py # 多列处理示例
├── comprehensive_example.py # 综合示例
├── run_all_examples.py # 运行所有示例的脚本
├── csv/ # CSV相关示例
│ ├── example.py
│ ├── delimiter_example.py
│ ├── double_pipe_example.py
│ ├── generate_data.py
│ ├── employees.csv
│ └── README.md
├── excel/ # Excel相关示例
│ ├── example.py
│ ├── dynamic_sheet_example.py
│ ├── explode_example.py
│ ├── generate_data.py
│ ├── products.xlsx
│ ├── salaries.xlsx
│ └── README.md
├── json/ # JSON相关示例
│ ├── example.py
│ ├── generate_data.py
│ ├── skills.json
│ └── README.md
├── mysql/ # MySQL相关示例
│ ├── example.py
│ ├── generate_data.py
│ ├── create_join_data.py
│ ├── join_query_example.py
│ ├── join_query_test.py
│ └── test_mysql_example.py
├── sqlite/ # SQLite相关示例
│ ├── example.py
│ ├── generate_data.py
│ └── company.db
└── duckdb/ # DuckDB相关示例
├── example.py
└── generate_data.py
``````
Raw data
{
"_id": null,
"home_page": "https://github.com/duanfu456/minispark",
"name": "minispqrk",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "spark, data-processing, pandas, sql, database",
"author": "\u6bb5\u798f",
"author_email": "\u6bb5\u798f <duanfu456@163.cm>",
"download_url": "https://files.pythonhosted.org/packages/a1/88/ea593806f18176aea93a4755dc311e3313f7d27719146f382d055ee57fee/minispqrk-0.1.8.tar.gz",
"platform": null,
"description": "# MiniSpark\r\n\r\nMiniSpark\u662f\u4e00\u4e2a\u8f7b\u91cf\u7ea7\u7684Python\u5e93\uff0c\u7528\u4e8e\u4ece\u591a\u79cd\u6570\u636e\u6e90\u8bfb\u53d6\u6570\u636e\u5e76\u5728\u672c\u5730\u8fdb\u884c\u9ad8\u6548\u5904\u7406\uff0c\u7c7b\u4f3c\u4e8eApache Spark\u7684\u529f\u80fd\u3002\r\n\r\n\u4f5c\u8005: \u6bb5\u798f (duanfu456@163.cm)\r\n\r\n## \u529f\u80fd\u7279\u6027\r\n\r\n- \u591a\u6570\u636e\u6e90\u652f\u6301\uff1aMySQL\u3001DuckDB\u3001SQLite\u3001CSV\u3001Excel\u548cJSON\r\n- \u672c\u5730\u6570\u636e\u5904\u7406\u5f15\u64ce\uff08DuckDB/SQLite\uff09\r\n- \u7edf\u4e00API\u63a5\u53e3\r\n- \u67e5\u8be2\u7ed3\u679c\u8868\u6ce8\u518c\u4e0e\u590d\u7528\r\n- \u652f\u6301\u81ea\u5b9a\u4e49\u51fd\u6570\u5904\u7406\u6570\u636e\uff08\u7b80\u5316API\uff0c\u76f4\u63a5\u4f20\u5165\u6574\u884c\u6570\u636e\uff09\r\n- \u652f\u6301\u5c06\u5b57\u6bb5\u503c\u6309\u5206\u9694\u7b26\u62c6\u5206\u6210\u591a\u884c\uff08\u652f\u6301\u5355\u4e2a\u6216\u591a\u4e2a\u5206\u9694\u7b26\uff09\r\n- \u652f\u6301\u81ea\u5b9a\u4e49\u51fd\u6570\u8fd4\u56de\u591a\u4e2a\u5217\r\n- \u652f\u6301\u67e5\u770b\u5df2\u6ce8\u518c\u7684\u8868\u4fe1\u606f\r\n- DataProcessor\u5904\u7406\u540e\u7684\u6570\u636e\u53ef\u81ea\u52a8\u6ce8\u518c\u5230\u672c\u5730\u5f15\u64ce\r\n- \u7075\u6d3b\u7684\u914d\u7f6e\u7ba1\u7406\uff0c\u652f\u6301\u591a\u79cd\u914d\u7f6e\u65b9\u5f0f\r\n- \u53ef\u914d\u7f6e\u7684\u91cd\u590d\u5217\u540d\u5904\u7406\u7b56\u7565\r\n\r\n## \u5b89\u88c5\r\n\r\n```bash\r\npip install minispqrk\r\n```\r\n\r\n\u5bf9\u4e8e\u7279\u5b9a\u6570\u636e\u5e93\u652f\u6301\uff0c\u53ef\u4ee5\u5b89\u88c5\u989d\u5916\u7684\u4f9d\u8d56\uff1a\r\n\r\n```bash\r\n# MySQL\u652f\u6301\r\npip install minispqrk[mysql]\r\n\r\n# DuckDB\u652f\u6301\r\npip install minispqrk[duckdb]\r\n\r\n# Excel\u652f\u6301\r\npip install minispqrk[excel]\r\n```\r\n\r\n## CLI\u5de5\u5177\r\n\r\n\u5b89\u88c5\u540e\u53ef\u4ee5\u4f7f\u7528\u547d\u4ee4\u884c\u5de5\u5177\uff1a\r\n\r\n```bash\r\n# \u67e5\u770b\u5e2e\u52a9\r\nminispark --help\r\n\r\n# \u8fd0\u884c\u793a\u4f8b\r\nminispark example\r\n```\r\n\r\n## \u652f\u6301\u7684\u6570\u636e\u6e90\r\n\r\n1. **\u5173\u7cfb\u578b\u6570\u636e\u5e93**\uff1a\r\n - MySQL\r\n - DuckDB\r\n - SQLite\r\n\r\n2. **\u6587\u4ef6\u683c\u5f0f**\uff1a\r\n - CSV\r\n - Excel (xlsx/xls)\r\n - JSON\r\n\r\n## \u5404\u6570\u636e\u6e90\u4f7f\u7528\u793a\u4f8b\r\n\r\n### CSV\u8fde\u63a5\u5668\r\n\r\n```python\r\nfrom minispark import MiniSpark, CSVConnector\r\nimport pandas as pd\r\n\r\n// \u521b\u5efa\u793a\u4f8b\u6570\u636e\r\ndata = pd.DataFrame({\r\n 'id': [1, 2, 3],\r\n 'name': ['Alice', 'Bob', 'Charlie'],\r\n 'age': [25, 30, 35]\r\n})\r\ndata.to_csv('sample.csv', index=False)\r\n\r\n// \u521d\u59cb\u5316MiniSpark\r\nspark = MiniSpark()\r\n\r\n// \u521b\u5efaCSV\u8fde\u63a5\u5668\r\ncsv_connector = CSVConnector('csv_connector')\r\nspark.add_connector('csv', csv_connector)\r\n\r\n// \u4eceCSV\u6587\u4ef6\u52a0\u8f7d\u6570\u636e\r\ndf = spark.load_data('csv', 'sample.csv', 'sample_table')\r\nprint(df)\r\n```\r\n\r\n**\u6307\u5b9a\u4e0d\u540c\u5206\u9694\u7b26**\uff1a\r\n\r\n```python\r\n// \u4f7f\u7528\u5206\u53f7\u5206\u9694\u7b26\r\nsemicolon_connector = CSVConnector('semicolon_csv', delimiter=';')\r\n\r\n// \u4f7f\u7528\u5236\u8868\u7b26\u5206\u9694\u7b26\r\ntab_connector = CSVConnector('tab_csv', delimiter='\\t')\r\n\r\n// \u4f7f\u7528\u7ba1\u9053\u7b26\u5206\u9694\u7b26\r\npipe_connector = CSVConnector('pipe_csv', delimiter='|')\r\n```\r\n\r\n### Excel\u8fde\u63a5\u5668\r\n\r\n```python\r\nfrom minispark import MiniSpark, ExcelConnector\r\nimport pandas as pd\r\n\r\n// \u521b\u5efa\u793a\u4f8b\u6570\u636e\uff08\u5305\u542b\u591a\u4e2a\u5de5\u4f5c\u8868\uff09\r\nproducts_data = pd.DataFrame({\r\n 'id': [1, 2, 3],\r\n 'product': ['Laptop', 'Phone', 'Tablet'],\r\n 'price': [1000, 500, 300]\r\n})\r\n\r\norders_data = pd.DataFrame({\r\n 'order_id': [101, 102],\r\n 'product_id': [1, 2],\r\n 'quantity': [2, 1]\r\n})\r\n\r\n// \u4fdd\u5b58\u4e3a\u5305\u542b\u591a\u4e2a\u5de5\u4f5c\u8868\u7684Excel\u6587\u4ef6\r\nwith pd.ExcelWriter('data.xlsx') as writer:\r\n products_data.to_excel(writer, sheet_name='Products', index=False)\r\n orders_data.to_excel(writer, sheet_name='Orders', index=False)\r\n\r\n// \u521d\u59cb\u5316MiniSpark\r\nspark = MiniSpark()\r\n\r\n// \u65b9\u6cd51\uff1a\u521b\u5efa\u901a\u7528Excel\u8fde\u63a5\u5668\uff08\u63a8\u8350\uff09\r\nexcel_connector = ExcelConnector('excel_connector')\r\nspark.add_connector('excel', excel_connector)\r\n\r\n// \u4f7f\u7528\u540c\u4e00\u4e2a\u8fde\u63a5\u5668\u8bfb\u53d6\u4e0d\u540c\u5de5\u4f5c\u8868\r\nproducts_df = spark.load_data('excel', 'data.xlsx', 'products_table', sheet_name='Products')\r\norders_df = spark.load_data('excel', 'data.xlsx', 'orders_table', sheet_name='Orders')\r\n\r\n// \u65b9\u6cd52\uff1a\u521b\u5efa\u6307\u5b9a\u9ed8\u8ba4\u5de5\u4f5c\u8868\u7684Excel\u8fde\u63a5\u5668\r\ndefault_excel_connector = ExcelConnector('default_excel', sheet_name='Products')\r\nspark.add_connector('default_excel', default_excel_connector)\r\n\r\n// \u4f7f\u7528\u9ed8\u8ba4\u5de5\u4f5c\u8868\u52a0\u8f7d\u6570\u636e\r\nproducts_df = spark.load_data('default_excel', 'data.xlsx', 'products_table')\r\n\r\n// \u8986\u76d6\u9ed8\u8ba4\u5de5\u4f5c\u8868\r\norders_df = spark.load_data('default_excel', 'data.xlsx', 'orders_table', sheet_name='Orders')\r\n```\r\n\r\n### JSON\u8fde\u63a5\u5668\r\n\r\n```python\r\nfrom minispark import MiniSpark, JSONConnector\r\nimport json\r\n\r\n// \u521b\u5efa\u793a\u4f8b\u6570\u636e\r\ndata = [\r\n {\"id\": 1, \"name\": \"Alice\", \"skills\": [\"Python\", \"SQL\"]},\r\n {\"id\": 2, \"name\": \"Bob\", \"skills\": [\"Java\", \"Docker\"]},\r\n {\"id\": 3, \"name\": \"Charlie\", \"skills\": [\"Excel\", \"Communication\"]}\r\n]\r\n\r\nwith open('employees.json', 'w') as f:\r\n json.dump(data, f)\r\n\r\n// \u521d\u59cb\u5316MiniSpark\r\nspark = MiniSpark()\r\n\r\n// \u521b\u5efaJSON\u8fde\u63a5\u5668\r\njson_connector = JSONConnector('json_connector')\r\nspark.add_connector('json', json_connector)\r\n\r\n// \u4eceJSON\u6587\u4ef6\u52a0\u8f7d\u6570\u636e\r\ndf = spark.load_data('json', 'employees.json', 'employees_table')\r\nprint(df)\r\n```\r\n\r\n### SQLite\u8fde\u63a5\u5668\r\n\r\n```python\r\nfrom minispark import MiniSpark, SQLiteConnector\r\nimport sqlite3\r\n\r\n// \u521b\u5efa\u793a\u4f8b\u6570\u636e\u5e93\u548c\u6570\u636e\r\nconn = sqlite3.connect('sample.db')\r\ncursor = conn.cursor()\r\ncursor.execute('''\r\n CREATE TABLE IF NOT EXISTS users (\r\n id INTEGER PRIMARY KEY,\r\n name TEXT NOT NULL,\r\n email TEXT\r\n )\r\n''')\r\ncursor.execute(\"INSERT INTO users (name, email) VALUES ('Alice', 'alice@example.com')\")\r\ncursor.execute(\"INSERT INTO users (name, email) VALUES ('Bob', 'bob@example.com')\")\r\nconn.commit()\r\nconn.close()\r\n\r\n// \u521d\u59cb\u5316MiniSpark\r\nspark = MiniSpark()\r\n\r\n// \u521b\u5efaSQLite\u8fde\u63a5\u5668\r\nsqlite_connector = SQLiteConnector('sqlite_connector', 'sample.db')\r\nspark.add_connector('sqlite', sqlite_connector)\r\n\r\n// \u4eceSQLite\u6570\u636e\u5e93\u67e5\u8be2\u6570\u636e\r\ndf = spark.load_data('sqlite', 'SELECT * FROM users', 'users_table')\r\nprint(df)\r\n```\r\n\r\n### MySQL\u8fde\u63a5\u5668\r\n\r\n```python\r\nfrom minispark import MiniSpark, MySQLConnector\r\n\r\n// \u521d\u59cb\u5316MiniSpark\r\nspark = MiniSpark()\r\n\r\n// \u521b\u5efaMySQL\u8fde\u63a5\u5668\r\nmysql_connector = MySQLConnector(\r\n name='mysql_connector',\r\n host='localhost',\r\n port=3306,\r\n user='username',\r\n password='password',\r\n database='database_name'\r\n)\r\nspark.add_connector('mysql', mysql_connector)\r\n\r\n// \u4eceMySQL\u6570\u636e\u5e93\u67e5\u8be2\u6570\u636e\r\ndf = spark.load_data('mysql', 'SELECT * FROM table_name LIMIT 10', 'mysql_table')\r\nprint(df)\r\n```\r\n\r\n### DuckDB\u8fde\u63a5\u5668\r\n\r\n```python\r\nfrom minispark import MiniSpark, DuckDBConnector\r\n\r\n// \u521d\u59cb\u5316MiniSpark\r\nspark = MiniSpark()\r\n\r\n// \u521b\u5efaDuckDB\u8fde\u63a5\u5668\uff08\u5185\u5b58\u6570\u636e\u5e93\uff09\r\nduckdb_connector = DuckDBConnector('duckdb_connector')\r\nspark.add_connector('duckdb', duckdb_connector)\r\n\r\n// \u6267\u884c\u67e5\u8be2\r\ndf = spark.load_data('duckdb', 'SELECT 1 as number', 'test_table')\r\nprint(df)\r\n```\r\n\r\n### \u8de8\u6570\u636e\u6e90\u67e5\u8be2\u793a\u4f8b\r\n\r\n```python\r\nfrom minispark import MiniSpark, CSVConnector, JSONConnector\r\n\r\n// \u521d\u59cb\u5316MiniSpark\r\nspark = MiniSpark()\r\n\r\n// \u6dfb\u52a0\u591a\u4e2a\u6570\u636e\u6e90\r\ncsv_connector = CSVConnector('csv_connector')\r\njson_connector = JSONConnector('json_connector')\r\nspark.add_connector('csv', csv_connector)\r\nspark.add_connector('json', json_connector)\r\n\r\n// \u4ece\u4e0d\u540c\u6570\u636e\u6e90\u52a0\u8f7d\u6570\u636e\r\nemployees_df = spark.load_data('csv', 'employees.csv', 'employees')\r\nskills_df = spark.load_data('json', 'skills.json', 'skills')\r\n\r\n// \u5728\u672c\u5730\u5f15\u64ce\u4e2d\u6267\u884c\u8de8\u6570\u636e\u6e90\u67e5\u8be2\r\nresult = spark.execute_query(\"\"\"\r\n SELECT e.name, e.department, e.salary\r\n FROM employees e\r\n WHERE e.salary > 7000\r\n ORDER BY e.salary DESC\r\n\"\"\", 'high_salary_employees')\r\n\r\nprint(result)\r\n```\r\n\r\n### 2. \u672c\u5730\u5904\u7406\u5f15\u64ce\r\n- SQLite\u5f15\u64ce\uff1a\u8f7b\u91cf\u7ea7\u672c\u5730\u6570\u636e\u5e93\u5f15\u64ce\r\n- DuckDB\u5f15\u64ce\uff1a\u9ad8\u6027\u80fd\u5206\u6790\u578b\u6570\u636e\u5e93\u5f15\u64ce\r\n\r\n### 3. \u6570\u636e\u5904\u7406\u529f\u80fd\r\n- \u6ce8\u518c\u81ea\u5b9a\u4e49\u51fd\u6570\u5e76\u5728\u6570\u636e\u5904\u7406\u4e2d\u5e94\u7528\r\n- \u76f4\u63a5\u5e94\u7528\u533f\u540d\u51fd\u6570\u8fdb\u884c\u6570\u636e\u8f6c\u6362\r\n- \u4f7f\u7528swifter\u52a0\u901fPandas\u64cd\u4f5c\r\n\r\n### 4. \u67e5\u8be2\u7ed3\u679c\u8868\u6ce8\u518c\r\n`execute_query`\u65b9\u6cd5\u652f\u6301\u5c06\u67e5\u8be2\u7ed3\u679c\u76f4\u63a5\u6ce8\u518c\u4e3a\u8868\uff0c\u65b9\u4fbf\u540e\u7eed\u5173\u8054\u67e5\u8be2\uff1a\r\n\r\n```python\r\n// \u5c06\u67e5\u8be2\u7ed3\u679c\u6ce8\u518c\u4e3a\u65b0\u8868\r\nspark.execute_query(\"\"\"\r\n SELECT department, AVG(salary) as avg_salary\r\n FROM employees\r\n GROUP BY department\r\n\"\"\", table_name=\"department_avg\")\r\n\r\n// \u540e\u7eed\u53ef\u4ee5\u76f4\u63a5\u67e5\u8be2\u5df2\u6ce8\u518c\u7684\u8868\r\nresult = spark.execute_query(\"SELECT * FROM department_avg WHERE avg_salary > 50000\")\r\n```\r\n\r\n\u901a\u8fc7\u63d0\u4f9b`table_name`\u53c2\u6570\uff0c\u67e5\u8be2\u7ed3\u679c\u5c06\u81ea\u52a8\u6ce8\u518c\u4e3a\u53ef\u91cd\u7528\u7684\u8868\u3002\u5982\u679c\u9700\u8981\u6267\u884c\u67e5\u8be2\u4f46\u4e0d\u5e0c\u671b\u6ce8\u518c\u7ed3\u679c\uff0c\u53ef\u4ee5\u8bbe\u7f6e`register=False`\u3002\r\n\r\n## JSON\u652f\u6301\r\n\r\nMiniSpark\u73b0\u5728\u652f\u6301JSON\u6570\u636e\u6e90\uff0c\u53ef\u4ee5\u5904\u7406\u591a\u79cdJSON\u683c\u5f0f\uff1a\r\n\r\n1. \u5bf9\u8c61\u6570\u7ec4\u683c\u5f0f\r\n2. \u5355\u4e2a\u5bf9\u8c61\u683c\u5f0f\r\n3. \u5d4c\u5957\u5bf9\u8c61\u683c\u5f0f\r\n\r\n### JSON\u4f7f\u7528\u793a\u4f8b\r\n\r\n```python\r\nfrom minispark import MiniSpark, JSONConnector\r\n\r\n// \u521d\u59cb\u5316MiniSpark\r\nspark = MiniSpark()\r\n\r\n// \u6dfb\u52a0JSON\u8fde\u63a5\u5668\r\njson_connector = JSONConnector('json')\r\nspark.add_connector('json', json_connector)\r\n\r\n// \u4eceJSON\u6587\u4ef6\u52a0\u8f7d\u6570\u636e\r\ndf = spark.load_data('json', 'data.json', 'my_table')\r\n\r\n// \u5904\u7406\u590d\u6742\u6570\u636e\u7c7b\u578b\uff08\u5982\u6570\u7ec4\u3001\u5d4c\u5957\u5bf9\u8c61\uff09\r\n// \u8fd9\u4e9b\u6570\u636e\u5728\u52a0\u8f7d\u65f6\u4f1a\u88ab\u81ea\u52a8\u8f6c\u6362\u4e3a\u5b57\u7b26\u4e32\u683c\u5f0f\u4ee5\u517c\u5bb9SQL\u5f15\u64ce\r\n```\r\n\r\n## \u8fd0\u884c\u6d4b\u8bd5\r\n\r\n\u9879\u76ee\u5305\u542b\u4e00\u7cfb\u5217\u6d4b\u8bd5\u7528\u4f8b\uff0c\u786e\u4fdd\u529f\u80fd\u6b63\u5e38\u5de5\u4f5c\u3002\u8981\u8fd0\u884c\u6240\u6709\u6d4b\u8bd5\uff1a\r\n\r\n```bash\r\n// \u4ece\u9879\u76ee\u6839\u76ee\u5f55\u8fd0\u884c\r\npython -m unittest discover test\r\n\r\n// \u6216\u8005\u4f7f\u7528\u6d4b\u8bd5\u8fd0\u884c\u811a\u672c\r\npython test/run_tests.py\r\n```\r\n\r\n## \u793a\u4f8b\u7a0b\u5e8f\r\n\r\n\u9879\u76ee\u63d0\u4f9b\u4e86\u4e00\u4e9b\u5b8c\u6574\u7684\u793a\u4f8b\u7a0b\u5e8f\uff0c\u5c55\u793a\u4e86MiniSpark\u7684\u5404\u79cd\u529f\u80fd\u3002\u8fd9\u4e9b\u793a\u4f8b\u4f4d\u4e8e[examples](file://./examples)\u76ee\u5f55\u4e2d\uff1a\r\n\r\n\r\n\u8981\u8fd0\u884c\u793a\u4f8b\u7a0b\u5e8f\uff1a\r\n\r\n```bash\r\ncd examples\r\npython example_row_function.py\r\npython example_multi_column.py\r\npython comprehensive_example.py\r\n```\r\n\r\n## \u914d\u7f6e\r\n\r\nMiniSpark\u652f\u6301\u591a\u79cd\u914d\u7f6e\u65b9\u5f0f\uff0c\u63d0\u4f9b\u4e86\u7075\u6d3b\u7684\u914d\u7f6e\u7ba1\u7406\u673a\u5236\uff1a\r\n\r\n### 1. \u914d\u7f6e\u6587\u4ef6\u65b9\u5f0f\uff08\u9ed8\u8ba4\uff09\r\n\r\n\u4f7f\u7528`config.toml`\u6587\u4ef6\u8fdb\u884c\u914d\u7f6e\uff1a\r\n\r\n```toml\r\n// \u672c\u5730\u5904\u7406\u5f15\u64ce\u914d\u7f6e\r\n[engine]\r\n// \u5f15\u64ce\u7c7b\u578b\uff0c\u652f\u6301 duckdb \u6216 sqlite\r\ntype = \"duckdb\"\r\n// \u6570\u636e\u5e93\u8def\u5f84\uff0c:memory: \u8868\u793a\u5185\u5b58\u6a21\u5f0f\r\ndatabase_path = \":memory:\"\r\n\r\n// \u4e34\u65f6\u6570\u636e\u5b58\u50a8\u914d\u7f6e\r\n[storage]\r\n// \u5b58\u50a8\u683c\u5f0f\uff0c\u652f\u6301 parquet \u6216 avro\r\nformat = \"parquet\"\r\n\r\n// \u91cd\u590d\u5217\u540d\u5904\u7406\u65b9\u5f0f\uff0c\u652f\u6301 rename/error/keep_first\r\nhandle_duplicate_columns = \"rename\"\r\n```\r\n\r\n### 2. \u914d\u7f6e\u5b57\u5178\u65b9\u5f0f\r\n\r\n\u53ef\u4ee5\u76f4\u63a5\u4f20\u9012\u914d\u7f6e\u5b57\u5178\uff1a\r\n\r\n```python\r\nfrom minispark import MiniSpark\r\n\r\nconfig = {\r\n \"engine\": {\r\n \"type\": \"sqlite\",\r\n \"database_path\": \":memory:\"\r\n },\r\n \"storage\": {\r\n \"format\": \"parquet\"\r\n }\r\n}\r\n\r\nspark = MiniSpark(config=config)\r\n```\r\n\r\n### 3. \u6307\u5b9a\u914d\u7f6e\u6587\u4ef6\u8def\u5f84\r\n\r\n\u53ef\u4ee5\u6307\u5b9a\u914d\u7f6e\u6587\u4ef6\u7684\u8def\u5f84\uff1a\r\n\r\n```python\r\nfrom minispark import MiniSpark\r\n\r\nspark = MiniSpark(config_path=\"/path/to/your/config.toml\")\r\n```\r\n\r\n### 4. \u70b9\u5bf9\u8c61\u65b9\u5f0f\u8bbf\u95ee\u548c\u4fee\u6539\u914d\u7f6e\r\n\r\n\u53ef\u4ee5\u901a\u8fc7\u70b9\u5bf9\u8c61\u65b9\u5f0f\u8bbf\u95ee\u548c\u4fee\u6539\u914d\u7f6e\uff1a\r\n\r\n```python\r\nfrom minispark import MiniSpark\r\n\r\nspark = MiniSpark()\r\n\r\n// \u8bbf\u95ee\u914d\u7f6e\r\nprint(spark.config.engine.type)\r\nprint(spark.config.engine.database_path)\r\nprint(spark.config.storage.format)\r\n\r\n// \u4fee\u6539\u914d\u7f6e\r\nspark.config.engine.type = \"sqlite\"\r\nspark.config.engine.database_path = \":memory:\"\r\nspark.config.storage.format = \"parquet\"\r\n```\r\n\r\n### 5. \u5c5e\u6027\u65b9\u5f0f\u8bbf\u95ee\u548c\u4fee\u6539\u914d\u7f6e\r\n\r\n\u53ef\u4ee5\u901a\u8fc7\u5c5e\u6027\u65b9\u5f0f\u8bbf\u95ee\u548c\u4fee\u6539\u914d\u7f6e\uff1a\r\n\r\n```python\r\nfrom minispark import MiniSpark\r\n\r\nspark = MiniSpark()\r\n\r\n// \u8bbf\u95ee\u914d\u7f6e\r\nprint(spark.config.engine.type)\r\nprint(spark.config.engine.database_path)\r\nprint(spark.config.storage.format)\r\nprint(spark.config.handle_duplicate_columns)\r\n\r\n// \u4fee\u6539\u914d\u7f6e\r\nspark.config.engine.type = \"sqlite\"\r\nspark.config.engine.database_path = \":memory:\"\r\nspark.config.storage.format = \"parquet\"\r\nspark.config.handle_duplicate_columns = \"error\"\r\n```\r\n\r\n### 6. Setter\u65b9\u6cd5\u65b9\u5f0f\r\n\r\n\u53ef\u4ee5\u4f7f\u7528setter\u65b9\u6cd5\u4fee\u6539\u914d\u7f6e\uff1a\r\n\r\n```python\r\nfrom minispark import MiniSpark\r\n\r\nspark = MiniSpark()\r\n\r\n// \u8bbe\u7f6e\u65b0\u7684\u914d\u7f6e\u5b57\u5178\r\nspark.set_config({\r\n \"engine\": {\"type\": \"sqlite\"},\r\n \"storage\": {\"format\": \"parquet\"},\r\n \"handle_duplicate_columns\": \"error\"\r\n})\r\n\r\n// \u901a\u8fc7\u914d\u7f6e\u6587\u4ef6\u8def\u5f84\u8bbe\u7f6e\u914d\u7f6e\r\nspark.set_config_path(\"/path/to/your/config.toml\")\r\n```\r\n\r\n## \u4f9d\u8d56\r\n\r\n- Python 3.9+\r\n- pandas>=1.3.0\r\n- sqlalchemy>=1.4.0\r\n- toml>=0.10.2\r\n- swifter>=1.0.0\r\n\r\n\u53ef\u9009\u4f9d\u8d56\uff1a\r\n- pymysql>=1.0.0 (MySQL\u652f\u6301)\r\n- duckdb>=0.3.0 (DuckDB\u652f\u6301)\r\n- openpyxl>=3.0.0, xlrd>=2.0.0 (Excel\u652f\u6301)\r\n\r\n## \u6570\u636e\u5904\u7406\u529f\u80fd\r\n\r\nMiniSpark\u63d0\u4f9b\u4e86\u4e00\u4e2a\u5f3a\u5927\u7684\u6570\u636e\u5904\u7406\u5668\uff0c\u53ef\u7528\u4e8e\u5bf9\u6570\u636e\u8fdb\u884c\u5404\u79cd\u64cd\u4f5c\uff0c\u5904\u7406\u540e\u7684\u7ed3\u679c\u6570\u636e\u8868\u53ef\u4ee5\u81ea\u52a8\u6ce8\u518c\u5230\u672c\u5730\u5f15\u64ce\u4e2d\uff0c\u65b9\u4fbf\u540e\u7eed\u67e5\u8be2\u548c\u5206\u6790\u3002\r\n\r\n## \u91cd\u590d\u5217\u540d\u5904\u7406\u7b56\u7565\r\n\r\nMiniSpark\u652f\u6301\u4e09\u79cd\u5904\u7406\u91cd\u590d\u5217\u540d\u7684\u7b56\u7565\uff1a\r\n\r\n1. **rename**\uff08\u9ed8\u8ba4\uff09\uff1a\u81ea\u52a8\u91cd\u547d\u540d\u91cd\u590d\u5217\uff0c\u5728\u91cd\u590d\u5217\u540d\u540e\u6dfb\u52a0\u540e\u7f00\uff08\u5982`_2`, `_3`\u7b49\uff09\r\n2. **error**\uff1a\u5f53\u53d1\u73b0\u91cd\u590d\u5217\u540d\u65f6\u629b\u51fa\u5f02\u5e38\r\n3. **keep_first**\uff1a\u53ea\u4fdd\u7559\u7b2c\u4e00\u4e2a\u91cd\u590d\u5217\uff0c\u5220\u9664\u5176\u4ed6\u91cd\u590d\u5217\r\n\r\n### 1. \u81ea\u5b9a\u4e49\u51fd\u6570\u5e94\u7528\r\n\r\n\u53ef\u4ee5\u5c06Python\u51fd\u6570\u5e94\u7528\u4e8e\u6570\u636e\u8868\u7684\u884c\uff0c\u51fd\u6570\u63a5\u6536\u6574\u884c\u6570\u636e\u4f5c\u4e3a\u53c2\u6570\uff1a\r\n\r\n```python\r\nfrom minispark import MiniSpark\r\n\r\n// \u521d\u59cb\u5316MiniSpark\r\nspark = MiniSpark()\r\n\r\n// \u83b7\u53d6\u6570\u636e\u5904\u7406\u5668\r\nprocessor = spark.processor\r\n\r\n// \u5b9a\u4e49\u5904\u7406\u6574\u884c\u6570\u636e\u7684\u51fd\u6570\r\ndef calculate_employee_benefits(row):\r\n // \u6839\u636e\u591a\u4e2a\u5b57\u6bb5\u7efc\u5408\u8ba1\u7b97\u5458\u5de5\u798f\u5229\r\n base_benefits = row['salary'] * 0.1\r\n // IT\u90e8\u95e8\u6709\u989d\u5916\u798f\u5229\r\n it_bonus = 5000 if row['department'] == 'IT' else 0\r\n // \u5de5\u9f84\u8d85\u8fc75\u5e74\u6709\u989d\u5916\u798f\u5229\r\n experience_bonus = 2000 if row['years_of_service'] > 5 else 0\r\n return base_benefits + it_bonus + experience_bonus\r\n\r\n// \u5e94\u7528\u5904\u7406\u6574\u884c\u6570\u636e\u7684\u51fd\u6570\r\ndf_with_benefits = processor.apply_custom_function(\r\n df,\r\n 'benefits', // \u65b0\u5217\u540d\r\n calculate_employee_benefits, // \u51fd\u6570\r\n table_name='employees_with_benefits' // \u81ea\u52a8\u6ce8\u518c\u4e3a\u65b0\u8868\r\n)\r\n\r\n\r\n// \u6ce8\u518c\u5e76\u4f7f\u7528\u5904\u7406\u6574\u884c\u6570\u636e\u7684\u51fd\u6570\r\ndef calculate_performance_score(row):\r\n // \u57fa\u4e8e\u591a\u4e2a\u56e0\u7d20\u8ba1\u7b97\u7ee9\u6548\u5f97\u5206\r\n base_score = row['salary'] / 1000\r\n bonus_factor = row['bonus'] / 100\r\n return base_score + bonus_factor\r\n\r\nprocessor.register_function('performance_score', calculate_performance_score)\r\n\r\ndf_with_score = processor.apply_function(\r\n df,\r\n 'performance_score', // \u65b0\u5217\u540d\r\n 'performance_score' // \u5df2\u6ce8\u518c\u7684\u51fd\u6570\u540d\r\n)\r\n\r\n// \u652f\u6301\u8fd4\u56de\u591a\u4e2a\u5217\u7684\u51fd\u6570\r\ndef calculate_min_max_salary(row):\r\n // \u8fd4\u56de\u6700\u5c0f\u548c\u6700\u5927\u85aa\u8d44\u7684\u5143\u7ec4\r\n return (row['salary'] * 0.8, row['salary'] * 1.2)\r\n\r\n// \u521b\u5efa\u4e24\u4e2a\u65b0\u5217\u6765\u63a5\u6536\u8fd4\u56de\u503c\r\ndf_with_ranges = processor.apply_custom_function(\r\n df,\r\n ['min_salary', 'max_salary'], // \u591a\u4e2a\u65b0\u5217\u540d\r\n calculate_min_max_salary, // \u8fd4\u56de\u591a\u4e2a\u503c\u7684\u51fd\u6570\r\n table_name='employees_with_ranges' // \u81ea\u52a8\u6ce8\u518c\u4e3a\u65b0\u8868\r\n)\r\n```\r\n\r\n### 2. \u67e5\u770b\u5df2\u6ce8\u518c\u7684\u8868\r\n\r\n\u53ef\u4ee5\u4f7f\u7528`list_tables`\u65b9\u6cd5\u67e5\u770b\u6240\u6709\u5df2\u6ce8\u518c\u7684\u8868\u53ca\u5176\u4fe1\u606f\uff1a\r\n\r\n```python\r\nfrom minispark import MiniSpark\r\nimport pandas as pd\r\n\r\n// \u521d\u59cb\u5316MiniSpark\r\nspark = MiniSpark()\r\n\r\n// \u521b\u5efa\u4e00\u4e9b\u793a\u4f8b\u6570\u636e\r\nusers_data = pd.DataFrame({\r\n 'id': [1, 2, 3],\r\n 'name': ['Alice', 'Bob', 'Charlie'],\r\n 'age': [25, 30, 35]\r\n})\r\n\r\norders_data = pd.DataFrame({\r\n 'order_id': [101, 102, 103],\r\n 'user_id': [1, 2, 1],\r\n 'amount': [100.0, 200.0, 150.0]\r\n})\r\n\r\n// \u6ce8\u518c\u8868\u5230\u5f15\u64ce\r\nspark.engine.register_table('users', users_data)\r\nspark.engine.register_table('orders', orders_data)\r\n\r\n// \u67e5\u770b\u6240\u6709\u5df2\u6ce8\u518c\u7684\u8868\r\ntable_info = spark.list_tables()\r\nprint(table_info)\r\n```\r\n\r\n### 3. \u5b57\u6bb5\u62c6\u5206\u529f\u80fd\uff08\u652f\u6301\u5355\u4e2a\u6216\u591a\u4e2a\u5206\u9694\u7b26\uff09\r\n\r\n\u652f\u6301\u5c06\u5305\u542b\u5206\u9694\u7b26\u7684\u5b57\u6bb5\u62c6\u5206\u6210\u591a\u884c\uff1a\r\n\r\n```python\r\nfrom minispark import MiniSpark\r\n\r\n// \u521d\u59cb\u5316MiniSpark\r\nspark = MiniSpark()\r\n\r\n// \u83b7\u53d6\u6570\u636e\u5904\u7406\u5668\r\nprocessor = spark.processor\r\n\r\n// \u5047\u8bbe\u6709\u4e00\u4e2aDataFrame\uff0c\u5176\u4e2d\"tags\"\u5217\u5305\u542b\u7528\u9017\u53f7\u5206\u9694\u7684\u6807\u7b7e\r\n// \u4f8b\u5982: \"python,spark,data\" => \u62c6\u5206\u62103\u884c\uff0c\u6bcf\u884c\u4e00\u4e2a\u6807\u7b7e\r\ndf = spark.load_data('csv', 'data.csv', 'original_data')\r\n\r\n// \u4f7f\u7528\u5355\u4e2a\u5206\u9694\u7b26\r\ndf_exploded = processor.explode_column(\r\n df, \r\n 'tags', \r\n ',', \r\n table_name='exploded_data' // \u81ea\u52a8\u6ce8\u518c\u4e3a\u65b0\u8868\r\n)\r\n\r\n// \u4f7f\u7528\u591a\u4e2a\u5206\u9694\u7b26\uff08\u5206\u53f7\u3001\u7ad6\u7ebf\u548c\u8fde\u5b57\u7b26\uff09\r\ndf_multi_exploded = processor.explode_column(\r\n df, \r\n 'description', \r\n [';', '|', '-'], \r\n table_name='multi_exploded_data' // \u81ea\u52a8\u6ce8\u518c\u4e3a\u65b0\u8868\r\n)\r\n\r\n\r\n// \u73b0\u5728\u53ef\u4ee5\u5c06\u62c6\u5206\u540e\u7684\u6570\u636e\u6ce8\u518c\u5230\u5f15\u64ce\u4e2d\u8fdb\u884cSQL\u67e5\u8be2\r\nspark.engine.register_table('exploded_data', df_exploded)\r\nresult = spark.execute_query(\"SELECT * FROM exploded_data WHERE tags = 'python'\")\r\n\r\n// \u94fe\u5f0f\u64cd\u4f5c\u793a\u4f8b\uff1a\u8fde\u7eed\u62c6\u5206\u591a\u4e2a\u5217\r\ndf_step1 = processor.explode_column(df, 'tags', ',')\r\ndf_step2 = processor.explode_column(df_step1, 'description', [';', '|'])\r\ndf_step3 = processor.explode_column(df_step2, 'features', ['-', '#'])\r\n```\r\n# \u6587\u6863\u76ee\u5f55\r\n```\r\nexamples/\r\n\u251c\u2500\u2500 example_row_function.py # \u7b80\u5316API\u5904\u7406\u6574\u884c\u6570\u636e\u793a\u4f8b\r\n\u251c\u2500\u2500 example_multi_column.py # \u591a\u5217\u5904\u7406\u793a\u4f8b\r\n\u251c\u2500\u2500 comprehensive_example.py # \u7efc\u5408\u793a\u4f8b\r\n\u251c\u2500\u2500 run_all_examples.py # \u8fd0\u884c\u6240\u6709\u793a\u4f8b\u7684\u811a\u672c\r\n\u251c\u2500\u2500 csv/ # CSV\u76f8\u5173\u793a\u4f8b\r\n\u2502 \u251c\u2500\u2500 example.py\r\n\u2502 \u251c\u2500\u2500 delimiter_example.py\r\n\u2502 \u251c\u2500\u2500 double_pipe_example.py\r\n\u2502 \u251c\u2500\u2500 generate_data.py\r\n\u2502 \u251c\u2500\u2500 employees.csv\r\n\u2502 \u2514\u2500\u2500 README.md\r\n\u251c\u2500\u2500 excel/ # Excel\u76f8\u5173\u793a\u4f8b\r\n\u2502 \u251c\u2500\u2500 example.py\r\n\u2502 \u251c\u2500\u2500 dynamic_sheet_example.py\r\n\u2502 \u251c\u2500\u2500 explode_example.py\r\n\u2502 \u251c\u2500\u2500 generate_data.py\r\n\u2502 \u251c\u2500\u2500 products.xlsx\r\n\u2502 \u251c\u2500\u2500 salaries.xlsx\r\n\u2502 \u2514\u2500\u2500 README.md\r\n\u251c\u2500\u2500 json/ # JSON\u76f8\u5173\u793a\u4f8b\r\n\u2502 \u251c\u2500\u2500 example.py\r\n\u2502 \u251c\u2500\u2500 generate_data.py\r\n\u2502 \u251c\u2500\u2500 skills.json\r\n\u2502 \u2514\u2500\u2500 README.md\r\n\u251c\u2500\u2500 mysql/ # MySQL\u76f8\u5173\u793a\u4f8b\r\n\u2502 \u251c\u2500\u2500 example.py\r\n\u2502 \u251c\u2500\u2500 generate_data.py\r\n\u2502 \u251c\u2500\u2500 create_join_data.py\r\n\u2502 \u251c\u2500\u2500 join_query_example.py\r\n\u2502 \u251c\u2500\u2500 join_query_test.py\r\n\u2502 \u2514\u2500\u2500 test_mysql_example.py\r\n\u251c\u2500\u2500 sqlite/ # SQLite\u76f8\u5173\u793a\u4f8b\r\n\u2502 \u251c\u2500\u2500 example.py\r\n\u2502 \u251c\u2500\u2500 generate_data.py\r\n\u2502 \u2514\u2500\u2500 company.db\r\n\u2514\u2500\u2500 duckdb/ # DuckDB\u76f8\u5173\u793a\u4f8b\r\n \u251c\u2500\u2500 example.py\r\n \u2514\u2500\u2500 generate_data.py\r\n``````\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "\u4e00\u4e2a\u8f7b\u91cf\u7ea7\u7684Python\u5e93\uff0c\u7528\u4e8e\u4ece\u591a\u79cd\u6570\u636e\u6e90\u8bfb\u53d6\u6570\u636e\u5e76\u5728\u672c\u5730\u8fdb\u884c\u9ad8\u6548\u5904\u7406\uff0c\u7c7b\u4f3c\u4e8eApache Spark\u7684\u529f\u80fd",
"version": "0.1.8",
"project_urls": {
"Documentation": "https://github.com/your-username/minispark#readme",
"Homepage": "https://github.com/your-username/minispark",
"Repository": "https://github.com/your-username/minispark"
},
"split_keywords": [
"spark",
" data-processing",
" pandas",
" sql",
" database"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "a619549b6e076828f9e28f263ed4cef8dba06902da15844142ff0f6952fe2a3e",
"md5": "33c5e3768f14898a02077fb87045d16d",
"sha256": "efd1d3999526b8abda3fa68efed37804b5b3f3adedf2d248ec2d87d36aa31deb"
},
"downloads": -1,
"filename": "minispqrk-0.1.8-py3-none-any.whl",
"has_sig": false,
"md5_digest": "33c5e3768f14898a02077fb87045d16d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 26665,
"upload_time": "2025-08-12T16:25:46",
"upload_time_iso_8601": "2025-08-12T16:25:46.135244Z",
"url": "https://files.pythonhosted.org/packages/a6/19/549b6e076828f9e28f263ed4cef8dba06902da15844142ff0f6952fe2a3e/minispqrk-0.1.8-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "a188ea593806f18176aea93a4755dc311e3313f7d27719146f382d055ee57fee",
"md5": "94933cf320fd4253cddef81e14c52c4d",
"sha256": "a045e3d02c705ddc3234ea9b7b1f4a9726efa0fcadb22f5cb979dddd986dddf5"
},
"downloads": -1,
"filename": "minispqrk-0.1.8.tar.gz",
"has_sig": false,
"md5_digest": "94933cf320fd4253cddef81e14c52c4d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 75402,
"upload_time": "2025-08-12T16:25:47",
"upload_time_iso_8601": "2025-08-12T16:25:47.512948Z",
"url": "https://files.pythonhosted.org/packages/a1/88/ea593806f18176aea93a4755dc311e3313f7d27719146f382d055ee57fee/minispqrk-0.1.8.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-12 16:25:47",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "duanfu456",
"github_project": "minispark",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "minispqrk"
}