# DatasetPlus
[δΈζηζ¬](README-CN.md)
An enhanced Hugging Face datasets wrapper designed for large model data processing, providing intelligent caching, data augmentation, and filtering capabilities.
## π Core Features
### 1. π Full Compatibility - Seamless datasets Replacement
- **100% Compatible**: Supports all datasets methods and attributes
- **Mutual Conversion, Plug and Play**: Simply `dsp = DatasetPlus(ds)` to convert back: `ds = dsp.ds`
- **Static Method Support**: All Dataset static methods and class methods can be called directly
### 2. π§ Intelligent Caching System - Zero Loss for Large Model Calls
- **Automatic Function-level Caching**: Automatically generates cache based on function content, even if interrupted due to network instability, insufficient quota, etc., previous results won't be lost
- **Jupyter Friendly**: Even if you forget to assign variables, results can be recovered from cache
- **Resume from Breakpoint**: Supports continuing processing after interruption, automatically reads cache for processed data
### 3. π Data Augmentation - One Row Becomes Multiple Rows
- **Array Auto-expansion**: Automatically expands arrays returned by map function into multiple rows of data
- **LLM Result Parsing**: Easy data augmentation with MyLLMTool
### 4. π Intelligent Filtering - Auto-delete When Returning None
- **Conditional Filtering**: Automatically deletes rows when map function returns None
- **LLM Intelligent Filtering**: Use large models for complex conditional filtering
### 5. π¨ Generate from Scratch - Direct Data Generation from Large Models
- **iter Method**: Supports generating datasets from scratch
- **Flexible Generation**: Can generate data in any format and content
- **Batch Generation**: Supports large-scale data generation with automatic caching and parallel processing
## π¦ Installation
### Install from PyPI
```bash
pip install datasetplus
```
### Install from Source
```bash
git clone https://github.com/yourusername/datasetplus.git
cd datasetplus
pip install -e .
```
### Dependencies Installation
```bash
# Basic dependencies
pip install datasets pandas numpy
# Excel support
pip install openpyxl
# LLM support
pip install openai
```
## π― Quick Start
### Basic Usage
```python
from datasetplus import DatasetPlus, MyLLMTool
# Load dataset
dsp = DatasetPlus.load_dataset("data.jsonl")
# Fully compatible with datasets - plug and play
ds = dsp.ds # Now ds has all datasets functionality + DatasetPlus enhancements
```
### π Feature 1: Full datasets Compatibility
```python
# All datasets methods can be used directly
ds = dsp.ds # Get native dataset object
dsp_shuffled = dsp.shuffle(seed=42)
dsp_split = dsp.train_test_split(test_size=0.2)
dsp_filtered = dsp.filter(lambda x: len(x['text']) > 100)
# pandas can also seamlessly connect
dsp_df = dsp.to_pandas()
dsp = DatasetPlus.from_pandas(dsp_df)
# Static methods are fully supported
dsp_from_dict = DatasetPlus.from_dict({"text": ["hello", "world"]})
dsp_from_hf = DatasetPlus.from_pretrained("squad")
# Seamless switching with native datasets
from datasets import Dataset
ds = Dataset.from_dict({"text": ["test"]})
dsp = DatasetPlus(ds) # Directly wrap existing dataset
# Jupyter-friendly display
dsp = DatasetPlus.from_dict({"text": ["a", "b", "c"]})
dsp
----------------
DatasetPlus({
features: ['text'],
num_rows: 3
})
~~~~~~~~~~~~~~~~~
## Humanized slicing and display logic 1
dsp[0] # Equivalent to ds.select(range(0)), of course dsp also supports dsp.select(range(0))
----------------
{'text': 'a'}
## Humanized slicing and display logic 2
dsp[1:2] # Equivalent to ds.select(range(1,2))
----------------
DatasetPlus({
features: ['text'],
num_rows: 1
## Humanized slicing and display logic 3
dsp[1:]
----------------
DatasetPlus({
features: ['text'],
num_rows: 2
})
```
### π§ Feature 2: Intelligent Caching - Zero Loss for Large Model Calls
```python
# Define processing function containing large model calls
def enhance_with_llm(example):
# Initialize LLM tool (needs to be instantiated internally for multiprocessing)
llm = MyLLMTool(
model_name="gpt-3.5-turbo",
base_url="https://api.openai.com/v1",
api_key="your-api-key"
)
# Call large model for data enhancement
prompt = f"Please generate a summary for the following text: {example['text']}"
summary = llm.getResult(prompt)
example['summary'] = summary
return example
# First run - will call large model
dsp_enhanced = dsp.map(enhance_with_llm, num_proc=4, cache=True)
# Forgot to assign in Jupyter? No problem!
# Even if you ran: dsp.map(enhance_with_llm, cache=True) # Forgot assignment
# You can still recover results with:
dsp_enhanced = dsp.map(enhance_with_llm, cache=True) # Auto-read from cache, won't call LLM again
# Continuing after interruption is also fine, processed data will be automatically skipped
```
### π Feature 3: Data Augmentation - One Row Becomes Multiple Rows
```python
def expand_data_with_llm(example):
# Use LLM to generate multiple related questions
prompt = f"Generate 3 related questions based on the following text, return in JSON array format: {example['text']}"
questions_json = llm.getResult(prompt)
try:
questions = json.loads(questions_json)
# Return array, DatasetPlus will automatically expand into multiple rows
return [{
'original_text': example['text'],
'question': q,
'source': 'llm_generated'
} for q in questions]
except:
return example # Return original data on parsing failure or delete directly: return None
# Original data: 100 rows
# After processing: might become 300 rows (3 questions generated per row)
dsp_expanded = dsp.map(expand_data_with_llm, cache=True)
print(f"Original data: {len(dsp)} rows")
print(f"After expansion: {len(dsp_expanded)} rows")
```
### π Feature 4: Intelligent Filtering - Auto-delete When Returning None
```python
def filter_with_llm(example):
# Use LLM for quality assessment
prompt = f"""Evaluate the quality of the following text, return JSON format: {{"quality": "high/mid/low"}}
Text: {example['text']}"""
result = llm.getResult(prompt)
try:
quality_data = json.loads(result)
quality = quality_data.get('quality', 'low')
# Only keep high-quality data, others returning None will be auto-deleted
if quality == 'high':
example['quality_score'] = quality
return example
else:
return None # Auto-delete low-quality data
except:
return None # Also delete on parsing failure
# Original data: 1000 rows
# After filtering: might only have 200 rows of high-quality data
dsp_filtered = dsp.map(filter_with_llm, cache=True)
print(f"Before filtering: {len(dsp)} rows")
print(f"After filtering: {len(dsp_filtered)} rows")
```
### π¨ Feature 5: Generate from Scratch - Direct Data Generation from Large Models
```python
# Use iter method to generate data directly from large models
def generate_dialogues(example):
llm = MyLLMTool(
model_name="gpt-3.5-turbo",
base_url="https://api.openai.com/v1",
api_key="your-api-key"
)
# Prompt: Generate 10 customer service dialogues
prompt = """Please generate 10 different customer service dialogues, each containing questions and answers.
Requirements:
1. Each dialogue has users asking different specific questions
2. Customer service provides professional answers
3. Return JSON array format: [{"user": "User question 1", "assistant": "Service answer 1", "category": "Question category 1"}, ...]
4. Cover different types of questions (technical support, after-sales service, product consultation, etc.)
"""
try:
result = llm.getResult(prompt)
dialogues_data = json.loads(result)
# Return array, DatasetPlus will automatically expand into multiple rows
return [{
'batch_id': example['id'],
'dialogue_id': i,
'user': dialogue.get('user', ''),
'assistant': dialogue.get('assistant', ''),
'category': dialogue.get('category', ''),
'source': 'generated'
} for i, dialogue in enumerate(dialogues_data)]
except Exception as e:
print(f"Generation failed: {e}")
return None # Skip on generation failure
# Generate 10 batches of dialogue data, each batch containing 10 dialogues
dsp_generated = DatasetPlus.iter(
iterate_num=10, # Generate 10 batches of data
fn=generate_dialogues, # Generation function
num_proc=2, # 2 processes in parallel
cache=False # iter cache defaults to False
)
print(f"Generated {len(dsp_generated)} dialogue data") # Should be 100 (10 batches Γ 10 dialogues)
print(dsp_generated[0]) # View first generated data
```
## π Supported Data Formats
- **JSON/JSONL**: Standard JSON and JSON Lines format
- **CSV**: Comma-separated values files
- **Excel**: .xlsx and .xls files
- **Hugging Face Datasets**: Any dataset from the Hub
- **Dataframe, datasets**: Support pandas DataFrame and Hugging Face datasets
- **Directory Batch Loading**: Automatically merge multiple files in directory
## π§ Advanced Features
### Intelligent Caching Mechanism
```python
# Cache based on function content hash, ensures recalculation when function changes
def process_v1(example):
return {"result": example["text"].upper()} # Version 1
def process_v2(example):
return {"result": example["text"].lower()} # Version 2
# Different functions generate different caches, no interference
ds1 = ds.map(process_v1, cache=True) # Cache A
ds2 = ds.map(process_v2, cache=True) # Cache B
```
### Batch Processing and Multiprocessing
```python
# Efficient processing of large datasets
dsp = DatasetPlus.load_dataset("large_dataset.jsonl")
dsp_processed = dsp.map(
enhance_with_llm,
num_proc=8, # 8 processes in parallel
max_inner_num=1000, # Process 1000 items per batch
cache=True # Enable caching
)
```
### Directory Batch Loading
```python
# Automatically load and merge all supported files in directory
dsp = DatasetPlus.load_dataset_plus("./data_folder/")
# Supports mixed formats: data_folder/
# βββ file1.jsonl
# βββ file2.csv
# βββ file3.xlsx
```
### Professional Excel Processing
```python
from datasetplus import DatasetPlusExcels
# Professional Excel file processing
excel_dsp = DatasetPlusExcels("spreadsheet.xlsx")
# Support multi-sheet processing
sheet_names = excel_dsp.get_sheet_names()
for sheet in sheet_names:
sheet_data = excel_dsp.get_sheet_data(sheet)
dsp_processed = excel_dsp.map(lambda x: {'cleaned': x['column'].strip()})
```
## π API Reference
### DatasetPlus
Enhanced dataset processing class, fully compatible with Hugging Face datasets.
#### Core Methods
- `map(fn, num_proc=1, max_inner_num=1000, cache=True)`: Enhanced mapping function
- **fn**: Processing function, supports returning arrays (auto-expand) and None (auto-delete)
- **cache**: Intelligent caching, automatically generates cache keys based on function content
- **num_proc**: Multiprocess parallel processing
- **max_inner_num**: Batch processing size
#### Static Methods
- `load_dataset(file_name, output_file)`: Load single dataset file
- `load_dataset_plus(input_path, output_file)`: Load from file, directory, or Hub
- `from_pandas(df)`: Create from pandas DataFrame
- `from_dict(data)`: Create from dictionary
- `from_pretrained(path)`: Load from Hugging Face Hub
- `iter(iterate_num, fn, num_proc=1, max_inner_num=1000, cache=True)`: Generate from scratch, iteratively generate data
- **iterate_num**: Number of data to generate
- **fn**: Generation function, receives example with id, returns generated data
- **num_proc**: Multiprocess parallel processing
- **cache**: Enable caching, avoid duplicate generation
#### Compatibility
```python
# All datasets methods can be used directly
ds.shuffle() # Shuffle data
ds.filter() # Filter data
ds.select() # Select data
ds.train_test_split() # Split data
ds.save_to_disk() # Save to disk
# ... and all other datasets methods
```
### MyLLMTool
Large model calling tool, supports OpenAI-compatible APIs.
#### Initialization
```python
llm = MyLLMTool(
model_name="gpt-3.5-turbo", # Model name
base_url="https://api.openai.com/v1", # API base URL
api_key="your-api-key" # API key
)
```
#### Methods
- `getResult(query, sys_prompt=None, temperature=0.7, top_p=1, max_tokens=2048, model_name=None)`: Get LLM response
- **query**: User query
- **sys_prompt**: System prompt
- **temperature**: Temperature parameter
- **max_tokens**: Maximum tokens
### DatasetPlusExcels
Professional Excel file processing class.
#### Methods
- `__init__(file_path, output_file)`: Initialize Excel processor
- `get_sheet_names()`: Get all sheet names
- `get_sheet_data(sheet_name)`: Get data from specified sheet
## π― Real-world Use Cases
### Case 1: Large-scale Data Annotation
```python
# Use LLM for sentiment analysis annotation on large amounts of text
def sentiment_labeling(example):
prompt = f"Analyze the sentiment of the following text, return positive/negative/neutral: {example['text']}"
sentiment = llm.getResult(prompt)
example['sentiment'] = sentiment.strip()
return example
# Process 100,000 data points, supports resume from breakpoint
dsp_labeled = dsp.map(sentiment_labeling, cache=True, num_proc=4)
```
### Case 2: Data Quality Filtering
```python
# Use LLM to filter high-quality training data
def quality_filter(example):
prompt = f"Rate text quality (1-5 points): {example['text']}"
score = llm.getResult(prompt)
try:
if int(score) >= 4:
return example
else:
return None # Auto-delete low-quality data
except:
return None
dsp_filtered = dsp.map(quality_filter, cache=True)
```
### Case 3: Data Augmentation
```python
# Generate multiple variants for each data point
def data_augmentation(example):
prompt = f"Generate 3 synonymous rewrites for the following text: {example['text']}"
variants = llm.getResult(prompt).split('\n')
# Return array, automatically expand into multiple rows
return [{
'text': variant.strip(),
'label': example['label'],
'source': 'augmented'
} for variant in variants if variant.strip()]
dsp_augmented = dsp.map(data_augmentation, cache=True)
```
### Case 4: Generate Training Data from Scratch
```python
# Use LLM to generate training data from scratch
def generate_qa_pairs(example):
llm = MyLLMTool(
model_name="gpt-3.5-turbo",
base_url="https://api.openai.com/v1",
api_key="your-api-key"
)
# Prompt for generating Q&A pairs
prompt = """Generate a Q&A pair about Python programming.
Requirements:
1. Question should be specific and practical
2. Answer should be accurate and detailed
3. Return JSON format: {"question": "question", "answer": "answer", "difficulty": "easy/medium/hard"}
"""
try:
result = llm.getResult(prompt)
qa_data = json.loads(result)
return {
'id': example['id'],
'question': qa_data.get('question', ''),
'answer': qa_data.get('answer', ''),
'difficulty': qa_data.get('difficulty', 'medium'),
'domain': 'python_programming',
'generated_at': datetime.now().isoformat()
}
except Exception as e:
print(f"Generation failed: {e}")
return None
# Generate 1000 Python programming Q&A pairs
dsp_qa_dataset = DatasetPlus.iter(
iterate_num=1000,
fn=generate_qa_pairs,
num_proc=4,
cache=True
)
print(f"Successfully generated {len(dsp_qa_dataset)} Q&A pairs")
```
## π‘ Best Practices
### 1. Caching Strategy
- Always enable caching: `cache=True`
- Large model call friendly, even if interrupted due to network instability, insufficient quota, etc., previous results won't be lost
- Automatically recalculates when function is modified
### 2. Performance Optimization
- Set `num_proc` reasonably (based on maximum concurrency the large model can accept)
- Adjust `max_inner_num` (maximum memory data storage, writes to disk for persistence every max_inner_num)
- Use batch processing for large datasets
### 3. Error Handling
```python
def robust_processing(example):
try:
# LLM call
result = llm.getResult(prompt)
return process_result(result)
except Exception as e:
print(f"Processing failed: {e}")
return None # Failed data auto-deleted
```
## π System Requirements
- **Python**: >= 3.7
- **datasets**: >= 2.0.0
- **pandas**: >= 1.3.0
- **numpy**: >= 1.21.0
- **openpyxl**: >= 3.0.0 (Excel support)
- **openai**: >= 1.0.0 (LLM support)
## π€ Contributing
Pull Requests are welcome! Please ensure:
- Code follows project standards
- Add appropriate tests
- Update relevant documentation
## π License
This project is licensed under the MIT License - see the LICENSE file for details.
## π Changelog
### v0.2.0 (Latest)
- β¨ Added intelligent caching system
- β¨ Support array auto-expansion
- β¨ Support None auto-filtering
- β¨ Full datasets API compatibility
- β¨ Added MyLLMTool large model tool
### v0.1.0
- π Initial release
- π Basic dataset loading functionality
- π Excel file support
- β‘ Caching and batch processing
- π Directory loading support
## π οΈ Auxiliary Tools
DatasetPlus provides two powerful auxiliary tools to enhance your data processing workflow:
### MyLLMTool
A comprehensive LLM (Large Language Model) calling tool that supports OpenAI-compatible APIs.
#### Key Features
- **Multi-model Support**: Compatible with OpenAI and other OpenAI-compatible APIs
- **Flexible Configuration**: Customizable parameters for different use cases
- **Easy Integration**: Seamlessly integrates with DatasetPlus processing workflows
#### Methods
- **`__init__(model_name, base_url, api_key)`**: Initialize the LLM tool
- `model_name`: The name of the model to use (e.g., "gpt-3.5-turbo")
- `base_url`: API base URL (e.g., "https://api.openai.com/v1")
- `api_key`: Your API key for authentication
- **`getResult(query, sys_prompt=None, temperature=0.7, top_p=1, max_tokens=2048, model_name=None)`**: Get LLM response
- `query`: User query or prompt
- `sys_prompt`: System prompt to set context (optional)
- `temperature`: Controls randomness (0.0-2.0)
- `top_p`: Controls diversity via nucleus sampling
- `max_tokens`: Maximum number of tokens to generate
- `model_name`: Override the default model for this request
#### Usage Example
```python
from datasetplus import MyLLMTool
llm = MyLLMTool(
model_name="gpt-3.5-turbo",
base_url="https://api.openai.com/v1",
api_key="your-api-key"
)
result = llm.getResult(
query="Summarize this text: ...",
sys_prompt="You are a helpful assistant.",
temperature=0.7
)
```
### DataTool
A utility class providing various data processing and parsing functions for common data manipulation tasks.
#### Key Features
- **JSON Parsing**: Safe JSON extraction and parsing from text
- **File Operations**: Read and sample data from files
- **Data Validation**: Check data structure compliance
- **Format Conversion**: Convert between different data formats
#### Methods
- **`parse_json_safe(text_str)`**: Extract and parse JSON objects/arrays from text
- `text_str`: Input string that may contain embedded JSON
- Returns: List of parsed Python objects (dicts or lists)
- **`get_prompt(file_path)`**: Read text file and return content as string
- `file_path`: Path to the text file
- Returns: File content as concatenated string
- **`check(row)`**: Validate data structure for message format
- `row`: Data row to validate
- Returns: Boolean indicating if structure is valid
- **`check_with_system(row)`**: Check if data has system message format
- `row`: Data row to validate
- Returns: Boolean indicating if has valid system message
- **`parse_messages(str_row)`**: Parse message format from string
- `str_row`: String containing message data
- Returns: Parsed message object or None
- **`parse_json(str, json_tag=False)`**: Parse JSON with error handling
- `str`: JSON string to parse
- `json_tag`: Whether to extract from ```json``` code blocks
- Returns: Parsed object or None on failure
- **`sample_from_file(file_path, num=-1)`**: Sample lines from text file
- `file_path`: Path to the file
- `num`: Number of samples (-1 for all)
- Returns: List of sampled lines
- **`sample_from(path, num=-1, granularity="auto", exclude=[])`**: Sample data from files/directories
- `path`: File or directory path
- `num`: Number of samples (-1 for all)
- `granularity`: Sampling granularity ("auto", "file", "line")
- `exclude`: Patterns to exclude
- Returns: List of sampled content
- **`jsonl2json(source_path, des_path)`**: Convert JSONL to JSON format
- `source_path`: Source JSONL file path
- `des_path`: Destination JSON file path
#### Usage Example
```python
from datasetplus import DataTool
# Parse JSON from text
json_data = DataTool.parse_json_safe('Some text {"key": "value"} more text')
# Read prompt from file
prompt = DataTool.get_prompt('prompt.txt')
# Validate data structure
is_valid = DataTool.check(data_row)
# Sample from file
samples = DataTool.sample_from_file('data.txt', num=10)
```
These auxiliary tools are designed to work seamlessly with DatasetPlus workflows, providing essential utilities for LLM integration and data processing tasks.
Raw data
{
"_id": null,
"home_page": null,
"name": "datasetplus",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "datasets, data processing, nlp",
"author": null,
"author_email": "sweeneyhe <sweeenyhe@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/3d/e0/04021bd00d5d8dcaf09eebb6d4322f9b4a12c083a98912522743020803a6/datasetplus-0.4.3.tar.gz",
"platform": null,
"description": "# DatasetPlus\n\n[\u4e2d\u6587\u7248\u672c](README-CN.md)\n\nAn enhanced Hugging Face datasets wrapper designed for large model data processing, providing intelligent caching, data augmentation, and filtering capabilities.\n\n\n## \ud83d\ude80 Core Features\n\n### 1. \ud83d\udd04 Full Compatibility - Seamless datasets Replacement\n- **100% Compatible**: Supports all datasets methods and attributes\n- **Mutual Conversion, Plug and Play**: Simply `dsp = DatasetPlus(ds)` to convert back: `ds = dsp.ds`\n- **Static Method Support**: All Dataset static methods and class methods can be called directly\n\n### 2. \ud83e\udde0 Intelligent Caching System - Zero Loss for Large Model Calls\n- **Automatic Function-level Caching**: Automatically generates cache based on function content, even if interrupted due to network instability, insufficient quota, etc., previous results won't be lost\n- **Jupyter Friendly**: Even if you forget to assign variables, results can be recovered from cache\n- **Resume from Breakpoint**: Supports continuing processing after interruption, automatically reads cache for processed data\n\n### 3. \ud83d\udcc8 Data Augmentation - One Row Becomes Multiple Rows\n- **Array Auto-expansion**: Automatically expands arrays returned by map function into multiple rows of data\n- **LLM Result Parsing**: Easy data augmentation with MyLLMTool\n\n### 4. \ud83d\udd0d Intelligent Filtering - Auto-delete When Returning None\n- **Conditional Filtering**: Automatically deletes rows when map function returns None\n- **LLM Intelligent Filtering**: Use large models for complex conditional filtering\n\n### 5. \ud83c\udfa8 Generate from Scratch - Direct Data Generation from Large Models\n- **iter Method**: Supports generating datasets from scratch\n- **Flexible Generation**: Can generate data in any format and content\n- **Batch Generation**: Supports large-scale data generation with automatic caching and parallel processing\n\n## \ud83d\udce6 Installation\n\n### Install from PyPI\n\n```bash\npip install datasetplus\n```\n\n### Install from Source\n\n```bash\ngit clone https://github.com/yourusername/datasetplus.git\ncd datasetplus\npip install -e .\n```\n\n### Dependencies Installation\n\n```bash\n# Basic dependencies\npip install datasets pandas numpy\n\n# Excel support\npip install openpyxl\n\n# LLM support\npip install openai\n```\n\n## \ud83c\udfaf Quick Start\n\n### Basic Usage\n```python\nfrom datasetplus import DatasetPlus, MyLLMTool\n\n# Load dataset\ndsp = DatasetPlus.load_dataset(\"data.jsonl\")\n\n# Fully compatible with datasets - plug and play\nds = dsp.ds # Now ds has all datasets functionality + DatasetPlus enhancements\n```\n\n### \ud83d\udd04 Feature 1: Full datasets Compatibility\n\n```python\n# All datasets methods can be used directly\nds = dsp.ds # Get native dataset object\ndsp_shuffled = dsp.shuffle(seed=42)\ndsp_split = dsp.train_test_split(test_size=0.2)\ndsp_filtered = dsp.filter(lambda x: len(x['text']) > 100)\n\n# pandas can also seamlessly connect\ndsp_df = dsp.to_pandas()\ndsp = DatasetPlus.from_pandas(dsp_df)\n\n# Static methods are fully supported\ndsp_from_dict = DatasetPlus.from_dict({\"text\": [\"hello\", \"world\"]})\ndsp_from_hf = DatasetPlus.from_pretrained(\"squad\")\n\n# Seamless switching with native datasets\nfrom datasets import Dataset\nds = Dataset.from_dict({\"text\": [\"test\"]})\ndsp = DatasetPlus(ds) # Directly wrap existing dataset\n\n# Jupyter-friendly display\ndsp = DatasetPlus.from_dict({\"text\": [\"a\", \"b\", \"c\"]})\ndsp\n----------------\nDatasetPlus({\n features: ['text'],\n num_rows: 3\n})\n~~~~~~~~~~~~~~~~~\n\n## Humanized slicing and display logic 1\ndsp[0] # Equivalent to ds.select(range(0)), of course dsp also supports dsp.select(range(0))\n----------------\n{'text': 'a'}\n\n## Humanized slicing and display logic 2\ndsp[1:2] # Equivalent to ds.select(range(1,2))\n----------------\nDatasetPlus({\n features: ['text'],\n num_rows: 1\n\n## Humanized slicing and display logic 3\ndsp[1:]\n----------------\nDatasetPlus({\n features: ['text'],\n num_rows: 2\n})\n```\n\n### \ud83e\udde0 Feature 2: Intelligent Caching - Zero Loss for Large Model Calls\n\n```python\n# Define processing function containing large model calls\ndef enhance_with_llm(example):\n # Initialize LLM tool (needs to be instantiated internally for multiprocessing)\n llm = MyLLMTool(\n model_name=\"gpt-3.5-turbo\",\n base_url=\"https://api.openai.com/v1\",\n api_key=\"your-api-key\"\n )\n # Call large model for data enhancement\n prompt = f\"Please generate a summary for the following text: {example['text']}\"\n summary = llm.getResult(prompt)\n example['summary'] = summary\n return example\n\n# First run - will call large model\ndsp_enhanced = dsp.map(enhance_with_llm, num_proc=4, cache=True)\n\n# Forgot to assign in Jupyter? No problem!\n# Even if you ran: dsp.map(enhance_with_llm, cache=True) # Forgot assignment\n# You can still recover results with:\ndsp_enhanced = dsp.map(enhance_with_llm, cache=True) # Auto-read from cache, won't call LLM again\n\n# Continuing after interruption is also fine, processed data will be automatically skipped\n```\n\n### \ud83d\udcc8 Feature 3: Data Augmentation - One Row Becomes Multiple Rows\n\n```python\ndef expand_data_with_llm(example):\n # Use LLM to generate multiple related questions\n prompt = f\"Generate 3 related questions based on the following text, return in JSON array format: {example['text']}\"\n questions_json = llm.getResult(prompt)\n \n try:\n questions = json.loads(questions_json)\n # Return array, DatasetPlus will automatically expand into multiple rows\n return [{\n 'original_text': example['text'],\n 'question': q,\n 'source': 'llm_generated'\n } for q in questions]\n except:\n return example # Return original data on parsing failure or delete directly: return None\n\n# Original data: 100 rows\n# After processing: might become 300 rows (3 questions generated per row)\ndsp_expanded = dsp.map(expand_data_with_llm, cache=True)\nprint(f\"Original data: {len(dsp)} rows\")\nprint(f\"After expansion: {len(dsp_expanded)} rows\")\n```\n\n### \ud83d\udd0d Feature 4: Intelligent Filtering - Auto-delete When Returning None\n\n```python\ndef filter_with_llm(example):\n # Use LLM for quality assessment\n prompt = f\"\"\"Evaluate the quality of the following text, return JSON format: {{\"quality\": \"high/mid/low\"}}\n Text: {example['text']}\"\"\"\n \n result = llm.getResult(prompt)\n try:\n quality_data = json.loads(result)\n quality = quality_data.get('quality', 'low')\n \n # Only keep high-quality data, others returning None will be auto-deleted\n if quality == 'high':\n example['quality_score'] = quality\n return example\n else:\n return None # Auto-delete low-quality data\n except:\n return None # Also delete on parsing failure\n\n# Original data: 1000 rows\n# After filtering: might only have 200 rows of high-quality data\ndsp_filtered = dsp.map(filter_with_llm, cache=True)\nprint(f\"Before filtering: {len(dsp)} rows\")\nprint(f\"After filtering: {len(dsp_filtered)} rows\")\n```\n\n### \ud83c\udfa8 Feature 5: Generate from Scratch - Direct Data Generation from Large Models\n\n```python\n# Use iter method to generate data directly from large models\ndef generate_dialogues(example):\n llm = MyLLMTool(\n model_name=\"gpt-3.5-turbo\",\n base_url=\"https://api.openai.com/v1\",\n api_key=\"your-api-key\"\n )\n \n # Prompt: Generate 10 customer service dialogues\n prompt = \"\"\"Please generate 10 different customer service dialogues, each containing questions and answers.\n Requirements:\n 1. Each dialogue has users asking different specific questions\n 2. Customer service provides professional answers\n 3. Return JSON array format: [{\"user\": \"User question 1\", \"assistant\": \"Service answer 1\", \"category\": \"Question category 1\"}, ...]\n 4. Cover different types of questions (technical support, after-sales service, product consultation, etc.)\n \"\"\"\n \n try:\n result = llm.getResult(prompt)\n dialogues_data = json.loads(result)\n \n # Return array, DatasetPlus will automatically expand into multiple rows\n return [{\n 'batch_id': example['id'],\n 'dialogue_id': i,\n 'user': dialogue.get('user', ''),\n 'assistant': dialogue.get('assistant', ''),\n 'category': dialogue.get('category', ''),\n 'source': 'generated'\n } for i, dialogue in enumerate(dialogues_data)]\n except Exception as e:\n print(f\"Generation failed: {e}\")\n return None # Skip on generation failure\n\n# Generate 10 batches of dialogue data, each batch containing 10 dialogues\ndsp_generated = DatasetPlus.iter(\n iterate_num=10, # Generate 10 batches of data\n fn=generate_dialogues, # Generation function\n num_proc=2, # 2 processes in parallel\n cache=False # iter cache defaults to False\n)\n\nprint(f\"Generated {len(dsp_generated)} dialogue data\") # Should be 100 (10 batches \u00d7 10 dialogues)\nprint(dsp_generated[0]) # View first generated data\n```\n\n## \ud83d\udcc1 Supported Data Formats\n\n- **JSON/JSONL**: Standard JSON and JSON Lines format\n- **CSV**: Comma-separated values files\n- **Excel**: .xlsx and .xls files\n- **Hugging Face Datasets**: Any dataset from the Hub\n- **Dataframe, datasets**: Support pandas DataFrame and Hugging Face datasets\n- **Directory Batch Loading**: Automatically merge multiple files in directory\n\n## \ud83d\udd27 Advanced Features\n\n### Intelligent Caching Mechanism\n\n```python\n# Cache based on function content hash, ensures recalculation when function changes\ndef process_v1(example):\n return {\"result\": example[\"text\"].upper()} # Version 1\n\ndef process_v2(example):\n return {\"result\": example[\"text\"].lower()} # Version 2\n\n# Different functions generate different caches, no interference\nds1 = ds.map(process_v1, cache=True) # Cache A\nds2 = ds.map(process_v2, cache=True) # Cache B\n```\n\n### Batch Processing and Multiprocessing\n\n```python\n# Efficient processing of large datasets\ndsp = DatasetPlus.load_dataset(\"large_dataset.jsonl\")\ndsp_processed = dsp.map(\n enhance_with_llm,\n num_proc=8, # 8 processes in parallel\n max_inner_num=1000, # Process 1000 items per batch\n cache=True # Enable caching\n)\n```\n\n### Directory Batch Loading\n\n```python\n# Automatically load and merge all supported files in directory\ndsp = DatasetPlus.load_dataset_plus(\"./data_folder/\")\n# Supports mixed formats: data_folder/\n# \u251c\u2500\u2500 file1.jsonl\n# \u251c\u2500\u2500 file2.csv\n# \u2514\u2500\u2500 file3.xlsx\n```\n\n### Professional Excel Processing\n\n```python\nfrom datasetplus import DatasetPlusExcels\n\n# Professional Excel file processing\nexcel_dsp = DatasetPlusExcels(\"spreadsheet.xlsx\")\n\n# Support multi-sheet processing\nsheet_names = excel_dsp.get_sheet_names()\nfor sheet in sheet_names:\n sheet_data = excel_dsp.get_sheet_data(sheet)\n dsp_processed = excel_dsp.map(lambda x: {'cleaned': x['column'].strip()})\n```\n\n## \ud83d\udcda API Reference\n\n### DatasetPlus\n\nEnhanced dataset processing class, fully compatible with Hugging Face datasets.\n\n#### Core Methods\n\n- `map(fn, num_proc=1, max_inner_num=1000, cache=True)`: Enhanced mapping function\n - **fn**: Processing function, supports returning arrays (auto-expand) and None (auto-delete)\n - **cache**: Intelligent caching, automatically generates cache keys based on function content\n - **num_proc**: Multiprocess parallel processing\n - **max_inner_num**: Batch processing size\n\n#### Static Methods\n\n- `load_dataset(file_name, output_file)`: Load single dataset file\n- `load_dataset_plus(input_path, output_file)`: Load from file, directory, or Hub\n- `from_pandas(df)`: Create from pandas DataFrame\n- `from_dict(data)`: Create from dictionary\n- `from_pretrained(path)`: Load from Hugging Face Hub\n- `iter(iterate_num, fn, num_proc=1, max_inner_num=1000, cache=True)`: Generate from scratch, iteratively generate data\n - **iterate_num**: Number of data to generate\n - **fn**: Generation function, receives example with id, returns generated data\n - **num_proc**: Multiprocess parallel processing\n - **cache**: Enable caching, avoid duplicate generation\n\n#### Compatibility\n\n```python\n# All datasets methods can be used directly\nds.shuffle() # Shuffle data\nds.filter() # Filter data\nds.select() # Select data\nds.train_test_split() # Split data\nds.save_to_disk() # Save to disk\n# ... and all other datasets methods\n```\n\n### MyLLMTool\n\nLarge model calling tool, supports OpenAI-compatible APIs.\n\n#### Initialization\n\n```python\nllm = MyLLMTool(\n model_name=\"gpt-3.5-turbo\", # Model name\n base_url=\"https://api.openai.com/v1\", # API base URL\n api_key=\"your-api-key\" # API key\n)\n```\n\n#### Methods\n\n- `getResult(query, sys_prompt=None, temperature=0.7, top_p=1, max_tokens=2048, model_name=None)`: Get LLM response\n - **query**: User query\n - **sys_prompt**: System prompt\n - **temperature**: Temperature parameter\n - **max_tokens**: Maximum tokens\n\n### DatasetPlusExcels\n\nProfessional Excel file processing class.\n\n#### Methods\n\n- `__init__(file_path, output_file)`: Initialize Excel processor\n- `get_sheet_names()`: Get all sheet names\n- `get_sheet_data(sheet_name)`: Get data from specified sheet\n\n## \ud83c\udfaf Real-world Use Cases\n\n### Case 1: Large-scale Data Annotation\n\n```python\n# Use LLM for sentiment analysis annotation on large amounts of text\ndef sentiment_labeling(example):\n prompt = f\"Analyze the sentiment of the following text, return positive/negative/neutral: {example['text']}\"\n sentiment = llm.getResult(prompt)\n example['sentiment'] = sentiment.strip()\n return example\n\n# Process 100,000 data points, supports resume from breakpoint\ndsp_labeled = dsp.map(sentiment_labeling, cache=True, num_proc=4)\n```\n\n### Case 2: Data Quality Filtering\n\n```python\n# Use LLM to filter high-quality training data\ndef quality_filter(example):\n prompt = f\"Rate text quality (1-5 points): {example['text']}\"\n score = llm.getResult(prompt)\n try:\n if int(score) >= 4:\n return example\n else:\n return None # Auto-delete low-quality data\n except:\n return None\n\ndsp_filtered = dsp.map(quality_filter, cache=True)\n```\n\n### Case 3: Data Augmentation\n\n```python\n# Generate multiple variants for each data point\ndef data_augmentation(example):\n prompt = f\"Generate 3 synonymous rewrites for the following text: {example['text']}\"\n variants = llm.getResult(prompt).split('\\n')\n \n # Return array, automatically expand into multiple rows\n return [{\n 'text': variant.strip(),\n 'label': example['label'],\n 'source': 'augmented'\n } for variant in variants if variant.strip()]\n\ndsp_augmented = dsp.map(data_augmentation, cache=True)\n```\n\n### Case 4: Generate Training Data from Scratch\n\n```python\n# Use LLM to generate training data from scratch\ndef generate_qa_pairs(example):\n llm = MyLLMTool(\n model_name=\"gpt-3.5-turbo\",\n base_url=\"https://api.openai.com/v1\",\n api_key=\"your-api-key\"\n )\n \n # Prompt for generating Q&A pairs\n prompt = \"\"\"Generate a Q&A pair about Python programming.\n Requirements:\n 1. Question should be specific and practical\n 2. Answer should be accurate and detailed\n 3. Return JSON format: {\"question\": \"question\", \"answer\": \"answer\", \"difficulty\": \"easy/medium/hard\"}\n \"\"\"\n \n try:\n result = llm.getResult(prompt)\n qa_data = json.loads(result)\n return {\n 'id': example['id'],\n 'question': qa_data.get('question', ''),\n 'answer': qa_data.get('answer', ''),\n 'difficulty': qa_data.get('difficulty', 'medium'),\n 'domain': 'python_programming',\n 'generated_at': datetime.now().isoformat()\n }\n except Exception as e:\n print(f\"Generation failed: {e}\")\n return None\n\n# Generate 1000 Python programming Q&A pairs\ndsp_qa_dataset = DatasetPlus.iter(\n iterate_num=1000,\n fn=generate_qa_pairs,\n num_proc=4,\n cache=True\n)\n\nprint(f\"Successfully generated {len(dsp_qa_dataset)} Q&A pairs\")\n```\n\n## \ud83d\udca1 Best Practices\n\n### 1. Caching Strategy\n- Always enable caching: `cache=True`\n- Large model call friendly, even if interrupted due to network instability, insufficient quota, etc., previous results won't be lost\n- Automatically recalculates when function is modified\n\n### 2. Performance Optimization\n- Set `num_proc` reasonably (based on maximum concurrency the large model can accept)\n- Adjust `max_inner_num` (maximum memory data storage, writes to disk for persistence every max_inner_num)\n- Use batch processing for large datasets\n\n### 3. Error Handling\n```python\ndef robust_processing(example):\n try:\n # LLM call\n result = llm.getResult(prompt)\n return process_result(result)\n except Exception as e:\n print(f\"Processing failed: {e}\")\n return None # Failed data auto-deleted\n```\n\n## \ud83d\udccb System Requirements\n\n- **Python**: >= 3.7\n- **datasets**: >= 2.0.0\n- **pandas**: >= 1.3.0\n- **numpy**: >= 1.21.0\n- **openpyxl**: >= 3.0.0 (Excel support)\n- **openai**: >= 1.0.0 (LLM support)\n\n## \ud83e\udd1d Contributing\n\nPull Requests are welcome! Please ensure:\n- Code follows project standards\n- Add appropriate tests\n- Update relevant documentation\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n## \ud83d\udcdd Changelog\n\n### v0.2.0 (Latest)\n- \u2728 Added intelligent caching system\n- \u2728 Support array auto-expansion\n- \u2728 Support None auto-filtering\n- \u2728 Full datasets API compatibility\n- \u2728 Added MyLLMTool large model tool\n\n### v0.1.0\n- \ud83c\udf89 Initial release\n- \ud83d\udcc1 Basic dataset loading functionality\n- \ud83d\udcca Excel file support\n- \u26a1 Caching and batch processing\n- \ud83d\udcc2 Directory loading support\n\n## \ud83d\udee0\ufe0f Auxiliary Tools\n\nDatasetPlus provides two powerful auxiliary tools to enhance your data processing workflow:\n\n### MyLLMTool\n\nA comprehensive LLM (Large Language Model) calling tool that supports OpenAI-compatible APIs.\n\n#### Key Features\n- **Multi-model Support**: Compatible with OpenAI and other OpenAI-compatible APIs\n- **Flexible Configuration**: Customizable parameters for different use cases\n- **Easy Integration**: Seamlessly integrates with DatasetPlus processing workflows\n\n#### Methods\n\n- **`__init__(model_name, base_url, api_key)`**: Initialize the LLM tool\n - `model_name`: The name of the model to use (e.g., \"gpt-3.5-turbo\")\n - `base_url`: API base URL (e.g., \"https://api.openai.com/v1\")\n - `api_key`: Your API key for authentication\n\n- **`getResult(query, sys_prompt=None, temperature=0.7, top_p=1, max_tokens=2048, model_name=None)`**: Get LLM response\n - `query`: User query or prompt\n - `sys_prompt`: System prompt to set context (optional)\n - `temperature`: Controls randomness (0.0-2.0)\n - `top_p`: Controls diversity via nucleus sampling\n - `max_tokens`: Maximum number of tokens to generate\n - `model_name`: Override the default model for this request\n\n#### Usage Example\n```python\nfrom datasetplus import MyLLMTool\n\nllm = MyLLMTool(\n model_name=\"gpt-3.5-turbo\",\n base_url=\"https://api.openai.com/v1\",\n api_key=\"your-api-key\"\n)\n\nresult = llm.getResult(\n query=\"Summarize this text: ...\",\n sys_prompt=\"You are a helpful assistant.\",\n temperature=0.7\n)\n```\n\n### DataTool\n\nA utility class providing various data processing and parsing functions for common data manipulation tasks.\n\n#### Key Features\n- **JSON Parsing**: Safe JSON extraction and parsing from text\n- **File Operations**: Read and sample data from files\n- **Data Validation**: Check data structure compliance\n- **Format Conversion**: Convert between different data formats\n\n#### Methods\n\n- **`parse_json_safe(text_str)`**: Extract and parse JSON objects/arrays from text\n - `text_str`: Input string that may contain embedded JSON\n - Returns: List of parsed Python objects (dicts or lists)\n\n- **`get_prompt(file_path)`**: Read text file and return content as string\n - `file_path`: Path to the text file\n - Returns: File content as concatenated string\n\n- **`check(row)`**: Validate data structure for message format\n - `row`: Data row to validate\n - Returns: Boolean indicating if structure is valid\n\n- **`check_with_system(row)`**: Check if data has system message format\n - `row`: Data row to validate\n - Returns: Boolean indicating if has valid system message\n\n- **`parse_messages(str_row)`**: Parse message format from string\n - `str_row`: String containing message data\n - Returns: Parsed message object or None\n\n- **`parse_json(str, json_tag=False)`**: Parse JSON with error handling\n - `str`: JSON string to parse\n - `json_tag`: Whether to extract from ```json``` code blocks\n - Returns: Parsed object or None on failure\n\n- **`sample_from_file(file_path, num=-1)`**: Sample lines from text file\n - `file_path`: Path to the file\n - `num`: Number of samples (-1 for all)\n - Returns: List of sampled lines\n\n- **`sample_from(path, num=-1, granularity=\"auto\", exclude=[])`**: Sample data from files/directories\n - `path`: File or directory path\n - `num`: Number of samples (-1 for all)\n - `granularity`: Sampling granularity (\"auto\", \"file\", \"line\")\n - `exclude`: Patterns to exclude\n - Returns: List of sampled content\n\n- **`jsonl2json(source_path, des_path)`**: Convert JSONL to JSON format\n - `source_path`: Source JSONL file path\n - `des_path`: Destination JSON file path\n\n#### Usage Example\n```python\nfrom datasetplus import DataTool\n\n# Parse JSON from text\njson_data = DataTool.parse_json_safe('Some text {\"key\": \"value\"} more text')\n\n# Read prompt from file\nprompt = DataTool.get_prompt('prompt.txt')\n\n# Validate data structure\nis_valid = DataTool.check(data_row)\n\n# Sample from file\nsamples = DataTool.sample_from_file('data.txt', num=10)\n```\n\nThese auxiliary tools are designed to work seamlessly with DatasetPlus workflows, providing essential utilities for LLM integration and data processing tasks.\n",
"bugtrack_url": null,
"license": null,
"summary": "An enhanced wrapper for Hugging Face datasets with additional functionality",
"version": "0.4.3",
"project_urls": {
"Bug Reports": "https://github.com/SWEENEYHE/DatasetPlus/issues",
"Homepage": "https://github.com/SWEENEYHE/DatasetPlus",
"Source": "https://github.com/SWEENEYHE/DatasetPlus"
},
"split_keywords": [
"datasets",
" data processing",
" nlp"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "dbd687935f29c110452f96fd78e00315dee69a88969d75ec6b327242248e64dd",
"md5": "b1493d05c28876445f4e67be6bb93ad9",
"sha256": "5842812464a392241dffecf69578efe586ac4d15191b44a1fa9c6defda608c53"
},
"downloads": -1,
"filename": "datasetplus-0.4.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b1493d05c28876445f4e67be6bb93ad9",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 19897,
"upload_time": "2025-08-03T02:31:39",
"upload_time_iso_8601": "2025-08-03T02:31:39.300699Z",
"url": "https://files.pythonhosted.org/packages/db/d6/87935f29c110452f96fd78e00315dee69a88969d75ec6b327242248e64dd/datasetplus-0.4.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "3de004021bd00d5d8dcaf09eebb6d4322f9b4a12c083a98912522743020803a6",
"md5": "b672710dfdba9b3c640bd3c311cf6ed9",
"sha256": "60bbdb33e5e0813945e26230beb1cb70a55f7101a98baa6630f7d510cfc1670c"
},
"downloads": -1,
"filename": "datasetplus-0.4.3.tar.gz",
"has_sig": false,
"md5_digest": "b672710dfdba9b3c640bd3c311cf6ed9",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 27465,
"upload_time": "2025-08-03T02:31:41",
"upload_time_iso_8601": "2025-08-03T02:31:41.072786Z",
"url": "https://files.pythonhosted.org/packages/3d/e0/04021bd00d5d8dcaf09eebb6d4322f9b4a12c083a98912522743020803a6/datasetplus-0.4.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-03 02:31:41",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "SWEENEYHE",
"github_project": "DatasetPlus",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "datasets",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"1.3.0"
]
]
},
{
"name": "numpy",
"specs": [
[
">=",
"1.21.0"
]
]
},
{
"name": "openpyxl",
"specs": [
[
">=",
"3.0.0"
]
]
},
{
"name": "xlrd",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "pyarrow",
"specs": [
[
">=",
"5.0.0"
]
]
},
{
"name": "tqdm",
"specs": [
[
">=",
"4.62.0"
]
]
},
{
"name": "requests",
"specs": [
[
">=",
"2.25.0"
]
]
},
{
"name": "openai",
"specs": [
[
">=",
"1.0.0"
]
]
},
{
"name": "json5",
"specs": [
[
">=",
"0.9.0"
]
]
},
{
"name": "json-repair",
"specs": [
[
">=",
"0.7.0"
]
]
}
],
"lcname": "datasetplus"
}