datasetplus


Namedatasetplus JSON
Version 0.4.3 PyPI version JSON
download
home_pageNone
SummaryAn enhanced wrapper for Hugging Face datasets with additional functionality
upload_time2025-08-03 02:31:41
maintainerNone
docs_urlNone
authorNone
requires_python>=3.7
licenseNone
keywords datasets data processing nlp
VCS
bugtrack_url
requirements datasets pandas numpy openpyxl xlrd pyarrow tqdm requests openai json5 json-repair
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # DatasetPlus

[δΈ­ζ–‡η‰ˆζœ¬](README-CN.md)

An enhanced Hugging Face datasets wrapper designed for large model data processing, providing intelligent caching, data augmentation, and filtering capabilities.


## πŸš€ Core Features

### 1. πŸ”„ Full Compatibility - Seamless datasets Replacement
- **100% Compatible**: Supports all datasets methods and attributes
- **Mutual Conversion, Plug and Play**: Simply `dsp = DatasetPlus(ds)` to convert back: `ds = dsp.ds`
- **Static Method Support**: All Dataset static methods and class methods can be called directly

### 2. 🧠 Intelligent Caching System - Zero Loss for Large Model Calls
- **Automatic Function-level Caching**: Automatically generates cache based on function content, even if interrupted due to network instability, insufficient quota, etc., previous results won't be lost
- **Jupyter Friendly**: Even if you forget to assign variables, results can be recovered from cache
- **Resume from Breakpoint**: Supports continuing processing after interruption, automatically reads cache for processed data

### 3. πŸ“ˆ Data Augmentation - One Row Becomes Multiple Rows
- **Array Auto-expansion**: Automatically expands arrays returned by map function into multiple rows of data
- **LLM Result Parsing**: Easy data augmentation with MyLLMTool

### 4. πŸ” Intelligent Filtering - Auto-delete When Returning None
- **Conditional Filtering**: Automatically deletes rows when map function returns None
- **LLM Intelligent Filtering**: Use large models for complex conditional filtering

### 5. 🎨 Generate from Scratch - Direct Data Generation from Large Models
- **iter Method**: Supports generating datasets from scratch
- **Flexible Generation**: Can generate data in any format and content
- **Batch Generation**: Supports large-scale data generation with automatic caching and parallel processing

## πŸ“¦ Installation

### Install from PyPI

```bash
pip install datasetplus
```

### Install from Source

```bash
git clone https://github.com/yourusername/datasetplus.git
cd datasetplus
pip install -e .
```

### Dependencies Installation

```bash
# Basic dependencies
pip install datasets pandas numpy

# Excel support
pip install openpyxl

# LLM support
pip install openai
```

## 🎯 Quick Start

### Basic Usage
```python
from datasetplus import DatasetPlus, MyLLMTool

# Load dataset
dsp = DatasetPlus.load_dataset("data.jsonl")

# Fully compatible with datasets - plug and play
ds = dsp.ds  # Now ds has all datasets functionality + DatasetPlus enhancements
```

### πŸ”„ Feature 1: Full datasets Compatibility

```python
# All datasets methods can be used directly
ds = dsp.ds  # Get native dataset object
dsp_shuffled = dsp.shuffle(seed=42)
dsp_split = dsp.train_test_split(test_size=0.2)
dsp_filtered = dsp.filter(lambda x: len(x['text']) > 100)

# pandas can also seamlessly connect
dsp_df = dsp.to_pandas()
dsp = DatasetPlus.from_pandas(dsp_df)

# Static methods are fully supported
dsp_from_dict = DatasetPlus.from_dict({"text": ["hello", "world"]})
dsp_from_hf = DatasetPlus.from_pretrained("squad")

# Seamless switching with native datasets
from datasets import Dataset
ds = Dataset.from_dict({"text": ["test"]})
dsp = DatasetPlus(ds)  # Directly wrap existing dataset

# Jupyter-friendly display
dsp = DatasetPlus.from_dict({"text": ["a", "b", "c"]})
dsp
----------------
DatasetPlus({
    features: ['text'],
    num_rows: 3
})
~~~~~~~~~~~~~~~~~

## Humanized slicing and display logic 1
dsp[0] # Equivalent to ds.select(range(0)), of course dsp also supports dsp.select(range(0))
----------------
{'text': 'a'}

## Humanized slicing and display logic 2
dsp[1:2] # Equivalent to ds.select(range(1,2))
----------------
DatasetPlus({
    features: ['text'],
    num_rows: 1

## Humanized slicing and display logic 3
dsp[1:]
----------------
DatasetPlus({
    features: ['text'],
    num_rows: 2
})
```

### 🧠 Feature 2: Intelligent Caching - Zero Loss for Large Model Calls

```python
# Define processing function containing large model calls
def enhance_with_llm(example):
    # Initialize LLM tool (needs to be instantiated internally for multiprocessing)
    llm = MyLLMTool(
        model_name="gpt-3.5-turbo",
        base_url="https://api.openai.com/v1",
        api_key="your-api-key"
    )
    # Call large model for data enhancement
    prompt = f"Please generate a summary for the following text: {example['text']}"
    summary = llm.getResult(prompt)
    example['summary'] = summary
    return example

# First run - will call large model
dsp_enhanced = dsp.map(enhance_with_llm, num_proc=4, cache=True)

# Forgot to assign in Jupyter? No problem!
# Even if you ran: dsp.map(enhance_with_llm, cache=True)  # Forgot assignment
# You can still recover results with:
dsp_enhanced = dsp.map(enhance_with_llm, cache=True)  # Auto-read from cache, won't call LLM again

# Continuing after interruption is also fine, processed data will be automatically skipped
```

### πŸ“ˆ Feature 3: Data Augmentation - One Row Becomes Multiple Rows

```python
def expand_data_with_llm(example):
    # Use LLM to generate multiple related questions
    prompt = f"Generate 3 related questions based on the following text, return in JSON array format: {example['text']}"
    questions_json = llm.getResult(prompt)
    
    try:
        questions = json.loads(questions_json)
        # Return array, DatasetPlus will automatically expand into multiple rows
        return [{
            'original_text': example['text'],
            'question': q,
            'source': 'llm_generated'
        } for q in questions]
    except:
        return example  # Return original data on parsing failure or delete directly: return None

# Original data: 100 rows
# After processing: might become 300 rows (3 questions generated per row)
dsp_expanded = dsp.map(expand_data_with_llm, cache=True)
print(f"Original data: {len(dsp)} rows")
print(f"After expansion: {len(dsp_expanded)} rows")
```

### πŸ” Feature 4: Intelligent Filtering - Auto-delete When Returning None

```python
def filter_with_llm(example):
    # Use LLM for quality assessment
    prompt = f"""Evaluate the quality of the following text, return JSON format: {{"quality": "high/mid/low"}}
    Text: {example['text']}"""
    
    result = llm.getResult(prompt)
    try:
        quality_data = json.loads(result)
        quality = quality_data.get('quality', 'low')
        
        # Only keep high-quality data, others returning None will be auto-deleted
        if quality == 'high':
            example['quality_score'] = quality
            return example
        else:
            return None  # Auto-delete low-quality data
    except:
        return None  # Also delete on parsing failure

# Original data: 1000 rows
# After filtering: might only have 200 rows of high-quality data
dsp_filtered = dsp.map(filter_with_llm, cache=True)
print(f"Before filtering: {len(dsp)} rows")
print(f"After filtering: {len(dsp_filtered)} rows")
```

### 🎨 Feature 5: Generate from Scratch - Direct Data Generation from Large Models

```python
# Use iter method to generate data directly from large models
def generate_dialogues(example):
    llm = MyLLMTool(
        model_name="gpt-3.5-turbo",
        base_url="https://api.openai.com/v1",
        api_key="your-api-key"
    )
    
    # Prompt: Generate 10 customer service dialogues
    prompt = """Please generate 10 different customer service dialogues, each containing questions and answers.
    Requirements:
    1. Each dialogue has users asking different specific questions
    2. Customer service provides professional answers
    3. Return JSON array format: [{"user": "User question 1", "assistant": "Service answer 1", "category": "Question category 1"}, ...]
    4. Cover different types of questions (technical support, after-sales service, product consultation, etc.)
    """
    
    try:
        result = llm.getResult(prompt)
        dialogues_data = json.loads(result)
        
        # Return array, DatasetPlus will automatically expand into multiple rows
        return [{
            'batch_id': example['id'],
            'dialogue_id': i,
            'user': dialogue.get('user', ''),
            'assistant': dialogue.get('assistant', ''),
            'category': dialogue.get('category', ''),
            'source': 'generated'
        } for i, dialogue in enumerate(dialogues_data)]
    except Exception as e:
        print(f"Generation failed: {e}")
        return None  # Skip on generation failure

# Generate 10 batches of dialogue data, each batch containing 10 dialogues
dsp_generated = DatasetPlus.iter(
    iterate_num=10,           # Generate 10 batches of data
    fn=generate_dialogues,    # Generation function
    num_proc=2,              # 2 processes in parallel
    cache=False              # iter cache defaults to False
)

print(f"Generated {len(dsp_generated)} dialogue data")  # Should be 100 (10 batches Γ— 10 dialogues)
print(dsp_generated[0])  # View first generated data
```

## πŸ“ Supported Data Formats

- **JSON/JSONL**: Standard JSON and JSON Lines format
- **CSV**: Comma-separated values files
- **Excel**: .xlsx and .xls files
- **Hugging Face Datasets**: Any dataset from the Hub
- **Dataframe, datasets**: Support pandas DataFrame and Hugging Face datasets
- **Directory Batch Loading**: Automatically merge multiple files in directory

## πŸ”§ Advanced Features

### Intelligent Caching Mechanism

```python
# Cache based on function content hash, ensures recalculation when function changes
def process_v1(example):
    return {"result": example["text"].upper()}  # Version 1

def process_v2(example):
    return {"result": example["text"].lower()}  # Version 2

# Different functions generate different caches, no interference
ds1 = ds.map(process_v1, cache=True)  # Cache A
ds2 = ds.map(process_v2, cache=True)  # Cache B
```

### Batch Processing and Multiprocessing

```python
# Efficient processing of large datasets
dsp = DatasetPlus.load_dataset("large_dataset.jsonl")
dsp_processed = dsp.map(
    enhance_with_llm,
    num_proc=8,           # 8 processes in parallel
    max_inner_num=1000,   # Process 1000 items per batch
    cache=True            # Enable caching
)
```

### Directory Batch Loading

```python
# Automatically load and merge all supported files in directory
dsp = DatasetPlus.load_dataset_plus("./data_folder/")
# Supports mixed formats: data_folder/
#   β”œβ”€β”€ file1.jsonl
#   β”œβ”€β”€ file2.csv
#   └── file3.xlsx
```

### Professional Excel Processing

```python
from datasetplus import DatasetPlusExcels

# Professional Excel file processing
excel_dsp = DatasetPlusExcels("spreadsheet.xlsx")

# Support multi-sheet processing
sheet_names = excel_dsp.get_sheet_names()
for sheet in sheet_names:
    sheet_data = excel_dsp.get_sheet_data(sheet)
    dsp_processed = excel_dsp.map(lambda x: {'cleaned': x['column'].strip()})
```

## πŸ“š API Reference

### DatasetPlus

Enhanced dataset processing class, fully compatible with Hugging Face datasets.

#### Core Methods

- `map(fn, num_proc=1, max_inner_num=1000, cache=True)`: Enhanced mapping function
  - **fn**: Processing function, supports returning arrays (auto-expand) and None (auto-delete)
  - **cache**: Intelligent caching, automatically generates cache keys based on function content
  - **num_proc**: Multiprocess parallel processing
  - **max_inner_num**: Batch processing size

#### Static Methods

- `load_dataset(file_name, output_file)`: Load single dataset file
- `load_dataset_plus(input_path, output_file)`: Load from file, directory, or Hub
- `from_pandas(df)`: Create from pandas DataFrame
- `from_dict(data)`: Create from dictionary
- `from_pretrained(path)`: Load from Hugging Face Hub
- `iter(iterate_num, fn, num_proc=1, max_inner_num=1000, cache=True)`: Generate from scratch, iteratively generate data
  - **iterate_num**: Number of data to generate
  - **fn**: Generation function, receives example with id, returns generated data
  - **num_proc**: Multiprocess parallel processing
  - **cache**: Enable caching, avoid duplicate generation

#### Compatibility

```python
# All datasets methods can be used directly
ds.shuffle()          # Shuffle data
ds.filter()           # Filter data
ds.select()           # Select data
ds.train_test_split() # Split data
ds.save_to_disk()     # Save to disk
# ... and all other datasets methods
```

### MyLLMTool

Large model calling tool, supports OpenAI-compatible APIs.

#### Initialization

```python
llm = MyLLMTool(
    model_name="gpt-3.5-turbo",      # Model name
    base_url="https://api.openai.com/v1",  # API base URL
    api_key="your-api-key"           # API key
)
```

#### Methods

- `getResult(query, sys_prompt=None, temperature=0.7, top_p=1, max_tokens=2048, model_name=None)`: Get LLM response
  - **query**: User query
  - **sys_prompt**: System prompt
  - **temperature**: Temperature parameter
  - **max_tokens**: Maximum tokens

### DatasetPlusExcels

Professional Excel file processing class.

#### Methods

- `__init__(file_path, output_file)`: Initialize Excel processor
- `get_sheet_names()`: Get all sheet names
- `get_sheet_data(sheet_name)`: Get data from specified sheet

## 🎯 Real-world Use Cases

### Case 1: Large-scale Data Annotation

```python
# Use LLM for sentiment analysis annotation on large amounts of text
def sentiment_labeling(example):
    prompt = f"Analyze the sentiment of the following text, return positive/negative/neutral: {example['text']}"
    sentiment = llm.getResult(prompt)
    example['sentiment'] = sentiment.strip()
    return example

# Process 100,000 data points, supports resume from breakpoint
dsp_labeled = dsp.map(sentiment_labeling, cache=True, num_proc=4)
```

### Case 2: Data Quality Filtering

```python
# Use LLM to filter high-quality training data
def quality_filter(example):
    prompt = f"Rate text quality (1-5 points): {example['text']}"
    score = llm.getResult(prompt)
    try:
        if int(score) >= 4:
            return example
        else:
            return None  # Auto-delete low-quality data
    except:
        return None

dsp_filtered = dsp.map(quality_filter, cache=True)
```

### Case 3: Data Augmentation

```python
# Generate multiple variants for each data point
def data_augmentation(example):
    prompt = f"Generate 3 synonymous rewrites for the following text: {example['text']}"
    variants = llm.getResult(prompt).split('\n')
    
    # Return array, automatically expand into multiple rows
    return [{
        'text': variant.strip(),
        'label': example['label'],
        'source': 'augmented'
    } for variant in variants if variant.strip()]

dsp_augmented = dsp.map(data_augmentation, cache=True)
```

### Case 4: Generate Training Data from Scratch

```python
# Use LLM to generate training data from scratch
def generate_qa_pairs(example):
    llm = MyLLMTool(
        model_name="gpt-3.5-turbo",
        base_url="https://api.openai.com/v1",
        api_key="your-api-key"
    )
    
    # Prompt for generating Q&A pairs
    prompt = """Generate a Q&A pair about Python programming.
    Requirements:
    1. Question should be specific and practical
    2. Answer should be accurate and detailed
    3. Return JSON format: {"question": "question", "answer": "answer", "difficulty": "easy/medium/hard"}
    """
    
    try:
        result = llm.getResult(prompt)
        qa_data = json.loads(result)
        return {
            'id': example['id'],
            'question': qa_data.get('question', ''),
            'answer': qa_data.get('answer', ''),
            'difficulty': qa_data.get('difficulty', 'medium'),
            'domain': 'python_programming',
            'generated_at': datetime.now().isoformat()
        }
    except Exception as e:
        print(f"Generation failed: {e}")
        return None

# Generate 1000 Python programming Q&A pairs
dsp_qa_dataset = DatasetPlus.iter(
    iterate_num=1000,
    fn=generate_qa_pairs,
    num_proc=4,
    cache=True
)

print(f"Successfully generated {len(dsp_qa_dataset)} Q&A pairs")
```

## πŸ’‘ Best Practices

### 1. Caching Strategy
- Always enable caching: `cache=True`
- Large model call friendly, even if interrupted due to network instability, insufficient quota, etc., previous results won't be lost
- Automatically recalculates when function is modified

### 2. Performance Optimization
- Set `num_proc` reasonably (based on maximum concurrency the large model can accept)
- Adjust `max_inner_num` (maximum memory data storage, writes to disk for persistence every max_inner_num)
- Use batch processing for large datasets

### 3. Error Handling
```python
def robust_processing(example):
    try:
        # LLM call
        result = llm.getResult(prompt)
        return process_result(result)
    except Exception as e:
        print(f"Processing failed: {e}")
        return None  # Failed data auto-deleted
```

## πŸ“‹ System Requirements

- **Python**: >= 3.7
- **datasets**: >= 2.0.0
- **pandas**: >= 1.3.0
- **numpy**: >= 1.21.0
- **openpyxl**: >= 3.0.0 (Excel support)
- **openai**: >= 1.0.0 (LLM support)

## 🀝 Contributing

Pull Requests are welcome! Please ensure:
- Code follows project standards
- Add appropriate tests
- Update relevant documentation

## πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

## πŸ“ Changelog

### v0.2.0 (Latest)
- ✨ Added intelligent caching system
- ✨ Support array auto-expansion
- ✨ Support None auto-filtering
- ✨ Full datasets API compatibility
- ✨ Added MyLLMTool large model tool

### v0.1.0
- πŸŽ‰ Initial release
- πŸ“ Basic dataset loading functionality
- πŸ“Š Excel file support
- ⚑ Caching and batch processing
- πŸ“‚ Directory loading support

## πŸ› οΈ Auxiliary Tools

DatasetPlus provides two powerful auxiliary tools to enhance your data processing workflow:

### MyLLMTool

A comprehensive LLM (Large Language Model) calling tool that supports OpenAI-compatible APIs.

#### Key Features
- **Multi-model Support**: Compatible with OpenAI and other OpenAI-compatible APIs
- **Flexible Configuration**: Customizable parameters for different use cases
- **Easy Integration**: Seamlessly integrates with DatasetPlus processing workflows

#### Methods

- **`__init__(model_name, base_url, api_key)`**: Initialize the LLM tool
  - `model_name`: The name of the model to use (e.g., "gpt-3.5-turbo")
  - `base_url`: API base URL (e.g., "https://api.openai.com/v1")
  - `api_key`: Your API key for authentication

- **`getResult(query, sys_prompt=None, temperature=0.7, top_p=1, max_tokens=2048, model_name=None)`**: Get LLM response
  - `query`: User query or prompt
  - `sys_prompt`: System prompt to set context (optional)
  - `temperature`: Controls randomness (0.0-2.0)
  - `top_p`: Controls diversity via nucleus sampling
  - `max_tokens`: Maximum number of tokens to generate
  - `model_name`: Override the default model for this request

#### Usage Example
```python
from datasetplus import MyLLMTool

llm = MyLLMTool(
    model_name="gpt-3.5-turbo",
    base_url="https://api.openai.com/v1",
    api_key="your-api-key"
)

result = llm.getResult(
    query="Summarize this text: ...",
    sys_prompt="You are a helpful assistant.",
    temperature=0.7
)
```

### DataTool

A utility class providing various data processing and parsing functions for common data manipulation tasks.

#### Key Features
- **JSON Parsing**: Safe JSON extraction and parsing from text
- **File Operations**: Read and sample data from files
- **Data Validation**: Check data structure compliance
- **Format Conversion**: Convert between different data formats

#### Methods

- **`parse_json_safe(text_str)`**: Extract and parse JSON objects/arrays from text
  - `text_str`: Input string that may contain embedded JSON
  - Returns: List of parsed Python objects (dicts or lists)

- **`get_prompt(file_path)`**: Read text file and return content as string
  - `file_path`: Path to the text file
  - Returns: File content as concatenated string

- **`check(row)`**: Validate data structure for message format
  - `row`: Data row to validate
  - Returns: Boolean indicating if structure is valid

- **`check_with_system(row)`**: Check if data has system message format
  - `row`: Data row to validate
  - Returns: Boolean indicating if has valid system message

- **`parse_messages(str_row)`**: Parse message format from string
  - `str_row`: String containing message data
  - Returns: Parsed message object or None

- **`parse_json(str, json_tag=False)`**: Parse JSON with error handling
  - `str`: JSON string to parse
  - `json_tag`: Whether to extract from ```json``` code blocks
  - Returns: Parsed object or None on failure

- **`sample_from_file(file_path, num=-1)`**: Sample lines from text file
  - `file_path`: Path to the file
  - `num`: Number of samples (-1 for all)
  - Returns: List of sampled lines

- **`sample_from(path, num=-1, granularity="auto", exclude=[])`**: Sample data from files/directories
  - `path`: File or directory path
  - `num`: Number of samples (-1 for all)
  - `granularity`: Sampling granularity ("auto", "file", "line")
  - `exclude`: Patterns to exclude
  - Returns: List of sampled content

- **`jsonl2json(source_path, des_path)`**: Convert JSONL to JSON format
  - `source_path`: Source JSONL file path
  - `des_path`: Destination JSON file path

#### Usage Example
```python
from datasetplus import DataTool

# Parse JSON from text
json_data = DataTool.parse_json_safe('Some text {"key": "value"} more text')

# Read prompt from file
prompt = DataTool.get_prompt('prompt.txt')

# Validate data structure
is_valid = DataTool.check(data_row)

# Sample from file
samples = DataTool.sample_from_file('data.txt', num=10)
```

These auxiliary tools are designed to work seamlessly with DatasetPlus workflows, providing essential utilities for LLM integration and data processing tasks.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "datasetplus",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "datasets, data processing, nlp",
    "author": null,
    "author_email": "sweeneyhe <sweeenyhe@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/3d/e0/04021bd00d5d8dcaf09eebb6d4322f9b4a12c083a98912522743020803a6/datasetplus-0.4.3.tar.gz",
    "platform": null,
    "description": "# DatasetPlus\n\n[\u4e2d\u6587\u7248\u672c](README-CN.md)\n\nAn enhanced Hugging Face datasets wrapper designed for large model data processing, providing intelligent caching, data augmentation, and filtering capabilities.\n\n\n## \ud83d\ude80 Core Features\n\n### 1. \ud83d\udd04 Full Compatibility - Seamless datasets Replacement\n- **100% Compatible**: Supports all datasets methods and attributes\n- **Mutual Conversion, Plug and Play**: Simply `dsp = DatasetPlus(ds)` to convert back: `ds = dsp.ds`\n- **Static Method Support**: All Dataset static methods and class methods can be called directly\n\n### 2. \ud83e\udde0 Intelligent Caching System - Zero Loss for Large Model Calls\n- **Automatic Function-level Caching**: Automatically generates cache based on function content, even if interrupted due to network instability, insufficient quota, etc., previous results won't be lost\n- **Jupyter Friendly**: Even if you forget to assign variables, results can be recovered from cache\n- **Resume from Breakpoint**: Supports continuing processing after interruption, automatically reads cache for processed data\n\n### 3. \ud83d\udcc8 Data Augmentation - One Row Becomes Multiple Rows\n- **Array Auto-expansion**: Automatically expands arrays returned by map function into multiple rows of data\n- **LLM Result Parsing**: Easy data augmentation with MyLLMTool\n\n### 4. \ud83d\udd0d Intelligent Filtering - Auto-delete When Returning None\n- **Conditional Filtering**: Automatically deletes rows when map function returns None\n- **LLM Intelligent Filtering**: Use large models for complex conditional filtering\n\n### 5. \ud83c\udfa8 Generate from Scratch - Direct Data Generation from Large Models\n- **iter Method**: Supports generating datasets from scratch\n- **Flexible Generation**: Can generate data in any format and content\n- **Batch Generation**: Supports large-scale data generation with automatic caching and parallel processing\n\n## \ud83d\udce6 Installation\n\n### Install from PyPI\n\n```bash\npip install datasetplus\n```\n\n### Install from Source\n\n```bash\ngit clone https://github.com/yourusername/datasetplus.git\ncd datasetplus\npip install -e .\n```\n\n### Dependencies Installation\n\n```bash\n# Basic dependencies\npip install datasets pandas numpy\n\n# Excel support\npip install openpyxl\n\n# LLM support\npip install openai\n```\n\n## \ud83c\udfaf Quick Start\n\n### Basic Usage\n```python\nfrom datasetplus import DatasetPlus, MyLLMTool\n\n# Load dataset\ndsp = DatasetPlus.load_dataset(\"data.jsonl\")\n\n# Fully compatible with datasets - plug and play\nds = dsp.ds  # Now ds has all datasets functionality + DatasetPlus enhancements\n```\n\n### \ud83d\udd04 Feature 1: Full datasets Compatibility\n\n```python\n# All datasets methods can be used directly\nds = dsp.ds  # Get native dataset object\ndsp_shuffled = dsp.shuffle(seed=42)\ndsp_split = dsp.train_test_split(test_size=0.2)\ndsp_filtered = dsp.filter(lambda x: len(x['text']) > 100)\n\n# pandas can also seamlessly connect\ndsp_df = dsp.to_pandas()\ndsp = DatasetPlus.from_pandas(dsp_df)\n\n# Static methods are fully supported\ndsp_from_dict = DatasetPlus.from_dict({\"text\": [\"hello\", \"world\"]})\ndsp_from_hf = DatasetPlus.from_pretrained(\"squad\")\n\n# Seamless switching with native datasets\nfrom datasets import Dataset\nds = Dataset.from_dict({\"text\": [\"test\"]})\ndsp = DatasetPlus(ds)  # Directly wrap existing dataset\n\n# Jupyter-friendly display\ndsp = DatasetPlus.from_dict({\"text\": [\"a\", \"b\", \"c\"]})\ndsp\n----------------\nDatasetPlus({\n    features: ['text'],\n    num_rows: 3\n})\n~~~~~~~~~~~~~~~~~\n\n## Humanized slicing and display logic 1\ndsp[0] # Equivalent to ds.select(range(0)), of course dsp also supports dsp.select(range(0))\n----------------\n{'text': 'a'}\n\n## Humanized slicing and display logic 2\ndsp[1:2] # Equivalent to ds.select(range(1,2))\n----------------\nDatasetPlus({\n    features: ['text'],\n    num_rows: 1\n\n## Humanized slicing and display logic 3\ndsp[1:]\n----------------\nDatasetPlus({\n    features: ['text'],\n    num_rows: 2\n})\n```\n\n### \ud83e\udde0 Feature 2: Intelligent Caching - Zero Loss for Large Model Calls\n\n```python\n# Define processing function containing large model calls\ndef enhance_with_llm(example):\n    # Initialize LLM tool (needs to be instantiated internally for multiprocessing)\n    llm = MyLLMTool(\n        model_name=\"gpt-3.5-turbo\",\n        base_url=\"https://api.openai.com/v1\",\n        api_key=\"your-api-key\"\n    )\n    # Call large model for data enhancement\n    prompt = f\"Please generate a summary for the following text: {example['text']}\"\n    summary = llm.getResult(prompt)\n    example['summary'] = summary\n    return example\n\n# First run - will call large model\ndsp_enhanced = dsp.map(enhance_with_llm, num_proc=4, cache=True)\n\n# Forgot to assign in Jupyter? No problem!\n# Even if you ran: dsp.map(enhance_with_llm, cache=True)  # Forgot assignment\n# You can still recover results with:\ndsp_enhanced = dsp.map(enhance_with_llm, cache=True)  # Auto-read from cache, won't call LLM again\n\n# Continuing after interruption is also fine, processed data will be automatically skipped\n```\n\n### \ud83d\udcc8 Feature 3: Data Augmentation - One Row Becomes Multiple Rows\n\n```python\ndef expand_data_with_llm(example):\n    # Use LLM to generate multiple related questions\n    prompt = f\"Generate 3 related questions based on the following text, return in JSON array format: {example['text']}\"\n    questions_json = llm.getResult(prompt)\n    \n    try:\n        questions = json.loads(questions_json)\n        # Return array, DatasetPlus will automatically expand into multiple rows\n        return [{\n            'original_text': example['text'],\n            'question': q,\n            'source': 'llm_generated'\n        } for q in questions]\n    except:\n        return example  # Return original data on parsing failure or delete directly: return None\n\n# Original data: 100 rows\n# After processing: might become 300 rows (3 questions generated per row)\ndsp_expanded = dsp.map(expand_data_with_llm, cache=True)\nprint(f\"Original data: {len(dsp)} rows\")\nprint(f\"After expansion: {len(dsp_expanded)} rows\")\n```\n\n### \ud83d\udd0d Feature 4: Intelligent Filtering - Auto-delete When Returning None\n\n```python\ndef filter_with_llm(example):\n    # Use LLM for quality assessment\n    prompt = f\"\"\"Evaluate the quality of the following text, return JSON format: {{\"quality\": \"high/mid/low\"}}\n    Text: {example['text']}\"\"\"\n    \n    result = llm.getResult(prompt)\n    try:\n        quality_data = json.loads(result)\n        quality = quality_data.get('quality', 'low')\n        \n        # Only keep high-quality data, others returning None will be auto-deleted\n        if quality == 'high':\n            example['quality_score'] = quality\n            return example\n        else:\n            return None  # Auto-delete low-quality data\n    except:\n        return None  # Also delete on parsing failure\n\n# Original data: 1000 rows\n# After filtering: might only have 200 rows of high-quality data\ndsp_filtered = dsp.map(filter_with_llm, cache=True)\nprint(f\"Before filtering: {len(dsp)} rows\")\nprint(f\"After filtering: {len(dsp_filtered)} rows\")\n```\n\n### \ud83c\udfa8 Feature 5: Generate from Scratch - Direct Data Generation from Large Models\n\n```python\n# Use iter method to generate data directly from large models\ndef generate_dialogues(example):\n    llm = MyLLMTool(\n        model_name=\"gpt-3.5-turbo\",\n        base_url=\"https://api.openai.com/v1\",\n        api_key=\"your-api-key\"\n    )\n    \n    # Prompt: Generate 10 customer service dialogues\n    prompt = \"\"\"Please generate 10 different customer service dialogues, each containing questions and answers.\n    Requirements:\n    1. Each dialogue has users asking different specific questions\n    2. Customer service provides professional answers\n    3. Return JSON array format: [{\"user\": \"User question 1\", \"assistant\": \"Service answer 1\", \"category\": \"Question category 1\"}, ...]\n    4. Cover different types of questions (technical support, after-sales service, product consultation, etc.)\n    \"\"\"\n    \n    try:\n        result = llm.getResult(prompt)\n        dialogues_data = json.loads(result)\n        \n        # Return array, DatasetPlus will automatically expand into multiple rows\n        return [{\n            'batch_id': example['id'],\n            'dialogue_id': i,\n            'user': dialogue.get('user', ''),\n            'assistant': dialogue.get('assistant', ''),\n            'category': dialogue.get('category', ''),\n            'source': 'generated'\n        } for i, dialogue in enumerate(dialogues_data)]\n    except Exception as e:\n        print(f\"Generation failed: {e}\")\n        return None  # Skip on generation failure\n\n# Generate 10 batches of dialogue data, each batch containing 10 dialogues\ndsp_generated = DatasetPlus.iter(\n    iterate_num=10,           # Generate 10 batches of data\n    fn=generate_dialogues,    # Generation function\n    num_proc=2,              # 2 processes in parallel\n    cache=False              # iter cache defaults to False\n)\n\nprint(f\"Generated {len(dsp_generated)} dialogue data\")  # Should be 100 (10 batches \u00d7 10 dialogues)\nprint(dsp_generated[0])  # View first generated data\n```\n\n## \ud83d\udcc1 Supported Data Formats\n\n- **JSON/JSONL**: Standard JSON and JSON Lines format\n- **CSV**: Comma-separated values files\n- **Excel**: .xlsx and .xls files\n- **Hugging Face Datasets**: Any dataset from the Hub\n- **Dataframe, datasets**: Support pandas DataFrame and Hugging Face datasets\n- **Directory Batch Loading**: Automatically merge multiple files in directory\n\n## \ud83d\udd27 Advanced Features\n\n### Intelligent Caching Mechanism\n\n```python\n# Cache based on function content hash, ensures recalculation when function changes\ndef process_v1(example):\n    return {\"result\": example[\"text\"].upper()}  # Version 1\n\ndef process_v2(example):\n    return {\"result\": example[\"text\"].lower()}  # Version 2\n\n# Different functions generate different caches, no interference\nds1 = ds.map(process_v1, cache=True)  # Cache A\nds2 = ds.map(process_v2, cache=True)  # Cache B\n```\n\n### Batch Processing and Multiprocessing\n\n```python\n# Efficient processing of large datasets\ndsp = DatasetPlus.load_dataset(\"large_dataset.jsonl\")\ndsp_processed = dsp.map(\n    enhance_with_llm,\n    num_proc=8,           # 8 processes in parallel\n    max_inner_num=1000,   # Process 1000 items per batch\n    cache=True            # Enable caching\n)\n```\n\n### Directory Batch Loading\n\n```python\n# Automatically load and merge all supported files in directory\ndsp = DatasetPlus.load_dataset_plus(\"./data_folder/\")\n# Supports mixed formats: data_folder/\n#   \u251c\u2500\u2500 file1.jsonl\n#   \u251c\u2500\u2500 file2.csv\n#   \u2514\u2500\u2500 file3.xlsx\n```\n\n### Professional Excel Processing\n\n```python\nfrom datasetplus import DatasetPlusExcels\n\n# Professional Excel file processing\nexcel_dsp = DatasetPlusExcels(\"spreadsheet.xlsx\")\n\n# Support multi-sheet processing\nsheet_names = excel_dsp.get_sheet_names()\nfor sheet in sheet_names:\n    sheet_data = excel_dsp.get_sheet_data(sheet)\n    dsp_processed = excel_dsp.map(lambda x: {'cleaned': x['column'].strip()})\n```\n\n## \ud83d\udcda API Reference\n\n### DatasetPlus\n\nEnhanced dataset processing class, fully compatible with Hugging Face datasets.\n\n#### Core Methods\n\n- `map(fn, num_proc=1, max_inner_num=1000, cache=True)`: Enhanced mapping function\n  - **fn**: Processing function, supports returning arrays (auto-expand) and None (auto-delete)\n  - **cache**: Intelligent caching, automatically generates cache keys based on function content\n  - **num_proc**: Multiprocess parallel processing\n  - **max_inner_num**: Batch processing size\n\n#### Static Methods\n\n- `load_dataset(file_name, output_file)`: Load single dataset file\n- `load_dataset_plus(input_path, output_file)`: Load from file, directory, or Hub\n- `from_pandas(df)`: Create from pandas DataFrame\n- `from_dict(data)`: Create from dictionary\n- `from_pretrained(path)`: Load from Hugging Face Hub\n- `iter(iterate_num, fn, num_proc=1, max_inner_num=1000, cache=True)`: Generate from scratch, iteratively generate data\n  - **iterate_num**: Number of data to generate\n  - **fn**: Generation function, receives example with id, returns generated data\n  - **num_proc**: Multiprocess parallel processing\n  - **cache**: Enable caching, avoid duplicate generation\n\n#### Compatibility\n\n```python\n# All datasets methods can be used directly\nds.shuffle()          # Shuffle data\nds.filter()           # Filter data\nds.select()           # Select data\nds.train_test_split() # Split data\nds.save_to_disk()     # Save to disk\n# ... and all other datasets methods\n```\n\n### MyLLMTool\n\nLarge model calling tool, supports OpenAI-compatible APIs.\n\n#### Initialization\n\n```python\nllm = MyLLMTool(\n    model_name=\"gpt-3.5-turbo\",      # Model name\n    base_url=\"https://api.openai.com/v1\",  # API base URL\n    api_key=\"your-api-key\"           # API key\n)\n```\n\n#### Methods\n\n- `getResult(query, sys_prompt=None, temperature=0.7, top_p=1, max_tokens=2048, model_name=None)`: Get LLM response\n  - **query**: User query\n  - **sys_prompt**: System prompt\n  - **temperature**: Temperature parameter\n  - **max_tokens**: Maximum tokens\n\n### DatasetPlusExcels\n\nProfessional Excel file processing class.\n\n#### Methods\n\n- `__init__(file_path, output_file)`: Initialize Excel processor\n- `get_sheet_names()`: Get all sheet names\n- `get_sheet_data(sheet_name)`: Get data from specified sheet\n\n## \ud83c\udfaf Real-world Use Cases\n\n### Case 1: Large-scale Data Annotation\n\n```python\n# Use LLM for sentiment analysis annotation on large amounts of text\ndef sentiment_labeling(example):\n    prompt = f\"Analyze the sentiment of the following text, return positive/negative/neutral: {example['text']}\"\n    sentiment = llm.getResult(prompt)\n    example['sentiment'] = sentiment.strip()\n    return example\n\n# Process 100,000 data points, supports resume from breakpoint\ndsp_labeled = dsp.map(sentiment_labeling, cache=True, num_proc=4)\n```\n\n### Case 2: Data Quality Filtering\n\n```python\n# Use LLM to filter high-quality training data\ndef quality_filter(example):\n    prompt = f\"Rate text quality (1-5 points): {example['text']}\"\n    score = llm.getResult(prompt)\n    try:\n        if int(score) >= 4:\n            return example\n        else:\n            return None  # Auto-delete low-quality data\n    except:\n        return None\n\ndsp_filtered = dsp.map(quality_filter, cache=True)\n```\n\n### Case 3: Data Augmentation\n\n```python\n# Generate multiple variants for each data point\ndef data_augmentation(example):\n    prompt = f\"Generate 3 synonymous rewrites for the following text: {example['text']}\"\n    variants = llm.getResult(prompt).split('\\n')\n    \n    # Return array, automatically expand into multiple rows\n    return [{\n        'text': variant.strip(),\n        'label': example['label'],\n        'source': 'augmented'\n    } for variant in variants if variant.strip()]\n\ndsp_augmented = dsp.map(data_augmentation, cache=True)\n```\n\n### Case 4: Generate Training Data from Scratch\n\n```python\n# Use LLM to generate training data from scratch\ndef generate_qa_pairs(example):\n    llm = MyLLMTool(\n        model_name=\"gpt-3.5-turbo\",\n        base_url=\"https://api.openai.com/v1\",\n        api_key=\"your-api-key\"\n    )\n    \n    # Prompt for generating Q&A pairs\n    prompt = \"\"\"Generate a Q&A pair about Python programming.\n    Requirements:\n    1. Question should be specific and practical\n    2. Answer should be accurate and detailed\n    3. Return JSON format: {\"question\": \"question\", \"answer\": \"answer\", \"difficulty\": \"easy/medium/hard\"}\n    \"\"\"\n    \n    try:\n        result = llm.getResult(prompt)\n        qa_data = json.loads(result)\n        return {\n            'id': example['id'],\n            'question': qa_data.get('question', ''),\n            'answer': qa_data.get('answer', ''),\n            'difficulty': qa_data.get('difficulty', 'medium'),\n            'domain': 'python_programming',\n            'generated_at': datetime.now().isoformat()\n        }\n    except Exception as e:\n        print(f\"Generation failed: {e}\")\n        return None\n\n# Generate 1000 Python programming Q&A pairs\ndsp_qa_dataset = DatasetPlus.iter(\n    iterate_num=1000,\n    fn=generate_qa_pairs,\n    num_proc=4,\n    cache=True\n)\n\nprint(f\"Successfully generated {len(dsp_qa_dataset)} Q&A pairs\")\n```\n\n## \ud83d\udca1 Best Practices\n\n### 1. Caching Strategy\n- Always enable caching: `cache=True`\n- Large model call friendly, even if interrupted due to network instability, insufficient quota, etc., previous results won't be lost\n- Automatically recalculates when function is modified\n\n### 2. Performance Optimization\n- Set `num_proc` reasonably (based on maximum concurrency the large model can accept)\n- Adjust `max_inner_num` (maximum memory data storage, writes to disk for persistence every max_inner_num)\n- Use batch processing for large datasets\n\n### 3. Error Handling\n```python\ndef robust_processing(example):\n    try:\n        # LLM call\n        result = llm.getResult(prompt)\n        return process_result(result)\n    except Exception as e:\n        print(f\"Processing failed: {e}\")\n        return None  # Failed data auto-deleted\n```\n\n## \ud83d\udccb System Requirements\n\n- **Python**: >= 3.7\n- **datasets**: >= 2.0.0\n- **pandas**: >= 1.3.0\n- **numpy**: >= 1.21.0\n- **openpyxl**: >= 3.0.0 (Excel support)\n- **openai**: >= 1.0.0 (LLM support)\n\n## \ud83e\udd1d Contributing\n\nPull Requests are welcome! Please ensure:\n- Code follows project standards\n- Add appropriate tests\n- Update relevant documentation\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n## \ud83d\udcdd Changelog\n\n### v0.2.0 (Latest)\n- \u2728 Added intelligent caching system\n- \u2728 Support array auto-expansion\n- \u2728 Support None auto-filtering\n- \u2728 Full datasets API compatibility\n- \u2728 Added MyLLMTool large model tool\n\n### v0.1.0\n- \ud83c\udf89 Initial release\n- \ud83d\udcc1 Basic dataset loading functionality\n- \ud83d\udcca Excel file support\n- \u26a1 Caching and batch processing\n- \ud83d\udcc2 Directory loading support\n\n## \ud83d\udee0\ufe0f Auxiliary Tools\n\nDatasetPlus provides two powerful auxiliary tools to enhance your data processing workflow:\n\n### MyLLMTool\n\nA comprehensive LLM (Large Language Model) calling tool that supports OpenAI-compatible APIs.\n\n#### Key Features\n- **Multi-model Support**: Compatible with OpenAI and other OpenAI-compatible APIs\n- **Flexible Configuration**: Customizable parameters for different use cases\n- **Easy Integration**: Seamlessly integrates with DatasetPlus processing workflows\n\n#### Methods\n\n- **`__init__(model_name, base_url, api_key)`**: Initialize the LLM tool\n  - `model_name`: The name of the model to use (e.g., \"gpt-3.5-turbo\")\n  - `base_url`: API base URL (e.g., \"https://api.openai.com/v1\")\n  - `api_key`: Your API key for authentication\n\n- **`getResult(query, sys_prompt=None, temperature=0.7, top_p=1, max_tokens=2048, model_name=None)`**: Get LLM response\n  - `query`: User query or prompt\n  - `sys_prompt`: System prompt to set context (optional)\n  - `temperature`: Controls randomness (0.0-2.0)\n  - `top_p`: Controls diversity via nucleus sampling\n  - `max_tokens`: Maximum number of tokens to generate\n  - `model_name`: Override the default model for this request\n\n#### Usage Example\n```python\nfrom datasetplus import MyLLMTool\n\nllm = MyLLMTool(\n    model_name=\"gpt-3.5-turbo\",\n    base_url=\"https://api.openai.com/v1\",\n    api_key=\"your-api-key\"\n)\n\nresult = llm.getResult(\n    query=\"Summarize this text: ...\",\n    sys_prompt=\"You are a helpful assistant.\",\n    temperature=0.7\n)\n```\n\n### DataTool\n\nA utility class providing various data processing and parsing functions for common data manipulation tasks.\n\n#### Key Features\n- **JSON Parsing**: Safe JSON extraction and parsing from text\n- **File Operations**: Read and sample data from files\n- **Data Validation**: Check data structure compliance\n- **Format Conversion**: Convert between different data formats\n\n#### Methods\n\n- **`parse_json_safe(text_str)`**: Extract and parse JSON objects/arrays from text\n  - `text_str`: Input string that may contain embedded JSON\n  - Returns: List of parsed Python objects (dicts or lists)\n\n- **`get_prompt(file_path)`**: Read text file and return content as string\n  - `file_path`: Path to the text file\n  - Returns: File content as concatenated string\n\n- **`check(row)`**: Validate data structure for message format\n  - `row`: Data row to validate\n  - Returns: Boolean indicating if structure is valid\n\n- **`check_with_system(row)`**: Check if data has system message format\n  - `row`: Data row to validate\n  - Returns: Boolean indicating if has valid system message\n\n- **`parse_messages(str_row)`**: Parse message format from string\n  - `str_row`: String containing message data\n  - Returns: Parsed message object or None\n\n- **`parse_json(str, json_tag=False)`**: Parse JSON with error handling\n  - `str`: JSON string to parse\n  - `json_tag`: Whether to extract from ```json``` code blocks\n  - Returns: Parsed object or None on failure\n\n- **`sample_from_file(file_path, num=-1)`**: Sample lines from text file\n  - `file_path`: Path to the file\n  - `num`: Number of samples (-1 for all)\n  - Returns: List of sampled lines\n\n- **`sample_from(path, num=-1, granularity=\"auto\", exclude=[])`**: Sample data from files/directories\n  - `path`: File or directory path\n  - `num`: Number of samples (-1 for all)\n  - `granularity`: Sampling granularity (\"auto\", \"file\", \"line\")\n  - `exclude`: Patterns to exclude\n  - Returns: List of sampled content\n\n- **`jsonl2json(source_path, des_path)`**: Convert JSONL to JSON format\n  - `source_path`: Source JSONL file path\n  - `des_path`: Destination JSON file path\n\n#### Usage Example\n```python\nfrom datasetplus import DataTool\n\n# Parse JSON from text\njson_data = DataTool.parse_json_safe('Some text {\"key\": \"value\"} more text')\n\n# Read prompt from file\nprompt = DataTool.get_prompt('prompt.txt')\n\n# Validate data structure\nis_valid = DataTool.check(data_row)\n\n# Sample from file\nsamples = DataTool.sample_from_file('data.txt', num=10)\n```\n\nThese auxiliary tools are designed to work seamlessly with DatasetPlus workflows, providing essential utilities for LLM integration and data processing tasks.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "An enhanced wrapper for Hugging Face datasets with additional functionality",
    "version": "0.4.3",
    "project_urls": {
        "Bug Reports": "https://github.com/SWEENEYHE/DatasetPlus/issues",
        "Homepage": "https://github.com/SWEENEYHE/DatasetPlus",
        "Source": "https://github.com/SWEENEYHE/DatasetPlus"
    },
    "split_keywords": [
        "datasets",
        " data processing",
        " nlp"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "dbd687935f29c110452f96fd78e00315dee69a88969d75ec6b327242248e64dd",
                "md5": "b1493d05c28876445f4e67be6bb93ad9",
                "sha256": "5842812464a392241dffecf69578efe586ac4d15191b44a1fa9c6defda608c53"
            },
            "downloads": -1,
            "filename": "datasetplus-0.4.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b1493d05c28876445f4e67be6bb93ad9",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 19897,
            "upload_time": "2025-08-03T02:31:39",
            "upload_time_iso_8601": "2025-08-03T02:31:39.300699Z",
            "url": "https://files.pythonhosted.org/packages/db/d6/87935f29c110452f96fd78e00315dee69a88969d75ec6b327242248e64dd/datasetplus-0.4.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3de004021bd00d5d8dcaf09eebb6d4322f9b4a12c083a98912522743020803a6",
                "md5": "b672710dfdba9b3c640bd3c311cf6ed9",
                "sha256": "60bbdb33e5e0813945e26230beb1cb70a55f7101a98baa6630f7d510cfc1670c"
            },
            "downloads": -1,
            "filename": "datasetplus-0.4.3.tar.gz",
            "has_sig": false,
            "md5_digest": "b672710dfdba9b3c640bd3c311cf6ed9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 27465,
            "upload_time": "2025-08-03T02:31:41",
            "upload_time_iso_8601": "2025-08-03T02:31:41.072786Z",
            "url": "https://files.pythonhosted.org/packages/3d/e0/04021bd00d5d8dcaf09eebb6d4322f9b4a12c083a98912522743020803a6/datasetplus-0.4.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-03 02:31:41",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "SWEENEYHE",
    "github_project": "DatasetPlus",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "datasets",
            "specs": [
                [
                    ">=",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.3.0"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.21.0"
                ]
            ]
        },
        {
            "name": "openpyxl",
            "specs": [
                [
                    ">=",
                    "3.0.0"
                ]
            ]
        },
        {
            "name": "xlrd",
            "specs": [
                [
                    ">=",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "pyarrow",
            "specs": [
                [
                    ">=",
                    "5.0.0"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    ">=",
                    "4.62.0"
                ]
            ]
        },
        {
            "name": "requests",
            "specs": [
                [
                    ">=",
                    "2.25.0"
                ]
            ]
        },
        {
            "name": "openai",
            "specs": [
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "json5",
            "specs": [
                [
                    ">=",
                    "0.9.0"
                ]
            ]
        },
        {
            "name": "json-repair",
            "specs": [
                [
                    ">=",
                    "0.7.0"
                ]
            ]
        }
    ],
    "lcname": "datasetplus"
}
        
Elapsed time: 1.86272s