sortx-universal


Namesortx-universal JSON
Version 0.1.0 PyPI version JSON
download
home_pageNone
SummaryUniversal sorting tool for files, data structures, and large datasets
upload_time2025-08-18 14:25:44
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT
keywords sorting csv json external-sort cli data-processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # sortx-universal

[![Build Status](https://github.com/Okymi-X/sortx-universal/workflows/CI/badge.svg)](https://github.com/Okymi-X/sortx-universal/actions)
[![PyPI version](https://badge.fury.io/py/sortx-universal.svg)](https://badge.fury.io/py/sortx-universal)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)

**sortx-universal** is a powerful, universal sorting tool and Python library designed to sort any kind of data: in-memory data structures, CSV/JSONL files, plain text, and even massive datasets using efficient external sorting algorithms.

## ✨ Features

πŸš€ **Universal Sorting**: Sort any data format (CSV, JSONL, TXT, compressed files)  
πŸ“Š **Multi-key Sorting**: Sort by multiple columns with different data types and directions  
⚑ **External Sorting**: Handle massive files that don't fit in memory using external merge sort  
🌍 **Locale-aware**: International text sorting with locale support  
πŸ”§ **Smart Detection**: Automatically detect file formats and separators  
πŸ“¦ **Easy Installation**: Simple `pip install sortx-universal`  
πŸ› οΈ **CLI + Library**: Use as command-line tool or import as Python library  
🎯 **Type Support**: Numbers, strings, dates, natural sorting  
πŸ”„ **Stable Sorting**: Preserves original order for equal elements  
πŸŽ›οΈ **Flexible Options**: Reverse, unique constraints, memory limits  

## πŸ“¦ Installation

### Basic Installation
```bash
pip install sortx-universal
```

### Full Installation (with CLI and enhanced features)
```bash
pip install sortx-universal[full]
```

The full installation includes:
- `typer` and `rich` for beautiful CLI experience
- `python-dateutil` for advanced date parsing
- `natsort` for natural sorting
- `chardet` for encoding detection

## πŸš€ Quick Start

### Command Line Interface

```bash
# Sort CSV by price (numeric), then name (alphabetic)
sortx-universal data.csv -o sorted.csv -k price:num -k name:str

# Sort large JSONL file by timestamp with memory limit
sortx-universal logs.jsonl.gz -o sorted.jsonl.gz -k timestamp:date --memory-limit=512M

# Natural sort of text file (file2 comes before file10)
sortx-universal filenames.txt -o sorted.txt -k 0:nat

# Sort with uniqueness constraint
sortx-universal users.jsonl -o unique_users.jsonl -k created_at:date --unique=id

# Show sorting statistics
sortx-universal large_data.csv -o sorted_data.csv -k score:num:desc=true --stats
```

### Python Library

```python
import sortx

# Sort in-memory data
data = [
    {"name": "Alice", "age": 30, "salary": 50000},
    {"name": "Bob", "age": 25, "salary": 45000},
    {"name": "Charlie", "age": 35, "salary": 60000}
]

# Single key sorting
sorted_by_age = list(sortx.sort_iter(
    data, 
    keys=[sortx.key("age", "num")]
))

# Multi-key sorting
sorted_multi = list(sortx.sort_iter(
    data,
    keys=[
        sortx.key("salary", "num", desc=True),  # Salary descending
        sortx.key("name", "str")                # Then name ascending
    ]
))

# Sort file to file
stats = sortx.sort_file(
    input_path="input.csv",
    output_path="output.csv", 
    keys=[sortx.key("created_at", "date", desc=True)],
    stats=True
)
print(f"Processed {stats.lines_processed} lines in {stats.processing_time:.2f}s")
```

## πŸ“Š Data Types

sortx-universal supports multiple data types for sorting keys:

| Type | Description | Example |
|------|-------------|---------|
| **`num`** | Numeric sorting (integers, floats) | `42`, `3.14`, `-10` |
| **`str`** | String sorting with locale support | `"Alice"`, `"cafΓ©"` |
| **`date`** | Date/time sorting (ISO 8601 + common formats) | `"2025-01-15"`, `"2025-01-15T10:30:00Z"` |
| **`nat`** | Natural sorting ("file2" < "file10") | `"file1.txt"`, `"file10.txt"` |

### Date Format Support
- ISO 8601: `2025-01-15T10:30:00Z`
- Common formats: `2025-01-15`, `01/15/2025`, `Jan 15, 2025`
- Automatic parsing with `python-dateutil` (when installed)

## πŸ“ File Format Support

| Format | Extensions | Compression | Description |
|--------|------------|-------------|-------------|
| **CSV/TSV** | `.csv`, `.tsv` | βœ… | Automatic delimiter detection |
| **JSONL** | `.jsonl`, `.ndjson` | βœ… | One JSON object per line |
| **Plain Text** | `.txt`, any | βœ… | Line-by-line sorting |
| **Compressed** | `.gz`, `.zst` | - | Transparent compression support |

### Large File Handling
- **External Sorting**: Automatically handles files larger than available RAM
- **Memory Limits**: Configurable memory usage (`--memory-limit=512M`)
- **Streaming**: Processes files line-by-line to minimize memory footprint

## πŸ”§ Command Line Reference

```bash
sortx-universal [INPUT] [OPTIONS]
```

### Options

| Option | Short | Description |
|--------|-------|-------------|
| `--output FILE` | `-o` | Output file path |
| `--key KEY_SPEC` | `-k` | Sort key specification (can be used multiple times) |
| `--reverse` | | Reverse the entire sort order |
| `--stable` | | Use stable sorting (default) |
| `--unique COLUMN` | | Keep only unique values for specified column |
| `--memory-limit SIZE` | | Memory limit for external sorting (e.g., 512M, 2G) |
| `--stats` | | Show detailed sorting statistics |
| `--help` | `-h` | Show help message |

### Key Specification Format

Sort keys use the format: `column:type[:desc=true][:locale=name]`

**Examples:**
- `price:num` - Sort by price as number (ascending)
- `price:num:desc=true` - Sort by price as number (descending)
- `name:str:locale=fr_FR` - Sort by name with French locale
- `timestamp:date` - Sort by timestamp as date
- `0:nat` - Natural sort by first column (for text files)

## πŸ’‘ Examples

### Example 1: Sales Data Analysis

**Input (`sales.csv`):**
```csv
region,product,revenue,date
North,Widget A,1000,2025-01-15
South,Widget B,1500,2025-01-14
North,Widget C,800,2025-01-16
South,Widget A,1200,2025-01-13
```

**Command:**
```bash
sortx-universal sales.csv -o sorted_sales.csv -k region:str -k revenue:num:desc=true
```

**Output:**
```csv
region,product,revenue,date
North,Widget A,1000,2025-01-15
North,Widget C,800,2025-01-16
South,Widget B,1500,2025-01-14
South,Widget A,1200,2025-01-13
```

### Example 2: Log File Processing

**Input (`server.jsonl`):**
```json
{"timestamp": "2025-01-15T10:30:00Z", "level": "ERROR", "message": "Connection failed"}
{"timestamp": "2025-01-15T10:25:00Z", "level": "INFO", "message": "Server started"}
{"timestamp": "2025-01-15T10:35:00Z", "level": "WARN", "message": "High memory usage"}
```

**Command:**
```bash
sortx-universal server.jsonl -o sorted_logs.jsonl -k timestamp:date --stats
```

**Output includes statistics:**
```
Sorting Statistics:
  Input file: server.jsonl
  Output file: sorted_logs.jsonl
  Lines processed: 3
  Processing time: 0.01s
  Input size: 312B
  Output size: 312B
  External sort: No
  Throughput: 300 lines/sec
```

### Example 3: Large Dataset Processing

**Processing a 5GB file:**
```bash
sortx-universal huge_dataset.csv.gz -o sorted_huge.csv.gz \
  -k timestamp:date \
  -k user_id:num \
  --memory-limit=1G \
  --unique=transaction_id \
  --stats
```

This command:
- Sorts by timestamp, then user_id
- Uses maximum 1GB of RAM (external sort for larger files)
- Removes duplicate transactions
- Shows detailed performance statistics

## 🐍 Python API Reference

### Core Functions

#### `sortx.key(column, data_type, desc=False, locale_name=None, **options)`
Create a sort key specification.

**Parameters:**
- `column`: Column name (dict) or index (list/tuple)
- `data_type`: Data type (`'str'`, `'num'`, `'date'`, `'nat'`)
- `desc`: Sort in descending order if True
- `locale_name`: Locale for string sorting (e.g., `'fr_FR.UTF-8'`)

#### `sortx.sort_iter(data, keys, stable=True, reverse=False, unique=None)`
Sort an iterator of data in memory.

**Parameters:**
- `data`: Iterator of items to sort
- `keys`: List of SortKey specifications
- `stable`: Use stable sorting algorithm
- `reverse`: Reverse the entire sort order
- `unique`: Column name for uniqueness constraint

#### `sortx.sort_file(input_path, output_path, keys, memory_limit=None, stats=False, **options)`
Sort a file and write results to another file.

**Parameters:**
- `input_path`: Path to input file
- `output_path`: Path to output file
- `keys`: List of SortKey specifications
- `memory_limit`: Memory limit string (e.g., `'512M'`, `'2G'`)
- `stats`: Return sorting statistics

### Advanced Usage

```python
import sortx

# Complex multi-key sorting with different options per key
keys = [
    sortx.key("department", "str"),                    # Primary: department
    sortx.key("salary", "num", desc=True),            # Secondary: salary (desc)
    sortx.key("hire_date", "date"),                   # Tertiary: hire date
    sortx.key("name", "str", locale_name="en_US")     # Quaternary: name
]

result = list(sortx.sort_iter(employee_data, keys=keys))

# File sorting with memory management and statistics
stats = sortx.sort_file(
    input_path="employees.csv",
    output_path="sorted_employees.csv",
    keys=keys,
    memory_limit="256M",  # Use max 256MB RAM
    unique="employee_id", # Remove duplicates by employee ID
    stats=True           # Return detailed statistics
)

print(f"Sorted {stats.lines_processed} employees")
print(f"Processing time: {stats.processing_time:.2f} seconds")
print(f"Throughput: {stats.throughput:.0f} lines/second")
```

## ⚑ Performance

sortx-universal is optimized for performance across different scenarios:

### In-Memory Sorting
- **Fast**: Optimized Python sorting with custom key functions
- **Memory Efficient**: Streaming processing where possible
- **Stable**: Maintains relative order of equal elements

### External Sorting (Large Files)
- **Scalable**: Handles files larger than available RAM
- **Configurable**: Memory usage limits prevent system overload
- **Efficient**: Multi-way merge sort with optimized I/O

### Benchmarks (Approximate)

| File Size | Records | Memory Limit | Processing Time | Throughput |
|-----------|---------|-------------|----------------|------------|
| 100MB | 1M | 512MB | 5s | 200K lines/sec |
| 1GB | 10M | 512MB | 60s | 167K lines/sec |
| 10GB | 100M | 1GB | 15min | 111K lines/sec |

*Benchmarks run on modern hardware (SSD, 16GB RAM). Performance varies based on data complexity and system specifications.*

## πŸ› οΈ Development

### Setup Development Environment

```bash
# Clone the repository
git clone https://github.com/Okymi-X/sortx-universal.git
cd sortx-universal

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install in development mode with all dependencies
pip install -e ".[full,dev]"
```

### Run Tests

```bash
# Run all tests
pytest

# Run tests with coverage
pytest --cov=sortx --cov-report=html

# Run specific test file
pytest tests/test_core.py
```

### Code Quality

```bash
# Format code
black sortx tests

# Sort imports
isort sortx tests

# Lint code
flake8 sortx tests

# Type checking
mypy sortx
```

### Running Demo

```bash
# Quick demo
python demo.py

# Comprehensive tests
python main.py
```

## 🀝 Contributing

Contributions are welcome! Here's how to get started:

1. **Fork** the repository
2. **Create** your feature branch (`git checkout -b feature/amazing-feature`)
3. **Make** your changes and add tests
4. **Ensure** code quality (`black`, `isort`, `flake8`, `pytest`)
5. **Commit** your changes (`git commit -m 'Add amazing feature'`)
6. **Push** to the branch (`git push origin feature/amazing-feature`)
7. **Open** a Pull Request

### Areas for Contribution
- πŸš€ Performance optimizations
- πŸ“Š Additional file format support
- 🌍 Locale and internationalization improvements
- πŸ“š Documentation and examples
- πŸ§ͺ Test coverage expansion
- πŸ”§ CLI enhancements

## πŸ“„ License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## πŸ—ΊοΈ Roadmap

### Version 0.2.0
- [ ] Rust core implementation for 10x performance boost
- [ ] Additional compression formats (bz2, xz, lz4)
- [ ] Memory-mapped file support for better performance
- [ ] Progress bars for long-running operations

### Version 0.3.0
- [ ] Additional file formats (Parquet, Avro, Excel)
- [ ] Database integration (PostgreSQL, SQLite)
- [ ] Parallel sorting with multiple CPU cores
- [ ] Advanced statistics and profiling

### Version 1.0.0
- [ ] Distributed sorting across multiple machines
- [ ] Web-based GUI interface
- [ ] Plugin system for custom data types
- [ ] Real-time streaming sort capabilities

## πŸ™ Acknowledgments

- Inspired by **GNU sort** and other Unix sorting utilities
- Built with Python's robust ecosystem for data processing
- Uses **external sorting algorithms** from computer science literature
- Thanks to the open source community for excellent libraries:
  - `typer` and `rich` for beautiful CLI
  - `python-dateutil` for date parsing
  - `natsort` for natural sorting

## πŸ“ž Support

- πŸ“– **Documentation**: [GitHub README](https://github.com/Okymi-X/sortx-universal#readme)
- πŸ› **Bug Reports**: [GitHub Issues](https://github.com/Okymi-X/sortx-universal/issues)
- πŸ’¬ **Discussions**: [GitHub Discussions](https://github.com/Okymi-X/sortx-universal/discussions)
- πŸ“§ **Email**: dev@sortx-universal.io

---

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "sortx-universal",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "sortx-universal contributors <dev@sortx-universal.io>",
    "keywords": "sorting, csv, json, external-sort, cli, data-processing",
    "author": null,
    "author_email": "sortx-universal contributors <dev@sortx-universal.io>",
    "download_url": "https://files.pythonhosted.org/packages/1b/e8/2c6c2c4a0419fc48eb60d62398217329354b1ba77521edb6fbf10153a0a9/sortx_universal-0.1.0.tar.gz",
    "platform": null,
    "description": "# sortx-universal\r\n\r\n[![Build Status](https://github.com/Okymi-X/sortx-universal/workflows/CI/badge.svg)](https://github.com/Okymi-X/sortx-universal/actions)\r\n[![PyPI version](https://badge.fury.io/py/sortx-universal.svg)](https://badge.fury.io/py/sortx-universal)\r\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\r\n[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)\r\n\r\n**sortx-universal** is a powerful, universal sorting tool and Python library designed to sort any kind of data: in-memory data structures, CSV/JSONL files, plain text, and even massive datasets using efficient external sorting algorithms.\r\n\r\n## \u2728 Features\r\n\r\n\ud83d\ude80 **Universal Sorting**: Sort any data format (CSV, JSONL, TXT, compressed files)  \r\n\ud83d\udcca **Multi-key Sorting**: Sort by multiple columns with different data types and directions  \r\n\u26a1 **External Sorting**: Handle massive files that don't fit in memory using external merge sort  \r\n\ud83c\udf0d **Locale-aware**: International text sorting with locale support  \r\n\ud83d\udd27 **Smart Detection**: Automatically detect file formats and separators  \r\n\ud83d\udce6 **Easy Installation**: Simple `pip install sortx-universal`  \r\n\ud83d\udee0\ufe0f **CLI + Library**: Use as command-line tool or import as Python library  \r\n\ud83c\udfaf **Type Support**: Numbers, strings, dates, natural sorting  \r\n\ud83d\udd04 **Stable Sorting**: Preserves original order for equal elements  \r\n\ud83c\udf9b\ufe0f **Flexible Options**: Reverse, unique constraints, memory limits  \r\n\r\n## \ud83d\udce6 Installation\r\n\r\n### Basic Installation\r\n```bash\r\npip install sortx-universal\r\n```\r\n\r\n### Full Installation (with CLI and enhanced features)\r\n```bash\r\npip install sortx-universal[full]\r\n```\r\n\r\nThe full installation includes:\r\n- `typer` and `rich` for beautiful CLI experience\r\n- `python-dateutil` for advanced date parsing\r\n- `natsort` for natural sorting\r\n- `chardet` for encoding detection\r\n\r\n## \ud83d\ude80 Quick Start\r\n\r\n### Command Line Interface\r\n\r\n```bash\r\n# Sort CSV by price (numeric), then name (alphabetic)\r\nsortx-universal data.csv -o sorted.csv -k price:num -k name:str\r\n\r\n# Sort large JSONL file by timestamp with memory limit\r\nsortx-universal logs.jsonl.gz -o sorted.jsonl.gz -k timestamp:date --memory-limit=512M\r\n\r\n# Natural sort of text file (file2 comes before file10)\r\nsortx-universal filenames.txt -o sorted.txt -k 0:nat\r\n\r\n# Sort with uniqueness constraint\r\nsortx-universal users.jsonl -o unique_users.jsonl -k created_at:date --unique=id\r\n\r\n# Show sorting statistics\r\nsortx-universal large_data.csv -o sorted_data.csv -k score:num:desc=true --stats\r\n```\r\n\r\n### Python Library\r\n\r\n```python\r\nimport sortx\r\n\r\n# Sort in-memory data\r\ndata = [\r\n    {\"name\": \"Alice\", \"age\": 30, \"salary\": 50000},\r\n    {\"name\": \"Bob\", \"age\": 25, \"salary\": 45000},\r\n    {\"name\": \"Charlie\", \"age\": 35, \"salary\": 60000}\r\n]\r\n\r\n# Single key sorting\r\nsorted_by_age = list(sortx.sort_iter(\r\n    data, \r\n    keys=[sortx.key(\"age\", \"num\")]\r\n))\r\n\r\n# Multi-key sorting\r\nsorted_multi = list(sortx.sort_iter(\r\n    data,\r\n    keys=[\r\n        sortx.key(\"salary\", \"num\", desc=True),  # Salary descending\r\n        sortx.key(\"name\", \"str\")                # Then name ascending\r\n    ]\r\n))\r\n\r\n# Sort file to file\r\nstats = sortx.sort_file(\r\n    input_path=\"input.csv\",\r\n    output_path=\"output.csv\", \r\n    keys=[sortx.key(\"created_at\", \"date\", desc=True)],\r\n    stats=True\r\n)\r\nprint(f\"Processed {stats.lines_processed} lines in {stats.processing_time:.2f}s\")\r\n```\r\n\r\n## \ud83d\udcca Data Types\r\n\r\nsortx-universal supports multiple data types for sorting keys:\r\n\r\n| Type | Description | Example |\r\n|------|-------------|---------|\r\n| **`num`** | Numeric sorting (integers, floats) | `42`, `3.14`, `-10` |\r\n| **`str`** | String sorting with locale support | `\"Alice\"`, `\"caf\u00e9\"` |\r\n| **`date`** | Date/time sorting (ISO 8601 + common formats) | `\"2025-01-15\"`, `\"2025-01-15T10:30:00Z\"` |\r\n| **`nat`** | Natural sorting (\"file2\" < \"file10\") | `\"file1.txt\"`, `\"file10.txt\"` |\r\n\r\n### Date Format Support\r\n- ISO 8601: `2025-01-15T10:30:00Z`\r\n- Common formats: `2025-01-15`, `01/15/2025`, `Jan 15, 2025`\r\n- Automatic parsing with `python-dateutil` (when installed)\r\n\r\n## \ud83d\udcc1 File Format Support\r\n\r\n| Format | Extensions | Compression | Description |\r\n|--------|------------|-------------|-------------|\r\n| **CSV/TSV** | `.csv`, `.tsv` | \u2705 | Automatic delimiter detection |\r\n| **JSONL** | `.jsonl`, `.ndjson` | \u2705 | One JSON object per line |\r\n| **Plain Text** | `.txt`, any | \u2705 | Line-by-line sorting |\r\n| **Compressed** | `.gz`, `.zst` | - | Transparent compression support |\r\n\r\n### Large File Handling\r\n- **External Sorting**: Automatically handles files larger than available RAM\r\n- **Memory Limits**: Configurable memory usage (`--memory-limit=512M`)\r\n- **Streaming**: Processes files line-by-line to minimize memory footprint\r\n\r\n## \ud83d\udd27 Command Line Reference\r\n\r\n```bash\r\nsortx-universal [INPUT] [OPTIONS]\r\n```\r\n\r\n### Options\r\n\r\n| Option | Short | Description |\r\n|--------|-------|-------------|\r\n| `--output FILE` | `-o` | Output file path |\r\n| `--key KEY_SPEC` | `-k` | Sort key specification (can be used multiple times) |\r\n| `--reverse` | | Reverse the entire sort order |\r\n| `--stable` | | Use stable sorting (default) |\r\n| `--unique COLUMN` | | Keep only unique values for specified column |\r\n| `--memory-limit SIZE` | | Memory limit for external sorting (e.g., 512M, 2G) |\r\n| `--stats` | | Show detailed sorting statistics |\r\n| `--help` | `-h` | Show help message |\r\n\r\n### Key Specification Format\r\n\r\nSort keys use the format: `column:type[:desc=true][:locale=name]`\r\n\r\n**Examples:**\r\n- `price:num` - Sort by price as number (ascending)\r\n- `price:num:desc=true` - Sort by price as number (descending)\r\n- `name:str:locale=fr_FR` - Sort by name with French locale\r\n- `timestamp:date` - Sort by timestamp as date\r\n- `0:nat` - Natural sort by first column (for text files)\r\n\r\n## \ud83d\udca1 Examples\r\n\r\n### Example 1: Sales Data Analysis\r\n\r\n**Input (`sales.csv`):**\r\n```csv\r\nregion,product,revenue,date\r\nNorth,Widget A,1000,2025-01-15\r\nSouth,Widget B,1500,2025-01-14\r\nNorth,Widget C,800,2025-01-16\r\nSouth,Widget A,1200,2025-01-13\r\n```\r\n\r\n**Command:**\r\n```bash\r\nsortx-universal sales.csv -o sorted_sales.csv -k region:str -k revenue:num:desc=true\r\n```\r\n\r\n**Output:**\r\n```csv\r\nregion,product,revenue,date\r\nNorth,Widget A,1000,2025-01-15\r\nNorth,Widget C,800,2025-01-16\r\nSouth,Widget B,1500,2025-01-14\r\nSouth,Widget A,1200,2025-01-13\r\n```\r\n\r\n### Example 2: Log File Processing\r\n\r\n**Input (`server.jsonl`):**\r\n```json\r\n{\"timestamp\": \"2025-01-15T10:30:00Z\", \"level\": \"ERROR\", \"message\": \"Connection failed\"}\r\n{\"timestamp\": \"2025-01-15T10:25:00Z\", \"level\": \"INFO\", \"message\": \"Server started\"}\r\n{\"timestamp\": \"2025-01-15T10:35:00Z\", \"level\": \"WARN\", \"message\": \"High memory usage\"}\r\n```\r\n\r\n**Command:**\r\n```bash\r\nsortx-universal server.jsonl -o sorted_logs.jsonl -k timestamp:date --stats\r\n```\r\n\r\n**Output includes statistics:**\r\n```\r\nSorting Statistics:\r\n  Input file: server.jsonl\r\n  Output file: sorted_logs.jsonl\r\n  Lines processed: 3\r\n  Processing time: 0.01s\r\n  Input size: 312B\r\n  Output size: 312B\r\n  External sort: No\r\n  Throughput: 300 lines/sec\r\n```\r\n\r\n### Example 3: Large Dataset Processing\r\n\r\n**Processing a 5GB file:**\r\n```bash\r\nsortx-universal huge_dataset.csv.gz -o sorted_huge.csv.gz \\\r\n  -k timestamp:date \\\r\n  -k user_id:num \\\r\n  --memory-limit=1G \\\r\n  --unique=transaction_id \\\r\n  --stats\r\n```\r\n\r\nThis command:\r\n- Sorts by timestamp, then user_id\r\n- Uses maximum 1GB of RAM (external sort for larger files)\r\n- Removes duplicate transactions\r\n- Shows detailed performance statistics\r\n\r\n## \ud83d\udc0d Python API Reference\r\n\r\n### Core Functions\r\n\r\n#### `sortx.key(column, data_type, desc=False, locale_name=None, **options)`\r\nCreate a sort key specification.\r\n\r\n**Parameters:**\r\n- `column`: Column name (dict) or index (list/tuple)\r\n- `data_type`: Data type (`'str'`, `'num'`, `'date'`, `'nat'`)\r\n- `desc`: Sort in descending order if True\r\n- `locale_name`: Locale for string sorting (e.g., `'fr_FR.UTF-8'`)\r\n\r\n#### `sortx.sort_iter(data, keys, stable=True, reverse=False, unique=None)`\r\nSort an iterator of data in memory.\r\n\r\n**Parameters:**\r\n- `data`: Iterator of items to sort\r\n- `keys`: List of SortKey specifications\r\n- `stable`: Use stable sorting algorithm\r\n- `reverse`: Reverse the entire sort order\r\n- `unique`: Column name for uniqueness constraint\r\n\r\n#### `sortx.sort_file(input_path, output_path, keys, memory_limit=None, stats=False, **options)`\r\nSort a file and write results to another file.\r\n\r\n**Parameters:**\r\n- `input_path`: Path to input file\r\n- `output_path`: Path to output file\r\n- `keys`: List of SortKey specifications\r\n- `memory_limit`: Memory limit string (e.g., `'512M'`, `'2G'`)\r\n- `stats`: Return sorting statistics\r\n\r\n### Advanced Usage\r\n\r\n```python\r\nimport sortx\r\n\r\n# Complex multi-key sorting with different options per key\r\nkeys = [\r\n    sortx.key(\"department\", \"str\"),                    # Primary: department\r\n    sortx.key(\"salary\", \"num\", desc=True),            # Secondary: salary (desc)\r\n    sortx.key(\"hire_date\", \"date\"),                   # Tertiary: hire date\r\n    sortx.key(\"name\", \"str\", locale_name=\"en_US\")     # Quaternary: name\r\n]\r\n\r\nresult = list(sortx.sort_iter(employee_data, keys=keys))\r\n\r\n# File sorting with memory management and statistics\r\nstats = sortx.sort_file(\r\n    input_path=\"employees.csv\",\r\n    output_path=\"sorted_employees.csv\",\r\n    keys=keys,\r\n    memory_limit=\"256M\",  # Use max 256MB RAM\r\n    unique=\"employee_id\", # Remove duplicates by employee ID\r\n    stats=True           # Return detailed statistics\r\n)\r\n\r\nprint(f\"Sorted {stats.lines_processed} employees\")\r\nprint(f\"Processing time: {stats.processing_time:.2f} seconds\")\r\nprint(f\"Throughput: {stats.throughput:.0f} lines/second\")\r\n```\r\n\r\n## \u26a1 Performance\r\n\r\nsortx-universal is optimized for performance across different scenarios:\r\n\r\n### In-Memory Sorting\r\n- **Fast**: Optimized Python sorting with custom key functions\r\n- **Memory Efficient**: Streaming processing where possible\r\n- **Stable**: Maintains relative order of equal elements\r\n\r\n### External Sorting (Large Files)\r\n- **Scalable**: Handles files larger than available RAM\r\n- **Configurable**: Memory usage limits prevent system overload\r\n- **Efficient**: Multi-way merge sort with optimized I/O\r\n\r\n### Benchmarks (Approximate)\r\n\r\n| File Size | Records | Memory Limit | Processing Time | Throughput |\r\n|-----------|---------|-------------|----------------|------------|\r\n| 100MB | 1M | 512MB | 5s | 200K lines/sec |\r\n| 1GB | 10M | 512MB | 60s | 167K lines/sec |\r\n| 10GB | 100M | 1GB | 15min | 111K lines/sec |\r\n\r\n*Benchmarks run on modern hardware (SSD, 16GB RAM). Performance varies based on data complexity and system specifications.*\r\n\r\n## \ud83d\udee0\ufe0f Development\r\n\r\n### Setup Development Environment\r\n\r\n```bash\r\n# Clone the repository\r\ngit clone https://github.com/Okymi-X/sortx-universal.git\r\ncd sortx-universal\r\n\r\n# Create virtual environment\r\npython -m venv .venv\r\nsource .venv/bin/activate  # On Windows: .venv\\Scripts\\activate\r\n\r\n# Install in development mode with all dependencies\r\npip install -e \".[full,dev]\"\r\n```\r\n\r\n### Run Tests\r\n\r\n```bash\r\n# Run all tests\r\npytest\r\n\r\n# Run tests with coverage\r\npytest --cov=sortx --cov-report=html\r\n\r\n# Run specific test file\r\npytest tests/test_core.py\r\n```\r\n\r\n### Code Quality\r\n\r\n```bash\r\n# Format code\r\nblack sortx tests\r\n\r\n# Sort imports\r\nisort sortx tests\r\n\r\n# Lint code\r\nflake8 sortx tests\r\n\r\n# Type checking\r\nmypy sortx\r\n```\r\n\r\n### Running Demo\r\n\r\n```bash\r\n# Quick demo\r\npython demo.py\r\n\r\n# Comprehensive tests\r\npython main.py\r\n```\r\n\r\n## \ud83e\udd1d Contributing\r\n\r\nContributions are welcome! Here's how to get started:\r\n\r\n1. **Fork** the repository\r\n2. **Create** your feature branch (`git checkout -b feature/amazing-feature`)\r\n3. **Make** your changes and add tests\r\n4. **Ensure** code quality (`black`, `isort`, `flake8`, `pytest`)\r\n5. **Commit** your changes (`git commit -m 'Add amazing feature'`)\r\n6. **Push** to the branch (`git push origin feature/amazing-feature`)\r\n7. **Open** a Pull Request\r\n\r\n### Areas for Contribution\r\n- \ud83d\ude80 Performance optimizations\r\n- \ud83d\udcca Additional file format support\r\n- \ud83c\udf0d Locale and internationalization improvements\r\n- \ud83d\udcda Documentation and examples\r\n- \ud83e\uddea Test coverage expansion\r\n- \ud83d\udd27 CLI enhancements\r\n\r\n## \ud83d\udcc4 License\r\n\r\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\r\n\r\n## \ud83d\uddfa\ufe0f Roadmap\r\n\r\n### Version 0.2.0\r\n- [ ] Rust core implementation for 10x performance boost\r\n- [ ] Additional compression formats (bz2, xz, lz4)\r\n- [ ] Memory-mapped file support for better performance\r\n- [ ] Progress bars for long-running operations\r\n\r\n### Version 0.3.0\r\n- [ ] Additional file formats (Parquet, Avro, Excel)\r\n- [ ] Database integration (PostgreSQL, SQLite)\r\n- [ ] Parallel sorting with multiple CPU cores\r\n- [ ] Advanced statistics and profiling\r\n\r\n### Version 1.0.0\r\n- [ ] Distributed sorting across multiple machines\r\n- [ ] Web-based GUI interface\r\n- [ ] Plugin system for custom data types\r\n- [ ] Real-time streaming sort capabilities\r\n\r\n## \ud83d\ude4f Acknowledgments\r\n\r\n- Inspired by **GNU sort** and other Unix sorting utilities\r\n- Built with Python's robust ecosystem for data processing\r\n- Uses **external sorting algorithms** from computer science literature\r\n- Thanks to the open source community for excellent libraries:\r\n  - `typer` and `rich` for beautiful CLI\r\n  - `python-dateutil` for date parsing\r\n  - `natsort` for natural sorting\r\n\r\n## \ud83d\udcde Support\r\n\r\n- \ud83d\udcd6 **Documentation**: [GitHub README](https://github.com/Okymi-X/sortx-universal#readme)\r\n- \ud83d\udc1b **Bug Reports**: [GitHub Issues](https://github.com/Okymi-X/sortx-universal/issues)\r\n- \ud83d\udcac **Discussions**: [GitHub Discussions](https://github.com/Okymi-X/sortx-universal/discussions)\r\n- \ud83d\udce7 **Email**: dev@sortx-universal.io\r\n\r\n---\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Universal sorting tool for files, data structures, and large datasets",
    "version": "0.1.0",
    "project_urls": {
        "Documentation": "https://github.com/Okymi-X/sortx-universal#readme",
        "Homepage": "https://github.com/Okymi-X/sortx-universal",
        "Issues": "https://github.com/Okymi-X/sortx-universal/issues",
        "Repository": "https://github.com/Okymi-X/sortx-universal"
    },
    "split_keywords": [
        "sorting",
        " csv",
        " json",
        " external-sort",
        " cli",
        " data-processing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "253f702d88ec5808bcd02287096fa934914f7f9ee13ca3b89cf3a20ca85658e3",
                "md5": "1847a16c9796fbe97522de961cb05c66",
                "sha256": "e9eb3c63a3971dbf7726d8707aac3d7626c520661ae9788bcdfe36704ec207c4"
            },
            "downloads": -1,
            "filename": "sortx_universal-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1847a16c9796fbe97522de961cb05c66",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 21881,
            "upload_time": "2025-08-18T14:25:43",
            "upload_time_iso_8601": "2025-08-18T14:25:43.682938Z",
            "url": "https://files.pythonhosted.org/packages/25/3f/702d88ec5808bcd02287096fa934914f7f9ee13ca3b89cf3a20ca85658e3/sortx_universal-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1be82c6c2c4a0419fc48eb60d62398217329354b1ba77521edb6fbf10153a0a9",
                "md5": "dab44c27a27b8ee9ee0c962da381a12f",
                "sha256": "18cd2259bdbffa124fc1c30516513577377c9af7d8cd5433626dd4c8bde5fc71"
            },
            "downloads": -1,
            "filename": "sortx_universal-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "dab44c27a27b8ee9ee0c962da381a12f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 28998,
            "upload_time": "2025-08-18T14:25:44",
            "upload_time_iso_8601": "2025-08-18T14:25:44.982923Z",
            "url": "https://files.pythonhosted.org/packages/1b/e8/2c6c2c4a0419fc48eb60d62398217329354b1ba77521edb6fbf10153a0a9/sortx_universal-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-18 14:25:44",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Okymi-X",
    "github_project": "sortx-universal#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "sortx-universal"
}
        
Elapsed time: 1.42352s