mock-spark

Name	mock-spark JSON
Version	1.2.1 JSON
	download
home_page	None
Summary	Lightning-fast PySpark testing without JVM - 10x faster with 100% API compatibility
upload_time	2025-10-07 22:42:28
maintainer	None
docs_url	None
author	None
requires_python	>=3.8
license	MIT
keywords	spark pyspark mock testing development data-engineering dataframe spark-session unit-testing type-safe mypy error-simulation performance-testing data-generation enterprise
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Mock Spark

<div align="center">

**🚀 Test PySpark code at lightning speed—no JVM required**

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![PyPI version](https://badge.fury.io/py/mock-spark.svg)](https://badge.fury.io/py/mock-spark)
[![Tests](https://img.shields.io/badge/tests-388%20passing%20%7C%200%20failing-brightgreen.svg)](https://github.com/eddiethedean/mock-spark)
[![Code Style](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

*⚡ 10x faster tests • 🎯 Drop-in PySpark replacement • 📦 Zero JVM overhead*

</div>

---

## Why Mock Spark?

**Tired of waiting 30+ seconds for Spark to initialize in every test?**

Mock Spark is a lightweight PySpark replacement that runs your tests **10x faster** by eliminating JVM overhead. Your existing PySpark code works unchanged—just swap the import.

```python
# Before
from pyspark.sql import SparkSession

# After  
from mock_spark import MockSparkSession as SparkSession
```

### Key Benefits

| Feature | Description |
|---------|-------------|
| ⚡ **10x Faster** | No JVM startup (30s → 0.1s) |
| 🎯 **Drop-in Replacement** | Use existing PySpark code unchanged |
| 📦 **Zero Java** | Pure Python with DuckDB backend |
| 🧪 **100% Compatible** | Full PySpark 3.2 API support |
| 🔄 **Lazy Evaluation** | Mirrors PySpark's execution model |
| 🏭 **Production Ready** | 388 passing tests, type-safe |

### Perfect For

- **Unit Testing** - Fast, isolated test execution
- **CI/CD Pipelines** - Reliable tests without infrastructure
- **Local Development** - Prototype without Spark cluster
- **Documentation** - Runnable examples without setup
- **Learning** - Understand PySpark without complexity

---

## Quick Start

### Installation

```bash
pip install mock-spark
```

### Basic Usage

```python
from mock_spark import MockSparkSession, F

# Create session
spark = MockSparkSession("MyApp")

# Your PySpark code works as-is
data = [{"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}]
df = spark.createDataFrame(data)

# All operations work
result = df.filter(F.col("age") > 25).select("name").collect()
print(result)  # [Row(name='Bob')]
```

### Testing Example

```python
import pytest
from mock_spark import MockSparkSession, F

def test_data_pipeline():
    """Test PySpark logic without Spark cluster."""
    spark = MockSparkSession("TestApp")
    
    # Test data
    data = [{"score": 95}, {"score": 87}, {"score": 92}]
    df = spark.createDataFrame(data)
    
    # Business logic
    high_scores = df.filter(F.col("score") > 90)
    
    # Assertions
    assert high_scores.count() == 2
    assert high_scores.agg(F.avg("score")).collect()[0][0] == 93.5
```

---

## Core Features

### DataFrame Operations
- **Transformations**: `select`, `filter`, `withColumn`, `drop`, `distinct`, `orderBy`
- **Aggregations**: `groupBy`, `agg`, `count`, `sum`, `avg`, `min`, `max`
- **Joins**: `inner`, `left`, `right`, `outer`, `cross`
- **Advanced**: `union`, `pivot`, `unpivot`, `explode`

### Functions (50+)
- **String**: `upper`, `lower`, `concat`, `split`, `substring`, `trim`
- **Math**: `round`, `abs`, `sqrt`, `pow`, `ceil`, `floor`
- **Date/Time**: `current_date`, `date_add`, `date_sub`, `year`, `month`, `day`
- **Conditional**: `when`, `otherwise`, `coalesce`, `isnull`, `isnan`
- **Aggregate**: `sum`, `avg`, `count`, `min`, `max`, `first`, `last`

### Window Functions
```python
from mock_spark.window import MockWindow as Window

# Ranking and analytics
df.withColumn("rank", F.row_number().over(
    Window.partitionBy("dept").orderBy(F.desc("salary"))
))
```

### SQL Support
```python
df.createOrReplaceTempView("employees")
result = spark.sql("SELECT name, salary FROM employees WHERE salary > 50000")
```

### Lazy Evaluation
Mock Spark mirrors PySpark's lazy execution model:

```python
# Transformations are queued (not executed)
result = df.filter(F.col("age") > 25).select("name")  

# Actions trigger execution
rows = result.collect()  # ← Execution happens here
count = result.count()   # ← Or here
```

**Control evaluation mode:**
```python
# Lazy (default, recommended)
spark = MockSparkSession("App", enable_lazy_evaluation=True)

# Eager (for legacy tests)
spark = MockSparkSession("App", enable_lazy_evaluation=False)
```

---

## Advanced Features

### Storage Backends
- **Memory** (default) - Fast, ephemeral
- **DuckDB** - In-memory SQL analytics
- **File System** - Persistent storage

### Testing Utilities (Optional)
Optional utilities to make testing easier:

```python
# Error simulation for testing error handling
from mock_spark.error_simulation import MockErrorSimulator

# Performance simulation for edge cases
from mock_spark.performance_simulation import MockPerformanceSimulator

# Test data generation
from mock_spark.data_generation import create_test_data
```

**📘 Full guide**: [Testing Utilities Documentation](docs/testing_utilities_guide.md)

---

## Performance Comparison

Real-world test suite improvements:

| Operation | PySpark | Mock Spark | Speedup |
|-----------|---------|------------|---------|
| Session Creation | 30-45s | 0.1s | **300x** |
| Simple Query | 2-5s | 0.01s | **200x** |
| Window Functions | 5-10s | 0.05s | **100x** |
| Full Test Suite | 5-10min | 30-60s | **10x** |

---

## Documentation

### Getting Started
- 📖 [Installation & Setup](docs/getting_started.md)
- 🎯 [Quick Start Guide](docs/getting_started.md#quick-start)
- 🔄 [Migration from PySpark](docs/guides/migration.md)

### Core Concepts
- 📊 [API Reference](docs/api_reference.md)
- 🔄 [Lazy Evaluation](docs/guides/lazy_evaluation.md)
- 🗄️ [SQL Operations](docs/sql_operations_guide.md)
- 💾 [Storage & Persistence](docs/storage_serialization_guide.md)

### Advanced Topics
- 🧪 [Testing Utilities](docs/testing_utilities_guide.md)
- ⚙️ [Configuration](docs/guides/configuration.md)
- 📈 [Benchmarking](docs/guides/benchmarking.md)
- 🔌 [Plugins & Hooks](docs/guides/plugins.md)
- 🐍 [Pytest Integration](docs/guides/pytest_integration.md)

---

## What's New in 1.0.0

### Major Improvements
- ✨ **DuckDB Integration** - Replaced SQLite for 30% faster operations
- 🧹 **Code Consolidation** - Removed 1,300+ lines of duplicate code
- 📦 **Optional Pandas** - Pandas now optional, reducing core dependencies
- ⚡ **Performance** - Sub-4s aggregations on large datasets
- 🧪 **Test Coverage** - 388 passing tests with 100% compatibility

### Architecture
- In-memory DuckDB by default for faster testing
- Simplified DataFrame/GroupedData implementation
- Enhanced type safety and error handling
- Improved lazy evaluation with schema inference

---

## Known Limitations & Future Features

While Mock Spark provides comprehensive PySpark compatibility, some advanced features are planned for future releases:

**Type System**: Strict runtime type validation, custom validators  
**Error Handling**: Enhanced error messages with recovery strategies  
**Functions**: Extended date/time, math, and null handling  
**Performance**: Query optimization, parallel execution, intelligent caching  
**Enterprise**: Schema evolution, data lineage, audit logging  
**Compatibility**: PySpark 3.3+, Delta Lake, Iceberg support  

**Want to contribute?** These are great opportunities for community contributions! See [Contributing](#contributing) below.

---

## Contributing

We welcome contributions! Areas of interest:

- ⚡ **Performance** - Further DuckDB optimizations
- 📚 **Documentation** - Examples, guides, tutorials
- 🐛 **Bug Fixes** - Edge cases and compatibility issues
- 🧪 **PySpark API Coverage** - Additional functions and methods
- 🧪 **Tests** - Additional test coverage and scenarios

---

## Development Setup

```bash
# Install for development
git clone https://github.com/eddiethedean/mock-spark.git
cd mock-spark
pip install -e ".[dev]"

# Run tests
pytest tests/

# Format code
black mock_spark tests

# Type checking
mypy mock_spark --ignore-missing-imports
```

---

## License

MIT License - see [LICENSE](LICENSE) file for details.

---

## Links

- **GitHub**: [github.com/eddiethedean/mock-spark](https://github.com/eddiethedean/mock-spark)
- **PyPI**: [pypi.org/project/mock-spark](https://pypi.org/project/mock-spark/)
- **Issues**: [github.com/eddiethedean/mock-spark/issues](https://github.com/eddiethedean/mock-spark/issues)
- **Documentation**: [Full documentation](docs/)

---

<div align="center">

**Built with ❤️ for the PySpark community**

*Star ⭐ this repo if Mock Spark helps speed up your tests!*

</div>

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "mock-spark",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Odos Matthews <odosmatthews@gmail.com>",
    "keywords": "spark, pyspark, mock, testing, development, data-engineering, dataframe, spark-session, unit-testing, type-safe, mypy, error-simulation, performance-testing, data-generation, enterprise",
    "author": null,
    "author_email": "Odos Matthews <odosmatthews@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/8e/63/27d966298ddb410036499a4f05ab536be66e2228954fc8aa2804f0905d21/mock_spark-1.2.1.tar.gz",
    "platform": null,
    "description": "# Mock Spark\n\n<div align=\"center\">\n\n**\ud83d\ude80 Test PySpark code at lightning speed\u2014no JVM required**\n\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![PyPI version](https://badge.fury.io/py/mock-spark.svg)](https://badge.fury.io/py/mock-spark)\n[![Tests](https://img.shields.io/badge/tests-388%20passing%20%7C%200%20failing-brightgreen.svg)](https://github.com/eddiethedean/mock-spark)\n[![Code Style](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\n*\u26a1 10x faster tests \u2022 \ud83c\udfaf Drop-in PySpark replacement \u2022 \ud83d\udce6 Zero JVM overhead*\n\n</div>\n\n---\n\n## Why Mock Spark?\n\n**Tired of waiting 30+ seconds for Spark to initialize in every test?**\n\nMock Spark is a lightweight PySpark replacement that runs your tests **10x faster** by eliminating JVM overhead. Your existing PySpark code works unchanged\u2014just swap the import.\n\n```python\n# Before\nfrom pyspark.sql import SparkSession\n\n# After  \nfrom mock_spark import MockSparkSession as SparkSession\n```\n\n### Key Benefits\n\n| Feature | Description |\n|---------|-------------|\n| \u26a1 **10x Faster** | No JVM startup (30s \u2192 0.1s) |\n| \ud83c\udfaf **Drop-in Replacement** | Use existing PySpark code unchanged |\n| \ud83d\udce6 **Zero Java** | Pure Python with DuckDB backend |\n| \ud83e\uddea **100% Compatible** | Full PySpark 3.2 API support |\n| \ud83d\udd04 **Lazy Evaluation** | Mirrors PySpark's execution model |\n| \ud83c\udfed **Production Ready** | 388 passing tests, type-safe |\n\n### Perfect For\n\n- **Unit Testing** - Fast, isolated test execution\n- **CI/CD Pipelines** - Reliable tests without infrastructure\n- **Local Development** - Prototype without Spark cluster\n- **Documentation** - Runnable examples without setup\n- **Learning** - Understand PySpark without complexity\n\n---\n\n## Quick Start\n\n### Installation\n\n```bash\npip install mock-spark\n```\n\n### Basic Usage\n\n```python\nfrom mock_spark import MockSparkSession, F\n\n# Create session\nspark = MockSparkSession(\"MyApp\")\n\n# Your PySpark code works as-is\ndata = [{\"name\": \"Alice\", \"age\": 25}, {\"name\": \"Bob\", \"age\": 30}]\ndf = spark.createDataFrame(data)\n\n# All operations work\nresult = df.filter(F.col(\"age\") > 25).select(\"name\").collect()\nprint(result)  # [Row(name='Bob')]\n```\n\n### Testing Example\n\n```python\nimport pytest\nfrom mock_spark import MockSparkSession, F\n\ndef test_data_pipeline():\n    \"\"\"Test PySpark logic without Spark cluster.\"\"\"\n    spark = MockSparkSession(\"TestApp\")\n    \n    # Test data\n    data = [{\"score\": 95}, {\"score\": 87}, {\"score\": 92}]\n    df = spark.createDataFrame(data)\n    \n    # Business logic\n    high_scores = df.filter(F.col(\"score\") > 90)\n    \n    # Assertions\n    assert high_scores.count() == 2\n    assert high_scores.agg(F.avg(\"score\")).collect()[0][0] == 93.5\n```\n\n---\n\n## Core Features\n\n### DataFrame Operations\n- **Transformations**: `select`, `filter`, `withColumn`, `drop`, `distinct`, `orderBy`\n- **Aggregations**: `groupBy`, `agg`, `count`, `sum`, `avg`, `min`, `max`\n- **Joins**: `inner`, `left`, `right`, `outer`, `cross`\n- **Advanced**: `union`, `pivot`, `unpivot`, `explode`\n\n### Functions (50+)\n- **String**: `upper`, `lower`, `concat`, `split`, `substring`, `trim`\n- **Math**: `round`, `abs`, `sqrt`, `pow`, `ceil`, `floor`\n- **Date/Time**: `current_date`, `date_add`, `date_sub`, `year`, `month`, `day`\n- **Conditional**: `when`, `otherwise`, `coalesce`, `isnull`, `isnan`\n- **Aggregate**: `sum`, `avg`, `count`, `min`, `max`, `first`, `last`\n\n### Window Functions\n```python\nfrom mock_spark.window import MockWindow as Window\n\n# Ranking and analytics\ndf.withColumn(\"rank\", F.row_number().over(\n    Window.partitionBy(\"dept\").orderBy(F.desc(\"salary\"))\n))\n```\n\n### SQL Support\n```python\ndf.createOrReplaceTempView(\"employees\")\nresult = spark.sql(\"SELECT name, salary FROM employees WHERE salary > 50000\")\n```\n\n### Lazy Evaluation\nMock Spark mirrors PySpark's lazy execution model:\n\n```python\n# Transformations are queued (not executed)\nresult = df.filter(F.col(\"age\") > 25).select(\"name\")  \n\n# Actions trigger execution\nrows = result.collect()  # \u2190 Execution happens here\ncount = result.count()   # \u2190 Or here\n```\n\n**Control evaluation mode:**\n```python\n# Lazy (default, recommended)\nspark = MockSparkSession(\"App\", enable_lazy_evaluation=True)\n\n# Eager (for legacy tests)\nspark = MockSparkSession(\"App\", enable_lazy_evaluation=False)\n```\n\n---\n\n## Advanced Features\n\n### Storage Backends\n- **Memory** (default) - Fast, ephemeral\n- **DuckDB** - In-memory SQL analytics\n- **File System** - Persistent storage\n\n### Testing Utilities (Optional)\nOptional utilities to make testing easier:\n\n```python\n# Error simulation for testing error handling\nfrom mock_spark.error_simulation import MockErrorSimulator\n\n# Performance simulation for edge cases\nfrom mock_spark.performance_simulation import MockPerformanceSimulator\n\n# Test data generation\nfrom mock_spark.data_generation import create_test_data\n```\n\n**\ud83d\udcd8 Full guide**: [Testing Utilities Documentation](docs/testing_utilities_guide.md)\n\n---\n\n## Performance Comparison\n\nReal-world test suite improvements:\n\n| Operation | PySpark | Mock Spark | Speedup |\n|-----------|---------|------------|---------|\n| Session Creation | 30-45s | 0.1s | **300x** |\n| Simple Query | 2-5s | 0.01s | **200x** |\n| Window Functions | 5-10s | 0.05s | **100x** |\n| Full Test Suite | 5-10min | 30-60s | **10x** |\n\n---\n\n## Documentation\n\n### Getting Started\n- \ud83d\udcd6 [Installation & Setup](docs/getting_started.md)\n- \ud83c\udfaf [Quick Start Guide](docs/getting_started.md#quick-start)\n- \ud83d\udd04 [Migration from PySpark](docs/guides/migration.md)\n\n### Core Concepts\n- \ud83d\udcca [API Reference](docs/api_reference.md)\n- \ud83d\udd04 [Lazy Evaluation](docs/guides/lazy_evaluation.md)\n- \ud83d\uddc4\ufe0f [SQL Operations](docs/sql_operations_guide.md)\n- \ud83d\udcbe [Storage & Persistence](docs/storage_serialization_guide.md)\n\n### Advanced Topics\n- \ud83e\uddea [Testing Utilities](docs/testing_utilities_guide.md)\n- \u2699\ufe0f [Configuration](docs/guides/configuration.md)\n- \ud83d\udcc8 [Benchmarking](docs/guides/benchmarking.md)\n- \ud83d\udd0c [Plugins & Hooks](docs/guides/plugins.md)\n- \ud83d\udc0d [Pytest Integration](docs/guides/pytest_integration.md)\n\n---\n\n## What's New in 1.0.0\n\n### Major Improvements\n- \u2728 **DuckDB Integration** - Replaced SQLite for 30% faster operations\n- \ud83e\uddf9 **Code Consolidation** - Removed 1,300+ lines of duplicate code\n- \ud83d\udce6 **Optional Pandas** - Pandas now optional, reducing core dependencies\n- \u26a1 **Performance** - Sub-4s aggregations on large datasets\n- \ud83e\uddea **Test Coverage** - 388 passing tests with 100% compatibility\n\n### Architecture\n- In-memory DuckDB by default for faster testing\n- Simplified DataFrame/GroupedData implementation\n- Enhanced type safety and error handling\n- Improved lazy evaluation with schema inference\n\n---\n\n## Known Limitations & Future Features\n\nWhile Mock Spark provides comprehensive PySpark compatibility, some advanced features are planned for future releases:\n\n**Type System**: Strict runtime type validation, custom validators  \n**Error Handling**: Enhanced error messages with recovery strategies  \n**Functions**: Extended date/time, math, and null handling  \n**Performance**: Query optimization, parallel execution, intelligent caching  \n**Enterprise**: Schema evolution, data lineage, audit logging  \n**Compatibility**: PySpark 3.3+, Delta Lake, Iceberg support  \n\n**Want to contribute?** These are great opportunities for community contributions! See [Contributing](#contributing) below.\n\n---\n\n## Contributing\n\nWe welcome contributions! Areas of interest:\n\n- \u26a1 **Performance** - Further DuckDB optimizations\n- \ud83d\udcda **Documentation** - Examples, guides, tutorials\n- \ud83d\udc1b **Bug Fixes** - Edge cases and compatibility issues\n- \ud83e\uddea **PySpark API Coverage** - Additional functions and methods\n- \ud83e\uddea **Tests** - Additional test coverage and scenarios\n\n---\n\n## Development Setup\n\n```bash\n# Install for development\ngit clone https://github.com/eddiethedean/mock-spark.git\ncd mock-spark\npip install -e \".[dev]\"\n\n# Run tests\npytest tests/\n\n# Format code\nblack mock_spark tests\n\n# Type checking\nmypy mock_spark --ignore-missing-imports\n```\n\n---\n\n## License\n\nMIT License - see [LICENSE](LICENSE) file for details.\n\n---\n\n## Links\n\n- **GitHub**: [github.com/eddiethedean/mock-spark](https://github.com/eddiethedean/mock-spark)\n- **PyPI**: [pypi.org/project/mock-spark](https://pypi.org/project/mock-spark/)\n- **Issues**: [github.com/eddiethedean/mock-spark/issues](https://github.com/eddiethedean/mock-spark/issues)\n- **Documentation**: [Full documentation](docs/)\n\n---\n\n<div align=\"center\">\n\n**Built with \u2764\ufe0f for the PySpark community**\n\n*Star \u2b50 this repo if Mock Spark helps speed up your tests!*\n\n</div>\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Lightning-fast PySpark testing without JVM - 10x faster with 100% API compatibility",
    "version": "1.2.1",
    "project_urls": {
        "Homepage": "https://github.com/eddiethedean/mock-spark",
        "Issues": "https://github.com/eddiethedean/mock-spark/issues",
        "Repository": "https://github.com/eddiethedean/mock-spark"
    },
    "split_keywords": [
        "spark",
        " pyspark",
        " mock",
        " testing",
        " development",
        " data-engineering",
        " dataframe",
        " spark-session",
        " unit-testing",
        " type-safe",
        " mypy",
        " error-simulation",
        " performance-testing",
        " data-generation",
        " enterprise"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "786d9d1f1903d71ace06a56711e8d5867bbc65af950edde4246a6dfd66965f34",
                "md5": "1c2a79b8d5cf5cf04b3d4047f3aa9697",
                "sha256": "a2db72d3742333ffcea8350d5afaea72baecf17c2fb269351ef7f89295b3b2fa"
            },
            "downloads": -1,
            "filename": "mock_spark-1.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1c2a79b8d5cf5cf04b3d4047f3aa9697",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 200083,
            "upload_time": "2025-10-07T22:42:26",
            "upload_time_iso_8601": "2025-10-07T22:42:26.477272Z",
            "url": "https://files.pythonhosted.org/packages/78/6d/9d1f1903d71ace06a56711e8d5867bbc65af950edde4246a6dfd66965f34/mock_spark-1.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8e6327d966298ddb410036499a4f05ab536be66e2228954fc8aa2804f0905d21",
                "md5": "87a50c93570218b4554f0c7fa8237026",
                "sha256": "d77a911083939585dea1312b1f05f60cf51b17c942ec9e3efe688de396d55a3c"
            },
            "downloads": -1,
            "filename": "mock_spark-1.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "87a50c93570218b4554f0c7fa8237026",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 156607,
            "upload_time": "2025-10-07T22:42:28",
            "upload_time_iso_8601": "2025-10-07T22:42:28.020985Z",
            "url": "https://files.pythonhosted.org/packages/8e/63/27d966298ddb410036499a4f05ab536be66e2228954fc8aa2804f0905d21/mock_spark-1.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-07 22:42:28",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "eddiethedean",
    "github_project": "mock-spark",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "mock-spark"
}

None