# Mock Spark
<div align="center">
**π Test PySpark code at lightning speedβno JVM required**
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
[](https://badge.fury.io/py/mock-spark)
[](https://github.com/eddiethedean/mock-spark)
[](https://github.com/psf/black)
*β‘ 10x faster tests β’ π― Drop-in PySpark replacement β’ π¦ Zero JVM overhead*
</div>
---
## Why Mock Spark?
**Tired of waiting 30+ seconds for Spark to initialize in every test?**
Mock Spark is a lightweight PySpark replacement that runs your tests **10x faster** by eliminating JVM overhead. Your existing PySpark code works unchangedβjust swap the import.
```python
# Before
from pyspark.sql import SparkSession
# After
from mock_spark import MockSparkSession as SparkSession
```
### Key Benefits
| Feature | Description |
|---------|-------------|
| β‘ **10x Faster** | No JVM startup (30s β 0.1s) |
| π― **Drop-in Replacement** | Use existing PySpark code unchanged |
| π¦ **Zero Java** | Pure Python with DuckDB backend |
| π§ͺ **100% Compatible** | Full PySpark 3.2 API support |
| π **Lazy Evaluation** | Mirrors PySpark's execution model |
| π **Production Ready** | 388 passing tests, type-safe |
### Perfect For
- **Unit Testing** - Fast, isolated test execution
- **CI/CD Pipelines** - Reliable tests without infrastructure
- **Local Development** - Prototype without Spark cluster
- **Documentation** - Runnable examples without setup
- **Learning** - Understand PySpark without complexity
---
## Quick Start
### Installation
```bash
pip install mock-spark
```
### Basic Usage
```python
from mock_spark import MockSparkSession, F
# Create session
spark = MockSparkSession("MyApp")
# Your PySpark code works as-is
data = [{"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}]
df = spark.createDataFrame(data)
# All operations work
result = df.filter(F.col("age") > 25).select("name").collect()
print(result) # [Row(name='Bob')]
```
### Testing Example
```python
import pytest
from mock_spark import MockSparkSession, F
def test_data_pipeline():
"""Test PySpark logic without Spark cluster."""
spark = MockSparkSession("TestApp")
# Test data
data = [{"score": 95}, {"score": 87}, {"score": 92}]
df = spark.createDataFrame(data)
# Business logic
high_scores = df.filter(F.col("score") > 90)
# Assertions
assert high_scores.count() == 2
assert high_scores.agg(F.avg("score")).collect()[0][0] == 93.5
```
---
## Core Features
### DataFrame Operations
- **Transformations**: `select`, `filter`, `withColumn`, `drop`, `distinct`, `orderBy`
- **Aggregations**: `groupBy`, `agg`, `count`, `sum`, `avg`, `min`, `max`
- **Joins**: `inner`, `left`, `right`, `outer`, `cross`
- **Advanced**: `union`, `pivot`, `unpivot`, `explode`
### Functions (50+)
- **String**: `upper`, `lower`, `concat`, `split`, `substring`, `trim`
- **Math**: `round`, `abs`, `sqrt`, `pow`, `ceil`, `floor`
- **Date/Time**: `current_date`, `date_add`, `date_sub`, `year`, `month`, `day`
- **Conditional**: `when`, `otherwise`, `coalesce`, `isnull`, `isnan`
- **Aggregate**: `sum`, `avg`, `count`, `min`, `max`, `first`, `last`
### Window Functions
```python
from mock_spark.window import MockWindow as Window
# Ranking and analytics
df.withColumn("rank", F.row_number().over(
Window.partitionBy("dept").orderBy(F.desc("salary"))
))
```
### SQL Support
```python
df.createOrReplaceTempView("employees")
result = spark.sql("SELECT name, salary FROM employees WHERE salary > 50000")
```
### Lazy Evaluation
Mock Spark mirrors PySpark's lazy execution model:
```python
# Transformations are queued (not executed)
result = df.filter(F.col("age") > 25).select("name")
# Actions trigger execution
rows = result.collect() # β Execution happens here
count = result.count() # β Or here
```
**Control evaluation mode:**
```python
# Lazy (default, recommended)
spark = MockSparkSession("App", enable_lazy_evaluation=True)
# Eager (for legacy tests)
spark = MockSparkSession("App", enable_lazy_evaluation=False)
```
---
## Advanced Features
### Storage Backends
- **Memory** (default) - Fast, ephemeral
- **DuckDB** - In-memory SQL analytics
- **File System** - Persistent storage
### Testing Utilities (Optional)
Optional utilities to make testing easier:
```python
# Error simulation for testing error handling
from mock_spark.error_simulation import MockErrorSimulator
# Performance simulation for edge cases
from mock_spark.performance_simulation import MockPerformanceSimulator
# Test data generation
from mock_spark.data_generation import create_test_data
```
**π Full guide**: [Testing Utilities Documentation](docs/testing_utilities_guide.md)
---
## Performance Comparison
Real-world test suite improvements:
| Operation | PySpark | Mock Spark | Speedup |
|-----------|---------|------------|---------|
| Session Creation | 30-45s | 0.1s | **300x** |
| Simple Query | 2-5s | 0.01s | **200x** |
| Window Functions | 5-10s | 0.05s | **100x** |
| Full Test Suite | 5-10min | 30-60s | **10x** |
---
## Documentation
### Getting Started
- π [Installation & Setup](docs/getting_started.md)
- π― [Quick Start Guide](docs/getting_started.md#quick-start)
- π [Migration from PySpark](docs/guides/migration.md)
### Core Concepts
- π [API Reference](docs/api_reference.md)
- π [Lazy Evaluation](docs/guides/lazy_evaluation.md)
- ποΈ [SQL Operations](docs/sql_operations_guide.md)
- πΎ [Storage & Persistence](docs/storage_serialization_guide.md)
### Advanced Topics
- π§ͺ [Testing Utilities](docs/testing_utilities_guide.md)
- βοΈ [Configuration](docs/guides/configuration.md)
- π [Benchmarking](docs/guides/benchmarking.md)
- π [Plugins & Hooks](docs/guides/plugins.md)
- π [Pytest Integration](docs/guides/pytest_integration.md)
---
## What's New in 1.0.0
### Major Improvements
- β¨ **DuckDB Integration** - Replaced SQLite for 30% faster operations
- π§Ή **Code Consolidation** - Removed 1,300+ lines of duplicate code
- π¦ **Optional Pandas** - Pandas now optional, reducing core dependencies
- β‘ **Performance** - Sub-4s aggregations on large datasets
- π§ͺ **Test Coverage** - 388 passing tests with 100% compatibility
### Architecture
- In-memory DuckDB by default for faster testing
- Simplified DataFrame/GroupedData implementation
- Enhanced type safety and error handling
- Improved lazy evaluation with schema inference
---
## Known Limitations & Future Features
While Mock Spark provides comprehensive PySpark compatibility, some advanced features are planned for future releases:
**Type System**: Strict runtime type validation, custom validators
**Error Handling**: Enhanced error messages with recovery strategies
**Functions**: Extended date/time, math, and null handling
**Performance**: Query optimization, parallel execution, intelligent caching
**Enterprise**: Schema evolution, data lineage, audit logging
**Compatibility**: PySpark 3.3+, Delta Lake, Iceberg support
**Want to contribute?** These are great opportunities for community contributions! See [Contributing](#contributing) below.
---
## Contributing
We welcome contributions! Areas of interest:
- β‘ **Performance** - Further DuckDB optimizations
- π **Documentation** - Examples, guides, tutorials
- π **Bug Fixes** - Edge cases and compatibility issues
- π§ͺ **PySpark API Coverage** - Additional functions and methods
- π§ͺ **Tests** - Additional test coverage and scenarios
---
## Development Setup
```bash
# Install for development
git clone https://github.com/eddiethedean/mock-spark.git
cd mock-spark
pip install -e ".[dev]"
# Run tests
pytest tests/
# Format code
black mock_spark tests
# Type checking
mypy mock_spark --ignore-missing-imports
```
---
## License
MIT License - see [LICENSE](LICENSE) file for details.
---
## Links
- **GitHub**: [github.com/eddiethedean/mock-spark](https://github.com/eddiethedean/mock-spark)
- **PyPI**: [pypi.org/project/mock-spark](https://pypi.org/project/mock-spark/)
- **Issues**: [github.com/eddiethedean/mock-spark/issues](https://github.com/eddiethedean/mock-spark/issues)
- **Documentation**: [Full documentation](docs/)
---
<div align="center">
**Built with β€οΈ for the PySpark community**
*Star β this repo if Mock Spark helps speed up your tests!*
</div>
Raw data
{
"_id": null,
"home_page": null,
"name": "mock-spark",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "Odos Matthews <odosmatthews@gmail.com>",
"keywords": "spark, pyspark, mock, testing, development, data-engineering, dataframe, spark-session, unit-testing, type-safe, mypy, error-simulation, performance-testing, data-generation, enterprise",
"author": null,
"author_email": "Odos Matthews <odosmatthews@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/8e/63/27d966298ddb410036499a4f05ab536be66e2228954fc8aa2804f0905d21/mock_spark-1.2.1.tar.gz",
"platform": null,
"description": "# Mock Spark\n\n<div align=\"center\">\n\n**\ud83d\ude80 Test PySpark code at lightning speed\u2014no JVM required**\n\n[](https://www.python.org/downloads/)\n[](https://opensource.org/licenses/MIT)\n[](https://badge.fury.io/py/mock-spark)\n[](https://github.com/eddiethedean/mock-spark)\n[](https://github.com/psf/black)\n\n*\u26a1 10x faster tests \u2022 \ud83c\udfaf Drop-in PySpark replacement \u2022 \ud83d\udce6 Zero JVM overhead*\n\n</div>\n\n---\n\n## Why Mock Spark?\n\n**Tired of waiting 30+ seconds for Spark to initialize in every test?**\n\nMock Spark is a lightweight PySpark replacement that runs your tests **10x faster** by eliminating JVM overhead. Your existing PySpark code works unchanged\u2014just swap the import.\n\n```python\n# Before\nfrom pyspark.sql import SparkSession\n\n# After \nfrom mock_spark import MockSparkSession as SparkSession\n```\n\n### Key Benefits\n\n| Feature | Description |\n|---------|-------------|\n| \u26a1 **10x Faster** | No JVM startup (30s \u2192 0.1s) |\n| \ud83c\udfaf **Drop-in Replacement** | Use existing PySpark code unchanged |\n| \ud83d\udce6 **Zero Java** | Pure Python with DuckDB backend |\n| \ud83e\uddea **100% Compatible** | Full PySpark 3.2 API support |\n| \ud83d\udd04 **Lazy Evaluation** | Mirrors PySpark's execution model |\n| \ud83c\udfed **Production Ready** | 388 passing tests, type-safe |\n\n### Perfect For\n\n- **Unit Testing** - Fast, isolated test execution\n- **CI/CD Pipelines** - Reliable tests without infrastructure\n- **Local Development** - Prototype without Spark cluster\n- **Documentation** - Runnable examples without setup\n- **Learning** - Understand PySpark without complexity\n\n---\n\n## Quick Start\n\n### Installation\n\n```bash\npip install mock-spark\n```\n\n### Basic Usage\n\n```python\nfrom mock_spark import MockSparkSession, F\n\n# Create session\nspark = MockSparkSession(\"MyApp\")\n\n# Your PySpark code works as-is\ndata = [{\"name\": \"Alice\", \"age\": 25}, {\"name\": \"Bob\", \"age\": 30}]\ndf = spark.createDataFrame(data)\n\n# All operations work\nresult = df.filter(F.col(\"age\") > 25).select(\"name\").collect()\nprint(result) # [Row(name='Bob')]\n```\n\n### Testing Example\n\n```python\nimport pytest\nfrom mock_spark import MockSparkSession, F\n\ndef test_data_pipeline():\n \"\"\"Test PySpark logic without Spark cluster.\"\"\"\n spark = MockSparkSession(\"TestApp\")\n \n # Test data\n data = [{\"score\": 95}, {\"score\": 87}, {\"score\": 92}]\n df = spark.createDataFrame(data)\n \n # Business logic\n high_scores = df.filter(F.col(\"score\") > 90)\n \n # Assertions\n assert high_scores.count() == 2\n assert high_scores.agg(F.avg(\"score\")).collect()[0][0] == 93.5\n```\n\n---\n\n## Core Features\n\n### DataFrame Operations\n- **Transformations**: `select`, `filter`, `withColumn`, `drop`, `distinct`, `orderBy`\n- **Aggregations**: `groupBy`, `agg`, `count`, `sum`, `avg`, `min`, `max`\n- **Joins**: `inner`, `left`, `right`, `outer`, `cross`\n- **Advanced**: `union`, `pivot`, `unpivot`, `explode`\n\n### Functions (50+)\n- **String**: `upper`, `lower`, `concat`, `split`, `substring`, `trim`\n- **Math**: `round`, `abs`, `sqrt`, `pow`, `ceil`, `floor`\n- **Date/Time**: `current_date`, `date_add`, `date_sub`, `year`, `month`, `day`\n- **Conditional**: `when`, `otherwise`, `coalesce`, `isnull`, `isnan`\n- **Aggregate**: `sum`, `avg`, `count`, `min`, `max`, `first`, `last`\n\n### Window Functions\n```python\nfrom mock_spark.window import MockWindow as Window\n\n# Ranking and analytics\ndf.withColumn(\"rank\", F.row_number().over(\n Window.partitionBy(\"dept\").orderBy(F.desc(\"salary\"))\n))\n```\n\n### SQL Support\n```python\ndf.createOrReplaceTempView(\"employees\")\nresult = spark.sql(\"SELECT name, salary FROM employees WHERE salary > 50000\")\n```\n\n### Lazy Evaluation\nMock Spark mirrors PySpark's lazy execution model:\n\n```python\n# Transformations are queued (not executed)\nresult = df.filter(F.col(\"age\") > 25).select(\"name\") \n\n# Actions trigger execution\nrows = result.collect() # \u2190 Execution happens here\ncount = result.count() # \u2190 Or here\n```\n\n**Control evaluation mode:**\n```python\n# Lazy (default, recommended)\nspark = MockSparkSession(\"App\", enable_lazy_evaluation=True)\n\n# Eager (for legacy tests)\nspark = MockSparkSession(\"App\", enable_lazy_evaluation=False)\n```\n\n---\n\n## Advanced Features\n\n### Storage Backends\n- **Memory** (default) - Fast, ephemeral\n- **DuckDB** - In-memory SQL analytics\n- **File System** - Persistent storage\n\n### Testing Utilities (Optional)\nOptional utilities to make testing easier:\n\n```python\n# Error simulation for testing error handling\nfrom mock_spark.error_simulation import MockErrorSimulator\n\n# Performance simulation for edge cases\nfrom mock_spark.performance_simulation import MockPerformanceSimulator\n\n# Test data generation\nfrom mock_spark.data_generation import create_test_data\n```\n\n**\ud83d\udcd8 Full guide**: [Testing Utilities Documentation](docs/testing_utilities_guide.md)\n\n---\n\n## Performance Comparison\n\nReal-world test suite improvements:\n\n| Operation | PySpark | Mock Spark | Speedup |\n|-----------|---------|------------|---------|\n| Session Creation | 30-45s | 0.1s | **300x** |\n| Simple Query | 2-5s | 0.01s | **200x** |\n| Window Functions | 5-10s | 0.05s | **100x** |\n| Full Test Suite | 5-10min | 30-60s | **10x** |\n\n---\n\n## Documentation\n\n### Getting Started\n- \ud83d\udcd6 [Installation & Setup](docs/getting_started.md)\n- \ud83c\udfaf [Quick Start Guide](docs/getting_started.md#quick-start)\n- \ud83d\udd04 [Migration from PySpark](docs/guides/migration.md)\n\n### Core Concepts\n- \ud83d\udcca [API Reference](docs/api_reference.md)\n- \ud83d\udd04 [Lazy Evaluation](docs/guides/lazy_evaluation.md)\n- \ud83d\uddc4\ufe0f [SQL Operations](docs/sql_operations_guide.md)\n- \ud83d\udcbe [Storage & Persistence](docs/storage_serialization_guide.md)\n\n### Advanced Topics\n- \ud83e\uddea [Testing Utilities](docs/testing_utilities_guide.md)\n- \u2699\ufe0f [Configuration](docs/guides/configuration.md)\n- \ud83d\udcc8 [Benchmarking](docs/guides/benchmarking.md)\n- \ud83d\udd0c [Plugins & Hooks](docs/guides/plugins.md)\n- \ud83d\udc0d [Pytest Integration](docs/guides/pytest_integration.md)\n\n---\n\n## What's New in 1.0.0\n\n### Major Improvements\n- \u2728 **DuckDB Integration** - Replaced SQLite for 30% faster operations\n- \ud83e\uddf9 **Code Consolidation** - Removed 1,300+ lines of duplicate code\n- \ud83d\udce6 **Optional Pandas** - Pandas now optional, reducing core dependencies\n- \u26a1 **Performance** - Sub-4s aggregations on large datasets\n- \ud83e\uddea **Test Coverage** - 388 passing tests with 100% compatibility\n\n### Architecture\n- In-memory DuckDB by default for faster testing\n- Simplified DataFrame/GroupedData implementation\n- Enhanced type safety and error handling\n- Improved lazy evaluation with schema inference\n\n---\n\n## Known Limitations & Future Features\n\nWhile Mock Spark provides comprehensive PySpark compatibility, some advanced features are planned for future releases:\n\n**Type System**: Strict runtime type validation, custom validators \n**Error Handling**: Enhanced error messages with recovery strategies \n**Functions**: Extended date/time, math, and null handling \n**Performance**: Query optimization, parallel execution, intelligent caching \n**Enterprise**: Schema evolution, data lineage, audit logging \n**Compatibility**: PySpark 3.3+, Delta Lake, Iceberg support \n\n**Want to contribute?** These are great opportunities for community contributions! See [Contributing](#contributing) below.\n\n---\n\n## Contributing\n\nWe welcome contributions! Areas of interest:\n\n- \u26a1 **Performance** - Further DuckDB optimizations\n- \ud83d\udcda **Documentation** - Examples, guides, tutorials\n- \ud83d\udc1b **Bug Fixes** - Edge cases and compatibility issues\n- \ud83e\uddea **PySpark API Coverage** - Additional functions and methods\n- \ud83e\uddea **Tests** - Additional test coverage and scenarios\n\n---\n\n## Development Setup\n\n```bash\n# Install for development\ngit clone https://github.com/eddiethedean/mock-spark.git\ncd mock-spark\npip install -e \".[dev]\"\n\n# Run tests\npytest tests/\n\n# Format code\nblack mock_spark tests\n\n# Type checking\nmypy mock_spark --ignore-missing-imports\n```\n\n---\n\n## License\n\nMIT License - see [LICENSE](LICENSE) file for details.\n\n---\n\n## Links\n\n- **GitHub**: [github.com/eddiethedean/mock-spark](https://github.com/eddiethedean/mock-spark)\n- **PyPI**: [pypi.org/project/mock-spark](https://pypi.org/project/mock-spark/)\n- **Issues**: [github.com/eddiethedean/mock-spark/issues](https://github.com/eddiethedean/mock-spark/issues)\n- **Documentation**: [Full documentation](docs/)\n\n---\n\n<div align=\"center\">\n\n**Built with \u2764\ufe0f for the PySpark community**\n\n*Star \u2b50 this repo if Mock Spark helps speed up your tests!*\n\n</div>\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Lightning-fast PySpark testing without JVM - 10x faster with 100% API compatibility",
"version": "1.2.1",
"project_urls": {
"Homepage": "https://github.com/eddiethedean/mock-spark",
"Issues": "https://github.com/eddiethedean/mock-spark/issues",
"Repository": "https://github.com/eddiethedean/mock-spark"
},
"split_keywords": [
"spark",
" pyspark",
" mock",
" testing",
" development",
" data-engineering",
" dataframe",
" spark-session",
" unit-testing",
" type-safe",
" mypy",
" error-simulation",
" performance-testing",
" data-generation",
" enterprise"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "786d9d1f1903d71ace06a56711e8d5867bbc65af950edde4246a6dfd66965f34",
"md5": "1c2a79b8d5cf5cf04b3d4047f3aa9697",
"sha256": "a2db72d3742333ffcea8350d5afaea72baecf17c2fb269351ef7f89295b3b2fa"
},
"downloads": -1,
"filename": "mock_spark-1.2.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1c2a79b8d5cf5cf04b3d4047f3aa9697",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 200083,
"upload_time": "2025-10-07T22:42:26",
"upload_time_iso_8601": "2025-10-07T22:42:26.477272Z",
"url": "https://files.pythonhosted.org/packages/78/6d/9d1f1903d71ace06a56711e8d5867bbc65af950edde4246a6dfd66965f34/mock_spark-1.2.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "8e6327d966298ddb410036499a4f05ab536be66e2228954fc8aa2804f0905d21",
"md5": "87a50c93570218b4554f0c7fa8237026",
"sha256": "d77a911083939585dea1312b1f05f60cf51b17c942ec9e3efe688de396d55a3c"
},
"downloads": -1,
"filename": "mock_spark-1.2.1.tar.gz",
"has_sig": false,
"md5_digest": "87a50c93570218b4554f0c7fa8237026",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 156607,
"upload_time": "2025-10-07T22:42:28",
"upload_time_iso_8601": "2025-10-07T22:42:28.020985Z",
"url": "https://files.pythonhosted.org/packages/8e/63/27d966298ddb410036499a4f05ab536be66e2228954fc8aa2804f0905d21/mock_spark-1.2.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-07 22:42:28",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "eddiethedean",
"github_project": "mock-spark",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "mock-spark"
}