# Spark Simplicity π
[](https://www.python.org/downloads/)
[](https://spark.apache.org/)
[](https://github.com/psf/black)
[](https://opensource.org/licenses/MIT)
**Transform complex PySpark operations into simple, readable code.**
Spark Simplicity is a production-ready Python package that simplifies Apache Spark workflows with an intuitive API. Whether you're building ETL pipelines, analyzing big data, or processing streams, focus on your data logic instead of Spark boilerplate.
## β¨ Key Features
- **π― Intuitive API**: Simple, readable functions like `load_csv()`, `write_parquet()`
- **β‘ Optimized Performance**: Built-in broadcast joins, partitioning, and caching strategies
- **π Production-Ready**: Environment-specific configurations for dev, test, and production
- **π Rich I/O Support**: CSV, JSON, Parquet, Excel, and fixed-width files with intelligent defaults
- **π§ Advanced Connections**: Database (JDBC), SFTP, REST API, and email integrations
- **ποΈ Session Management**: Optimized Spark sessions with automatic resource management
- **π‘οΈ Enterprise Security**: Comprehensive validation, error handling, and logging
- **π» Windows Compatible**: Automatic Hadoop workarounds for seamless Windows development
## π Quick Start
### Installation
```bash
pip install spark-simplicity
```
### Basic Usage
```python
from spark_simplicity import get_spark_session, load_csv, write_parquet
# Create optimized Spark session
spark = get_spark_session("my_app")
# Load data with intelligent defaults
customers = load_csv(spark, "customers.csv")
orders = load_csv(spark, "orders.csv")
# Simple DataFrame operations
result = customers.join(orders, "customer_id", "left")
# Write optimized output
write_parquet(result, "customer_orders.parquet")
```
That's it! No complex configurations, no boilerplate code.
## π Core Modules
### ποΈ Session Management
Create optimized Spark sessions for different environments:
```python
from spark_simplicity import get_spark_session
# Local development (default)
spark = get_spark_session("my_app")
# Production with optimizations
spark = get_spark_session("prod_app", environment="production")
# Testing with minimal resources
spark = get_spark_session("test_app", environment="testing")
# Custom configuration
spark = get_spark_session(
"custom_app",
config_overrides={
"spark.executor.memory": "8g",
"spark.executor.cores": "4"
}
)
```
### π I/O Operations
Load and save data with automatic optimizations:
```python
from spark_simplicity import (
load_csv, load_excel, load_json, load_parquet, load_positional,
write_csv, write_excel, write_json, write_parquet
)
# Reading data
df = load_csv(spark, "data.csv") # Intelligent CSV parsing
df = load_excel(spark, "data.xlsx", sheet_name="Sales") # Excel support
df = load_json(spark, "data.json") # JSON with schema inference
df = load_parquet(spark, "data.parquet", columns=["id", "name"]) # Column pruning
# Fixed-width files
column_specs = [
("id", 0, 10),
("name", 10, 50),
("amount", 50, 65)
]
df = load_positional(spark, "fixed_width.txt", column_specs)
# Writing data
write_csv(df, "output.csv", single_file=True)
write_parquet(df, "output.parquet", partition_by=["year", "month"])
write_json(df, "output.json", pretty_print=True)
write_excel(df, "output.xlsx", sheet_name="Results")
```
### π Enterprise Connections
Robust connection handling for enterprise environments:
```python
from spark_simplicity import (
JdbcSqlServerConnection, SftpConnection,
RestApiConnection, EmailSender
)
# Database connections
db = JdbcSqlServerConnection(
server="sql-server.company.com",
database="datawarehouse",
username="user",
password="password"
)
df = db.read_table(spark, "sales_data")
# SFTP file operations
sftp = SftpConnection(
hostname="sftp.company.com",
username="user",
private_key_path="/path/to/key"
)
sftp.download_file("/remote/data.csv", "/local/data.csv")
# REST API integration
api = RestApiConnection(base_url="https://api.company.com")
response = api.get("/data/endpoint", headers={"API-Key": "secret"})
# Email notifications
email = EmailSender(
smtp_server="smtp.company.com",
smtp_port=587,
username="notifications@company.com",
password="password"
)
email.send_email(
to=["team@company.com"],
subject="ETL Pipeline Completed",
body="Your daily ETL pipeline has finished successfully."
)
```
## ποΈ Advanced Features
### Environment-Specific Configurations
Spark Simplicity provides optimized configurations for different environments:
| Environment | Memory | Cores | Use Case |
|------------|--------|-------|----------|
| **Development** | 2GB | 2 | Interactive development, debugging |
| **Testing** | 512MB | 1 | CI/CD pipelines, unit tests |
| **Production** | 8GB | 4 | Production workloads, batch processing |
| **Local** | Auto-detect | Auto-detect | Single-machine processing |
### Windows Compatibility
Built-in Windows support with automatic Hadoop workarounds:
- β
Automatic Hadoop configuration bypass
- β
Windows-safe file system operations
- β
Suppressed Hadoop native library warnings
- β
Python executable path configuration
- β
In-memory catalog by default
### Comprehensive Logging
Integrated logging system with multiple levels:
```python
from spark_simplicity import get_logger
# Get specialized logger
logger = get_logger("my_application")
# Different log levels
logger.info("Processing started")
logger.warning("Data quality issue detected")
logger.error("Processing failed")
```
## π Architecture
```
spark-simplicity/
βββ session.py # Spark session management
βββ io/ # I/O operations
β βββ readers/ # CSV, JSON, Parquet, Excel readers
β βββ writers/ # Optimized writers with compression
β βββ utils/ # File utilities and format detection
β βββ validation/ # Path validation and mount checking
βββ connections/ # Enterprise integrations
β βββ database_connection.py # JDBC SQL Server
β βββ sftp_connection.py # SFTP with retry logic
β βββ rest_api_connection.py # REST API client
β βββ email_connection.py # SMTP email sender
βββ logger.py # Centralized logging
βββ utils.py # DataFrame utilities
βββ exceptions.py # Custom exceptions
βββ notification_service.py # Notification management
```
## π οΈ Development
### Prerequisites
- Python 3.8+
- Java 8, 11, or 17 (for Spark)
- Apache Spark 3.5+
### Development Setup
```bash
# Clone the repository
git clone https://github.com/FabienBarrios/spark-simplicity.git
cd spark-simplicity
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Run code quality checks
black spark_simplicity/
isort spark_simplicity/
flake8 spark_simplicity/
mypy spark_simplicity/
```
### Testing
```bash
# Run all tests
pytest tests/ -v
# Run specific test categories
pytest -m unit # Unit tests only
pytest -m integration # Integration tests only
pytest -m "not slow" # Skip slow tests
# Run with coverage
pytest --cov=spark_simplicity --cov-report=html
```
### Code Quality
The project maintains high code quality standards:
- **Black**: Code formatting (88 character line length)
- **isort**: Import sorting
- **Flake8**: Linting with additional plugins
- **Mypy**: Type checking
- **Bandit**: Security analysis
- **Pre-commit**: Automated quality checks
## π Performance & Best Practices
### Optimized Session Management
```python
# Use environment-specific configurations
spark = get_spark_session("app", environment="production")
# Monitor session resources
from spark_simplicity import get_session_info, print_session_summary
info = get_session_info(spark)
print(f"Available executors: {info['executor_count']}")
print_session_summary(spark)
```
### Efficient I/O Operations
```python
# Leverage column pruning
df = load_parquet(spark, "large_file.parquet", columns=["id", "name"])
# Use partitioning for large datasets
write_parquet(df, "output.parquet", partition_by=["year", "month"])
# Optimize file output
write_csv(df, "output.csv", single_file=True, compression="gzip")
```
### Connection Pooling
```python
# Reuse connections efficiently
db = JdbcSqlServerConnection(server="...", database="...")
# Read multiple tables with same connection
customers = db.read_table(spark, "customers")
orders = db.read_table(spark, "orders")
products = db.read_table(spark, "products")
```
## π§ͺ Testing Strategy
Comprehensive testing with multiple levels:
- **Unit Tests**: Fast, isolated component testing
- **Integration Tests**: Real Spark cluster testing
- **Performance Tests**: Benchmark and profiling
- **Security Tests**: Vulnerability and penetration testing
- **Property-Based Tests**: Hypothesis-driven testing
Coverage targets:
- **Minimum**: 90% overall coverage
- **Target**: 95%+ for core modules
## π Security
Security-first design with comprehensive protections:
- **Input Validation**: All user inputs validated and sanitized
- **SQL Injection Protection**: Parameterized queries and prepared statements
- **Path Traversal Prevention**: Secure file path validation
- **Credential Management**: Secure storage and transmission
- **Audit Logging**: Comprehensive activity logging
- **Error Handling**: Secure error messages without sensitive data exposure
## π Contributing
We welcome contributions! Please see our contributing guidelines:
1. **Fork** the repository
2. **Create** a feature branch: `git checkout -b feature-name`
3. **Write** tests for your changes
4. **Ensure** all tests pass: `pytest`
5. **Follow** code style: `black` and `isort`
6. **Add** documentation for new features
7. **Submit** a pull request
### Areas for Contribution
- π§ͺ **More test cases** - Help us achieve 100% coverage across all modules
- π **Documentation** - Examples, tutorials, API documentation
- β‘ **Performance optimizations** - New caching strategies, join algorithms
- π **Integrations** - Support for more databases, cloud storage, file formats
- π **Bug fixes** - Report and fix issues
- π‘ **Feature requests** - Suggest new functionality
## πΊοΈ Roadmap & Future Development
Spark Simplicity is actively evolving to meet the growing needs of the data engineering community. We're committed to continuous improvement and regularly adding new features based on user feedback and industry best practices.
### π Upcoming Features (v1.1.x)
- **π Advanced Join Operations**
- Window joins for time-series data
- Fuzzy matching joins
- Multi-table join optimization
- Join performance analysis tools
- **π Enhanced DataFrame Utilities**
- Data profiling and quality metrics
- Automated schema validation
- Smart partitioning recommendations
- Performance bottleneck detection
- **π Streaming Support**
- Simplified Kafka integration
- Real-time data processing utilities
- Stream-to-batch conversion helpers
- Monitoring and alerting for streams
### π― Future Versions (v1.2.x+)
- **π€ Machine Learning Integration**
- MLlib workflow simplification
- Feature engineering utilities
- Model deployment helpers
- Pipeline automation tools
- **βοΈ Cloud Platform Support**
- AWS S3/EMR optimizations
- Azure Data Lake integration
- Google Cloud Platform support
- Multi-cloud deployment tools
- **π Advanced Analytics**
- SQL query builder with type safety
- Data lineage tracking
- Performance benchmarking suite
- Cost optimization recommendations
### π Long-term Vision (v2.0+)
- **ποΈ Next-Generation Architecture**
- Spark 4.0 compatibility
- Async operations support
- Plugin architecture for extensibility
- Advanced monitoring dashboard
- **π Extended Ecosystem**
- Delta Lake deep integration
- Apache Iceberg support
- Kubernetes-native operations
- GraphQL API for metadata
### π€ Community-Driven Development
We actively listen to our community and prioritize features based on:
- **User feedback** and feature requests
- **Industry trends** and emerging technologies
- **Performance improvements** and optimization opportunities
- **Security enhancements** and compliance requirements
**Want to influence our roadmap?**
- π‘ Submit feature requests in [GitHub Issues](https://github.com/FabienBarrios/spark-simplicity/issues)
- π£οΈ Join discussions in [GitHub Discussions](https://github.com/FabienBarrios/spark-simplicity/discussions)
- π€ Contribute code and become a collaborator
---
## π License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## π Acknowledgments
- **Apache Spark** community for the powerful distributed computing framework
- **PySpark** developers for the Python API
- **Contributors** who help make this package better
## π Support
- π **Bug Reports**: [GitHub Issues](https://github.com/FabienBarrios/spark-simplicity/issues)
- π¬ **Discussions**: [GitHub Discussions](https://github.com/FabienBarrios/spark-simplicity/discussions)
- π§ **Contact**: fabienbarrios@gmail.com
- π **Documentation**: [Read the Docs](https://spark-simplicity.readthedocs.io)
---
**Made with β€οΈ for the Spark community**
*Spark Simplicity - Because data engineering should be simple, not complicated.*
Raw data
{
"_id": null,
"home_page": null,
"name": "spark-simplicity",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "spark, pyspark, big-data, data-engineering, etl",
"author": null,
"author_email": "\"F. Barrios\" <fabienbarrios@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/86/09/9a92f3d43ffa3980980d2535f53fb70d8389f70473fdff7de9049d1cd427/spark_simplicity-1.0.0.tar.gz",
"platform": null,
"description": "# Spark Simplicity \ud83d\ude80\n\n[](https://www.python.org/downloads/)\n[](https://spark.apache.org/)\n[](https://github.com/psf/black)\n[](https://opensource.org/licenses/MIT)\n\n\n\n**Transform complex PySpark operations into simple, readable code.**\n\nSpark Simplicity is a production-ready Python package that simplifies Apache Spark workflows with an intuitive API. Whether you're building ETL pipelines, analyzing big data, or processing streams, focus on your data logic instead of Spark boilerplate.\n\n## \u2728 Key Features\n\n- **\ud83c\udfaf Intuitive API**: Simple, readable functions like `load_csv()`, `write_parquet()`\n- **\u26a1 Optimized Performance**: Built-in broadcast joins, partitioning, and caching strategies\n- **\ud83c\udfed Production-Ready**: Environment-specific configurations for dev, test, and production\n- **\ud83d\udcca Rich I/O Support**: CSV, JSON, Parquet, Excel, and fixed-width files with intelligent defaults\n- **\ud83d\udd27 Advanced Connections**: Database (JDBC), SFTP, REST API, and email integrations\n- **\ud83c\udf9a\ufe0f Session Management**: Optimized Spark sessions with automatic resource management\n- **\ud83d\udee1\ufe0f Enterprise Security**: Comprehensive validation, error handling, and logging\n- **\ud83d\udcbb Windows Compatible**: Automatic Hadoop workarounds for seamless Windows development\n\n## \ud83d\ude80 Quick Start\n\n### Installation\n\n```bash\npip install spark-simplicity\n```\n\n### Basic Usage\n\n```python\nfrom spark_simplicity import get_spark_session, load_csv, write_parquet\n\n# Create optimized Spark session\nspark = get_spark_session(\"my_app\")\n\n# Load data with intelligent defaults\ncustomers = load_csv(spark, \"customers.csv\")\norders = load_csv(spark, \"orders.csv\")\n\n# Simple DataFrame operations\nresult = customers.join(orders, \"customer_id\", \"left\")\n\n# Write optimized output\nwrite_parquet(result, \"customer_orders.parquet\")\n```\n\nThat's it! No complex configurations, no boilerplate code.\n\n## \ud83d\udcda Core Modules\n\n### \ud83c\udf9b\ufe0f Session Management\n\nCreate optimized Spark sessions for different environments:\n\n```python\nfrom spark_simplicity import get_spark_session\n\n# Local development (default)\nspark = get_spark_session(\"my_app\")\n\n# Production with optimizations\nspark = get_spark_session(\"prod_app\", environment=\"production\")\n\n# Testing with minimal resources\nspark = get_spark_session(\"test_app\", environment=\"testing\")\n\n# Custom configuration\nspark = get_spark_session(\n \"custom_app\",\n config_overrides={\n \"spark.executor.memory\": \"8g\",\n \"spark.executor.cores\": \"4\"\n }\n)\n```\n\n### \ud83d\udcc1 I/O Operations\n\nLoad and save data with automatic optimizations:\n\n```python\nfrom spark_simplicity import (\n load_csv, load_excel, load_json, load_parquet, load_positional,\n write_csv, write_excel, write_json, write_parquet\n)\n\n# Reading data\ndf = load_csv(spark, \"data.csv\") # Intelligent CSV parsing\ndf = load_excel(spark, \"data.xlsx\", sheet_name=\"Sales\") # Excel support\ndf = load_json(spark, \"data.json\") # JSON with schema inference\ndf = load_parquet(spark, \"data.parquet\", columns=[\"id\", \"name\"]) # Column pruning\n\n# Fixed-width files\ncolumn_specs = [\n (\"id\", 0, 10),\n (\"name\", 10, 50), \n (\"amount\", 50, 65)\n]\ndf = load_positional(spark, \"fixed_width.txt\", column_specs)\n\n# Writing data\nwrite_csv(df, \"output.csv\", single_file=True)\nwrite_parquet(df, \"output.parquet\", partition_by=[\"year\", \"month\"])\nwrite_json(df, \"output.json\", pretty_print=True)\nwrite_excel(df, \"output.xlsx\", sheet_name=\"Results\")\n```\n\n### \ud83d\udd17 Enterprise Connections\n\nRobust connection handling for enterprise environments:\n\n```python\nfrom spark_simplicity import (\n JdbcSqlServerConnection, SftpConnection, \n RestApiConnection, EmailSender\n)\n\n# Database connections\ndb = JdbcSqlServerConnection(\n server=\"sql-server.company.com\",\n database=\"datawarehouse\",\n username=\"user\",\n password=\"password\"\n)\ndf = db.read_table(spark, \"sales_data\")\n\n# SFTP file operations\nsftp = SftpConnection(\n hostname=\"sftp.company.com\",\n username=\"user\",\n private_key_path=\"/path/to/key\"\n)\nsftp.download_file(\"/remote/data.csv\", \"/local/data.csv\")\n\n# REST API integration\napi = RestApiConnection(base_url=\"https://api.company.com\")\nresponse = api.get(\"/data/endpoint\", headers={\"API-Key\": \"secret\"})\n\n# Email notifications\nemail = EmailSender(\n smtp_server=\"smtp.company.com\",\n smtp_port=587,\n username=\"notifications@company.com\",\n password=\"password\"\n)\nemail.send_email(\n to=[\"team@company.com\"],\n subject=\"ETL Pipeline Completed\",\n body=\"Your daily ETL pipeline has finished successfully.\"\n)\n```\n\n## \ud83c\udfd7\ufe0f Advanced Features\n\n### Environment-Specific Configurations\n\nSpark Simplicity provides optimized configurations for different environments:\n\n| Environment | Memory | Cores | Use Case |\n|------------|--------|-------|----------|\n| **Development** | 2GB | 2 | Interactive development, debugging |\n| **Testing** | 512MB | 1 | CI/CD pipelines, unit tests |\n| **Production** | 8GB | 4 | Production workloads, batch processing |\n| **Local** | Auto-detect | Auto-detect | Single-machine processing |\n\n### Windows Compatibility\n\nBuilt-in Windows support with automatic Hadoop workarounds:\n\n- \u2705 Automatic Hadoop configuration bypass\n- \u2705 Windows-safe file system operations \n- \u2705 Suppressed Hadoop native library warnings\n- \u2705 Python executable path configuration\n- \u2705 In-memory catalog by default\n\n### Comprehensive Logging\n\nIntegrated logging system with multiple levels:\n\n```python\nfrom spark_simplicity import get_logger\n\n# Get specialized logger\nlogger = get_logger(\"my_application\")\n\n# Different log levels\nlogger.info(\"Processing started\")\nlogger.warning(\"Data quality issue detected\")\nlogger.error(\"Processing failed\")\n```\n\n## \ud83d\udcd6 Architecture\n\n```\nspark-simplicity/\n\u251c\u2500\u2500 session.py # Spark session management\n\u251c\u2500\u2500 io/ # I/O operations\n\u2502 \u251c\u2500\u2500 readers/ # CSV, JSON, Parquet, Excel readers\n\u2502 \u251c\u2500\u2500 writers/ # Optimized writers with compression\n\u2502 \u251c\u2500\u2500 utils/ # File utilities and format detection\n\u2502 \u2514\u2500\u2500 validation/ # Path validation and mount checking\n\u251c\u2500\u2500 connections/ # Enterprise integrations\n\u2502 \u251c\u2500\u2500 database_connection.py # JDBC SQL Server\n\u2502 \u251c\u2500\u2500 sftp_connection.py # SFTP with retry logic\n\u2502 \u251c\u2500\u2500 rest_api_connection.py # REST API client\n\u2502 \u2514\u2500\u2500 email_connection.py # SMTP email sender\n\u251c\u2500\u2500 logger.py # Centralized logging\n\u251c\u2500\u2500 utils.py # DataFrame utilities\n\u251c\u2500\u2500 exceptions.py # Custom exceptions\n\u2514\u2500\u2500 notification_service.py # Notification management\n```\n\n## \ud83d\udee0\ufe0f Development\n\n### Prerequisites\n\n- Python 3.8+\n- Java 8, 11, or 17 (for Spark)\n- Apache Spark 3.5+\n\n### Development Setup\n\n```bash\n# Clone the repository\ngit clone https://github.com/FabienBarrios/spark-simplicity.git\ncd spark-simplicity\n\n# Create virtual environment\npython -m venv venv\nsource venv/bin/activate # On Windows: venv\\Scripts\\activate\n\n# Install development dependencies\npip install -e \".[dev]\"\n\n# Run tests\npytest tests/ -v\n\n# Run code quality checks\nblack spark_simplicity/\nisort spark_simplicity/\nflake8 spark_simplicity/\nmypy spark_simplicity/\n```\n\n### Testing\n\n```bash\n# Run all tests\npytest tests/ -v\n\n# Run specific test categories\npytest -m unit # Unit tests only\npytest -m integration # Integration tests only\npytest -m \"not slow\" # Skip slow tests\n\n# Run with coverage\npytest --cov=spark_simplicity --cov-report=html\n```\n\n### Code Quality\n\nThe project maintains high code quality standards:\n\n- **Black**: Code formatting (88 character line length)\n- **isort**: Import sorting\n- **Flake8**: Linting with additional plugins\n- **Mypy**: Type checking\n- **Bandit**: Security analysis\n- **Pre-commit**: Automated quality checks\n\n## \ud83d\udcca Performance & Best Practices\n\n### Optimized Session Management\n\n```python\n# Use environment-specific configurations\nspark = get_spark_session(\"app\", environment=\"production\")\n\n# Monitor session resources\nfrom spark_simplicity import get_session_info, print_session_summary\n\ninfo = get_session_info(spark)\nprint(f\"Available executors: {info['executor_count']}\")\nprint_session_summary(spark)\n```\n\n### Efficient I/O Operations\n\n```python\n# Leverage column pruning\ndf = load_parquet(spark, \"large_file.parquet\", columns=[\"id\", \"name\"])\n\n# Use partitioning for large datasets\nwrite_parquet(df, \"output.parquet\", partition_by=[\"year\", \"month\"])\n\n# Optimize file output\nwrite_csv(df, \"output.csv\", single_file=True, compression=\"gzip\")\n```\n\n### Connection Pooling\n\n```python\n# Reuse connections efficiently\ndb = JdbcSqlServerConnection(server=\"...\", database=\"...\")\n\n# Read multiple tables with same connection\ncustomers = db.read_table(spark, \"customers\")\norders = db.read_table(spark, \"orders\")\nproducts = db.read_table(spark, \"products\")\n```\n\n## \ud83e\uddea Testing Strategy\n\nComprehensive testing with multiple levels:\n\n- **Unit Tests**: Fast, isolated component testing\n- **Integration Tests**: Real Spark cluster testing \n- **Performance Tests**: Benchmark and profiling\n- **Security Tests**: Vulnerability and penetration testing\n- **Property-Based Tests**: Hypothesis-driven testing\n\nCoverage targets:\n- **Minimum**: 90% overall coverage\n- **Target**: 95%+ for core modules\n\n## \ud83d\udd12 Security\n\nSecurity-first design with comprehensive protections:\n\n- **Input Validation**: All user inputs validated and sanitized\n- **SQL Injection Protection**: Parameterized queries and prepared statements\n- **Path Traversal Prevention**: Secure file path validation\n- **Credential Management**: Secure storage and transmission\n- **Audit Logging**: Comprehensive activity logging\n- **Error Handling**: Secure error messages without sensitive data exposure\n\n## \ud83c\udf1f Contributing\n\nWe welcome contributions! Please see our contributing guidelines:\n\n1. **Fork** the repository\n2. **Create** a feature branch: `git checkout -b feature-name`\n3. **Write** tests for your changes\n4. **Ensure** all tests pass: `pytest`\n5. **Follow** code style: `black` and `isort`\n6. **Add** documentation for new features\n7. **Submit** a pull request\n\n### Areas for Contribution\n\n- \ud83e\uddea **More test cases** - Help us achieve 100% coverage across all modules\n- \ud83d\udcda **Documentation** - Examples, tutorials, API documentation\n- \u26a1 **Performance optimizations** - New caching strategies, join algorithms\n- \ud83d\udd0c **Integrations** - Support for more databases, cloud storage, file formats\n- \ud83d\udc1b **Bug fixes** - Report and fix issues\n- \ud83d\udca1 **Feature requests** - Suggest new functionality\n\n## \ud83d\uddfa\ufe0f Roadmap & Future Development\n\nSpark Simplicity is actively evolving to meet the growing needs of the data engineering community. We're committed to continuous improvement and regularly adding new features based on user feedback and industry best practices.\n\n### \ud83d\ude80 Upcoming Features (v1.1.x)\n\n- **\ud83d\udd04 Advanced Join Operations**\n - Window joins for time-series data\n - Fuzzy matching joins\n - Multi-table join optimization\n - Join performance analysis tools\n\n- **\ud83d\udcca Enhanced DataFrame Utilities**\n - Data profiling and quality metrics\n - Automated schema validation\n - Smart partitioning recommendations\n - Performance bottleneck detection\n\n- **\ud83c\udf0a Streaming Support**\n - Simplified Kafka integration\n - Real-time data processing utilities\n - Stream-to-batch conversion helpers\n - Monitoring and alerting for streams\n\n### \ud83c\udfaf Future Versions (v1.2.x+)\n\n- **\ud83e\udd16 Machine Learning Integration**\n - MLlib workflow simplification\n - Feature engineering utilities\n - Model deployment helpers\n - Pipeline automation tools\n\n- **\u2601\ufe0f Cloud Platform Support**\n - AWS S3/EMR optimizations\n - Azure Data Lake integration\n - Google Cloud Platform support\n - Multi-cloud deployment tools\n\n- **\ud83d\udcc8 Advanced Analytics**\n - SQL query builder with type safety\n - Data lineage tracking\n - Performance benchmarking suite\n - Cost optimization recommendations\n\n### \ud83c\udf1f Long-term Vision (v2.0+)\n\n- **\ud83c\udfd7\ufe0f Next-Generation Architecture**\n - Spark 4.0 compatibility\n - Async operations support\n - Plugin architecture for extensibility\n - Advanced monitoring dashboard\n\n- **\ud83d\udd17 Extended Ecosystem**\n - Delta Lake deep integration\n - Apache Iceberg support\n - Kubernetes-native operations\n - GraphQL API for metadata\n\n### \ud83e\udd1d Community-Driven Development\n\nWe actively listen to our community and prioritize features based on:\n- **User feedback** and feature requests\n- **Industry trends** and emerging technologies\n- **Performance improvements** and optimization opportunities\n- **Security enhancements** and compliance requirements\n\n**Want to influence our roadmap?** \n- \ud83d\udca1 Submit feature requests in [GitHub Issues](https://github.com/FabienBarrios/spark-simplicity/issues)\n- \ud83d\udde3\ufe0f Join discussions in [GitHub Discussions](https://github.com/FabienBarrios/spark-simplicity/discussions)\n- \ud83e\udd1d Contribute code and become a collaborator\n\n---\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\ude4f Acknowledgments\n\n- **Apache Spark** community for the powerful distributed computing framework\n- **PySpark** developers for the Python API\n- **Contributors** who help make this package better\n\n## \ud83d\udcde Support\n\n- \ud83d\udc1b **Bug Reports**: [GitHub Issues](https://github.com/FabienBarrios/spark-simplicity/issues)\n- \ud83d\udcac **Discussions**: [GitHub Discussions](https://github.com/FabienBarrios/spark-simplicity/discussions)\n- \ud83d\udce7 **Contact**: fabienbarrios@gmail.com\n- \ud83d\udcd6 **Documentation**: [Read the Docs](https://spark-simplicity.readthedocs.io)\n\n---\n\n**Made with \u2764\ufe0f for the Spark community**\n\n*Spark Simplicity - Because data engineering should be simple, not complicated.*\n",
"bugtrack_url": null,
"license": null,
"summary": "Simplify Apache Spark operations with an intuitive Python API",
"version": "1.0.0",
"project_urls": {
"Documentation": "https://spark-simplicity.readthedocs.io",
"Homepage": "https://github.com/FabienBarrios/spark-simplicity",
"Issues": "https://github.com/FabienBarrios/spark-simplicity/issues",
"Repository": "https://github.com/FabienBarrios/spark-simplicity.git"
},
"split_keywords": [
"spark",
" pyspark",
" big-data",
" data-engineering",
" etl"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "b62a2ca102d02231964207f10f6098dfc31fdc5ad4e3db4761d7ff1c475a87fe",
"md5": "7afedcf4749f260d6ec04e5560c0ef9d",
"sha256": "4d5c8d51fa574c4f099e50fe198d82bc8954d9078077752964c8234f2b267836"
},
"downloads": -1,
"filename": "spark_simplicity-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "7afedcf4749f260d6ec04e5560c0ef9d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 227827,
"upload_time": "2025-08-15T13:33:08",
"upload_time_iso_8601": "2025-08-15T13:33:08.768659Z",
"url": "https://files.pythonhosted.org/packages/b6/2a/2ca102d02231964207f10f6098dfc31fdc5ad4e3db4761d7ff1c475a87fe/spark_simplicity-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "86099a92f3d43ffa3980980d2535f53fb70d8389f70473fdff7de9049d1cd427",
"md5": "4bbcd9c463b9d1a51e4b9a2d66acdca2",
"sha256": "d9e4400772392c858379a6fb0c551d9a13c24f4d677d2870606e7db9c3b5202f"
},
"downloads": -1,
"filename": "spark_simplicity-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "4bbcd9c463b9d1a51e4b9a2d66acdca2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 196390,
"upload_time": "2025-08-15T13:33:10",
"upload_time_iso_8601": "2025-08-15T13:33:10.511975Z",
"url": "https://files.pythonhosted.org/packages/86/09/9a92f3d43ffa3980980d2535f53fb70d8389f70473fdff7de9049d1cd427/spark_simplicity-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-15 13:33:10",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "FabienBarrios",
"github_project": "spark-simplicity",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "spark-simplicity"
}