# SparkGrep ๐
[](https://github.com/leandroasaservice/sparkgrep/actions/workflows/ci.yml)
[](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)
[](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)
[](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)
[](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)
[](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)
[](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)
[](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)
[](https://www.python.org/downloads/)
[](https://github.com/astral-sh/ruff)
[](https://github.com/PyCQA/bandit)
[](LICENSE)
Pre-commit hook that detects debugging leftovers and anti-patterns in Apache Spark applications.
## ๐ฏ Purpose
SparkGrep helps maintain clean Apache Spark codebases by detecting common debugging leftovers and performance anti-patterns that developers often forget to remove before committing code.
## ๐ What it Detects
- **`display()` calls** - Jupyter/Databricks debugging function
- **`.show()` methods** - DataFrame inspection calls
- **`.collect()` without assignment** - Potential performance issues
- **`.count()` without assignment** - Unnecessary computations
- **Custom patterns** - User-defined patterns via configuration
## ๐ Installation
```bash
pip install sparkgrep
```
## ๐ Usage
### As a Pre-commit Hook
Add to your `.pre-commit-config.yaml`:
```yaml
repos:
- repo: https://github.com/leandroasaservice/sparkgrep
rev: v1.0.0 # Use the latest version
hooks:
- id: sparkgrep
```
### Command Line
```bash
# Check specific files
sparkgrep src/my_script.py notebook.ipynb
# Check with additional patterns
sparkgrep --additional-patterns "debug_print:Debug print statement" src/
# Disable default patterns and use only custom ones
sparkgrep --disable-default-patterns --additional-patterns "my_pattern:My description" src/
```
### Configuration
Create a `.sparkgrep.json` file in your project root:
```json
{
"additional_patterns": [
"logger\\.debug\\(.*\\):Debug logging statement",
"print\\(.*\\):Print statement"
],
"disable_default_patterns": false
}
```
## ๐ก๏ธ Security & Quality
This project maintains high security and code quality standards:
### ๐ Security Measures
- **Daily security scans** with Bandit, Safety, and GitGuardian
- **Automated vulnerability detection** and issue creation
- **Admin-protected CI/CD** pipelines
- **Dependency vulnerability monitoring**
### ๐ Code Quality
- **80% minimum code coverage** enforced in CI
- **SonarCloud integration** for continuous code quality analysis
- **Automated testing** on every PR
- **Code formatting** with Ruff
### ๐ฆ CI/CD Pipeline
The CI pipeline runs automatically on:
- **Pull requests to main** (requires admin approval)
- **Manual dispatch** (admin-only)
Pipeline includes:
- Comprehensive test suite with 80% coverage requirement
- Security scans (Bandit, Safety, GitGuardian)
- Code quality analysis (SonarCloud)
- Linting and formatting checks
**Quality Gates:**
- โ **Pipeline fails** if coverage < 80%
- โ **Pipeline fails** if critical vulnerabilities found
- โ
**Pipeline passes** only when all checks succeed
## ๐งช Development
### Setup
```bash
# Clone the repository
git clone https://github.com/leandroasaservice/sparkgrep.git
cd sparkgrep
# Install in development mode
pip install -e .
pip install -r requirements.txt
# Install pre-commit hooks
pre-commit install
```
### Running Tests
```bash
# Run all tests with coverage
task test
# Run specific test categories
task test:unit
task test:integration
# Generate coverage report
task test:cov
```
### Security Scanning
```bash
# Run security scans locally
bandit -r src/
safety check
ggshield secret scan ci # Requires GitGuardian API key
```
### Code Quality
```bash
# Format code
ruff format .
# Lint code
ruff check .
# Type checking (if using mypy)
mypy src/
```
## ๐ Project Structure
```
sparkgrep/
โโโ src/sparkgrep/ # Main package
โ โโโ cli.py # Command-line interface
โ โโโ patterns.py # Pattern definitions
โ โโโ file_processors.py # File processing logic
โ โโโ utils.py # Utility functions
โโโ tests/ # Test suite
โ โโโ unit/ # Unit tests
โ โโโ integration/ # Integration tests
โโโ .github/ # GitHub configuration
โ โโโ workflows/ # CI/CD pipelines
โ โโโ ISSUE_TEMPLATE/ # Issue templates
โโโ docs/ # Documentation
```
## ๐ค Contributing
1. **Fork** the repository
2. **Create** a feature branch (`git checkout -b feature/amazing-feature`)
3. **Make** your changes with tests
4. **Ensure** all checks pass (`task test`, security scans)
5. **Submit** a pull request
### Contribution Guidelines
- **Tests required** for all new features
- **Security scans** must pass
- **Code coverage** must remain โฅ 80%
- **Admin approval** required for all PRs to main
- **Follow** existing code style and patterns
## ๐ License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## ๐ Security
For security vulnerabilities, please:
1. **Create a security issue** using our [security template](.github/ISSUE_TEMPLATE/security_report.md)
2. **Contact maintainers** directly for critical issues
3. **Follow responsible disclosure** practices
Our security measures include automated daily scans and continuous monitoring.
## ๐ Support
- **Issues**: [GitHub Issues](https://github.com/leandroasaservice/sparkgrep/issues)
- **Discussions**: [GitHub Discussions](https://github.com/leandroasaservice/sparkgrep/discussions)
- **Documentation**: [Project Docs](doc/)
---
**Made with โค๏ธ for the Apache Spark community**
Raw data
{
"_id": null,
"home_page": null,
"name": "sparkgrep",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "spark, databricks, pre-commit, code-quality, linting",
"author": null,
"author_email": "Leandro Kellermann de Oliveira <lkellermann@leandroasaservice.com>",
"download_url": "https://files.pythonhosted.org/packages/16/5e/460607c2c626bda3c9fcf6a25f7613e6a15ae0d7da9c825e97fa5fd23479/sparkgrep-0.1.0a0.tar.gz",
"platform": null,
"description": "# SparkGrep \ud83d\ude80\n\n[](https://github.com/leandroasaservice/sparkgrep/actions/workflows/ci.yml)\n[](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)\n[](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)\n[](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)\n[](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)\n[](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)\n[](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)\n[](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)\n\n[](https://www.python.org/downloads/)\n[](https://github.com/astral-sh/ruff)\n[](https://github.com/PyCQA/bandit)\n[](LICENSE)\n\nPre-commit hook that detects debugging leftovers and anti-patterns in Apache Spark applications.\n\n## \ud83c\udfaf Purpose\n\nSparkGrep helps maintain clean Apache Spark codebases by detecting common debugging leftovers and performance anti-patterns that developers often forget to remove before committing code.\n\n## \ud83d\udd0d What it Detects\n\n- **`display()` calls** - Jupyter/Databricks debugging function\n- **`.show()` methods** - DataFrame inspection calls\n- **`.collect()` without assignment** - Potential performance issues\n- **`.count()` without assignment** - Unnecessary computations\n- **Custom patterns** - User-defined patterns via configuration\n\n## \ud83d\ude80 Installation\n\n```bash\npip install sparkgrep\n```\n\n## \ud83d\udccb Usage\n\n### As a Pre-commit Hook\n\nAdd to your `.pre-commit-config.yaml`:\n\n```yaml\nrepos:\n - repo: https://github.com/leandroasaservice/sparkgrep\n rev: v1.0.0 # Use the latest version\n hooks:\n - id: sparkgrep\n```\n\n### Command Line\n\n```bash\n# Check specific files\nsparkgrep src/my_script.py notebook.ipynb\n\n# Check with additional patterns\nsparkgrep --additional-patterns \"debug_print:Debug print statement\" src/\n\n# Disable default patterns and use only custom ones\nsparkgrep --disable-default-patterns --additional-patterns \"my_pattern:My description\" src/\n```\n\n### Configuration\n\nCreate a `.sparkgrep.json` file in your project root:\n\n```json\n{\n \"additional_patterns\": [\n \"logger\\\\.debug\\\\(.*\\\\):Debug logging statement\",\n \"print\\\\(.*\\\\):Print statement\"\n ],\n \"disable_default_patterns\": false\n}\n```\n\n## \ud83d\udee1\ufe0f Security & Quality\n\nThis project maintains high security and code quality standards:\n\n### \ud83d\udd12 Security Measures\n- **Daily security scans** with Bandit, Safety, and GitGuardian\n- **Automated vulnerability detection** and issue creation\n- **Admin-protected CI/CD** pipelines\n- **Dependency vulnerability monitoring**\n\n### \ud83d\udcca Code Quality\n- **80% minimum code coverage** enforced in CI\n- **SonarCloud integration** for continuous code quality analysis\n- **Automated testing** on every PR\n- **Code formatting** with Ruff\n\n### \ud83d\udea6 CI/CD Pipeline\n\nThe CI pipeline runs automatically on:\n- **Pull requests to main** (requires admin approval)\n- **Manual dispatch** (admin-only)\n\nPipeline includes:\n- Comprehensive test suite with 80% coverage requirement\n- Security scans (Bandit, Safety, GitGuardian)\n- Code quality analysis (SonarCloud)\n- Linting and formatting checks\n\n**Quality Gates:**\n- \u274c **Pipeline fails** if coverage < 80%\n- \u274c **Pipeline fails** if critical vulnerabilities found\n- \u2705 **Pipeline passes** only when all checks succeed\n\n## \ud83e\uddea Development\n\n### Setup\n\n```bash\n# Clone the repository\ngit clone https://github.com/leandroasaservice/sparkgrep.git\ncd sparkgrep\n\n# Install in development mode\npip install -e .\npip install -r requirements.txt\n\n# Install pre-commit hooks\npre-commit install\n```\n\n### Running Tests\n\n```bash\n# Run all tests with coverage\ntask test\n\n# Run specific test categories\ntask test:unit\ntask test:integration\n\n# Generate coverage report\ntask test:cov\n```\n\n### Security Scanning\n\n```bash\n# Run security scans locally\nbandit -r src/\nsafety check\nggshield secret scan ci # Requires GitGuardian API key\n```\n\n### Code Quality\n\n```bash\n# Format code\nruff format .\n\n# Lint code\nruff check .\n\n# Type checking (if using mypy)\nmypy src/\n```\n\n## \ud83d\udcc1 Project Structure\n\n```\nsparkgrep/\n\u251c\u2500\u2500 src/sparkgrep/ # Main package\n\u2502 \u251c\u2500\u2500 cli.py # Command-line interface\n\u2502 \u251c\u2500\u2500 patterns.py # Pattern definitions\n\u2502 \u251c\u2500\u2500 file_processors.py # File processing logic\n\u2502 \u2514\u2500\u2500 utils.py # Utility functions\n\u251c\u2500\u2500 tests/ # Test suite\n\u2502 \u251c\u2500\u2500 unit/ # Unit tests\n\u2502 \u2514\u2500\u2500 integration/ # Integration tests\n\u251c\u2500\u2500 .github/ # GitHub configuration\n\u2502 \u251c\u2500\u2500 workflows/ # CI/CD pipelines\n\u2502 \u2514\u2500\u2500 ISSUE_TEMPLATE/ # Issue templates\n\u2514\u2500\u2500 docs/ # Documentation\n```\n\n## \ud83e\udd1d Contributing\n\n1. **Fork** the repository\n2. **Create** a feature branch (`git checkout -b feature/amazing-feature`)\n3. **Make** your changes with tests\n4. **Ensure** all checks pass (`task test`, security scans)\n5. **Submit** a pull request\n\n### Contribution Guidelines\n\n- **Tests required** for all new features\n- **Security scans** must pass\n- **Code coverage** must remain \u2265 80%\n- **Admin approval** required for all PRs to main\n- **Follow** existing code style and patterns\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\udd12 Security\n\nFor security vulnerabilities, please:\n\n1. **Create a security issue** using our [security template](.github/ISSUE_TEMPLATE/security_report.md)\n2. **Contact maintainers** directly for critical issues\n3. **Follow responsible disclosure** practices\n\nOur security measures include automated daily scans and continuous monitoring.\n\n## \ud83d\udcde Support\n\n- **Issues**: [GitHub Issues](https://github.com/leandroasaservice/sparkgrep/issues)\n- **Discussions**: [GitHub Discussions](https://github.com/leandroasaservice/sparkgrep/discussions)\n- **Documentation**: [Project Docs](doc/)\n\n---\n\n**Made with \u2764\ufe0f for the Apache Spark community**\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Pre-commit hooks for Apache Spark development (Databricks, EMR, Dataproc, and more)",
"version": "0.1.0a0",
"project_urls": {
"Bug Reports": "https://github.com/leandroasaservice/sparkgrep/issues",
"Contributing": "https://github.com/leandroasaservice/sparkgrep/blob/main/doc/CONTRIBUTING.md",
"Documentation": "https://github.com/leandroasaservice/sparkgrep/blob/main/README.md",
"Homepage": "https://github.com/leandroasaservice/sparkgrep",
"Repository": "https://github.com/leandroasaservice/sparkgrep",
"Source": "https://github.com/leandroasaservice/sparkgrep"
},
"split_keywords": [
"spark",
" databricks",
" pre-commit",
" code-quality",
" linting"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "d039f2533a235bdce2149f7a8fe591dc3e7d553688e3b69bd10306986887d347",
"md5": "65941ad4c93725c0bfd1f8b3f048e2ec",
"sha256": "ead2a74ecd59f64d4209295b0047cd0a7badf53d9de45dea190495846b0d37cd"
},
"downloads": -1,
"filename": "sparkgrep-0.1.0a0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "65941ad4c93725c0bfd1f8b3f048e2ec",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 9625,
"upload_time": "2025-08-03T14:39:57",
"upload_time_iso_8601": "2025-08-03T14:39:57.501511Z",
"url": "https://files.pythonhosted.org/packages/d0/39/f2533a235bdce2149f7a8fe591dc3e7d553688e3b69bd10306986887d347/sparkgrep-0.1.0a0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "165e460607c2c626bda3c9fcf6a25f7613e6a15ae0d7da9c825e97fa5fd23479",
"md5": "8affab3969738c88d36c9fffd17113ee",
"sha256": "58bc7e8513a7a8eba907a55e3167e3c077746bb5d9cf6831d644114b8f496091"
},
"downloads": -1,
"filename": "sparkgrep-0.1.0a0.tar.gz",
"has_sig": false,
"md5_digest": "8affab3969738c88d36c9fffd17113ee",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 11399,
"upload_time": "2025-08-03T14:39:58",
"upload_time_iso_8601": "2025-08-03T14:39:58.712542Z",
"url": "https://files.pythonhosted.org/packages/16/5e/460607c2c626bda3c9fcf6a25f7613e6a15ae0d7da9c825e97fa5fd23479/sparkgrep-0.1.0a0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-03 14:39:58",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "leandroasaservice",
"github_project": "sparkgrep",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "pre-commit",
"specs": [
[
">=",
"3.0.0"
]
]
},
{
"name": "black",
"specs": [
[
">=",
"23.0.0"
]
]
},
{
"name": "flake8",
"specs": [
[
">=",
"6.0.0"
]
]
},
{
"name": "nbformat",
"specs": [
[
">=",
"5.0.0"
]
]
}
],
"lcname": "sparkgrep"
}