# SparkGrep

[](https://sonarcloud.io/summary/new_code?id=sparkgrep)
[](https://sonarcloud.io/summary/new_code?id=sparkgrep)
[](https://sonarcloud.io/summary/new_code?id=sparkgrep)
[](https://sonarcloud.io/summary/new_code?id=sparkgrep)
[](https://sonarcloud.io/summary/new_code?id=sparkgrep)
[](https://sonarcloud.io/summary/new_code?id=sparkgrep)
[](https://sonarcloud.io/summary/new_code?id=sparkgrep)
[](https://sonarcloud.io/summary/new_code?id=sparkgrep)
[](https://sonarcloud.io/summary/new_code?id=sparkgrep)
[](https://www.python.org/downloads/)
[](https://github.com/astral-sh/ruff)
[](https://github.com/PyCQA/bandit)
[](LICENSE)
Pre-commit hook that detects debugging leftovers in Apache Spark applications.
## 🎯 Purpose
SparkGrep helps maintain clean Apache Spark codebases by detecting common debugging leftovers and performance anti-patterns that developers often forget to remove before committing code.
### 🔍 What it Detects
- **`display()` calls** - Jupyter/Databricks debugging function
- **`.show()` methods** - DataFrame inspection calls
- **`.collect()` without assignment** - Potential performance issues
- **`.count()` without assignment** - Unnecessary computations
- **Custom patterns** - User-defined patterns via configuration
## 🚀 Installation
```bash
pip install sparkgrep
```
## 📋 Usage
### As a Pre-commit Hook
Add to your `.pre-commit-config.yaml`:
```yaml
repos:
- repo: https://github.com/leandroasaservice/sparkgrep
rev: v0.1.1a1 # Use this preview version.
hooks:
- id: sparkgrep
```
### Command Line
```bash
# Check specific files
sparkgrep src/my_script.py notebook.ipynb
# Check with additional patterns
sparkgrep --additional-patterns "debug_print:Debug print statement" src/
# Disable default patterns and use only custom ones
sparkgrep --disable-default-patterns --additional-patterns "my_pattern:My description" src/
```
----
## 🛡️ Security & Quality
This project maintains high security and code quality standards:
### 🔒 Security Measures
- **Automated vulnerability detection** and issue creation
- **Admin-protected CI/CD** pipelines
- **Dependency vulnerability monitoring**
### 📊 Code Quality
- **80% minimum code coverage** enforced in CI
- **SonarCloud integration** for continuous code quality analysis
- **Automated testing** on every PR
- **Code formatting** with Ruff
----
## 📁 Project Structure
```sh
sparkgrep/
├── src/sparkgrep/ # Main package
│ ├── cli.py # Command-line interface
│ ├── patterns.py # Pattern definitions
│ ├── file_processors.py # File processing logic
│ └── utils.py # Utility functions
├── tests/ # Test suite
│ ├── unit/ # Unit tests
│ └── integration/ # Integration tests
├── .github/ # GitHub configuration
│ ├── workflows/ # CI/CD pipelines
│ └── ISSUE_TEMPLATE/ # Issue templates
└── docs/ # Documentation
```
## 🤝 Contributing
1. **Fork** the repository
2. **Create** a feature branch (`git checkout -b feature/amazing-feature`)
3. **Make** your changes with tests
4. **Ensure** all checks pass (`task quality`, `task test`)
5. **Submit** a pull request
### Contribution Guidelines
- **Tests required** for all new features
- **Security scans** must pass
- **Code coverage** must remain ≥ 80%
- **Admin approval** required for all PRs to main
- **Follow** existing code style and patterns
See [CONTRIBUTING.md](doc/CONTRIBUTING.md) for details.
## 📄 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## 📞 Support
- **Issues**: [GitHub Issues](https://github.com/leandroasaservice/sparkgrep/issues)
- **Discussions**: [GitHub Discussions](https://github.com/leandroasaservice/sparkgrep/discussions)
- **Documentation**: [Project Docs](doc/)
----
## Made with ❤️ for the Apache Spark community
Raw data
{
"_id": null,
"home_page": null,
"name": "sparkgrep",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "Leandro Kellermann de Oliveira <lkellermann@leandroasaservice.com>",
"keywords": "spark, databricks, pre-commit, code-quality, linting",
"author": null,
"author_email": "Leandro Kellermann de Oliveira <lkellermann@leandroasaservice.com>",
"download_url": "https://files.pythonhosted.org/packages/72/de/aa525ba89fdfa4b26460ffca69068dbb7cd39b967044e01024b82a9bce19/sparkgrep-0.1.1a1.tar.gz",
"platform": null,
"description": "# SparkGrep\n\n\n[](https://sonarcloud.io/summary/new_code?id=sparkgrep)\n[](https://sonarcloud.io/summary/new_code?id=sparkgrep)\n[](https://sonarcloud.io/summary/new_code?id=sparkgrep)\n[](https://sonarcloud.io/summary/new_code?id=sparkgrep)\n[](https://sonarcloud.io/summary/new_code?id=sparkgrep)\n[](https://sonarcloud.io/summary/new_code?id=sparkgrep)\n[](https://sonarcloud.io/summary/new_code?id=sparkgrep)\n[](https://sonarcloud.io/summary/new_code?id=sparkgrep)\n[](https://sonarcloud.io/summary/new_code?id=sparkgrep)\n[](https://www.python.org/downloads/)\n[](https://github.com/astral-sh/ruff)\n[](https://github.com/PyCQA/bandit)\n[](LICENSE)\n\nPre-commit hook that detects debugging leftovers in Apache Spark applications.\n\n## \ud83c\udfaf Purpose\n\nSparkGrep helps maintain clean Apache Spark codebases by detecting common debugging leftovers and performance anti-patterns that developers often forget to remove before committing code.\n\n### \ud83d\udd0d What it Detects\n\n- **`display()` calls** - Jupyter/Databricks debugging function\n- **`.show()` methods** - DataFrame inspection calls\n- **`.collect()` without assignment** - Potential performance issues\n- **`.count()` without assignment** - Unnecessary computations\n- **Custom patterns** - User-defined patterns via configuration\n\n## \ud83d\ude80 Installation\n\n```bash\npip install sparkgrep\n```\n\n## \ud83d\udccb Usage\n\n### As a Pre-commit Hook\n\nAdd to your `.pre-commit-config.yaml`:\n\n```yaml\nrepos:\n - repo: https://github.com/leandroasaservice/sparkgrep\n rev: v0.1.1a1 # Use this preview version.\n hooks:\n - id: sparkgrep\n```\n\n### Command Line\n\n```bash\n# Check specific files\nsparkgrep src/my_script.py notebook.ipynb\n\n# Check with additional patterns\nsparkgrep --additional-patterns \"debug_print:Debug print statement\" src/\n\n# Disable default patterns and use only custom ones\nsparkgrep --disable-default-patterns --additional-patterns \"my_pattern:My description\" src/\n```\n\n----\n\n## \ud83d\udee1\ufe0f Security & Quality\n\nThis project maintains high security and code quality standards:\n\n### \ud83d\udd12 Security Measures\n\n- **Automated vulnerability detection** and issue creation\n- **Admin-protected CI/CD** pipelines\n- **Dependency vulnerability monitoring**\n\n### \ud83d\udcca Code Quality\n\n- **80% minimum code coverage** enforced in CI\n- **SonarCloud integration** for continuous code quality analysis\n- **Automated testing** on every PR\n- **Code formatting** with Ruff\n\n----\n\n## \ud83d\udcc1 Project Structure\n\n```sh\nsparkgrep/\n\u251c\u2500\u2500 src/sparkgrep/ # Main package\n\u2502 \u251c\u2500\u2500 cli.py # Command-line interface\n\u2502 \u251c\u2500\u2500 patterns.py # Pattern definitions\n\u2502 \u251c\u2500\u2500 file_processors.py # File processing logic\n\u2502 \u2514\u2500\u2500 utils.py # Utility functions\n\u251c\u2500\u2500 tests/ # Test suite\n\u2502 \u251c\u2500\u2500 unit/ # Unit tests\n\u2502 \u2514\u2500\u2500 integration/ # Integration tests\n\u251c\u2500\u2500 .github/ # GitHub configuration\n\u2502 \u251c\u2500\u2500 workflows/ # CI/CD pipelines\n\u2502 \u2514\u2500\u2500 ISSUE_TEMPLATE/ # Issue templates\n\u2514\u2500\u2500 docs/ # Documentation\n```\n\n## \ud83e\udd1d Contributing\n\n1. **Fork** the repository\n2. **Create** a feature branch (`git checkout -b feature/amazing-feature`)\n3. **Make** your changes with tests\n4. **Ensure** all checks pass (`task quality`, `task test`)\n5. **Submit** a pull request\n\n### Contribution Guidelines\n\n- **Tests required** for all new features\n- **Security scans** must pass\n- **Code coverage** must remain \u2265 80%\n- **Admin approval** required for all PRs to main\n- **Follow** existing code style and patterns\nSee [CONTRIBUTING.md](doc/CONTRIBUTING.md) for details.\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\udcde Support\n\n- **Issues**: [GitHub Issues](https://github.com/leandroasaservice/sparkgrep/issues)\n- **Discussions**: [GitHub Discussions](https://github.com/leandroasaservice/sparkgrep/discussions)\n- **Documentation**: [Project Docs](doc/)\n\n----\n\n## Made with \u2764\ufe0f for the Apache Spark community\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Pre-commit hooks for Apache Spark development (Databricks, EMR, Dataproc, and more)",
"version": "0.1.1a1",
"project_urls": {
"Bug Reports": "https://github.com/leandroasaservice/sparkgrep/issues",
"Contributing": "https://github.com/leandroasaservice/sparkgrep/blob/main/doc/CONTRIBUTING.md",
"Documentation": "https://github.com/leandroasaservice/sparkgrep/blob/main/README.md",
"Homepage": "https://github.com/leandroasaservice/sparkgrep",
"Repository": "https://github.com/leandroasaservice/sparkgrep",
"Source": "https://github.com/leandroasaservice/sparkgrep"
},
"split_keywords": [
"spark",
" databricks",
" pre-commit",
" code-quality",
" linting"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "84e925d6bc4400d372013c12cc5b098778c98d762dd4edab0ee19a32d3b9f650",
"md5": "7f62ebbf6dda0472df4b05621210945c",
"sha256": "fb363d714c7883e4e6c344b973b83905164f401a18f7c670954deaf91baf00be"
},
"downloads": -1,
"filename": "sparkgrep-0.1.1a1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "7f62ebbf6dda0472df4b05621210945c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 8945,
"upload_time": "2025-09-04T13:21:42",
"upload_time_iso_8601": "2025-09-04T13:21:42.778957Z",
"url": "https://files.pythonhosted.org/packages/84/e9/25d6bc4400d372013c12cc5b098778c98d762dd4edab0ee19a32d3b9f650/sparkgrep-0.1.1a1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "72deaa525ba89fdfa4b26460ffca69068dbb7cd39b967044e01024b82a9bce19",
"md5": "4e3c258c51b247a3e3fb754dff21281e",
"sha256": "3d30cea61b3e315b1eef82f0ebf13b3ef676689bd52360f670e58004ef425473"
},
"downloads": -1,
"filename": "sparkgrep-0.1.1a1.tar.gz",
"has_sig": false,
"md5_digest": "4e3c258c51b247a3e3fb754dff21281e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 10004,
"upload_time": "2025-09-04T13:21:44",
"upload_time_iso_8601": "2025-09-04T13:21:44.075668Z",
"url": "https://files.pythonhosted.org/packages/72/de/aa525ba89fdfa4b26460ffca69068dbb7cd39b967044e01024b82a9bce19/sparkgrep-0.1.1a1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-04 13:21:44",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "leandroasaservice",
"github_project": "sparkgrep",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "black",
"specs": [
[
">=",
"23.0.0"
]
]
},
{
"name": "flake8",
"specs": [
[
">=",
"6.0.0"
]
]
},
{
"name": "nbformat",
"specs": [
[
">=",
"5.0.0"
]
]
},
{
"name": "ruff",
"specs": [
[
"==",
"0.12.7"
]
]
},
{
"name": "bandit",
"specs": [
[
"==",
"1.8.6"
]
]
},
{
"name": "pytest",
"specs": [
[
"==",
"8.4.1"
]
]
},
{
"name": "pytest-cov",
"specs": [
[
"==",
"6.2.1"
]
]
},
{
"name": "build",
"specs": [
[
"==",
"1.3.0"
]
]
}
],
"lcname": "sparkgrep"
}