sparkgrep


Namesparkgrep JSON
Version 0.1.0a0 PyPI version JSON
download
home_pageNone
SummaryPre-commit hooks for Apache Spark development (Databricks, EMR, Dataproc, and more)
upload_time2025-08-03 14:39:58
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT
keywords spark databricks pre-commit code-quality linting
VCS
bugtrack_url
requirements pre-commit black flake8 nbformat
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # SparkGrep ๐Ÿš€

[![CI Status](https://github.com/leandroasaservice/sparkgrep/workflows/CI%20Pipeline/badge.svg)](https://github.com/leandroasaservice/sparkgrep/actions/workflows/ci.yml)
[![Quality Gate Status](https://sonarcloud.io/api/project_badges/measure?project=leandroasaservice_sparkgrep&metric=alert_status)](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)
[![Security Rating](https://sonarcloud.io/api/project_badges/measure?project=leandroasaservice_sparkgrep&metric=security_rating)](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)
[![Coverage](https://sonarcloud.io/api/project_badges/measure?project=leandroasaservice_sparkgrep&metric=coverage)](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)
[![Maintainability Rating](https://sonarcloud.io/api/project_badges/measure?project=leandroasaservice_sparkgrep&metric=maintainability_rating)](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)
[![Reliability Rating](https://sonarcloud.io/api/project_badges/measure?project=leandroasaservice_sparkgrep&metric=reliability_rating)](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)
[![Vulnerabilities](https://sonarcloud.io/api/project_badges/measure?project=leandroasaservice_sparkgrep&metric=vulnerabilities)](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)
[![Lines of Code](https://sonarcloud.io/api/project_badges/measure?project=leandroasaservice_sparkgrep&metric=ncloc)](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)

[![Python Version](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
[![Code style: Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![Security: Bandit](https://img.shields.io/badge/security-bandit-yellow.svg)](https://github.com/PyCQA/bandit)
[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)

Pre-commit hook that detects debugging leftovers and anti-patterns in Apache Spark applications.

## ๐ŸŽฏ Purpose

SparkGrep helps maintain clean Apache Spark codebases by detecting common debugging leftovers and performance anti-patterns that developers often forget to remove before committing code.

## ๐Ÿ” What it Detects

- **`display()` calls** - Jupyter/Databricks debugging function
- **`.show()` methods** - DataFrame inspection calls
- **`.collect()` without assignment** - Potential performance issues
- **`.count()` without assignment** - Unnecessary computations
- **Custom patterns** - User-defined patterns via configuration

## ๐Ÿš€ Installation

```bash
pip install sparkgrep
```

## ๐Ÿ“‹ Usage

### As a Pre-commit Hook

Add to your `.pre-commit-config.yaml`:

```yaml
repos:
  - repo: https://github.com/leandroasaservice/sparkgrep
    rev: v1.0.0  # Use the latest version
    hooks:
      - id: sparkgrep
```

### Command Line

```bash
# Check specific files
sparkgrep src/my_script.py notebook.ipynb

# Check with additional patterns
sparkgrep --additional-patterns "debug_print:Debug print statement" src/

# Disable default patterns and use only custom ones
sparkgrep --disable-default-patterns --additional-patterns "my_pattern:My description" src/
```

### Configuration

Create a `.sparkgrep.json` file in your project root:

```json
{
  "additional_patterns": [
    "logger\\.debug\\(.*\\):Debug logging statement",
    "print\\(.*\\):Print statement"
  ],
  "disable_default_patterns": false
}
```

## ๐Ÿ›ก๏ธ Security & Quality

This project maintains high security and code quality standards:

### ๐Ÿ”’ Security Measures
- **Daily security scans** with Bandit, Safety, and GitGuardian
- **Automated vulnerability detection** and issue creation
- **Admin-protected CI/CD** pipelines
- **Dependency vulnerability monitoring**

### ๐Ÿ“Š Code Quality
- **80% minimum code coverage** enforced in CI
- **SonarCloud integration** for continuous code quality analysis
- **Automated testing** on every PR
- **Code formatting** with Ruff

### ๐Ÿšฆ CI/CD Pipeline

The CI pipeline runs automatically on:
- **Pull requests to main** (requires admin approval)
- **Manual dispatch** (admin-only)

Pipeline includes:
- Comprehensive test suite with 80% coverage requirement
- Security scans (Bandit, Safety, GitGuardian)
- Code quality analysis (SonarCloud)
- Linting and formatting checks

**Quality Gates:**
- โŒ **Pipeline fails** if coverage < 80%
- โŒ **Pipeline fails** if critical vulnerabilities found
- โœ… **Pipeline passes** only when all checks succeed

## ๐Ÿงช Development

### Setup

```bash
# Clone the repository
git clone https://github.com/leandroasaservice/sparkgrep.git
cd sparkgrep

# Install in development mode
pip install -e .
pip install -r requirements.txt

# Install pre-commit hooks
pre-commit install
```

### Running Tests

```bash
# Run all tests with coverage
task test

# Run specific test categories
task test:unit
task test:integration

# Generate coverage report
task test:cov
```

### Security Scanning

```bash
# Run security scans locally
bandit -r src/
safety check
ggshield secret scan ci  # Requires GitGuardian API key
```

### Code Quality

```bash
# Format code
ruff format .

# Lint code
ruff check .

# Type checking (if using mypy)
mypy src/
```

## ๐Ÿ“ Project Structure

```
sparkgrep/
โ”œโ”€โ”€ src/sparkgrep/          # Main package
โ”‚   โ”œโ”€โ”€ cli.py              # Command-line interface
โ”‚   โ”œโ”€โ”€ patterns.py         # Pattern definitions
โ”‚   โ”œโ”€โ”€ file_processors.py  # File processing logic
โ”‚   โ””โ”€โ”€ utils.py            # Utility functions
โ”œโ”€โ”€ tests/                  # Test suite
โ”‚   โ”œโ”€โ”€ unit/               # Unit tests
โ”‚   โ””โ”€โ”€ integration/        # Integration tests
โ”œโ”€โ”€ .github/                # GitHub configuration
โ”‚   โ”œโ”€โ”€ workflows/          # CI/CD pipelines
โ”‚   โ””โ”€โ”€ ISSUE_TEMPLATE/     # Issue templates
โ””โ”€โ”€ docs/                   # Documentation
```

## ๐Ÿค Contributing

1. **Fork** the repository
2. **Create** a feature branch (`git checkout -b feature/amazing-feature`)
3. **Make** your changes with tests
4. **Ensure** all checks pass (`task test`, security scans)
5. **Submit** a pull request

### Contribution Guidelines

- **Tests required** for all new features
- **Security scans** must pass
- **Code coverage** must remain โ‰ฅ 80%
- **Admin approval** required for all PRs to main
- **Follow** existing code style and patterns

## ๐Ÿ“„ License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## ๐Ÿ”’ Security

For security vulnerabilities, please:

1. **Create a security issue** using our [security template](.github/ISSUE_TEMPLATE/security_report.md)
2. **Contact maintainers** directly for critical issues
3. **Follow responsible disclosure** practices

Our security measures include automated daily scans and continuous monitoring.

## ๐Ÿ“ž Support

- **Issues**: [GitHub Issues](https://github.com/leandroasaservice/sparkgrep/issues)
- **Discussions**: [GitHub Discussions](https://github.com/leandroasaservice/sparkgrep/discussions)
- **Documentation**: [Project Docs](doc/)

---

**Made with โค๏ธ for the Apache Spark community**

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "sparkgrep",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "spark, databricks, pre-commit, code-quality, linting",
    "author": null,
    "author_email": "Leandro Kellermann de Oliveira <lkellermann@leandroasaservice.com>",
    "download_url": "https://files.pythonhosted.org/packages/16/5e/460607c2c626bda3c9fcf6a25f7613e6a15ae0d7da9c825e97fa5fd23479/sparkgrep-0.1.0a0.tar.gz",
    "platform": null,
    "description": "# SparkGrep \ud83d\ude80\n\n[![CI Status](https://github.com/leandroasaservice/sparkgrep/workflows/CI%20Pipeline/badge.svg)](https://github.com/leandroasaservice/sparkgrep/actions/workflows/ci.yml)\n[![Quality Gate Status](https://sonarcloud.io/api/project_badges/measure?project=leandroasaservice_sparkgrep&metric=alert_status)](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)\n[![Security Rating](https://sonarcloud.io/api/project_badges/measure?project=leandroasaservice_sparkgrep&metric=security_rating)](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)\n[![Coverage](https://sonarcloud.io/api/project_badges/measure?project=leandroasaservice_sparkgrep&metric=coverage)](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)\n[![Maintainability Rating](https://sonarcloud.io/api/project_badges/measure?project=leandroasaservice_sparkgrep&metric=maintainability_rating)](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)\n[![Reliability Rating](https://sonarcloud.io/api/project_badges/measure?project=leandroasaservice_sparkgrep&metric=reliability_rating)](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)\n[![Vulnerabilities](https://sonarcloud.io/api/project_badges/measure?project=leandroasaservice_sparkgrep&metric=vulnerabilities)](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)\n[![Lines of Code](https://sonarcloud.io/api/project_badges/measure?project=leandroasaservice_sparkgrep&metric=ncloc)](https://sonarcloud.io/summary/new_code?id=leandroasaservice_sparkgrep)\n\n[![Python Version](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)\n[![Code style: Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)\n[![Security: Bandit](https://img.shields.io/badge/security-bandit-yellow.svg)](https://github.com/PyCQA/bandit)\n[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)\n\nPre-commit hook that detects debugging leftovers and anti-patterns in Apache Spark applications.\n\n## \ud83c\udfaf Purpose\n\nSparkGrep helps maintain clean Apache Spark codebases by detecting common debugging leftovers and performance anti-patterns that developers often forget to remove before committing code.\n\n## \ud83d\udd0d What it Detects\n\n- **`display()` calls** - Jupyter/Databricks debugging function\n- **`.show()` methods** - DataFrame inspection calls\n- **`.collect()` without assignment** - Potential performance issues\n- **`.count()` without assignment** - Unnecessary computations\n- **Custom patterns** - User-defined patterns via configuration\n\n## \ud83d\ude80 Installation\n\n```bash\npip install sparkgrep\n```\n\n## \ud83d\udccb Usage\n\n### As a Pre-commit Hook\n\nAdd to your `.pre-commit-config.yaml`:\n\n```yaml\nrepos:\n  - repo: https://github.com/leandroasaservice/sparkgrep\n    rev: v1.0.0  # Use the latest version\n    hooks:\n      - id: sparkgrep\n```\n\n### Command Line\n\n```bash\n# Check specific files\nsparkgrep src/my_script.py notebook.ipynb\n\n# Check with additional patterns\nsparkgrep --additional-patterns \"debug_print:Debug print statement\" src/\n\n# Disable default patterns and use only custom ones\nsparkgrep --disable-default-patterns --additional-patterns \"my_pattern:My description\" src/\n```\n\n### Configuration\n\nCreate a `.sparkgrep.json` file in your project root:\n\n```json\n{\n  \"additional_patterns\": [\n    \"logger\\\\.debug\\\\(.*\\\\):Debug logging statement\",\n    \"print\\\\(.*\\\\):Print statement\"\n  ],\n  \"disable_default_patterns\": false\n}\n```\n\n## \ud83d\udee1\ufe0f Security & Quality\n\nThis project maintains high security and code quality standards:\n\n### \ud83d\udd12 Security Measures\n- **Daily security scans** with Bandit, Safety, and GitGuardian\n- **Automated vulnerability detection** and issue creation\n- **Admin-protected CI/CD** pipelines\n- **Dependency vulnerability monitoring**\n\n### \ud83d\udcca Code Quality\n- **80% minimum code coverage** enforced in CI\n- **SonarCloud integration** for continuous code quality analysis\n- **Automated testing** on every PR\n- **Code formatting** with Ruff\n\n### \ud83d\udea6 CI/CD Pipeline\n\nThe CI pipeline runs automatically on:\n- **Pull requests to main** (requires admin approval)\n- **Manual dispatch** (admin-only)\n\nPipeline includes:\n- Comprehensive test suite with 80% coverage requirement\n- Security scans (Bandit, Safety, GitGuardian)\n- Code quality analysis (SonarCloud)\n- Linting and formatting checks\n\n**Quality Gates:**\n- \u274c **Pipeline fails** if coverage < 80%\n- \u274c **Pipeline fails** if critical vulnerabilities found\n- \u2705 **Pipeline passes** only when all checks succeed\n\n## \ud83e\uddea Development\n\n### Setup\n\n```bash\n# Clone the repository\ngit clone https://github.com/leandroasaservice/sparkgrep.git\ncd sparkgrep\n\n# Install in development mode\npip install -e .\npip install -r requirements.txt\n\n# Install pre-commit hooks\npre-commit install\n```\n\n### Running Tests\n\n```bash\n# Run all tests with coverage\ntask test\n\n# Run specific test categories\ntask test:unit\ntask test:integration\n\n# Generate coverage report\ntask test:cov\n```\n\n### Security Scanning\n\n```bash\n# Run security scans locally\nbandit -r src/\nsafety check\nggshield secret scan ci  # Requires GitGuardian API key\n```\n\n### Code Quality\n\n```bash\n# Format code\nruff format .\n\n# Lint code\nruff check .\n\n# Type checking (if using mypy)\nmypy src/\n```\n\n## \ud83d\udcc1 Project Structure\n\n```\nsparkgrep/\n\u251c\u2500\u2500 src/sparkgrep/          # Main package\n\u2502   \u251c\u2500\u2500 cli.py              # Command-line interface\n\u2502   \u251c\u2500\u2500 patterns.py         # Pattern definitions\n\u2502   \u251c\u2500\u2500 file_processors.py  # File processing logic\n\u2502   \u2514\u2500\u2500 utils.py            # Utility functions\n\u251c\u2500\u2500 tests/                  # Test suite\n\u2502   \u251c\u2500\u2500 unit/               # Unit tests\n\u2502   \u2514\u2500\u2500 integration/        # Integration tests\n\u251c\u2500\u2500 .github/                # GitHub configuration\n\u2502   \u251c\u2500\u2500 workflows/          # CI/CD pipelines\n\u2502   \u2514\u2500\u2500 ISSUE_TEMPLATE/     # Issue templates\n\u2514\u2500\u2500 docs/                   # Documentation\n```\n\n## \ud83e\udd1d Contributing\n\n1. **Fork** the repository\n2. **Create** a feature branch (`git checkout -b feature/amazing-feature`)\n3. **Make** your changes with tests\n4. **Ensure** all checks pass (`task test`, security scans)\n5. **Submit** a pull request\n\n### Contribution Guidelines\n\n- **Tests required** for all new features\n- **Security scans** must pass\n- **Code coverage** must remain \u2265 80%\n- **Admin approval** required for all PRs to main\n- **Follow** existing code style and patterns\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\udd12 Security\n\nFor security vulnerabilities, please:\n\n1. **Create a security issue** using our [security template](.github/ISSUE_TEMPLATE/security_report.md)\n2. **Contact maintainers** directly for critical issues\n3. **Follow responsible disclosure** practices\n\nOur security measures include automated daily scans and continuous monitoring.\n\n## \ud83d\udcde Support\n\n- **Issues**: [GitHub Issues](https://github.com/leandroasaservice/sparkgrep/issues)\n- **Discussions**: [GitHub Discussions](https://github.com/leandroasaservice/sparkgrep/discussions)\n- **Documentation**: [Project Docs](doc/)\n\n---\n\n**Made with \u2764\ufe0f for the Apache Spark community**\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Pre-commit hooks for Apache Spark development (Databricks, EMR, Dataproc, and more)",
    "version": "0.1.0a0",
    "project_urls": {
        "Bug Reports": "https://github.com/leandroasaservice/sparkgrep/issues",
        "Contributing": "https://github.com/leandroasaservice/sparkgrep/blob/main/doc/CONTRIBUTING.md",
        "Documentation": "https://github.com/leandroasaservice/sparkgrep/blob/main/README.md",
        "Homepage": "https://github.com/leandroasaservice/sparkgrep",
        "Repository": "https://github.com/leandroasaservice/sparkgrep",
        "Source": "https://github.com/leandroasaservice/sparkgrep"
    },
    "split_keywords": [
        "spark",
        " databricks",
        " pre-commit",
        " code-quality",
        " linting"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d039f2533a235bdce2149f7a8fe591dc3e7d553688e3b69bd10306986887d347",
                "md5": "65941ad4c93725c0bfd1f8b3f048e2ec",
                "sha256": "ead2a74ecd59f64d4209295b0047cd0a7badf53d9de45dea190495846b0d37cd"
            },
            "downloads": -1,
            "filename": "sparkgrep-0.1.0a0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "65941ad4c93725c0bfd1f8b3f048e2ec",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 9625,
            "upload_time": "2025-08-03T14:39:57",
            "upload_time_iso_8601": "2025-08-03T14:39:57.501511Z",
            "url": "https://files.pythonhosted.org/packages/d0/39/f2533a235bdce2149f7a8fe591dc3e7d553688e3b69bd10306986887d347/sparkgrep-0.1.0a0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "165e460607c2c626bda3c9fcf6a25f7613e6a15ae0d7da9c825e97fa5fd23479",
                "md5": "8affab3969738c88d36c9fffd17113ee",
                "sha256": "58bc7e8513a7a8eba907a55e3167e3c077746bb5d9cf6831d644114b8f496091"
            },
            "downloads": -1,
            "filename": "sparkgrep-0.1.0a0.tar.gz",
            "has_sig": false,
            "md5_digest": "8affab3969738c88d36c9fffd17113ee",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 11399,
            "upload_time": "2025-08-03T14:39:58",
            "upload_time_iso_8601": "2025-08-03T14:39:58.712542Z",
            "url": "https://files.pythonhosted.org/packages/16/5e/460607c2c626bda3c9fcf6a25f7613e6a15ae0d7da9c825e97fa5fd23479/sparkgrep-0.1.0a0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-03 14:39:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "leandroasaservice",
    "github_project": "sparkgrep",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "pre-commit",
            "specs": [
                [
                    ">=",
                    "3.0.0"
                ]
            ]
        },
        {
            "name": "black",
            "specs": [
                [
                    ">=",
                    "23.0.0"
                ]
            ]
        },
        {
            "name": "flake8",
            "specs": [
                [
                    ">=",
                    "6.0.0"
                ]
            ]
        },
        {
            "name": "nbformat",
            "specs": [
                [
                    ">=",
                    "5.0.0"
                ]
            ]
        }
    ],
    "lcname": "sparkgrep"
}
        
Elapsed time: 1.94316s