parql


Nameparql JSON
Version 1.0.1 PyPI version JSON
download
home_pageNone
SummaryA command-line tool for querying and manipulating Parquet datasets
upload_time2025-08-21 21:34:19
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords parquet sql data cli duckdb pandas
VCS
bugtrack_url
requirements duckdb click rich tabulate pandas pyarrow boto3 google-cloud-storage azure-storage-blob hdfs pytest pytest-cov black flake8 mypy sphinx sphinx-rtd-theme
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ParQL 🦆

**A powerful command-line tool for querying and manipulating Parquet datasets directly from the terminal.**

ParQL brings pandas-like operations and SQL capabilities to the command line, powered by DuckDB. Query, analyze, visualize, and transform Parquet data instantly without writing scripts or loading into memory. Perfect for data exploration, ETL pipelines, and data quality checks.

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)

## 🚀 Key Features

- **25+ Commands** - Complete data analysis toolkit from the CLI
- **Interactive Shell** - REPL mode for exploratory data analysis  
- **Built-in Visualizations** - ASCII charts and plots in your terminal
- **Advanced Analytics** - Correlations, profiling, percentiles, outliers
- **String Processing** - Text manipulation and pattern matching
- **Cloud Storage** - Native S3, GCS, Azure, and HTTP support
- **Smart Caching** - Automatic query result caching for performance
- **Data Quality** - Validation, assertions, and schema comparison
- **Multiple Formats** - Output to CSV, JSON, Parquet, Markdown

## 🚀 Quick Start

### Installation

```bash
# Install from PyPI (when published)
pip install parql

# Or install from source
git clone https://github.com/abdulrafey38/parql.git
cd parql
pip install -e .
```

### Basic Usage

```bash
# Preview data
parql head data/sales.parquet -n 10

# Data analysis
parql profile data/sales.parquet
parql corr data/sales.parquet -c "quantity,price,revenue"

# Filtering and aggregation  
parql select data/sales.parquet -w "revenue > 1000" -c "country,revenue"
parql agg data/sales.parquet -g "country" -a "sum(revenue):total,count():orders"

# Visualizations
parql plot data/sales.parquet -c revenue --chart-type hist --bins 20

# Interactive exploration
parql shell
parql> \l data/sales.parquet sales
parql> SELECT country, SUM(revenue) FROM sales GROUP BY country;

# Export results
parql write data/sales.parquet output.csv --format csv -w "country='US'"
```

### Complete Documentation

📖 **[View Live Documentation](https://abdulrafey38.github.io/parql/)** - Beautiful, interactive documentation with examples

📖 **[Commands Reference](https://abdulrafey38.github.io/parql/commands.html)** - Complete command reference with examples

📖 **[DOCUMENTATION.md](DOCUMENTATION.md)** - Markdown documentation for offline reference

## 📊 Command Categories

### 🔍 **Data Exploration**
- `head`, `tail`, `schema`, `sample` - Quick data inspection
- `profile` - Comprehensive data quality reports  
- `corr` - Correlation analysis between columns
- `percentiles` - Detailed percentile statistics

### 📈 **Analytics & Aggregation**
- `agg` - Group by and aggregate operations
- `window` - Window functions (ranking, moving averages)
- `pivot` - Pivot tables and data reshaping
- `sql` - Custom SQL queries with full DuckDB support

### 🔧 **Data Processing**  
- `select` - Filter rows and select columns
- `join` - Multi-table joins with various strategies
- `str` - String manipulation and text processing
- `pattern` - Advanced pattern matching with regex

### 📊 **Visualization & Quality**
- `plot` - ASCII charts (histograms, bar charts, scatter plots)
- `assert` - Data validation and quality checks
- `outliers` - Statistical outlier detection
- `nulls` - Missing value analysis

### 🖥️ **System & Productivity**
- `shell` - Interactive REPL mode for exploration
- `config` - Profile and settings management
- `cache` - Query result caching for performance
- `write` - Export to multiple formats

## 💡 Quick Examples

### Data Exploration
```bash
# Get a quick overview
parql head data/sales.parquet -n 5
parql schema data/sales.parquet
parql profile data/sales.parquet

# Statistical analysis
parql corr data/sales.parquet -c "quantity,price,revenue"
parql percentiles data/sales.parquet -c "revenue"
```

### Data Analysis
```bash
# Aggregations and grouping
parql agg data/sales.parquet -g "country" -a "sum(revenue):total,count():orders"

# Window functions
parql window data/sales.parquet --partition "user_id" --order "timestamp" --expr "row_number() as rank"

# SQL queries
parql sql "SELECT country, SUM(revenue) FROM t GROUP BY country ORDER BY 2 DESC" -p t=data/sales.parquet
```

### Visualizations
```bash
# Charts in your terminal
parql plot data/sales.parquet -c revenue --chart-type hist --bins 20
parql plot data/sales.parquet -c country --chart-type bar
```

### Interactive Mode
```bash
parql shell
parql> \l data/sales.parquet sales
parql> \l data/users.parquet users  
parql> SELECT u.country, AVG(s.revenue) FROM users u JOIN sales s ON u.user_id = s.user_id GROUP BY u.country;
```

## 🌐 Remote Data Sources

ParQL works with data anywhere:

```bash
# AWS S3
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret  
parql head s3://bucket/path/data.parquet

# Google Cloud Storage
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
parql agg gs://bucket/data/*.parquet -g region -a "count():total"

# Public GCS Datasets
parql head gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet
parql agg gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet -g color -a "avg(price):avg_price"

# Azure Blob Storage
export AZURE_STORAGE_ACCOUNT=your_account
export AZURE_STORAGE_KEY=your_key

# Azure Data Lake Storage (Gen2)
parql head abfs://container@account.dfs.core.windows.net/path/data.parquet

# Azure Blob Storage (Hadoop-style)
parql head wasbs://container@account.blob.core.windows.net/path/data.parquet

# Public Azure files via HTTPS
parql head https://account.blob.core.windows.net/container/path/data.parquet

# HDFS (Hadoop Distributed File System)
export HDFS_NAMENODE=localhost
export HDFS_PORT=9000
parql head hdfs://localhost/tmp/save/part-r-00000-6a3ccfae-5eb9-4a88-8ce8-b11b2644d5de.gz.parquet

# HTTP/HTTPS
parql head https://example.com/data.parquet

# Multiple files and glob patterns
parql head "data/2024/*.parquet" -n 10
parql agg "data/sales/year=*/month=*/*.parquet" -g year,month
```

## 🎯 Why ParQL?

### Before ParQL
```python
# Traditional approach - slow, memory intensive
import pandas as pd
df = pd.read_parquet("large_file.parquet")  # Load entire file
result = df[df['revenue'] > 1000].groupby('country')['revenue'].sum()
print(result)
```

### With ParQL  
```bash
# Fast, memory efficient, one command
parql agg data.parquet -g country -a "sum(revenue):total" -w "revenue > 1000"
```

## 📈 Performance

- **Columnar Processing** - Only reads necessary columns
- **Parallel Execution** - Multi-threaded operations  
- **Memory Efficient** - Streams large datasets
- **Cloud Optimized** - Predicate pushdown for remote data

## 🛠️ Development

```bash
# Clone and setup
git clone https://github.com/abdulrafey38/parql.git
cd parql
python -m venv .env
source .env/bin/activate
pip install -e .

# Run tests
pytest tests/

# Check all features
parql --help
```

## 📝 License

MIT License - see [LICENSE](LICENSE) file for details.

## 🙏 Built With

- **[DuckDB](https://duckdb.org/)** - High-performance analytical database
- **[Rich](https://github.com/willmcgugan/rich)** - Beautiful terminal output
- **[Click](https://click.palletsprojects.com/)** - Command-line interface framework

---

⭐ **If ParQL helps you, please star this repo!** ⭐

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "parql",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Abdul Rafey <abdulrafey38@gmail.com>",
    "keywords": "parquet, sql, data, cli, duckdb, pandas",
    "author": null,
    "author_email": "Abdul Rafey <abdulrafey38@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/fd/22/ae73e773c3a57396b3587e78a0e038ca6177435696e46367a89cacb0e0d4/parql-1.0.1.tar.gz",
    "platform": null,
    "description": "# ParQL \ud83e\udd86\n\n**A powerful command-line tool for querying and manipulating Parquet datasets directly from the terminal.**\n\nParQL brings pandas-like operations and SQL capabilities to the command line, powered by DuckDB. Query, analyze, visualize, and transform Parquet data instantly without writing scripts or loading into memory. Perfect for data exploration, ETL pipelines, and data quality checks.\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n\n## \ud83d\ude80 Key Features\n\n- **25+ Commands** - Complete data analysis toolkit from the CLI\n- **Interactive Shell** - REPL mode for exploratory data analysis  \n- **Built-in Visualizations** - ASCII charts and plots in your terminal\n- **Advanced Analytics** - Correlations, profiling, percentiles, outliers\n- **String Processing** - Text manipulation and pattern matching\n- **Cloud Storage** - Native S3, GCS, Azure, and HTTP support\n- **Smart Caching** - Automatic query result caching for performance\n- **Data Quality** - Validation, assertions, and schema comparison\n- **Multiple Formats** - Output to CSV, JSON, Parquet, Markdown\n\n## \ud83d\ude80 Quick Start\n\n### Installation\n\n```bash\n# Install from PyPI (when published)\npip install parql\n\n# Or install from source\ngit clone https://github.com/abdulrafey38/parql.git\ncd parql\npip install -e .\n```\n\n### Basic Usage\n\n```bash\n# Preview data\nparql head data/sales.parquet -n 10\n\n# Data analysis\nparql profile data/sales.parquet\nparql corr data/sales.parquet -c \"quantity,price,revenue\"\n\n# Filtering and aggregation  \nparql select data/sales.parquet -w \"revenue > 1000\" -c \"country,revenue\"\nparql agg data/sales.parquet -g \"country\" -a \"sum(revenue):total,count():orders\"\n\n# Visualizations\nparql plot data/sales.parquet -c revenue --chart-type hist --bins 20\n\n# Interactive exploration\nparql shell\nparql> \\l data/sales.parquet sales\nparql> SELECT country, SUM(revenue) FROM sales GROUP BY country;\n\n# Export results\nparql write data/sales.parquet output.csv --format csv -w \"country='US'\"\n```\n\n### Complete Documentation\n\n\ud83d\udcd6 **[View Live Documentation](https://abdulrafey38.github.io/parql/)** - Beautiful, interactive documentation with examples\n\n\ud83d\udcd6 **[Commands Reference](https://abdulrafey38.github.io/parql/commands.html)** - Complete command reference with examples\n\n\ud83d\udcd6 **[DOCUMENTATION.md](DOCUMENTATION.md)** - Markdown documentation for offline reference\n\n## \ud83d\udcca Command Categories\n\n### \ud83d\udd0d **Data Exploration**\n- `head`, `tail`, `schema`, `sample` - Quick data inspection\n- `profile` - Comprehensive data quality reports  \n- `corr` - Correlation analysis between columns\n- `percentiles` - Detailed percentile statistics\n\n### \ud83d\udcc8 **Analytics & Aggregation**\n- `agg` - Group by and aggregate operations\n- `window` - Window functions (ranking, moving averages)\n- `pivot` - Pivot tables and data reshaping\n- `sql` - Custom SQL queries with full DuckDB support\n\n### \ud83d\udd27 **Data Processing**  \n- `select` - Filter rows and select columns\n- `join` - Multi-table joins with various strategies\n- `str` - String manipulation and text processing\n- `pattern` - Advanced pattern matching with regex\n\n### \ud83d\udcca **Visualization & Quality**\n- `plot` - ASCII charts (histograms, bar charts, scatter plots)\n- `assert` - Data validation and quality checks\n- `outliers` - Statistical outlier detection\n- `nulls` - Missing value analysis\n\n### \ud83d\udda5\ufe0f **System & Productivity**\n- `shell` - Interactive REPL mode for exploration\n- `config` - Profile and settings management\n- `cache` - Query result caching for performance\n- `write` - Export to multiple formats\n\n## \ud83d\udca1 Quick Examples\n\n### Data Exploration\n```bash\n# Get a quick overview\nparql head data/sales.parquet -n 5\nparql schema data/sales.parquet\nparql profile data/sales.parquet\n\n# Statistical analysis\nparql corr data/sales.parquet -c \"quantity,price,revenue\"\nparql percentiles data/sales.parquet -c \"revenue\"\n```\n\n### Data Analysis\n```bash\n# Aggregations and grouping\nparql agg data/sales.parquet -g \"country\" -a \"sum(revenue):total,count():orders\"\n\n# Window functions\nparql window data/sales.parquet --partition \"user_id\" --order \"timestamp\" --expr \"row_number() as rank\"\n\n# SQL queries\nparql sql \"SELECT country, SUM(revenue) FROM t GROUP BY country ORDER BY 2 DESC\" -p t=data/sales.parquet\n```\n\n### Visualizations\n```bash\n# Charts in your terminal\nparql plot data/sales.parquet -c revenue --chart-type hist --bins 20\nparql plot data/sales.parquet -c country --chart-type bar\n```\n\n### Interactive Mode\n```bash\nparql shell\nparql> \\l data/sales.parquet sales\nparql> \\l data/users.parquet users  \nparql> SELECT u.country, AVG(s.revenue) FROM users u JOIN sales s ON u.user_id = s.user_id GROUP BY u.country;\n```\n\n## \ud83c\udf10 Remote Data Sources\n\nParQL works with data anywhere:\n\n```bash\n# AWS S3\nexport AWS_ACCESS_KEY_ID=your_key\nexport AWS_SECRET_ACCESS_KEY=your_secret  \nparql head s3://bucket/path/data.parquet\n\n# Google Cloud Storage\nexport GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json\nparql agg gs://bucket/data/*.parquet -g region -a \"count():total\"\n\n# Public GCS Datasets\nparql head gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet\nparql agg gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet -g color -a \"avg(price):avg_price\"\n\n# Azure Blob Storage\nexport AZURE_STORAGE_ACCOUNT=your_account\nexport AZURE_STORAGE_KEY=your_key\n\n# Azure Data Lake Storage (Gen2)\nparql head abfs://container@account.dfs.core.windows.net/path/data.parquet\n\n# Azure Blob Storage (Hadoop-style)\nparql head wasbs://container@account.blob.core.windows.net/path/data.parquet\n\n# Public Azure files via HTTPS\nparql head https://account.blob.core.windows.net/container/path/data.parquet\n\n# HDFS (Hadoop Distributed File System)\nexport HDFS_NAMENODE=localhost\nexport HDFS_PORT=9000\nparql head hdfs://localhost/tmp/save/part-r-00000-6a3ccfae-5eb9-4a88-8ce8-b11b2644d5de.gz.parquet\n\n# HTTP/HTTPS\nparql head https://example.com/data.parquet\n\n# Multiple files and glob patterns\nparql head \"data/2024/*.parquet\" -n 10\nparql agg \"data/sales/year=*/month=*/*.parquet\" -g year,month\n```\n\n## \ud83c\udfaf Why ParQL?\n\n### Before ParQL\n```python\n# Traditional approach - slow, memory intensive\nimport pandas as pd\ndf = pd.read_parquet(\"large_file.parquet\")  # Load entire file\nresult = df[df['revenue'] > 1000].groupby('country')['revenue'].sum()\nprint(result)\n```\n\n### With ParQL  \n```bash\n# Fast, memory efficient, one command\nparql agg data.parquet -g country -a \"sum(revenue):total\" -w \"revenue > 1000\"\n```\n\n## \ud83d\udcc8 Performance\n\n- **Columnar Processing** - Only reads necessary columns\n- **Parallel Execution** - Multi-threaded operations  \n- **Memory Efficient** - Streams large datasets\n- **Cloud Optimized** - Predicate pushdown for remote data\n\n## \ud83d\udee0\ufe0f Development\n\n```bash\n# Clone and setup\ngit clone https://github.com/abdulrafey38/parql.git\ncd parql\npython -m venv .env\nsource .env/bin/activate\npip install -e .\n\n# Run tests\npytest tests/\n\n# Check all features\nparql --help\n```\n\n## \ud83d\udcdd License\n\nMIT License - see [LICENSE](LICENSE) file for details.\n\n## \ud83d\ude4f Built With\n\n- **[DuckDB](https://duckdb.org/)** - High-performance analytical database\n- **[Rich](https://github.com/willmcgugan/rich)** - Beautiful terminal output\n- **[Click](https://click.palletsprojects.com/)** - Command-line interface framework\n\n---\n\n\u2b50 **If ParQL helps you, please star this repo!** \u2b50\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A command-line tool for querying and manipulating Parquet datasets",
    "version": "1.0.1",
    "project_urls": {
        "Documentation": "https://github.com/abdulrafey38/parql",
        "Homepage": "https://github.com/abdulrafey38/parql",
        "Issues": "https://github.com/abdulrafey38/parql/issues",
        "Repository": "https://github.com/abdulrafey38/parql"
    },
    "split_keywords": [
        "parquet",
        " sql",
        " data",
        " cli",
        " duckdb",
        " pandas"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d87f11f087e648b4636e3f4d768717a01cf95bece46d52064fb4a20ef5aeddd6",
                "md5": "dc583ba44f41d962259b26a929456c39",
                "sha256": "958c6acb6d8278823c5de09c7cf334601b1a7021573615cf656a8a34da469b31"
            },
            "downloads": -1,
            "filename": "parql-1.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "dc583ba44f41d962259b26a929456c39",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 37283,
            "upload_time": "2025-08-21T21:34:18",
            "upload_time_iso_8601": "2025-08-21T21:34:18.649718Z",
            "url": "https://files.pythonhosted.org/packages/d8/7f/11f087e648b4636e3f4d768717a01cf95bece46d52064fb4a20ef5aeddd6/parql-1.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "fd22ae73e773c3a57396b3587e78a0e038ca6177435696e46367a89cacb0e0d4",
                "md5": "b7e450c08ab17ec7fe2b97d5b027b76a",
                "sha256": "9fa0ae98d77c0b3b86170425d22d7293cabecfc8e6f148dc5a6b125b75f1dfa5"
            },
            "downloads": -1,
            "filename": "parql-1.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "b7e450c08ab17ec7fe2b97d5b027b76a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 38971,
            "upload_time": "2025-08-21T21:34:19",
            "upload_time_iso_8601": "2025-08-21T21:34:19.582774Z",
            "url": "https://files.pythonhosted.org/packages/fd/22/ae73e773c3a57396b3587e78a0e038ca6177435696e46367a89cacb0e0d4/parql-1.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-21 21:34:19",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "abdulrafey38",
    "github_project": "parql",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "duckdb",
            "specs": [
                [
                    ">=",
                    "0.9.0"
                ]
            ]
        },
        {
            "name": "click",
            "specs": [
                [
                    ">=",
                    "8.0.0"
                ]
            ]
        },
        {
            "name": "rich",
            "specs": [
                [
                    ">=",
                    "13.0.0"
                ]
            ]
        },
        {
            "name": "tabulate",
            "specs": [
                [
                    ">=",
                    "0.9.0"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.5.0"
                ]
            ]
        },
        {
            "name": "pyarrow",
            "specs": [
                [
                    ">=",
                    "10.0.0"
                ]
            ]
        },
        {
            "name": "boto3",
            "specs": [
                [
                    ">=",
                    "1.26.0"
                ]
            ]
        },
        {
            "name": "google-cloud-storage",
            "specs": [
                [
                    ">=",
                    "2.5.0"
                ]
            ]
        },
        {
            "name": "azure-storage-blob",
            "specs": [
                [
                    ">=",
                    "12.14.0"
                ]
            ]
        },
        {
            "name": "hdfs",
            "specs": [
                [
                    ">=",
                    "2.6.0"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": [
                [
                    ">=",
                    "7.0.0"
                ]
            ]
        },
        {
            "name": "pytest-cov",
            "specs": [
                [
                    ">=",
                    "4.0.0"
                ]
            ]
        },
        {
            "name": "black",
            "specs": [
                [
                    ">=",
                    "22.0.0"
                ]
            ]
        },
        {
            "name": "flake8",
            "specs": [
                [
                    ">=",
                    "5.0.0"
                ]
            ]
        },
        {
            "name": "mypy",
            "specs": [
                [
                    ">=",
                    "0.991"
                ]
            ]
        },
        {
            "name": "sphinx",
            "specs": [
                [
                    ">=",
                    "5.0.0"
                ]
            ]
        },
        {
            "name": "sphinx-rtd-theme",
            "specs": [
                [
                    ">=",
                    "1.2.0"
                ]
            ]
        }
    ],
    "lcname": "parql"
}
        
Elapsed time: 1.38604s