datasets-sql


Namedatasets-sql JSON
Version 0.4.0 PyPI version JSON
download
home_pagehttps://github.com/mariosasko/datasets_sql
Summarydatasets_sql is an extension package of 🤗 Datasets package that provides support for executing arbitrary SQL queries on datasets.
upload_time2024-01-24 16:37:14
maintainer
docs_urlNone
authorMario Šaško
requires_python>=3.7.0
licenseApache
keywords datasets
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # datasets_sql

A 🤗 Datasets extension package that provides support for executing arbitrary SQL queries on HF datasets. It uses [DuckDB](https://duckdb.org/) as a SQL engine and follows its [query syntax](https://duckdb.org/docs/sql/introduction#querying-a-table).

## Installation

```bash
pip install datasets_sql
```

## Quick Start

```python
from datasets import load_dataset, Dataset
from datasets_sql import query

imdb_dset = load_dataset("imdb", split="train")

# Remove the rows where the `text` field has less than 1000 characters
imdb_query_dset1 = query("SELECT text FROM imdb_dset WHERE length(text) > 1000")

# Count the number of rows per label
imdb_query_dset2 = query("SELECT label, COUNT(*) as num_rows FROM imdb_dset GROUP BY label")

# Remove duplicated rows
imdb_query_dset3 = query("SELECT DISTINCT text FROM imdb_dset")

# Get the average length of the `text` field
imdb_query_dset4 = query("SELECT AVG(length(text)) as avg_text_length FROM imdb_dset")

order_customer_dset = Dataset.from_dict({
    "order_id": [10001, 10002, 10003],
    "customer_id": [3, 1, 2],
})

customer_dset = Dataset.from_dict({
    "customer_id": [1, 2, 3],
    "name": ["John", "Jane", "Mary"],
})

# Join two tables
join_query_dset = query(
    "SELECT order_id, name FROM order_customer_dset INNER JOIN customer_dset ON order_customer_dset.customer_id = customer_dset.customer_id"
)
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/mariosasko/datasets_sql",
    "name": "datasets-sql",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7.0",
    "maintainer_email": "",
    "keywords": "datasets",
    "author": "Mario \u0160a\u0161ko",
    "author_email": "mariosasko777@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/77/60/62f680f0b4aad9ccbbd08f0885d55eb1d214b21e86fd08d319ad76b36246/datasets_sql-0.4.0.tar.gz",
    "platform": null,
    "description": "# datasets_sql\n\nA \ud83e\udd17 Datasets extension package that provides support for executing arbitrary SQL queries on HF datasets. It uses [DuckDB](https://duckdb.org/) as a SQL engine and follows its [query syntax](https://duckdb.org/docs/sql/introduction#querying-a-table).\n\n## Installation\n\n```bash\npip install datasets_sql\n```\n\n## Quick Start\n\n```python\nfrom datasets import load_dataset, Dataset\nfrom datasets_sql import query\n\nimdb_dset = load_dataset(\"imdb\", split=\"train\")\n\n# Remove the rows where the `text` field has less than 1000 characters\nimdb_query_dset1 = query(\"SELECT text FROM imdb_dset WHERE length(text) > 1000\")\n\n# Count the number of rows per label\nimdb_query_dset2 = query(\"SELECT label, COUNT(*) as num_rows FROM imdb_dset GROUP BY label\")\n\n# Remove duplicated rows\nimdb_query_dset3 = query(\"SELECT DISTINCT text FROM imdb_dset\")\n\n# Get the average length of the `text` field\nimdb_query_dset4 = query(\"SELECT AVG(length(text)) as avg_text_length FROM imdb_dset\")\n\norder_customer_dset = Dataset.from_dict({\n    \"order_id\": [10001, 10002, 10003],\n    \"customer_id\": [3, 1, 2],\n})\n\ncustomer_dset = Dataset.from_dict({\n    \"customer_id\": [1, 2, 3],\n    \"name\": [\"John\", \"Jane\", \"Mary\"],\n})\n\n# Join two tables\njoin_query_dset = query(\n    \"SELECT order_id, name FROM order_customer_dset INNER JOIN customer_dset ON order_customer_dset.customer_id = customer_dset.customer_id\"\n)\n```\n",
    "bugtrack_url": null,
    "license": "Apache",
    "summary": "datasets_sql is an extension package of \ud83e\udd17 Datasets package that provides support for executing arbitrary SQL queries on datasets.",
    "version": "0.4.0",
    "project_urls": {
        "Homepage": "https://github.com/mariosasko/datasets_sql"
    },
    "split_keywords": [
        "datasets"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2f79faea9bde92a8ec329934998a56f6a68d25b60da39ef9af0a1d38619fe772",
                "md5": "436d534eafd5d45877dc8f793e9a5602",
                "sha256": "e039965fa6519cab5d8e93e685f081ee27f62b68c634cf92d9a5ebb13163cc30"
            },
            "downloads": -1,
            "filename": "datasets_sql-0.4.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "436d534eafd5d45877dc8f793e9a5602",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7.0",
            "size": 9447,
            "upload_time": "2024-01-24T16:37:12",
            "upload_time_iso_8601": "2024-01-24T16:37:12.593863Z",
            "url": "https://files.pythonhosted.org/packages/2f/79/faea9bde92a8ec329934998a56f6a68d25b60da39ef9af0a1d38619fe772/datasets_sql-0.4.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "776062f680f0b4aad9ccbbd08f0885d55eb1d214b21e86fd08d319ad76b36246",
                "md5": "a99d793bccf514e7618f82b7c1f9a852",
                "sha256": "d68fb0f5718d66ce6c9a5249400cefa69be0567ce13ce1514b10f8857036943f"
            },
            "downloads": -1,
            "filename": "datasets_sql-0.4.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a99d793bccf514e7618f82b7c1f9a852",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7.0",
            "size": 9810,
            "upload_time": "2024-01-24T16:37:14",
            "upload_time_iso_8601": "2024-01-24T16:37:14.219817Z",
            "url": "https://files.pythonhosted.org/packages/77/60/62f680f0b4aad9ccbbd08f0885d55eb1d214b21e86fd08d319ad76b36246/datasets_sql-0.4.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-24 16:37:14",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mariosasko",
    "github_project": "datasets_sql",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "datasets-sql"
}
        
Elapsed time: 0.37194s