# datasets_sql
A 🤗 Datasets extension package that provides support for executing arbitrary SQL queries on HF datasets. It uses [DuckDB](https://duckdb.org/) as a SQL engine and follows its [query syntax](https://duckdb.org/docs/sql/introduction#querying-a-table).
## Installation
```bash
pip install datasets_sql
```
## Quick Start
```python
from datasets import load_dataset, Dataset
from datasets_sql import query
imdb_dset = load_dataset("imdb", split="train")
# Remove the rows where the `text` field has less than 1000 characters
imdb_query_dset1 = query("SELECT text FROM imdb_dset WHERE length(text) > 1000")
# Count the number of rows per label
imdb_query_dset2 = query("SELECT label, COUNT(*) as num_rows FROM imdb_dset GROUP BY label")
# Remove duplicated rows
imdb_query_dset3 = query("SELECT DISTINCT text FROM imdb_dset")
# Get the average length of the `text` field
imdb_query_dset4 = query("SELECT AVG(length(text)) as avg_text_length FROM imdb_dset")
order_customer_dset = Dataset.from_dict({
"order_id": [10001, 10002, 10003],
"customer_id": [3, 1, 2],
})
customer_dset = Dataset.from_dict({
"customer_id": [1, 2, 3],
"name": ["John", "Jane", "Mary"],
})
# Join two tables
join_query_dset = query(
"SELECT order_id, name FROM order_customer_dset INNER JOIN customer_dset ON order_customer_dset.customer_id = customer_dset.customer_id"
)
```
Raw data
{
"_id": null,
"home_page": "https://github.com/mariosasko/datasets_sql",
"name": "datasets-sql",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7.0",
"maintainer_email": "",
"keywords": "datasets",
"author": "Mario \u0160a\u0161ko",
"author_email": "mariosasko777@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/77/60/62f680f0b4aad9ccbbd08f0885d55eb1d214b21e86fd08d319ad76b36246/datasets_sql-0.4.0.tar.gz",
"platform": null,
"description": "# datasets_sql\n\nA \ud83e\udd17 Datasets extension package that provides support for executing arbitrary SQL queries on HF datasets. It uses [DuckDB](https://duckdb.org/) as a SQL engine and follows its [query syntax](https://duckdb.org/docs/sql/introduction#querying-a-table).\n\n## Installation\n\n```bash\npip install datasets_sql\n```\n\n## Quick Start\n\n```python\nfrom datasets import load_dataset, Dataset\nfrom datasets_sql import query\n\nimdb_dset = load_dataset(\"imdb\", split=\"train\")\n\n# Remove the rows where the `text` field has less than 1000 characters\nimdb_query_dset1 = query(\"SELECT text FROM imdb_dset WHERE length(text) > 1000\")\n\n# Count the number of rows per label\nimdb_query_dset2 = query(\"SELECT label, COUNT(*) as num_rows FROM imdb_dset GROUP BY label\")\n\n# Remove duplicated rows\nimdb_query_dset3 = query(\"SELECT DISTINCT text FROM imdb_dset\")\n\n# Get the average length of the `text` field\nimdb_query_dset4 = query(\"SELECT AVG(length(text)) as avg_text_length FROM imdb_dset\")\n\norder_customer_dset = Dataset.from_dict({\n \"order_id\": [10001, 10002, 10003],\n \"customer_id\": [3, 1, 2],\n})\n\ncustomer_dset = Dataset.from_dict({\n \"customer_id\": [1, 2, 3],\n \"name\": [\"John\", \"Jane\", \"Mary\"],\n})\n\n# Join two tables\njoin_query_dset = query(\n \"SELECT order_id, name FROM order_customer_dset INNER JOIN customer_dset ON order_customer_dset.customer_id = customer_dset.customer_id\"\n)\n```\n",
"bugtrack_url": null,
"license": "Apache",
"summary": "datasets_sql is an extension package of \ud83e\udd17 Datasets package that provides support for executing arbitrary SQL queries on datasets.",
"version": "0.4.0",
"project_urls": {
"Homepage": "https://github.com/mariosasko/datasets_sql"
},
"split_keywords": [
"datasets"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "2f79faea9bde92a8ec329934998a56f6a68d25b60da39ef9af0a1d38619fe772",
"md5": "436d534eafd5d45877dc8f793e9a5602",
"sha256": "e039965fa6519cab5d8e93e685f081ee27f62b68c634cf92d9a5ebb13163cc30"
},
"downloads": -1,
"filename": "datasets_sql-0.4.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "436d534eafd5d45877dc8f793e9a5602",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7.0",
"size": 9447,
"upload_time": "2024-01-24T16:37:12",
"upload_time_iso_8601": "2024-01-24T16:37:12.593863Z",
"url": "https://files.pythonhosted.org/packages/2f/79/faea9bde92a8ec329934998a56f6a68d25b60da39ef9af0a1d38619fe772/datasets_sql-0.4.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "776062f680f0b4aad9ccbbd08f0885d55eb1d214b21e86fd08d319ad76b36246",
"md5": "a99d793bccf514e7618f82b7c1f9a852",
"sha256": "d68fb0f5718d66ce6c9a5249400cefa69be0567ce13ce1514b10f8857036943f"
},
"downloads": -1,
"filename": "datasets_sql-0.4.0.tar.gz",
"has_sig": false,
"md5_digest": "a99d793bccf514e7618f82b7c1f9a852",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7.0",
"size": 9810,
"upload_time": "2024-01-24T16:37:14",
"upload_time_iso_8601": "2024-01-24T16:37:14.219817Z",
"url": "https://files.pythonhosted.org/packages/77/60/62f680f0b4aad9ccbbd08f0885d55eb1d214b21e86fd08d319ad76b36246/datasets_sql-0.4.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-01-24 16:37:14",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "mariosasko",
"github_project": "datasets_sql",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "datasets-sql"
}