fastparquet
===========
.. image:: https://github.com/dask/fastparquet/actions/workflows/main.yaml/badge.svg
:target: https://github.com/dask/fastparquet/actions/workflows/main.yaml
.. image:: https://readthedocs.org/projects/fastparquet/badge/?version=latest
:target: https://fastparquet.readthedocs.io/en/latest/
fastparquet is a python implementation of the `parquet
format <https://github.com/apache/parquet-format>`_, aiming integrate
into python-based big data work-flows. It is used implicitly by
the projects Dask, Pandas and intake-parquet.
We offer a high degree of support for the features of the parquet format, and
very competitive performance, in a small install size and codebase.
Details of this project, how to use it and comparisons to other work can be found in the documentation_.
.. _documentation: https://fastparquet.readthedocs.io
Requirements
------------
(all development is against recent versions in the default anaconda channels
and/or conda-forge)
Required:
- numpy
- pandas
- cython >= 0.29.23 (if building from pyx files)
- cramjam
- fsspec
Supported compression algorithms:
- Available by default:
- gzip
- snappy
- brotli
- lz4
- zstandard
- Optionally supported
- `lzo <https://github.com/jd-boyd/python-lzo>`_
Installation
------------
Install using conda, to get the latest compiled version::
conda install -c conda-forge fastparquet
or install from PyPI::
pip install fastparquet
You may wish to install numpy first, to help pip's resolver.
This may install an appropriate wheel, or compile from source. For the latter,
you will need a suitable C compiler toolchain on your system.
You can also install latest version from github::
pip install git+https://github.com/dask/fastparquet
in which case you should also have ``cython`` to be able to rebuild the C files.
Usage
-----
Please refer to the documentation_.
*Reading*
.. code-block:: python
from fastparquet import ParquetFile
pf = ParquetFile('myfile.parq')
df = pf.to_pandas()
df2 = pf.to_pandas(['col1', 'col2'], categories=['col1'])
You may specify which columns to load, which of those to keep as categoricals
(if the data uses dictionary encoding). The file-path can be a single file,
a metadata file pointing to other data files, or a directory (tree) containing
data files. The latter is what is typically output by hive/spark.
*Writing*
.. code-block:: python
from fastparquet import write
write('outfile.parq', df)
write('outfile2.parq', df, row_group_offsets=[0, 10000, 20000],
compression='GZIP', file_scheme='hive')
The default is to produce a single output file with a single row-group
(i.e., logical segment) and no compression. At the moment, only simple
data-types and plain encoding are supported, so expect performance to be
similar to *numpy.savez*.
History
-------
This project forked in October 2016 from `parquet-python`_, which was not designed
for vectorised loading of big data or parallel access.
.. _parquet-python: https://github.com/jcrobak/parquet-python
Raw data
{
"_id": null,
"home_page": "https://github.com/dask/fastparquet/",
"name": "fastparquet",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "",
"author": "Martin Durant",
"author_email": "mdurant@anaconda.com",
"download_url": "https://files.pythonhosted.org/packages/e4/be/66b9b0c1b1ad092940efbca5e12402e5e92c55360dce876a65ed3cbb78ff/fastparquet-2023.10.1.tar.gz",
"platform": null,
"description": "fastparquet\n===========\n\n.. image:: https://github.com/dask/fastparquet/actions/workflows/main.yaml/badge.svg\n :target: https://github.com/dask/fastparquet/actions/workflows/main.yaml\n\n.. image:: https://readthedocs.org/projects/fastparquet/badge/?version=latest\n :target: https://fastparquet.readthedocs.io/en/latest/\n\nfastparquet is a python implementation of the `parquet\nformat <https://github.com/apache/parquet-format>`_, aiming integrate\ninto python-based big data work-flows. It is used implicitly by\nthe projects Dask, Pandas and intake-parquet.\n\nWe offer a high degree of support for the features of the parquet format, and\nvery competitive performance, in a small install size and codebase.\n\nDetails of this project, how to use it and comparisons to other work can be found in the documentation_.\n\n.. _documentation: https://fastparquet.readthedocs.io\n\nRequirements\n------------\n\n(all development is against recent versions in the default anaconda channels\nand/or conda-forge)\n\nRequired:\n\n- numpy\n- pandas\n- cython >= 0.29.23 (if building from pyx files)\n- cramjam\n- fsspec\n\nSupported compression algorithms:\n\n- Available by default:\n\n - gzip\n - snappy\n - brotli\n - lz4\n - zstandard\n\n- Optionally supported\n \n - `lzo <https://github.com/jd-boyd/python-lzo>`_\n\n\nInstallation\n------------\n\nInstall using conda, to get the latest compiled version::\n\n conda install -c conda-forge fastparquet\n\nor install from PyPI::\n\n pip install fastparquet\n\nYou may wish to install numpy first, to help pip's resolver.\nThis may install an appropriate wheel, or compile from source. For the latter,\nyou will need a suitable C compiler toolchain on your system.\n\nYou can also install latest version from github::\n\n pip install git+https://github.com/dask/fastparquet\n\nin which case you should also have ``cython`` to be able to rebuild the C files.\n\nUsage\n-----\n\nPlease refer to the documentation_.\n\n*Reading*\n\n.. code-block:: python\n\n from fastparquet import ParquetFile\n pf = ParquetFile('myfile.parq')\n df = pf.to_pandas()\n df2 = pf.to_pandas(['col1', 'col2'], categories=['col1'])\n\nYou may specify which columns to load, which of those to keep as categoricals\n(if the data uses dictionary encoding). The file-path can be a single file,\na metadata file pointing to other data files, or a directory (tree) containing\ndata files. The latter is what is typically output by hive/spark.\n\n*Writing*\n\n.. code-block:: python\n\n from fastparquet import write\n write('outfile.parq', df)\n write('outfile2.parq', df, row_group_offsets=[0, 10000, 20000],\n compression='GZIP', file_scheme='hive')\n\nThe default is to produce a single output file with a single row-group\n(i.e., logical segment) and no compression. At the moment, only simple\ndata-types and plain encoding are supported, so expect performance to be\nsimilar to *numpy.savez*.\n\nHistory\n-------\n\nThis project forked in October 2016 from `parquet-python`_, which was not designed\nfor vectorised loading of big data or parallel access.\n\n.. _parquet-python: https://github.com/jcrobak/parquet-python\n\n",
"bugtrack_url": null,
"license": "Apache License 2.0",
"summary": "Python support for Parquet file format",
"version": "2023.10.1",
"project_urls": {
"Homepage": "https://github.com/dask/fastparquet/"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "e4be66b9b0c1b1ad092940efbca5e12402e5e92c55360dce876a65ed3cbb78ff",
"md5": "9aa9829a57e774e3b5566f832d46646d",
"sha256": "076fedfba2b56782b4823c1d351424425cfeaa5b8644c542416ca1363fe6d921"
},
"downloads": -1,
"filename": "fastparquet-2023.10.1.tar.gz",
"has_sig": false,
"md5_digest": "9aa9829a57e774e3b5566f832d46646d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 393440,
"upload_time": "2023-10-26T18:44:09",
"upload_time_iso_8601": "2023-10-26T18:44:09.227469Z",
"url": "https://files.pythonhosted.org/packages/e4/be/66b9b0c1b1ad092940efbca5e12402e5e92c55360dce876a65ed3cbb78ff/fastparquet-2023.10.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-10-26 18:44:09",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "dask",
"github_project": "fastparquet",
"travis_ci": false,
"coveralls": true,
"github_actions": true,
"requirements": [],
"lcname": "fastparquet"
}