fastparquet


Namefastparquet JSON
Version 2023.10.1 PyPI version JSON
download
home_pagehttps://github.com/dask/fastparquet/
SummaryPython support for Parquet file format
upload_time2023-10-26 18:44:09
maintainer
docs_urlNone
authorMartin Durant
requires_python>=3.8
licenseApache License 2.0
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage
            fastparquet
===========

.. image:: https://github.com/dask/fastparquet/actions/workflows/main.yaml/badge.svg
    :target: https://github.com/dask/fastparquet/actions/workflows/main.yaml

.. image:: https://readthedocs.org/projects/fastparquet/badge/?version=latest
    :target: https://fastparquet.readthedocs.io/en/latest/

fastparquet is a python implementation of the `parquet
format <https://github.com/apache/parquet-format>`_, aiming integrate
into python-based big data work-flows. It is used implicitly by
the projects Dask, Pandas and intake-parquet.

We offer a high degree of support for the features of the parquet format, and
very competitive performance, in a small install size and codebase.

Details of this project, how to use it and comparisons to other work can be found in the documentation_.

.. _documentation: https://fastparquet.readthedocs.io

Requirements
------------

(all development is against recent versions in the default anaconda channels
and/or conda-forge)

Required:

- numpy
- pandas
- cython >= 0.29.23 (if building from pyx files)
- cramjam
- fsspec

Supported compression algorithms:

- Available by default:

  - gzip
  - snappy
  - brotli
  - lz4
  - zstandard

- Optionally supported
  
  - `lzo <https://github.com/jd-boyd/python-lzo>`_


Installation
------------

Install using conda, to get the latest compiled version::

   conda install -c conda-forge fastparquet

or install from PyPI::

   pip install fastparquet

You may wish to install numpy first, to help pip's resolver.
This may install an appropriate wheel, or compile from source. For the latter,
you will need a suitable C compiler toolchain on your system.

You can also install latest version from github::

   pip install git+https://github.com/dask/fastparquet

in which case you should also have ``cython`` to be able to rebuild the C files.

Usage
-----

Please refer to the documentation_.

*Reading*

.. code-block:: python

    from fastparquet import ParquetFile
    pf = ParquetFile('myfile.parq')
    df = pf.to_pandas()
    df2 = pf.to_pandas(['col1', 'col2'], categories=['col1'])

You may specify which columns to load, which of those to keep as categoricals
(if the data uses dictionary encoding). The file-path can be a single file,
a metadata file pointing to other data files, or a directory (tree) containing
data files. The latter is what is typically output by hive/spark.

*Writing*

.. code-block:: python

    from fastparquet import write
    write('outfile.parq', df)
    write('outfile2.parq', df, row_group_offsets=[0, 10000, 20000],
          compression='GZIP', file_scheme='hive')

The default is to produce a single output file with a single row-group
(i.e., logical segment) and no compression. At the moment, only simple
data-types and plain encoding are supported, so expect performance to be
similar to *numpy.savez*.

History
-------

This project forked in October 2016 from `parquet-python`_, which was not designed
for vectorised loading of big data or parallel access.

.. _parquet-python: https://github.com/jcrobak/parquet-python


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/dask/fastparquet/",
    "name": "fastparquet",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "",
    "author": "Martin Durant",
    "author_email": "mdurant@anaconda.com",
    "download_url": "https://files.pythonhosted.org/packages/e4/be/66b9b0c1b1ad092940efbca5e12402e5e92c55360dce876a65ed3cbb78ff/fastparquet-2023.10.1.tar.gz",
    "platform": null,
    "description": "fastparquet\n===========\n\n.. image:: https://github.com/dask/fastparquet/actions/workflows/main.yaml/badge.svg\n    :target: https://github.com/dask/fastparquet/actions/workflows/main.yaml\n\n.. image:: https://readthedocs.org/projects/fastparquet/badge/?version=latest\n    :target: https://fastparquet.readthedocs.io/en/latest/\n\nfastparquet is a python implementation of the `parquet\nformat <https://github.com/apache/parquet-format>`_, aiming integrate\ninto python-based big data work-flows. It is used implicitly by\nthe projects Dask, Pandas and intake-parquet.\n\nWe offer a high degree of support for the features of the parquet format, and\nvery competitive performance, in a small install size and codebase.\n\nDetails of this project, how to use it and comparisons to other work can be found in the documentation_.\n\n.. _documentation: https://fastparquet.readthedocs.io\n\nRequirements\n------------\n\n(all development is against recent versions in the default anaconda channels\nand/or conda-forge)\n\nRequired:\n\n- numpy\n- pandas\n- cython >= 0.29.23 (if building from pyx files)\n- cramjam\n- fsspec\n\nSupported compression algorithms:\n\n- Available by default:\n\n  - gzip\n  - snappy\n  - brotli\n  - lz4\n  - zstandard\n\n- Optionally supported\n  \n  - `lzo <https://github.com/jd-boyd/python-lzo>`_\n\n\nInstallation\n------------\n\nInstall using conda, to get the latest compiled version::\n\n   conda install -c conda-forge fastparquet\n\nor install from PyPI::\n\n   pip install fastparquet\n\nYou may wish to install numpy first, to help pip's resolver.\nThis may install an appropriate wheel, or compile from source. For the latter,\nyou will need a suitable C compiler toolchain on your system.\n\nYou can also install latest version from github::\n\n   pip install git+https://github.com/dask/fastparquet\n\nin which case you should also have ``cython`` to be able to rebuild the C files.\n\nUsage\n-----\n\nPlease refer to the documentation_.\n\n*Reading*\n\n.. code-block:: python\n\n    from fastparquet import ParquetFile\n    pf = ParquetFile('myfile.parq')\n    df = pf.to_pandas()\n    df2 = pf.to_pandas(['col1', 'col2'], categories=['col1'])\n\nYou may specify which columns to load, which of those to keep as categoricals\n(if the data uses dictionary encoding). The file-path can be a single file,\na metadata file pointing to other data files, or a directory (tree) containing\ndata files. The latter is what is typically output by hive/spark.\n\n*Writing*\n\n.. code-block:: python\n\n    from fastparquet import write\n    write('outfile.parq', df)\n    write('outfile2.parq', df, row_group_offsets=[0, 10000, 20000],\n          compression='GZIP', file_scheme='hive')\n\nThe default is to produce a single output file with a single row-group\n(i.e., logical segment) and no compression. At the moment, only simple\ndata-types and plain encoding are supported, so expect performance to be\nsimilar to *numpy.savez*.\n\nHistory\n-------\n\nThis project forked in October 2016 from `parquet-python`_, which was not designed\nfor vectorised loading of big data or parallel access.\n\n.. _parquet-python: https://github.com/jcrobak/parquet-python\n\n",
    "bugtrack_url": null,
    "license": "Apache License 2.0",
    "summary": "Python support for Parquet file format",
    "version": "2023.10.1",
    "project_urls": {
        "Homepage": "https://github.com/dask/fastparquet/"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e4be66b9b0c1b1ad092940efbca5e12402e5e92c55360dce876a65ed3cbb78ff",
                "md5": "9aa9829a57e774e3b5566f832d46646d",
                "sha256": "076fedfba2b56782b4823c1d351424425cfeaa5b8644c542416ca1363fe6d921"
            },
            "downloads": -1,
            "filename": "fastparquet-2023.10.1.tar.gz",
            "has_sig": false,
            "md5_digest": "9aa9829a57e774e3b5566f832d46646d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 393440,
            "upload_time": "2023-10-26T18:44:09",
            "upload_time_iso_8601": "2023-10-26T18:44:09.227469Z",
            "url": "https://files.pythonhosted.org/packages/e4/be/66b9b0c1b1ad092940efbca5e12402e5e92c55360dce876a65ed3cbb78ff/fastparquet-2023.10.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-10-26 18:44:09",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "dask",
    "github_project": "fastparquet",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "requirements": [],
    "lcname": "fastparquet"
}
        
Elapsed time: 0.13740s