pytd


Namepytd JSON
Version 1.5.1 PyPI version JSON
download
home_pagehttps://github.com/treasure-data/pytd
SummaryTreasure Data Driver for Python
upload_time2022-12-08 15:45:14
maintainerTreasure Data
docs_urlNone
authorTreasure Data
requires_python<3.11,>=3.7
licenseApache License 2.0
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            pytd
====

|Build status| |PyPI version| |docs status|

**pytd** provides user-friendly interfaces to Treasure Data’s `REST
APIs <https://github.com/treasure-data/td-client-python>`__, `Presto
query
engine <https://docs.treasuredata.com/display/public/PD/About+Presto+Distributed+Query+Engine>`__,
and `Plazma primary
storage <https://www.slideshare.net/treasure-data/td-techplazma>`__.

The seamless connection allows your Python code to efficiently
read/write a large volume of data from/to Treasure Data. Eventually,
pytd makes your day-to-day data analytics work more productive.

Installation
------------

.. code:: sh

   pip install pytd

Usage
-----

-  `Documentation <https://pytd-doc.readthedocs.io/>`__
-  `Sample usage on Google
   Colaboratory <https://colab.research.google.com/drive/1ps_ChU-H2FvkeNlj1e1fcOebCt4ryN11>`__

Set your `API
key <https://docs.treasuredata.com/display/public/PD/Getting+Your+API+Keys>`__
and
`endpoint <https://docs.treasuredata.com/display/public/PD/Sites+and+Endpoints>`__
to the environment variables, ``TD_API_KEY`` and ``TD_API_SERVER``,
respectively, and create a client instance:

.. code:: py

   import pytd

   client = pytd.Client(database='sample_datasets')
   # or, hard-code your API key, endpoint, and/or query engine:
   # >>> pytd.Client(apikey='1/XXX', endpoint='https://api.treasuredata.com/', database='sample_datasets', default_engine='presto')

Query in Treasure Data
~~~~~~~~~~~~~~~~~~~~~~

Issue Presto query and retrieve the result:

.. code:: py

   client.query('select symbol, count(1) as cnt from nasdaq group by 1 order by 1')
   # {'columns': ['symbol', 'cnt'], 'data': [['AAIT', 590], ['AAL', 82], ['AAME', 9252], ..., ['ZUMZ', 2364]]}

In case of Hive:

.. code:: py

   client.query('select hivemall_version()', engine='hive')
   # {'columns': ['_c0'], 'data': [['0.6.0-SNAPSHOT-201901-r01']]} (as of Feb, 2019)

It is also possible to explicitly initialize ``pytd.Client`` for Hive:

.. code:: py

   client_hive = pytd.Client(database='sample_datasets', default_engine='hive')
   client_hive.query('select hivemall_version()')

Write data to Treasure Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Data represented as ``pandas.DataFrame`` can be written to Treasure Data
as follows:

.. code:: py

   import pandas as pd

   df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 10]})
   client.load_table_from_dataframe(df, 'takuti.foo', writer='bulk_import', if_exists='overwrite')

For the ``writer`` option, pytd supports three different ways to ingest
data to Treasure Data:

1. **Bulk Import API**: ``bulk_import`` (default)

   -  Convert data into a CSV file and upload in the batch fashion.

2. **Presto INSERT INTO query**: ``insert_into``

   -  Insert every single row in ``DataFrame`` by issuing an INSERT INTO
      query through the Presto query engine.
   -  Recommended only for a small volume of data.

3. `td-spark <https://treasure-data.github.io/td-spark/>`__:
   ``spark``

   -  Local customized Spark instance directly writes ``DataFrame`` to
      Treasure Data’s primary storage system.

Characteristics of each of these methods can be summarized as follows:

+-----------------------------------+------------------+------------------+-----------+
|                                   | ``bulk_import``  | ``insert_into``  | ``spark`` |
+===================================+==================+==================+===========+
| Scalable against data volume      |        ✓         |                  |     ✓     |
+-----------------------------------+------------------+------------------+-----------+
| Write performance for larger data |                  |                  |     ✓     |
+-----------------------------------+------------------+------------------+-----------+
| Memory efficient                  |        ✓         |        ✓         |           |
+-----------------------------------+------------------+------------------+-----------+
| Disk efficient                    |                  |        ✓         |           |
+-----------------------------------+------------------+------------------+-----------+
| Minimal package dependency        |        ✓         |        ✓         |           |
+-----------------------------------+------------------+------------------+-----------+

Enabling Spark Writer
^^^^^^^^^^^^^^^^^^^^^

Since td-spark gives special access to the main storage system via
`PySpark <https://spark.apache.org/docs/latest/api/python/index.html>`__,
follow the instructions below:

1. Contact support@treasuredata.com to activate the permission to your
   Treasure Data account. Note that the underlying component, Plazma Public
   API, limits its free tier at 100GB Read and 100TB Write.
2. Install pytd with ``[spark]`` option if you use the third option:
   ``pip install pytd[spark]``

If you want to use existing td-spark JAR file, creating ``SparkWriter``
with ``td_spark_path`` option would be helpful.

.. code:: py

   from pytd.writer import SparkWriter

   writer = SparkWriter(td_spark_path='/path/to/td-spark-assembly.jar')
   client.load_table_from_dataframe(df, 'mydb.bar', writer=writer, if_exists='overwrite')

Comparison between pytd, td-client-python, and pandas-td
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Treasure Data offers three different Python clients on GitHub, and the following list summarizes their characteristics.

1. `td-client-python <https://github.com/treasure-data/td-client-python>`__

   - Basic REST API wrapper.
   - Similar functionalities to td-client-{`ruby <https://github.com/treasure-data/td-client-ruby>`__, `java <https://github.com/treasure-data/td-client-java>`__, `node <https://github.com/treasure-data/td-client-node>`__, `go <https://github.com/treasure-data/td-client-go>`__}.
   - The capability is limited by `what Treasure Data REST API can do <https://docs.treasuredata.com/display/public/PD/REST+APIs+in+Treasure+Data>`__.

2. **pytd**

   - Access to Plazma via td-spark as introduced above.
   - Efficient connection to Presto based on `presto-python-client <https://github.com/prestodb/presto-python-client>`__.
   - Multiple data ingestion methods and a variety of utility functions.

3. `pandas-td <https://github.com/treasure-data/pandas-td>`__ *(deprecated)*

   - Old tool optimized for `pandas <https://pandas.pydata.org>`__ and `Jupyter Notebook <https://jupyter.org>`__.
   - **pytd** offers its compatible function set (see below for the detail).

An optimal choice of package depends on your specific use case, but common guidelines can be listed as follows:

- Use td-client-python if you want to execute *basic CRUD operations* from Python applications.
- Use **pytd** for (1) *analytical purpose* relying on pandas and Jupyter Notebook, and (2) achieving *more efficient data access* at ease.
- Do not use pandas-td. If you are using pandas-td, replace the code with pytd based on the following guidance as soon as possible.

How to replace pandas-td
^^^^^^^^^^^^^^^^^^^^^^^^

**pytd** offers
`pandas-td <https://github.com/treasure-data/pandas-td>`__-compatible
functions that provide the same functionalities more efficiently. If you
are still using pandas-td, we recommend you to switch to **pytd** as
follows.

First, install the package from PyPI:

.. code:: sh

   pip install pytd
   # or, `pip install pytd[spark]` if you wish to use `to_td`

Next, make the following modifications on the import statements.

*Before:*

.. code:: python

   import pandas_td as td

.. code:: python

   In [1]: %%load_ext pandas_td.ipython

*After:*

.. code:: python

   import pytd.pandas_td as td

.. code:: python

   In [1]: %%load_ext pytd.pandas_td.ipython

Consequently, all ``pandas_td`` code should keep running correctly with
``pytd``. Report an issue from
`here <https://github.com/treasure-data/pytd/issues/new>`__ if you
noticed any incompatible behaviors.

.. |Build status| image:: https://github.com/treasure-data/pytd/workflows/Build/badge.svg
   :target: https://github.com/treasure-data/pytd/actions/
.. |PyPI version| image:: https://badge.fury.io/py/pytd.svg
   :target: https://badge.fury.io/py/pytd
.. |docs status| image:: https://readthedocs.org/projects/pytd-doc/badge/?version=latest
   :target: https://pytd-doc.readthedocs.io/en/latest/?badge=latest

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/treasure-data/pytd",
    "name": "pytd",
    "maintainer": "Treasure Data",
    "docs_url": null,
    "requires_python": "<3.11,>=3.7",
    "maintainer_email": "support@treasure-data.com",
    "keywords": "",
    "author": "Treasure Data",
    "author_email": "support@treasure-data.com",
    "download_url": "https://files.pythonhosted.org/packages/af/9a/d1bc9ebcd4691413fc8a8622af2b44c84f143d26dda9daad672f667a4eb1/pytd-1.5.1.tar.gz",
    "platform": null,
    "description": "pytd\n====\n\n|Build status| |PyPI version| |docs status|\n\n**pytd** provides user-friendly interfaces to Treasure Data\u2019s `REST\nAPIs <https://github.com/treasure-data/td-client-python>`__, `Presto\nquery\nengine <https://docs.treasuredata.com/display/public/PD/About+Presto+Distributed+Query+Engine>`__,\nand `Plazma primary\nstorage <https://www.slideshare.net/treasure-data/td-techplazma>`__.\n\nThe seamless connection allows your Python code to efficiently\nread/write a large volume of data from/to Treasure Data. Eventually,\npytd makes your day-to-day data analytics work more productive.\n\nInstallation\n------------\n\n.. code:: sh\n\n   pip install pytd\n\nUsage\n-----\n\n-  `Documentation <https://pytd-doc.readthedocs.io/>`__\n-  `Sample usage on Google\n   Colaboratory <https://colab.research.google.com/drive/1ps_ChU-H2FvkeNlj1e1fcOebCt4ryN11>`__\n\nSet your `API\nkey <https://docs.treasuredata.com/display/public/PD/Getting+Your+API+Keys>`__\nand\n`endpoint <https://docs.treasuredata.com/display/public/PD/Sites+and+Endpoints>`__\nto the environment variables, ``TD_API_KEY`` and ``TD_API_SERVER``,\nrespectively, and create a client instance:\n\n.. code:: py\n\n   import pytd\n\n   client = pytd.Client(database='sample_datasets')\n   # or, hard-code your API key, endpoint, and/or query engine:\n   # >>> pytd.Client(apikey='1/XXX', endpoint='https://api.treasuredata.com/', database='sample_datasets', default_engine='presto')\n\nQuery in Treasure Data\n~~~~~~~~~~~~~~~~~~~~~~\n\nIssue Presto query and retrieve the result:\n\n.. code:: py\n\n   client.query('select symbol, count(1) as cnt from nasdaq group by 1 order by 1')\n   # {'columns': ['symbol', 'cnt'], 'data': [['AAIT', 590], ['AAL', 82], ['AAME', 9252], ..., ['ZUMZ', 2364]]}\n\nIn case of Hive:\n\n.. code:: py\n\n   client.query('select hivemall_version()', engine='hive')\n   # {'columns': ['_c0'], 'data': [['0.6.0-SNAPSHOT-201901-r01']]} (as of Feb, 2019)\n\nIt is also possible to explicitly initialize ``pytd.Client`` for Hive:\n\n.. code:: py\n\n   client_hive = pytd.Client(database='sample_datasets', default_engine='hive')\n   client_hive.query('select hivemall_version()')\n\nWrite data to Treasure Data\n~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nData represented as ``pandas.DataFrame`` can be written to Treasure Data\nas follows:\n\n.. code:: py\n\n   import pandas as pd\n\n   df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 10]})\n   client.load_table_from_dataframe(df, 'takuti.foo', writer='bulk_import', if_exists='overwrite')\n\nFor the ``writer`` option, pytd supports three different ways to ingest\ndata to Treasure Data:\n\n1. **Bulk Import API**: ``bulk_import`` (default)\n\n   -  Convert data into a CSV file and upload in the batch fashion.\n\n2. **Presto INSERT INTO query**: ``insert_into``\n\n   -  Insert every single row in ``DataFrame`` by issuing an INSERT INTO\n      query through the Presto query engine.\n   -  Recommended only for a small volume of data.\n\n3. `td-spark <https://treasure-data.github.io/td-spark/>`__:\n   ``spark``\n\n   -  Local customized Spark instance directly writes ``DataFrame`` to\n      Treasure Data\u2019s primary storage system.\n\nCharacteristics of each of these methods can be summarized as follows:\n\n+-----------------------------------+------------------+------------------+-----------+\n|                                   | ``bulk_import``  | ``insert_into``  | ``spark`` |\n+===================================+==================+==================+===========+\n| Scalable against data volume      |        \u2713         |                  |     \u2713     |\n+-----------------------------------+------------------+------------------+-----------+\n| Write performance for larger data |                  |                  |     \u2713     |\n+-----------------------------------+------------------+------------------+-----------+\n| Memory efficient                  |        \u2713         |        \u2713         |           |\n+-----------------------------------+------------------+------------------+-----------+\n| Disk efficient                    |                  |        \u2713         |           |\n+-----------------------------------+------------------+------------------+-----------+\n| Minimal package dependency        |        \u2713         |        \u2713         |           |\n+-----------------------------------+------------------+------------------+-----------+\n\nEnabling Spark Writer\n^^^^^^^^^^^^^^^^^^^^^\n\nSince td-spark gives special access to the main storage system via\n`PySpark <https://spark.apache.org/docs/latest/api/python/index.html>`__,\nfollow the instructions below:\n\n1. Contact support@treasuredata.com to activate the permission to your\n   Treasure Data account. Note that the underlying component, Plazma Public\n   API, limits its free tier at 100GB Read and 100TB Write.\n2. Install pytd with ``[spark]`` option if you use the third option:\n   ``pip install pytd[spark]``\n\nIf you want to use existing td-spark JAR file, creating ``SparkWriter``\nwith ``td_spark_path`` option would be helpful.\n\n.. code:: py\n\n   from pytd.writer import SparkWriter\n\n   writer = SparkWriter(td_spark_path='/path/to/td-spark-assembly.jar')\n   client.load_table_from_dataframe(df, 'mydb.bar', writer=writer, if_exists='overwrite')\n\nComparison between pytd, td-client-python, and pandas-td\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nTreasure Data offers three different Python clients on GitHub, and the following list summarizes their characteristics.\n\n1. `td-client-python <https://github.com/treasure-data/td-client-python>`__\n\n   - Basic REST API wrapper.\n   - Similar functionalities to td-client-{`ruby <https://github.com/treasure-data/td-client-ruby>`__, `java <https://github.com/treasure-data/td-client-java>`__, `node <https://github.com/treasure-data/td-client-node>`__, `go <https://github.com/treasure-data/td-client-go>`__}.\n   - The capability is limited by `what Treasure Data REST API can do <https://docs.treasuredata.com/display/public/PD/REST+APIs+in+Treasure+Data>`__.\n\n2. **pytd**\n\n   - Access to Plazma via td-spark as introduced above.\n   - Efficient connection to Presto based on `presto-python-client <https://github.com/prestodb/presto-python-client>`__.\n   - Multiple data ingestion methods and a variety of utility functions.\n\n3. `pandas-td <https://github.com/treasure-data/pandas-td>`__ *(deprecated)*\n\n   - Old tool optimized for `pandas <https://pandas.pydata.org>`__ and `Jupyter Notebook <https://jupyter.org>`__.\n   - **pytd** offers its compatible function set (see below for the detail).\n\nAn optimal choice of package depends on your specific use case, but common guidelines can be listed as follows:\n\n- Use td-client-python if you want to execute *basic CRUD operations* from Python applications.\n- Use **pytd** for (1) *analytical purpose* relying on pandas and Jupyter Notebook, and (2) achieving *more efficient data access* at ease.\n- Do not use pandas-td. If you are using pandas-td, replace the code with pytd based on the following guidance as soon as possible.\n\nHow to replace pandas-td\n^^^^^^^^^^^^^^^^^^^^^^^^\n\n**pytd** offers\n`pandas-td <https://github.com/treasure-data/pandas-td>`__-compatible\nfunctions that provide the same functionalities more efficiently. If you\nare still using pandas-td, we recommend you to switch to **pytd** as\nfollows.\n\nFirst, install the package from PyPI:\n\n.. code:: sh\n\n   pip install pytd\n   # or, `pip install pytd[spark]` if you wish to use `to_td`\n\nNext, make the following modifications on the import statements.\n\n*Before:*\n\n.. code:: python\n\n   import pandas_td as td\n\n.. code:: python\n\n   In [1]: %%load_ext pandas_td.ipython\n\n*After:*\n\n.. code:: python\n\n   import pytd.pandas_td as td\n\n.. code:: python\n\n   In [1]: %%load_ext pytd.pandas_td.ipython\n\nConsequently, all ``pandas_td`` code should keep running correctly with\n``pytd``. Report an issue from\n`here <https://github.com/treasure-data/pytd/issues/new>`__ if you\nnoticed any incompatible behaviors.\n\n.. |Build status| image:: https://github.com/treasure-data/pytd/workflows/Build/badge.svg\n   :target: https://github.com/treasure-data/pytd/actions/\n.. |PyPI version| image:: https://badge.fury.io/py/pytd.svg\n   :target: https://badge.fury.io/py/pytd\n.. |docs status| image:: https://readthedocs.org/projects/pytd-doc/badge/?version=latest\n   :target: https://pytd-doc.readthedocs.io/en/latest/?badge=latest\n",
    "bugtrack_url": null,
    "license": "Apache License 2.0",
    "summary": "Treasure Data Driver for Python",
    "version": "1.5.1",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "5a8bfe34ec2692b7f133e6f586cfc6d7",
                "sha256": "24de31b0648bd8815265574bcba13d72a940acf2790d0b65304e06b18d3eaa48"
            },
            "downloads": -1,
            "filename": "pytd-1.5.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5a8bfe34ec2692b7f133e6f586cfc6d7",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.11,>=3.7",
            "size": 37395,
            "upload_time": "2022-12-08T15:45:12",
            "upload_time_iso_8601": "2022-12-08T15:45:12.299439Z",
            "url": "https://files.pythonhosted.org/packages/0f/d4/f5be3fb088b7f4f59f8e05b31e222a6883eb46dd0e557208e5ea7cda35ee/pytd-1.5.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "59bb554eb7c39669cf22909df5a41726",
                "sha256": "57d6898744b91d1f054f06a9784c1535addc8dcf87b3eef7f1c180f21f43c715"
            },
            "downloads": -1,
            "filename": "pytd-1.5.1.tar.gz",
            "has_sig": false,
            "md5_digest": "59bb554eb7c39669cf22909df5a41726",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.11,>=3.7",
            "size": 35437,
            "upload_time": "2022-12-08T15:45:14",
            "upload_time_iso_8601": "2022-12-08T15:45:14.266207Z",
            "url": "https://files.pythonhosted.org/packages/af/9a/d1bc9ebcd4691413fc8a8622af2b44c84f143d26dda9daad672f667a4eb1/pytd-1.5.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-08 15:45:14",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "treasure-data",
    "github_project": "pytd",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pytd"
}
        
Elapsed time: 0.01552s