# Databricks SDK for Python (Beta)
[![PyPI - Downloads](https://img.shields.io/pypi/dw/databricks-sdk)](https://pypistats.org/packages/databricks-sdk)
[![PyPI - License](https://img.shields.io/pypi/l/databricks-sdk)](https://github.com/databricks/databricks-sdk-py/blob/main/LICENSE)
[![databricks-sdk](https://snyk.io/advisor/python/databricks-sdk/badge.svg)](https://snyk.io/advisor/python/databricks-sdk)
![PyPI](https://img.shields.io/pypi/v/databricks-sdk)
[![codecov](https://codecov.io/gh/databricks/databricks-sdk-py/branch/main/graph/badge.svg?token=GU63K7WDBE)](https://codecov.io/gh/databricks/databricks-sdk-py)
[Beta](https://docs.databricks.com/release-notes/release-types.html): This SDK is supported for production use cases,
but we do expect future releases to have some interface changes; see [Interface stability](#interface-stability).
We are keen to hear feedback from you on these SDKs. Please [file issues](https://github.com/databricks/databricks-sdk-py/issues), and we will address them.
| See also the [SDK for Java](https://github.com/databricks/databricks-sdk-java)
| See also the [SDK for Go](https://github.com/databricks/databricks-sdk-go)
| See also the [Terraform Provider](https://github.com/databricks/terraform-provider-databricks)
| See also cloud-specific docs ([AWS](https://docs.databricks.com/dev-tools/sdk-python.html),
[Azure](https://learn.microsoft.com/en-us/azure/databricks/dev-tools/sdk-python),
[GCP](https://docs.gcp.databricks.com/dev-tools/sdk-python.html))
| See also the [API reference on readthedocs](https://databricks-sdk-py.readthedocs.io/en/latest/)
The Databricks SDK for Python includes functionality to accelerate development with [Python](https://www.python.org/) for the Databricks Lakehouse.
It covers all public [Databricks REST API](https://docs.databricks.com/dev-tools/api/index.html) operations.
The SDK's internal HTTP client is robust and handles failures on different levels by performing intelligent retries.
## Contents
- [Getting started](#getting-started)
- [Code examples](#code-examples)
- [Authentication](#authentication)
- [Long-running operations](#long-running-operations)
- [Paginated responses](#paginated-responses)
- [Single-sign-on with OAuth](#single-sign-on-sso-with-oauth)
- [Logging](#logging)
- [Integration with `dbutils`](#interaction-with-dbutils)
- [Interface stability](#interface-stability)
## Getting started<a id="getting-started"></a>
1. Please install Databricks SDK for Python via `pip install databricks-sdk` and instantiate `WorkspaceClient`:
```python
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
for c in w.clusters.list():
print(c.cluster_name)
```
Databricks SDK for Python is compatible with Python 3.7 _(until [June 2023](https://devguide.python.org/versions/))_, 3.8, 3.9, 3.10, and 3.11.
**Note:** Databricks Runtime starting from version 13.1 includes a bundled version of the Python SDK.
It is highly recommended to upgrade to the latest version which you can do by running the following in a notebook cell:
```python
%pip install --upgrade databricks-sdk
```
followed by
```python
dbutils.library.restartPython()
```
## Code examples<a id="code-examples"></a>
The Databricks SDK for Python comes with a number of examples demonstrating how to use the library for various common use-cases, including
* [Using the SDK with OAuth from a webserver](https://github.com/databricks/databricks-sdk-py/blob/main/examples/flask_app_with_oauth.py)
* [Using long-running operations](https://github.com/databricks/databricks-sdk-py/blob/main/examples/starting_job_and_waiting.py)
* [Authenticating a client app using OAuth](https://github.com/databricks/databricks-sdk-py/blob/main/examples/local_browser_oauth.py)
These examples and more are located in the [`examples/` directory of the Github repository](https://github.com/databricks/databricks-sdk-py/tree/main/examples).
Some other examples of using the SDK include:
* [Unity Catalog Automated Migration](https://github.com/databricks/ucx) heavily relies on Python SDK for working with Databricks APIs.
* [ip-access-list-analyzer](https://github.com/alexott/databricks-playground/tree/main/ip-access-list-analyzer) checks & prunes invalid entries from IP Access Lists.
## Authentication<a id="authentication"></a>
If you use Databricks [configuration profiles](https://docs.databricks.com/dev-tools/auth.html#configuration-profiles)
or Databricks-specific [environment variables](https://docs.databricks.com/dev-tools/auth.html#environment-variables)
for [Databricks authentication](https://docs.databricks.com/dev-tools/auth.html), the only code required to start
working with a Databricks workspace is the following code snippet, which instructs the Databricks SDK for Python to use
its [default authentication flow](#default-authentication-flow):
```python
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
w. # press <TAB> for autocompletion
```
The conventional name for the variable that holds the workspace-level client of the Databricks SDK for Python is `w`, which is shorthand for `workspace`.
### In this section
- [Default authentication flow](#default-authentication-flow)
- [Databricks native authentication](#databricks-native-authentication)
- [Azure native authentication](#azure-native-authentication)
- [Overriding .databrickscfg](#overriding-databrickscfg)
- [Additional authentication configuration options](#additional-authentication-configuration-options)
### Default authentication flow
If you run the [Databricks Terraform Provider](https://registry.terraform.io/providers/databrickslabs/databricks/latest),
the [Databricks SDK for Go](https://github.com/databricks/databricks-sdk-go), the [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html),
or applications that target the Databricks SDKs for other languages, most likely they will all interoperate nicely together.
By default, the Databricks SDK for Python tries the following [authentication](https://docs.databricks.com/dev-tools/auth.html) methods,
in the following order, until it succeeds:
1. [Databricks native authentication](#databricks-native-authentication)
2. [Azure native authentication](#azure-native-authentication)
4. If the SDK is unsuccessful at this point, it returns an authentication error and stops running.
You can instruct the Databricks SDK for Python to use a specific authentication method by setting the `auth_type` argument
as described in the following sections.
For each authentication method, the SDK searches for compatible authentication credentials in the following locations,
in the following order. Once the SDK finds a compatible set of credentials that it can use, it stops searching:
1. Credentials that are hard-coded into configuration arguments.
:warning: **Caution**: Databricks does not recommend hard-coding credentials into arguments, as they can be exposed in plain text in version control systems. Use environment variables or configuration profiles instead.
2. Credentials in Databricks-specific [environment variables](https://docs.databricks.com/dev-tools/auth.html#environment-variables).
3. For Databricks native authentication, credentials in the `.databrickscfg` file's `DEFAULT` [configuration profile](https://docs.databricks.com/dev-tools/auth.html#configuration-profiles) from its default file location (`~` for Linux or macOS, and `%USERPROFILE%` for Windows).
4. For Azure native authentication, the SDK searches for credentials through the Azure CLI as needed.
Depending on the Databricks authentication method, the SDK uses the following information. Presented are the `WorkspaceClient` and `AccountClient` arguments (which have corresponding `.databrickscfg` file fields), their descriptions, and any corresponding environment variables.
### Databricks native authentication
By default, the Databricks SDK for Python initially tries [Databricks token authentication](https://docs.databricks.com/dev-tools/api/latest/authentication.html) (`auth_type='pat'` argument). If the SDK is unsuccessful, it then tries Databricks basic (username/password) authentication (`auth_type="basic"` argument).
- For Databricks token authentication, you must provide `host` and `token`; or their environment variable or `.databrickscfg` file field equivalents.
- For Databricks basic authentication, you must provide `host`, `username`, and `password` _(for AWS workspace-level operations)_; or `host`, `account_id`, `username`, and `password` _(for AWS, Azure, or GCP account-level operations)_; or their environment variable or `.databrickscfg` file field equivalents.
| Argument | Description | Environment variable |
|--------------|-------------|-------------------|
| `host` | _(String)_ The Databricks host URL for either the Databricks workspace endpoint or the Databricks accounts endpoint. | `DATABRICKS_HOST` |
| `account_id` | _(String)_ The Databricks account ID for the Databricks accounts endpoint. Only has effect when `Host` is either `https://accounts.cloud.databricks.com/` _(AWS)_, `https://accounts.azuredatabricks.net/` _(Azure)_, or `https://accounts.gcp.databricks.com/` _(GCP)_. | `DATABRICKS_ACCOUNT_ID` |
| `token` | _(String)_ The Databricks personal access token (PAT) _(AWS, Azure, and GCP)_ or Azure Active Directory (Azure AD) token _(Azure)_. | `DATABRICKS_TOKEN` |
| `username` | _(String)_ The Databricks username part of basic authentication. Only possible when `Host` is `*.cloud.databricks.com` _(AWS)_. | `DATABRICKS_USERNAME` |
| `password` | _(String)_ The Databricks password part of basic authentication. Only possible when `Host` is `*.cloud.databricks.com` _(AWS)_. | `DATABRICKS_PASSWORD` |
For example, to use Databricks token authentication:
```python
from databricks.sdk import WorkspaceClient
w = WorkspaceClient(host=input('Databricks Workspace URL: '), token=input('Token: '))
```
### Azure native authentication
By default, the Databricks SDK for Python first tries Azure client secret authentication (`auth_type='azure-client-secret'` argument). If the SDK is unsuccessful, it then tries Azure CLI authentication (`auth_type='azure-cli'` argument). See [Manage service principals](https://learn.microsoft.com/azure/databricks/administration-guide/users-groups/service-principals).
The Databricks SDK for Python picks up an Azure CLI token, if you've previously authenticated as an Azure user by running `az login` on your machine. See [Get Azure AD tokens for users by using the Azure CLI](https://learn.microsoft.com/azure/databricks/dev-tools/api/latest/aad/user-aad-token).
To authenticate as an Azure Active Directory (Azure AD) service principal, you must provide one of the following. See also [Add a service principal to your Azure Databricks account](https://learn.microsoft.com/azure/databricks/administration-guide/users-groups/service-principals#add-sp-account):
- `azure_workspace_resource_id`, `azure_client_secret`, `azure_client_id`, and `azure_tenant_id`; or their environment variable or `.databrickscfg` file field equivalents.
- `azure_workspace_resource_id` and `azure_use_msi`; or their environment variable or `.databrickscfg` file field equivalents.
| Argument | Description | Environment variable |
|-----------------------|-------------|----------------------|
| `azure_workspace_resource_id` | _(String)_ The Azure Resource Manager ID for the Azure Databricks workspace, which is exchanged for a Databricks host URL. | `DATABRICKS_AZURE_RESOURCE_ID` |
| `azure_use_msi` | _(Boolean)_ `true` to use Azure Managed Service Identity passwordless authentication flow for service principals. _This feature is not yet implemented in the Databricks SDK for Python._ | `ARM_USE_MSI` |
| `azure_client_secret` | _(String)_ The Azure AD service principal's client secret. | `ARM_CLIENT_SECRET` |
| `azure_client_id` | _(String)_ The Azure AD service principal's application ID. | `ARM_CLIENT_ID` |
| `azure_tenant_id` | _(String)_ The Azure AD service principal's tenant ID. | `ARM_TENANT_ID` |
| `azure_environment` | _(String)_ The Azure environment type (such as Public, UsGov, China, and Germany) for a specific set of API endpoints. Defaults to `PUBLIC`. | `ARM_ENVIRONMENT` |
For example, to use Azure client secret authentication:
```python
from databricks.sdk import WorkspaceClient
w = WorkspaceClient(host=input('Databricks Workspace URL: '),
azure_workspace_resource_id=input('Azure Resource ID: '),
azure_tenant_id=input('AAD Tenant ID: '),
azure_client_id=input('AAD Client ID: '),
azure_client_secret=input('AAD Client Secret: '))
```
Please see more examples in [this document](./docs/azure-ad.md).
### Google Cloud Platform native authentication
By default, the Databricks SDK for Python first tries GCP credentials authentication (`auth_type='google-credentials'`, argument). If the SDK is unsuccessful, it then tries Google Cloud Platform (GCP) ID authentication (`auth_type='google-id'`, argument).
The Databricks SDK for Python picks up an OAuth token in the scope of the Google Default Application Credentials (DAC) flow. This means that if you have run `gcloud auth application-default login` on your development machine, or launch the application on the compute, that is allowed to impersonate the Google Cloud service account specified in `google_service_account`. Authentication should then work out of the box. See [Creating and managing service accounts](https://cloud.google.com/iam/docs/creating-managing-service-accounts).
To authenticate as a Google Cloud service account, you must provide one of the following:
- `host` and `google_credentials`; or their environment variable or `.databrickscfg` file field equivalents.
- `host` and `google_service_account`; or their environment variable or `.databrickscfg` file field equivalents.
| Argument | Description | Environment variable |
|--------------------------|-------------|--------------------------------------------------|
| `google_credentials` | _(String)_ GCP Service Account Credentials JSON or the location of these credentials on the local filesystem. | `GOOGLE_CREDENTIALS` |
| `google_service_account` | _(String)_ The Google Cloud Platform (GCP) service account e-mail used for impersonation in the Default Application Credentials Flow that does not require a password. | `DATABRICKS_GOOGLE_SERVICE_ACCOUNT` |
For example, to use Google ID authentication:
```python
from databricks.sdk import WorkspaceClient
w = WorkspaceClient(host=input('Databricks Workspace URL: '),
google_service_account=input('Google Service Account: '))
```
### Overriding `.databrickscfg`
For [Databricks native authentication](#databricks-native-authentication), you can override the default behavior for using `.databrickscfg` as follows:
| Argument | Description | Environment variable |
|---------------|-------------|----------------------|
| `profile` | _(String)_ A connection profile specified within `.databrickscfg` to use instead of `DEFAULT`. | `DATABRICKS_CONFIG_PROFILE` |
| `config_file` | _(String)_ A non-default location of the Databricks CLI credentials file. | `DATABRICKS_CONFIG_FILE` |
For example, to use a profile named `MYPROFILE` instead of `DEFAULT`:
```python
from databricks.sdk import WorkspaceClient
w = WorkspaceClient(profile='MYPROFILE')
# Now call the Databricks workspace APIs as desired...
```
### Additional authentication configuration options
For all authentication methods, you can override the default behavior in client arguments as follows:
| Argument | Description | Environment variable |
|-------------------------|-------------|------------------------|
| `auth_type` | _(String)_ When multiple auth attributes are available in the environment, use the auth type specified by this argument. This argument also holds the currently selected auth. | `DATABRICKS_AUTH_TYPE` |
| `http_timeout_seconds` | _(Integer)_ Number of seconds for HTTP timeout. Default is _60_. | _(None)_ |
| `retry_timeout_seconds` | _(Integer)_ Number of seconds to keep retrying HTTP requests. Default is _300 (5 minutes)_. | _(None)_ |
| `debug_truncate_bytes` | _(Integer)_ Truncate JSON fields in debug logs above this limit. Default is 96. | `DATABRICKS_DEBUG_TRUNCATE_BYTES` |
| `debug_headers` | _(Boolean)_ `true` to debug HTTP headers of requests made by the application. Default is `false`, as headers contain sensitive data, such as access tokens. | `DATABRICKS_DEBUG_HEADERS` |
| `rate_limit` | _(Integer)_ Maximum number of requests per second made to Databricks REST API. | `DATABRICKS_RATE_LIMIT` |
For example, to turn on debug HTTP headers:
```python
from databricks.sdk import WorkspaceClient
w = WorkspaceClient(debug_headers=True)
# Now call the Databricks workspace APIs as desired...
```
## Long-running operations<a id="long-running-operations"></a>
When you invoke a long-running operation, the SDK provides a high-level API to _trigger_ these operations and _wait_ for the related entities
to reach the correct state or return the error message in case of failure. All long-running operations return generic `Wait` instance with `result()`
method to get a result of long-running operation, once it's finished. Databricks SDK for Python picks the most reasonable default timeouts for
every method, but sometimes you may find yourself in a situation, where you'd want to provide `datetime.timedelta()` as the value of `timeout`
argument to `result()` method.
There are a number of long-running operations in Databricks APIs such as managing:
* Clusters,
* Command execution
* Jobs
* Libraries
* Delta Live Tables pipelines
* Databricks SQL warehouses.
For example, in the Clusters API, once you create a cluster, you receive a cluster ID, and the cluster is in the `PENDING` state Meanwhile
Databricks takes care of provisioning virtual machines from the cloud provider in the background. The cluster is
only usable in the `RUNNING` state and so you have to wait for that state to be reached.
Another example is the API for running a job or repairing the run: right after
the run starts, the run is in the `PENDING` state. The job is only considered to be finished when it is in either
the `TERMINATED` or `SKIPPED` state. Also you would likely need the error message if the long-running
operation times out and fails with an error code. Other times you may want to configure a custom timeout other than
the default of 20 minutes.
In the following example, `w.clusters.create` returns `ClusterInfo` only once the cluster is in the `RUNNING` state,
otherwise it will timeout in 10 minutes:
```python
import datetime
import logging
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
info = w.clusters.create_and_wait(cluster_name='Created cluster',
spark_version='12.0.x-scala2.12',
node_type_id='m5d.large',
autotermination_minutes=10,
num_workers=1,
timeout=datetime.timedelta(minutes=10))
logging.info(f'Created: {info}')
```
Please look at the `examples/starting_job_and_waiting.py` for a more advanced usage:
```python
import datetime
import logging
import time
from databricks.sdk import WorkspaceClient
import databricks.sdk.service.jobs as j
w = WorkspaceClient()
# create a dummy file on DBFS that just sleeps for 10 seconds
py_on_dbfs = f'/home/{w.current_user.me().user_name}/sample.py'
with w.dbfs.open(py_on_dbfs, write=True, overwrite=True) as f:
f.write(b'import time; time.sleep(10); print("Hello, World!")')
# trigger one-time-run job and get waiter object
waiter = w.jobs.submit(run_name=f'py-sdk-run-{time.time()}', tasks=[
j.RunSubmitTaskSettings(
task_key='hello_world',
new_cluster=j.BaseClusterInfo(
spark_version=w.clusters.select_spark_version(long_term_support=True),
node_type_id=w.clusters.select_node_type(local_disk=True),
num_workers=1
),
spark_python_task=j.SparkPythonTask(
python_file=f'dbfs:{py_on_dbfs}'
),
)
])
logging.info(f'starting to poll: {waiter.run_id}')
# callback, that receives a polled entity between state updates
def print_status(run: j.Run):
statuses = [f'{t.task_key}: {t.state.life_cycle_state}' for t in run.tasks]
logging.info(f'workflow intermediate status: {", ".join(statuses)}')
# If you want to perform polling in a separate thread, process, or service,
# you can use w.jobs.wait_get_run_job_terminated_or_skipped(
# run_id=waiter.run_id,
# timeout=datetime.timedelta(minutes=15),
# callback=print_status) to achieve the same results.
#
# Waiter interface allows for `w.jobs.submit(..).result()` simplicity in
# the scenarios, where you need to block the calling thread for the job to finish.
run = waiter.result(timeout=datetime.timedelta(minutes=15),
callback=print_status)
logging.info(f'job finished: {run.run_page_url}')
```
## Paginated responses<a id="paginated-responses"></a>
On the platform side the Databricks APIs have different wait to deal with pagination:
* Some APIs follow the offset-plus-limit pagination
* Some start their offsets from 0 and some from 1
* Some use the cursor-based iteration
* Others just return all results in a single response
The Databricks SDK for Python hides this complexity
under `Iterator[T]` abstraction, where multi-page results `yield` items. Python typing helps to auto-complete
the individual item fields.
```python
import logging
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
for repo in w.repos.list():
logging.info(f'Found repo: {repo.path}')
```
Please look at the `examples/last_job_runs.py` for a more advanced usage:
```python
import logging
from collections import defaultdict
from datetime import datetime, timezone
from databricks.sdk import WorkspaceClient
latest_state = {}
all_jobs = {}
durations = defaultdict(list)
w = WorkspaceClient()
for job in w.jobs.list():
all_jobs[job.job_id] = job
for run in w.jobs.list_runs(job_id=job.job_id, expand_tasks=False):
durations[job.job_id].append(run.run_duration)
if job.job_id not in latest_state:
latest_state[job.job_id] = run
continue
if run.end_time < latest_state[job.job_id].end_time:
continue
latest_state[job.job_id] = run
summary = []
for job_id, run in latest_state.items():
summary.append({
'job_name': all_jobs[job_id].settings.name,
'last_status': run.state.result_state,
'last_finished': datetime.fromtimestamp(run.end_time/1000, timezone.utc),
'average_duration': sum(durations[job_id]) / len(durations[job_id])
})
for line in sorted(summary, key=lambda s: s['last_finished'], reverse=True):
logging.info(f'Latest: {line}')
```
## Single-Sign-On (SSO) with OAuth<a id="single-sign-on-sso-with-oauth"></a>
### Authorization Code flow with PKCE
For a regular web app running on a server, it's recommended to use the Authorization Code Flow to obtain an Access Token
and a Refresh Token. This method is considered safe because the Access Token is transmitted directly to the server
hosting the app, without passing through the user's web browser and risking exposure.
To enhance the security of the Authorization Code Flow, the PKCE (Proof Key for Code Exchange) mechanism can be
employed. With PKCE, the calling application generates a secret called the Code Verifier, which is verified by
the authorization server. The app also creates a transform value of the Code Verifier, called the Code Challenge,
and sends it over HTTPS to obtain an Authorization Code. By intercepting the Authorization Code, a malicious attacker
cannot exchange it for a token without possessing the Code Verifier.
The [presented sample](https://github.com/databricks/databricks-sdk-py/blob/main/examples/flask_app_with_oauth.py)
is a Python3 script that uses the Flask web framework along with Databricks SDK for Python to demonstrate how to
implement the OAuth Authorization Code flow with PKCE security. It can be used to build an app where each user uses
their identity to access Databricks resources. The script can be executed with or without client and secret credentials
for a custom OAuth app.
Databricks SDK for Python exposes the `oauth_client.initiate_consent()` helper to acquire user redirect URL and initiate
PKCE state verification. Application developers are expected to persist `RefreshableCredentials` in the webapp session
and restore it via `RefreshableCredentials.from_dict(oauth_client, session['creds'])` helpers.
Works for both AWS and Azure. Not supported for GCP at the moment.
```python
from databricks.sdk.oauth import OAuthClient
oauth_client = OAuthClient(host='<workspace-url>',
client_id='<oauth client ID>',
redirect_url=f'http://host.domain/callback',
scopes=['clusters'])
import secrets
from flask import Flask, render_template_string, request, redirect, url_for, session
APP_NAME = 'flask-demo'
app = Flask(APP_NAME)
app.secret_key = secrets.token_urlsafe(32)
@app.route('/callback')
def callback():
from databricks.sdk.oauth import Consent
consent = Consent.from_dict(oauth_client, session['consent'])
session['creds'] = consent.exchange_callback_parameters(request.args).as_dict()
return redirect(url_for('index'))
@app.route('/')
def index():
if 'creds' not in session:
consent = oauth_client.initiate_consent()
session['consent'] = consent.as_dict()
return redirect(consent.auth_url)
from databricks.sdk import WorkspaceClient
from databricks.sdk.oauth import SessionCredentials
credentials_provider = SessionCredentials.from_dict(oauth_client, session['creds'])
workspace_client = WorkspaceClient(host=oauth_client.host,
product=APP_NAME,
credentials_provider=credentials_provider)
return render_template_string('...', w=workspace_client)
```
### SSO for local scripts on development machines
For applications, that do run on developer workstations, Databricks SDK for Python provides `auth_type='external-browser'`
utility, that opens up a browser for a user to go through SSO flow. Azure support is still in the early experimental
stage.
```python
from databricks.sdk import WorkspaceClient
host = input('Enter Databricks host: ')
w = WorkspaceClient(host=host, auth_type='external-browser')
clusters = w.clusters.list()
for cl in clusters:
print(f' - {cl.cluster_name} is {cl.state}')
```
### Creating custom OAuth applications
In order to use OAuth with Databricks SDK for Python, you should use `account_client.custom_app_integration.create` API.
```python
import logging, getpass
from databricks.sdk import AccountClient
account_client = AccountClient(host='https://accounts.cloud.databricks.com',
account_id=input('Databricks Account ID: '),
username=input('Username: '),
password=getpass.getpass('Password: '))
logging.info('Enrolling all published apps...')
account_client.o_auth_enrollment.create(enable_all_published_apps=True)
status = account_client.o_auth_enrollment.get()
logging.info(f'Enrolled all published apps: {status}')
custom_app = account_client.custom_app_integration.create(
name='awesome-app',
redirect_urls=[f'https://host.domain/path/to/callback'],
confidential=True)
logging.info(f'Created new custom app: '
f'--client_id {custom_app.client_id} '
f'--client_secret {custom_app.client_secret}')
```
## Logging<a id="logging"></a>
The Databricks SDK for Python seamlessly integrates with the standard [Logging facility for Python](https://docs.python.org/3/library/logging.html).
This allows developers to easily enable and customize logging for their Databricks Python projects.
To enable debug logging in your Databricks Python project, you can follow the example below:
```python
import logging, sys
logging.basicConfig(stream=sys.stderr,
level=logging.INFO,
format='%(asctime)s [%(name)s][%(levelname)s] %(message)s')
logging.getLogger('databricks.sdk').setLevel(logging.DEBUG)
from databricks.sdk import WorkspaceClient
w = WorkspaceClient(debug_truncate_bytes=1024, debug_headers=False)
for cluster in w.clusters.list():
logging.info(f'Found cluster: {cluster.cluster_name}')
```
In the above code snippet, the logging module is imported and the `basicConfig()` method is used to set the logging level to `DEBUG`.
This will enable logging at the debug level and above. Developers can adjust the logging level as needed to control the verbosity of the logging output.
The SDK will log all requests and responses to standard error output, using the format `> ` for requests and `< ` for responses.
In some cases, requests or responses may be truncated due to size considerations. If this occurs, the log message will include
the text `... (XXX additional elements)` to indicate that the request or response has been truncated. To increase the truncation limits,
developers can set the `debug_truncate_bytes` configuration property or the `DATABRICKS_DEBUG_TRUNCATE_BYTES` environment variable.
To protect sensitive data, such as authentication tokens, passwords, or any HTTP headers, the SDK will automatically replace these
values with `**REDACTED**` in the log output. Developers can disable this redaction by setting the `debug_headers` configuration property to `True`.
```text
2023-03-22 21:19:21,702 [databricks.sdk][DEBUG] GET /api/2.0/clusters/list
< 200 OK
< {
< "clusters": [
< {
< "autotermination_minutes": 60,
< "cluster_id": "1109-115255-s1w13zjj",
< "cluster_name": "DEFAULT Test Cluster",
< ... truncated for brevity
< },
< "... (47 additional elements)"
< ]
< }
```
Overall, the logging capabilities provided by the Databricks SDK for Python can be a powerful tool for monitoring and troubleshooting your
Databricks Python projects. Developers can use the various logging methods and configuration options provided by the SDK to customize
the logging output to their specific needs.
## Interaction with `dbutils`<a id="interaction-with-dbutils"></a>
You can use the client-side implementation of [`dbutils`](https://docs.databricks.com/dev-tools/databricks-utils.html) by accessing `dbutils` property on the `WorkspaceClient`.
Most of the `dbutils.fs` operations and `dbutils.secrets` are implemented natively in Python within Databricks SDK. Non-SDK implementations still require a Databricks cluster,
that you have to specify through the `cluster_id` configuration attribute or `DATABRICKS_CLUSTER_ID` environment variable. Don't worry if cluster is not running: internally,
Databricks SDK for Python calls `w.clusters.ensure_cluster_is_running()`.
```python
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
dbutils = w.dbutils
files_in_root = dbutils.fs.ls('/')
print(f'number of files in root: {len(files_in_root)}')
```
Alternatively, you can import `dbutils` from `databricks.sdk.runtime` module, but you have to make sure that all configuration is already [present in the environment variables](#default-authentication-flow):
```python
from databricks.sdk.runtime import dbutils
for secret_scope in dbutils.secrets.listScopes():
for secret_metadata in dbutils.secrets.list(secret_scope.name):
print(f'found {secret_metadata.key} secret in {secret_scope.name} scope')
```
## Interface stability<a id="interface-stability"></a>
Databricks is actively working on stabilizing the Databricks SDK for Python's interfaces.
API clients for all services are generated from specification files that are synchronized from the main platform.
You are highly encouraged to pin the exact dependency version and read the [changelog](https://github.com/databricks/databricks-sdk-py/blob/main/CHANGELOG.md)
where Databricks documents the changes. Databricks may have minor [documented](https://github.com/databricks/databricks-sdk-py/blob/main/CHANGELOG.md)
backward-incompatible changes, such as renaming some type names to bring more consistency.
Raw data
{
"_id": null,
"home_page": "https://databricks-sdk-py.readthedocs.io",
"name": "databricks-sdk-secure",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "databricks sdk",
"author": "Michael Spece",
"author_email": "Michael@Spece.AI",
"download_url": "https://files.pythonhosted.org/packages/e9/9a/f8f22900f7d5ac3f1612b75363d9c39f59ab96a2b8c054019b0879d4c8dc/databricks-sdk-secure-0.19.0.tar.gz",
"platform": null,
"description": "# Databricks SDK for Python (Beta)\r\n\r\n[![PyPI - Downloads](https://img.shields.io/pypi/dw/databricks-sdk)](https://pypistats.org/packages/databricks-sdk)\r\n[![PyPI - License](https://img.shields.io/pypi/l/databricks-sdk)](https://github.com/databricks/databricks-sdk-py/blob/main/LICENSE)\r\n[![databricks-sdk](https://snyk.io/advisor/python/databricks-sdk/badge.svg)](https://snyk.io/advisor/python/databricks-sdk)\r\n![PyPI](https://img.shields.io/pypi/v/databricks-sdk)\r\n[![codecov](https://codecov.io/gh/databricks/databricks-sdk-py/branch/main/graph/badge.svg?token=GU63K7WDBE)](https://codecov.io/gh/databricks/databricks-sdk-py)\r\n\r\n[Beta](https://docs.databricks.com/release-notes/release-types.html): This SDK is supported for production use cases, \r\nbut we do expect future releases to have some interface changes; see [Interface stability](#interface-stability). \r\nWe are keen to hear feedback from you on these SDKs. Please [file issues](https://github.com/databricks/databricks-sdk-py/issues), and we will address them. \r\n| See also the [SDK for Java](https://github.com/databricks/databricks-sdk-java) \r\n| See also the [SDK for Go](https://github.com/databricks/databricks-sdk-go) \r\n| See also the [Terraform Provider](https://github.com/databricks/terraform-provider-databricks)\r\n| See also cloud-specific docs ([AWS](https://docs.databricks.com/dev-tools/sdk-python.html), \r\n [Azure](https://learn.microsoft.com/en-us/azure/databricks/dev-tools/sdk-python), \r\n [GCP](https://docs.gcp.databricks.com/dev-tools/sdk-python.html)) \r\n| See also the [API reference on readthedocs](https://databricks-sdk-py.readthedocs.io/en/latest/)\r\n\r\nThe Databricks SDK for Python includes functionality to accelerate development with [Python](https://www.python.org/) for the Databricks Lakehouse.\r\nIt covers all public [Databricks REST API](https://docs.databricks.com/dev-tools/api/index.html) operations.\r\nThe SDK's internal HTTP client is robust and handles failures on different levels by performing intelligent retries.\r\n\r\n## Contents\r\n\r\n- [Getting started](#getting-started)\r\n- [Code examples](#code-examples)\r\n- [Authentication](#authentication)\r\n- [Long-running operations](#long-running-operations)\r\n- [Paginated responses](#paginated-responses)\r\n- [Single-sign-on with OAuth](#single-sign-on-sso-with-oauth)\r\n- [Logging](#logging)\r\n- [Integration with `dbutils`](#interaction-with-dbutils)\r\n- [Interface stability](#interface-stability)\r\n\r\n## Getting started<a id=\"getting-started\"></a>\r\n\r\n1. Please install Databricks SDK for Python via `pip install databricks-sdk` and instantiate `WorkspaceClient`:\r\n\r\n```python\r\nfrom databricks.sdk import WorkspaceClient\r\nw = WorkspaceClient()\r\nfor c in w.clusters.list():\r\n print(c.cluster_name)\r\n```\r\n\r\nDatabricks SDK for Python is compatible with Python 3.7 _(until [June 2023](https://devguide.python.org/versions/))_, 3.8, 3.9, 3.10, and 3.11. \r\n**Note:** Databricks Runtime starting from version 13.1 includes a bundled version of the Python SDK. \r\nIt is highly recommended to upgrade to the latest version which you can do by running the following in a notebook cell:\r\n\r\n```python\r\n%pip install --upgrade databricks-sdk\r\n```\r\nfollowed by\r\n```python\r\ndbutils.library.restartPython()\r\n```\r\n## Code examples<a id=\"code-examples\"></a>\r\n\r\nThe Databricks SDK for Python comes with a number of examples demonstrating how to use the library for various common use-cases, including\r\n\r\n* [Using the SDK with OAuth from a webserver](https://github.com/databricks/databricks-sdk-py/blob/main/examples/flask_app_with_oauth.py)\r\n* [Using long-running operations](https://github.com/databricks/databricks-sdk-py/blob/main/examples/starting_job_and_waiting.py)\r\n* [Authenticating a client app using OAuth](https://github.com/databricks/databricks-sdk-py/blob/main/examples/local_browser_oauth.py)\r\n\r\nThese examples and more are located in the [`examples/` directory of the Github repository](https://github.com/databricks/databricks-sdk-py/tree/main/examples).\r\n\r\nSome other examples of using the SDK include:\r\n* [Unity Catalog Automated Migration](https://github.com/databricks/ucx) heavily relies on Python SDK for working with Databricks APIs.\r\n* [ip-access-list-analyzer](https://github.com/alexott/databricks-playground/tree/main/ip-access-list-analyzer) checks & prunes invalid entries from IP Access Lists.\r\n\r\n## Authentication<a id=\"authentication\"></a>\r\n\r\nIf you use Databricks [configuration profiles](https://docs.databricks.com/dev-tools/auth.html#configuration-profiles)\r\nor Databricks-specific [environment variables](https://docs.databricks.com/dev-tools/auth.html#environment-variables)\r\nfor [Databricks authentication](https://docs.databricks.com/dev-tools/auth.html), the only code required to start\r\nworking with a Databricks workspace is the following code snippet, which instructs the Databricks SDK for Python to use\r\nits [default authentication flow](#default-authentication-flow):\r\n\r\n```python\r\nfrom databricks.sdk import WorkspaceClient\r\nw = WorkspaceClient()\r\nw. # press <TAB> for autocompletion\r\n```\r\n\r\nThe conventional name for the variable that holds the workspace-level client of the Databricks SDK for Python is `w`, which is shorthand for `workspace`.\r\n\r\n### In this section\r\n\r\n- [Default authentication flow](#default-authentication-flow)\r\n- [Databricks native authentication](#databricks-native-authentication)\r\n- [Azure native authentication](#azure-native-authentication)\r\n- [Overriding .databrickscfg](#overriding-databrickscfg)\r\n- [Additional authentication configuration options](#additional-authentication-configuration-options)\r\n\r\n### Default authentication flow\r\n\r\nIf you run the [Databricks Terraform Provider](https://registry.terraform.io/providers/databrickslabs/databricks/latest),\r\nthe [Databricks SDK for Go](https://github.com/databricks/databricks-sdk-go), the [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html),\r\nor applications that target the Databricks SDKs for other languages, most likely they will all interoperate nicely together.\r\nBy default, the Databricks SDK for Python tries the following [authentication](https://docs.databricks.com/dev-tools/auth.html) methods,\r\nin the following order, until it succeeds:\r\n\r\n1. [Databricks native authentication](#databricks-native-authentication)\r\n2. [Azure native authentication](#azure-native-authentication)\r\n4. If the SDK is unsuccessful at this point, it returns an authentication error and stops running.\r\n\r\nYou can instruct the Databricks SDK for Python to use a specific authentication method by setting the `auth_type` argument\r\nas described in the following sections.\r\n\r\nFor each authentication method, the SDK searches for compatible authentication credentials in the following locations,\r\nin the following order. Once the SDK finds a compatible set of credentials that it can use, it stops searching:\r\n\r\n1. Credentials that are hard-coded into configuration arguments.\r\n\r\n :warning: **Caution**: Databricks does not recommend hard-coding credentials into arguments, as they can be exposed in plain text in version control systems. Use environment variables or configuration profiles instead.\r\n\r\n2. Credentials in Databricks-specific [environment variables](https://docs.databricks.com/dev-tools/auth.html#environment-variables).\r\n3. For Databricks native authentication, credentials in the `.databrickscfg` file's `DEFAULT` [configuration profile](https://docs.databricks.com/dev-tools/auth.html#configuration-profiles) from its default file location (`~` for Linux or macOS, and `%USERPROFILE%` for Windows).\r\n4. For Azure native authentication, the SDK searches for credentials through the Azure CLI as needed.\r\n\r\nDepending on the Databricks authentication method, the SDK uses the following information. Presented are the `WorkspaceClient` and `AccountClient` arguments (which have corresponding `.databrickscfg` file fields), their descriptions, and any corresponding environment variables.\r\n\r\n### Databricks native authentication\r\n\r\nBy default, the Databricks SDK for Python initially tries [Databricks token authentication](https://docs.databricks.com/dev-tools/api/latest/authentication.html) (`auth_type='pat'` argument). If the SDK is unsuccessful, it then tries Databricks basic (username/password) authentication (`auth_type=\"basic\"` argument).\r\n\r\n- For Databricks token authentication, you must provide `host` and `token`; or their environment variable or `.databrickscfg` file field equivalents.\r\n- For Databricks basic authentication, you must provide `host`, `username`, and `password` _(for AWS workspace-level operations)_; or `host`, `account_id`, `username`, and `password` _(for AWS, Azure, or GCP account-level operations)_; or their environment variable or `.databrickscfg` file field equivalents.\r\n\r\n| Argument | Description | Environment variable |\r\n|--------------|-------------|-------------------|\r\n| `host` | _(String)_ The Databricks host URL for either the Databricks workspace endpoint or the Databricks accounts endpoint. | `DATABRICKS_HOST` | \r\n| `account_id` | _(String)_ The Databricks account ID for the Databricks accounts endpoint. Only has effect when `Host` is either `https://accounts.cloud.databricks.com/` _(AWS)_, `https://accounts.azuredatabricks.net/` _(Azure)_, or `https://accounts.gcp.databricks.com/` _(GCP)_. | `DATABRICKS_ACCOUNT_ID` |\r\n| `token` | _(String)_ The Databricks personal access token (PAT) _(AWS, Azure, and GCP)_ or Azure Active Directory (Azure AD) token _(Azure)_. | `DATABRICKS_TOKEN` |\r\n| `username` | _(String)_ The Databricks username part of basic authentication. Only possible when `Host` is `*.cloud.databricks.com` _(AWS)_. | `DATABRICKS_USERNAME` |\r\n| `password` | _(String)_ The Databricks password part of basic authentication. Only possible when `Host` is `*.cloud.databricks.com` _(AWS)_. | `DATABRICKS_PASSWORD` |\r\n\r\nFor example, to use Databricks token authentication:\r\n\r\n```python\r\nfrom databricks.sdk import WorkspaceClient\r\nw = WorkspaceClient(host=input('Databricks Workspace URL: '), token=input('Token: '))\r\n```\r\n\r\n### Azure native authentication\r\n\r\nBy default, the Databricks SDK for Python first tries Azure client secret authentication (`auth_type='azure-client-secret'` argument). If the SDK is unsuccessful, it then tries Azure CLI authentication (`auth_type='azure-cli'` argument). See [Manage service principals](https://learn.microsoft.com/azure/databricks/administration-guide/users-groups/service-principals).\r\n\r\nThe Databricks SDK for Python picks up an Azure CLI token, if you've previously authenticated as an Azure user by running `az login` on your machine. See [Get Azure AD tokens for users by using the Azure CLI](https://learn.microsoft.com/azure/databricks/dev-tools/api/latest/aad/user-aad-token).\r\n\r\nTo authenticate as an Azure Active Directory (Azure AD) service principal, you must provide one of the following. See also [Add a service principal to your Azure Databricks account](https://learn.microsoft.com/azure/databricks/administration-guide/users-groups/service-principals#add-sp-account):\r\n\r\n- `azure_workspace_resource_id`, `azure_client_secret`, `azure_client_id`, and `azure_tenant_id`; or their environment variable or `.databrickscfg` file field equivalents.\r\n- `azure_workspace_resource_id` and `azure_use_msi`; or their environment variable or `.databrickscfg` file field equivalents.\r\n\r\n| Argument | Description | Environment variable |\r\n|-----------------------|-------------|----------------------|\r\n| `azure_workspace_resource_id` | _(String)_ The Azure Resource Manager ID for the Azure Databricks workspace, which is exchanged for a Databricks host URL. | `DATABRICKS_AZURE_RESOURCE_ID` |\r\n| `azure_use_msi` | _(Boolean)_ `true` to use Azure Managed Service Identity passwordless authentication flow for service principals. _This feature is not yet implemented in the Databricks SDK for Python._ | `ARM_USE_MSI` |\r\n| `azure_client_secret` | _(String)_ The Azure AD service principal's client secret. | `ARM_CLIENT_SECRET` |\r\n| `azure_client_id` | _(String)_ The Azure AD service principal's application ID. | `ARM_CLIENT_ID` |\r\n| `azure_tenant_id` | _(String)_ The Azure AD service principal's tenant ID. | `ARM_TENANT_ID` |\r\n| `azure_environment` | _(String)_ The Azure environment type (such as Public, UsGov, China, and Germany) for a specific set of API endpoints. Defaults to `PUBLIC`. | `ARM_ENVIRONMENT` |\r\n\r\nFor example, to use Azure client secret authentication:\r\n\r\n```python\r\nfrom databricks.sdk import WorkspaceClient\r\nw = WorkspaceClient(host=input('Databricks Workspace URL: '),\r\n azure_workspace_resource_id=input('Azure Resource ID: '),\r\n azure_tenant_id=input('AAD Tenant ID: '),\r\n azure_client_id=input('AAD Client ID: '),\r\n azure_client_secret=input('AAD Client Secret: '))\r\n```\r\n\r\nPlease see more examples in [this document](./docs/azure-ad.md).\r\n\r\n### Google Cloud Platform native authentication\r\n\r\nBy default, the Databricks SDK for Python first tries GCP credentials authentication (`auth_type='google-credentials'`, argument). If the SDK is unsuccessful, it then tries Google Cloud Platform (GCP) ID authentication (`auth_type='google-id'`, argument).\r\n\r\nThe Databricks SDK for Python picks up an OAuth token in the scope of the Google Default Application Credentials (DAC) flow. This means that if you have run `gcloud auth application-default login` on your development machine, or launch the application on the compute, that is allowed to impersonate the Google Cloud service account specified in `google_service_account`. Authentication should then work out of the box. See [Creating and managing service accounts](https://cloud.google.com/iam/docs/creating-managing-service-accounts).\r\n\r\nTo authenticate as a Google Cloud service account, you must provide one of the following:\r\n\r\n- `host` and `google_credentials`; or their environment variable or `.databrickscfg` file field equivalents.\r\n- `host` and `google_service_account`; or their environment variable or `.databrickscfg` file field equivalents.\r\n\r\n| Argument | Description | Environment variable |\r\n|--------------------------|-------------|--------------------------------------------------|\r\n| `google_credentials` | _(String)_ GCP Service Account Credentials JSON or the location of these credentials on the local filesystem. | `GOOGLE_CREDENTIALS` |\r\n| `google_service_account` | _(String)_ The Google Cloud Platform (GCP) service account e-mail used for impersonation in the Default Application Credentials Flow that does not require a password. | `DATABRICKS_GOOGLE_SERVICE_ACCOUNT` |\r\n\r\nFor example, to use Google ID authentication:\r\n\r\n```python\r\nfrom databricks.sdk import WorkspaceClient\r\nw = WorkspaceClient(host=input('Databricks Workspace URL: '),\r\n google_service_account=input('Google Service Account: '))\r\n\r\n```\r\n\r\n### Overriding `.databrickscfg`\r\n\r\nFor [Databricks native authentication](#databricks-native-authentication), you can override the default behavior for using `.databrickscfg` as follows:\r\n\r\n| Argument | Description | Environment variable |\r\n|---------------|-------------|----------------------|\r\n| `profile` | _(String)_ A connection profile specified within `.databrickscfg` to use instead of `DEFAULT`. | `DATABRICKS_CONFIG_PROFILE` |\r\n| `config_file` | _(String)_ A non-default location of the Databricks CLI credentials file. | `DATABRICKS_CONFIG_FILE` |\r\n\r\nFor example, to use a profile named `MYPROFILE` instead of `DEFAULT`:\r\n\r\n```python\r\nfrom databricks.sdk import WorkspaceClient\r\nw = WorkspaceClient(profile='MYPROFILE')\r\n# Now call the Databricks workspace APIs as desired...\r\n```\r\n\r\n### Additional authentication configuration options\r\n\r\nFor all authentication methods, you can override the default behavior in client arguments as follows:\r\n\r\n| Argument | Description | Environment variable |\r\n|-------------------------|-------------|------------------------|\r\n| `auth_type` | _(String)_ When multiple auth attributes are available in the environment, use the auth type specified by this argument. This argument also holds the currently selected auth. | `DATABRICKS_AUTH_TYPE` |\r\n| `http_timeout_seconds` | _(Integer)_ Number of seconds for HTTP timeout. Default is _60_. | _(None)_ |\r\n| `retry_timeout_seconds` | _(Integer)_ Number of seconds to keep retrying HTTP requests. Default is _300 (5 minutes)_. | _(None)_ |\r\n| `debug_truncate_bytes` | _(Integer)_ Truncate JSON fields in debug logs above this limit. Default is 96. | `DATABRICKS_DEBUG_TRUNCATE_BYTES` |\r\n| `debug_headers` | _(Boolean)_ `true` to debug HTTP headers of requests made by the application. Default is `false`, as headers contain sensitive data, such as access tokens. | `DATABRICKS_DEBUG_HEADERS` |\r\n| `rate_limit` | _(Integer)_ Maximum number of requests per second made to Databricks REST API. | `DATABRICKS_RATE_LIMIT` |\r\n\r\nFor example, to turn on debug HTTP headers:\r\n\r\n```python\r\nfrom databricks.sdk import WorkspaceClient\r\nw = WorkspaceClient(debug_headers=True)\r\n# Now call the Databricks workspace APIs as desired...\r\n```\r\n\r\n## Long-running operations<a id=\"long-running-operations\"></a>\r\n\r\nWhen you invoke a long-running operation, the SDK provides a high-level API to _trigger_ these operations and _wait_ for the related entities\r\nto reach the correct state or return the error message in case of failure. All long-running operations return generic `Wait` instance with `result()`\r\nmethod to get a result of long-running operation, once it's finished. Databricks SDK for Python picks the most reasonable default timeouts for\r\nevery method, but sometimes you may find yourself in a situation, where you'd want to provide `datetime.timedelta()` as the value of `timeout`\r\nargument to `result()` method.\r\n\r\nThere are a number of long-running operations in Databricks APIs such as managing:\r\n* Clusters,\r\n* Command execution\r\n* Jobs\r\n* Libraries\r\n* Delta Live Tables pipelines\r\n* Databricks SQL warehouses.\r\n\r\nFor example, in the Clusters API, once you create a cluster, you receive a cluster ID, and the cluster is in the `PENDING` state Meanwhile\r\nDatabricks takes care of provisioning virtual machines from the cloud provider in the background. The cluster is\r\nonly usable in the `RUNNING` state and so you have to wait for that state to be reached.\r\n\r\nAnother example is the API for running a job or repairing the run: right after\r\nthe run starts, the run is in the `PENDING` state. The job is only considered to be finished when it is in either\r\nthe `TERMINATED` or `SKIPPED` state. Also you would likely need the error message if the long-running\r\noperation times out and fails with an error code. Other times you may want to configure a custom timeout other than\r\nthe default of 20 minutes.\r\n\r\nIn the following example, `w.clusters.create` returns `ClusterInfo` only once the cluster is in the `RUNNING` state,\r\notherwise it will timeout in 10 minutes:\r\n\r\n```python\r\nimport datetime\r\nimport logging\r\nfrom databricks.sdk import WorkspaceClient\r\n\r\nw = WorkspaceClient()\r\ninfo = w.clusters.create_and_wait(cluster_name='Created cluster',\r\n spark_version='12.0.x-scala2.12',\r\n node_type_id='m5d.large',\r\n autotermination_minutes=10,\r\n num_workers=1,\r\n timeout=datetime.timedelta(minutes=10))\r\nlogging.info(f'Created: {info}')\r\n```\r\n\r\nPlease look at the `examples/starting_job_and_waiting.py` for a more advanced usage:\r\n\r\n```python\r\nimport datetime\r\nimport logging\r\nimport time\r\n\r\nfrom databricks.sdk import WorkspaceClient\r\nimport databricks.sdk.service.jobs as j\r\n\r\nw = WorkspaceClient()\r\n\r\n# create a dummy file on DBFS that just sleeps for 10 seconds\r\npy_on_dbfs = f'/home/{w.current_user.me().user_name}/sample.py'\r\nwith w.dbfs.open(py_on_dbfs, write=True, overwrite=True) as f:\r\n f.write(b'import time; time.sleep(10); print(\"Hello, World!\")')\r\n\r\n# trigger one-time-run job and get waiter object\r\nwaiter = w.jobs.submit(run_name=f'py-sdk-run-{time.time()}', tasks=[\r\n j.RunSubmitTaskSettings(\r\n task_key='hello_world',\r\n new_cluster=j.BaseClusterInfo(\r\n spark_version=w.clusters.select_spark_version(long_term_support=True),\r\n node_type_id=w.clusters.select_node_type(local_disk=True),\r\n num_workers=1\r\n ),\r\n spark_python_task=j.SparkPythonTask(\r\n python_file=f'dbfs:{py_on_dbfs}'\r\n ),\r\n )\r\n])\r\n\r\nlogging.info(f'starting to poll: {waiter.run_id}')\r\n\r\n# callback, that receives a polled entity between state updates\r\ndef print_status(run: j.Run):\r\n statuses = [f'{t.task_key}: {t.state.life_cycle_state}' for t in run.tasks]\r\n logging.info(f'workflow intermediate status: {\", \".join(statuses)}')\r\n\r\n# If you want to perform polling in a separate thread, process, or service,\r\n# you can use w.jobs.wait_get_run_job_terminated_or_skipped(\r\n# run_id=waiter.run_id,\r\n# timeout=datetime.timedelta(minutes=15),\r\n# callback=print_status) to achieve the same results.\r\n#\r\n# Waiter interface allows for `w.jobs.submit(..).result()` simplicity in\r\n# the scenarios, where you need to block the calling thread for the job to finish.\r\nrun = waiter.result(timeout=datetime.timedelta(minutes=15),\r\n callback=print_status)\r\n\r\nlogging.info(f'job finished: {run.run_page_url}')\r\n```\r\n\r\n## Paginated responses<a id=\"paginated-responses\"></a>\r\n\r\nOn the platform side the Databricks APIs have different wait to deal with pagination:\r\n* Some APIs follow the offset-plus-limit pagination\r\n* Some start their offsets from 0 and some from 1\r\n* Some use the cursor-based iteration\r\n* Others just return all results in a single response\r\n\r\nThe Databricks SDK for Python hides this complexity\r\nunder `Iterator[T]` abstraction, where multi-page results `yield` items. Python typing helps to auto-complete\r\nthe individual item fields.\r\n\r\n```python\r\nimport logging\r\nfrom databricks.sdk import WorkspaceClient\r\nw = WorkspaceClient()\r\nfor repo in w.repos.list():\r\n logging.info(f'Found repo: {repo.path}')\r\n```\r\n\r\nPlease look at the `examples/last_job_runs.py` for a more advanced usage:\r\n\r\n```python\r\nimport logging\r\nfrom collections import defaultdict\r\nfrom datetime import datetime, timezone\r\nfrom databricks.sdk import WorkspaceClient\r\n\r\nlatest_state = {}\r\nall_jobs = {}\r\ndurations = defaultdict(list)\r\n\r\nw = WorkspaceClient()\r\nfor job in w.jobs.list():\r\n all_jobs[job.job_id] = job\r\n for run in w.jobs.list_runs(job_id=job.job_id, expand_tasks=False):\r\n durations[job.job_id].append(run.run_duration)\r\n if job.job_id not in latest_state:\r\n latest_state[job.job_id] = run\r\n continue\r\n if run.end_time < latest_state[job.job_id].end_time:\r\n continue\r\n latest_state[job.job_id] = run\r\n\r\nsummary = []\r\nfor job_id, run in latest_state.items():\r\n summary.append({\r\n 'job_name': all_jobs[job_id].settings.name,\r\n 'last_status': run.state.result_state,\r\n 'last_finished': datetime.fromtimestamp(run.end_time/1000, timezone.utc),\r\n 'average_duration': sum(durations[job_id]) / len(durations[job_id])\r\n })\r\n\r\nfor line in sorted(summary, key=lambda s: s['last_finished'], reverse=True):\r\n logging.info(f'Latest: {line}')\r\n```\r\n\r\n## Single-Sign-On (SSO) with OAuth<a id=\"single-sign-on-sso-with-oauth\"></a>\r\n\r\n### Authorization Code flow with PKCE\r\n\r\nFor a regular web app running on a server, it's recommended to use the Authorization Code Flow to obtain an Access Token\r\nand a Refresh Token. This method is considered safe because the Access Token is transmitted directly to the server\r\nhosting the app, without passing through the user's web browser and risking exposure.\r\n\r\nTo enhance the security of the Authorization Code Flow, the PKCE (Proof Key for Code Exchange) mechanism can be\r\nemployed. With PKCE, the calling application generates a secret called the Code Verifier, which is verified by\r\nthe authorization server. The app also creates a transform value of the Code Verifier, called the Code Challenge,\r\nand sends it over HTTPS to obtain an Authorization Code. By intercepting the Authorization Code, a malicious attacker\r\ncannot exchange it for a token without possessing the Code Verifier.\r\n\r\nThe [presented sample](https://github.com/databricks/databricks-sdk-py/blob/main/examples/flask_app_with_oauth.py)\r\nis a Python3 script that uses the Flask web framework along with Databricks SDK for Python to demonstrate how to\r\nimplement the OAuth Authorization Code flow with PKCE security. It can be used to build an app where each user uses\r\ntheir identity to access Databricks resources. The script can be executed with or without client and secret credentials\r\nfor a custom OAuth app.\r\n\r\nDatabricks SDK for Python exposes the `oauth_client.initiate_consent()` helper to acquire user redirect URL and initiate\r\nPKCE state verification. Application developers are expected to persist `RefreshableCredentials` in the webapp session\r\nand restore it via `RefreshableCredentials.from_dict(oauth_client, session['creds'])` helpers.\r\n\r\nWorks for both AWS and Azure. Not supported for GCP at the moment.\r\n\r\n```python\r\nfrom databricks.sdk.oauth import OAuthClient\r\n\r\noauth_client = OAuthClient(host='<workspace-url>',\r\n client_id='<oauth client ID>',\r\n redirect_url=f'http://host.domain/callback',\r\n scopes=['clusters'])\r\n\r\nimport secrets\r\nfrom flask import Flask, render_template_string, request, redirect, url_for, session\r\n\r\nAPP_NAME = 'flask-demo'\r\napp = Flask(APP_NAME)\r\napp.secret_key = secrets.token_urlsafe(32)\r\n\r\n\r\n@app.route('/callback')\r\ndef callback():\r\n from databricks.sdk.oauth import Consent\r\n consent = Consent.from_dict(oauth_client, session['consent'])\r\n session['creds'] = consent.exchange_callback_parameters(request.args).as_dict()\r\n return redirect(url_for('index'))\r\n\r\n\r\n@app.route('/')\r\ndef index():\r\n if 'creds' not in session:\r\n consent = oauth_client.initiate_consent()\r\n session['consent'] = consent.as_dict()\r\n return redirect(consent.auth_url)\r\n\r\n from databricks.sdk import WorkspaceClient\r\n from databricks.sdk.oauth import SessionCredentials\r\n\r\n credentials_provider = SessionCredentials.from_dict(oauth_client, session['creds'])\r\n workspace_client = WorkspaceClient(host=oauth_client.host,\r\n product=APP_NAME,\r\n credentials_provider=credentials_provider)\r\n\r\n return render_template_string('...', w=workspace_client)\r\n```\r\n\r\n### SSO for local scripts on development machines\r\n\r\nFor applications, that do run on developer workstations, Databricks SDK for Python provides `auth_type='external-browser'`\r\nutility, that opens up a browser for a user to go through SSO flow. Azure support is still in the early experimental\r\nstage.\r\n\r\n```python\r\nfrom databricks.sdk import WorkspaceClient\r\n\r\nhost = input('Enter Databricks host: ')\r\n\r\nw = WorkspaceClient(host=host, auth_type='external-browser')\r\nclusters = w.clusters.list()\r\n\r\nfor cl in clusters:\r\n print(f' - {cl.cluster_name} is {cl.state}')\r\n```\r\n\r\n### Creating custom OAuth applications\r\n\r\nIn order to use OAuth with Databricks SDK for Python, you should use `account_client.custom_app_integration.create` API.\r\n\r\n```python\r\nimport logging, getpass\r\nfrom databricks.sdk import AccountClient\r\naccount_client = AccountClient(host='https://accounts.cloud.databricks.com',\r\n account_id=input('Databricks Account ID: '),\r\n username=input('Username: '),\r\n password=getpass.getpass('Password: '))\r\n\r\nlogging.info('Enrolling all published apps...')\r\naccount_client.o_auth_enrollment.create(enable_all_published_apps=True)\r\n\r\nstatus = account_client.o_auth_enrollment.get()\r\nlogging.info(f'Enrolled all published apps: {status}')\r\n\r\ncustom_app = account_client.custom_app_integration.create(\r\n name='awesome-app',\r\n redirect_urls=[f'https://host.domain/path/to/callback'],\r\n confidential=True)\r\nlogging.info(f'Created new custom app: '\r\n f'--client_id {custom_app.client_id} '\r\n f'--client_secret {custom_app.client_secret}')\r\n```\r\n\r\n## Logging<a id=\"logging\"></a>\r\n\r\nThe Databricks SDK for Python seamlessly integrates with the standard [Logging facility for Python](https://docs.python.org/3/library/logging.html).\r\nThis allows developers to easily enable and customize logging for their Databricks Python projects.\r\nTo enable debug logging in your Databricks Python project, you can follow the example below:\r\n\r\n```python\r\nimport logging, sys\r\nlogging.basicConfig(stream=sys.stderr,\r\n level=logging.INFO,\r\n format='%(asctime)s [%(name)s][%(levelname)s] %(message)s')\r\nlogging.getLogger('databricks.sdk').setLevel(logging.DEBUG)\r\n\r\nfrom databricks.sdk import WorkspaceClient\r\nw = WorkspaceClient(debug_truncate_bytes=1024, debug_headers=False)\r\nfor cluster in w.clusters.list():\r\n logging.info(f'Found cluster: {cluster.cluster_name}')\r\n```\r\n\r\nIn the above code snippet, the logging module is imported and the `basicConfig()` method is used to set the logging level to `DEBUG`.\r\nThis will enable logging at the debug level and above. Developers can adjust the logging level as needed to control the verbosity of the logging output.\r\nThe SDK will log all requests and responses to standard error output, using the format `> ` for requests and `< ` for responses.\r\nIn some cases, requests or responses may be truncated due to size considerations. If this occurs, the log message will include\r\nthe text `... (XXX additional elements)` to indicate that the request or response has been truncated. To increase the truncation limits,\r\ndevelopers can set the `debug_truncate_bytes` configuration property or the `DATABRICKS_DEBUG_TRUNCATE_BYTES` environment variable.\r\nTo protect sensitive data, such as authentication tokens, passwords, or any HTTP headers, the SDK will automatically replace these\r\nvalues with `**REDACTED**` in the log output. Developers can disable this redaction by setting the `debug_headers` configuration property to `True`.\r\n\r\n```text\r\n2023-03-22 21:19:21,702 [databricks.sdk][DEBUG] GET /api/2.0/clusters/list\r\n< 200 OK\r\n< {\r\n< \"clusters\": [\r\n< {\r\n< \"autotermination_minutes\": 60,\r\n< \"cluster_id\": \"1109-115255-s1w13zjj\",\r\n< \"cluster_name\": \"DEFAULT Test Cluster\",\r\n< ... truncated for brevity\r\n< },\r\n< \"... (47 additional elements)\"\r\n< ]\r\n< }\r\n```\r\n\r\nOverall, the logging capabilities provided by the Databricks SDK for Python can be a powerful tool for monitoring and troubleshooting your\r\nDatabricks Python projects. Developers can use the various logging methods and configuration options provided by the SDK to customize\r\nthe logging output to their specific needs.\r\n\r\n## Interaction with `dbutils`<a id=\"interaction-with-dbutils\"></a>\r\n\r\nYou can use the client-side implementation of [`dbutils`](https://docs.databricks.com/dev-tools/databricks-utils.html) by accessing `dbutils` property on the `WorkspaceClient`.\r\nMost of the `dbutils.fs` operations and `dbutils.secrets` are implemented natively in Python within Databricks SDK. Non-SDK implementations still require a Databricks cluster,\r\nthat you have to specify through the `cluster_id` configuration attribute or `DATABRICKS_CLUSTER_ID` environment variable. Don't worry if cluster is not running: internally,\r\nDatabricks SDK for Python calls `w.clusters.ensure_cluster_is_running()`.\r\n\r\n```python\r\nfrom databricks.sdk import WorkspaceClient\r\n\r\nw = WorkspaceClient()\r\ndbutils = w.dbutils\r\n\r\nfiles_in_root = dbutils.fs.ls('/')\r\nprint(f'number of files in root: {len(files_in_root)}')\r\n```\r\n\r\nAlternatively, you can import `dbutils` from `databricks.sdk.runtime` module, but you have to make sure that all configuration is already [present in the environment variables](#default-authentication-flow):\r\n\r\n```python\r\nfrom databricks.sdk.runtime import dbutils\r\n\r\nfor secret_scope in dbutils.secrets.listScopes():\r\n for secret_metadata in dbutils.secrets.list(secret_scope.name):\r\n print(f'found {secret_metadata.key} secret in {secret_scope.name} scope')\r\n```\r\n\r\n## Interface stability<a id=\"interface-stability\"></a>\r\n\r\nDatabricks is actively working on stabilizing the Databricks SDK for Python's interfaces. \r\nAPI clients for all services are generated from specification files that are synchronized from the main platform. \r\nYou are highly encouraged to pin the exact dependency version and read the [changelog](https://github.com/databricks/databricks-sdk-py/blob/main/CHANGELOG.md) \r\nwhere Databricks documents the changes. Databricks may have minor [documented](https://github.com/databricks/databricks-sdk-py/blob/main/CHANGELOG.md) \r\nbackward-incompatible changes, such as renaming some type names to bring more consistency.\r\n",
"bugtrack_url": null,
"license": "",
"summary": "Databricks SDK for Python (Beta)",
"version": "0.19.0",
"project_urls": {
"Homepage": "https://databricks-sdk-py.readthedocs.io"
},
"split_keywords": [
"databricks",
"sdk"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "c4688b8adbcb9e9addf727dda4111325beba5487251d48333ba481de66bac299",
"md5": "b242540888084cae6d3de8cec72d3672",
"sha256": "cdc4f62ad7b6e585f7dc2d2d9aeeb607265a04cdfec9a5a765c6166b9fede299"
},
"downloads": -1,
"filename": "databricks_sdk_secure-0.19.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b242540888084cae6d3de8cec72d3672",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 447895,
"upload_time": "2024-02-11T02:19:00",
"upload_time_iso_8601": "2024-02-11T02:19:00.030269Z",
"url": "https://files.pythonhosted.org/packages/c4/68/8b8adbcb9e9addf727dda4111325beba5487251d48333ba481de66bac299/databricks_sdk_secure-0.19.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "e99af8f22900f7d5ac3f1612b75363d9c39f59ab96a2b8c054019b0879d4c8dc",
"md5": "ac8cc1801929a693b53bbab2990f301b",
"sha256": "42bfe589b6cd8565783affe2f21dd70488963cca61361fb6badbc00524caed69"
},
"downloads": -1,
"filename": "databricks-sdk-secure-0.19.0.tar.gz",
"has_sig": false,
"md5_digest": "ac8cc1801929a693b53bbab2990f301b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 461029,
"upload_time": "2024-02-11T02:19:02",
"upload_time_iso_8601": "2024-02-11T02:19:02.980260Z",
"url": "https://files.pythonhosted.org/packages/e9/9a/f8f22900f7d5ac3f1612b75363d9c39f59ab96a2b8c054019b0879d4c8dc/databricks-sdk-secure-0.19.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-02-11 02:19:02",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "databricks-sdk-secure"
}