aws-cdk.aws-glue-alpha

Name	aws-cdk.aws-glue-alpha JSON
Version	2.179.0a0 JSON
	download
home_page	https://github.com/aws/aws-cdk
Summary	The CDK Construct Library for AWS::Glue
upload_time	2025-02-18 00:35:09
maintainer	None
docs_url	None
author	Amazon Web Services
requires_python	~=3.8
license	Apache-2.0
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # AWS Glue Construct Library

<!--BEGIN STABILITY BANNER-->---


![cdk-constructs: Experimental](https://img.shields.io/badge/cdk--constructs-experimental-important.svg?style=for-the-badge)

> The APIs of higher level constructs in this module are experimental and under active development.
> They are subject to non-backward compatible changes or removal in any future version. These are
> not subject to the [Semantic Versioning](https://semver.org/) model and breaking changes will be
> announced in the release notes. This means that while you may use them, you may need to update
> your source code when upgrading to a newer version of this package.

---
<!--END STABILITY BANNER-->

This module is part of the [AWS Cloud Development Kit](https://github.com/aws/aws-cdk) project.

## README

[AWS Glue](https://aws.amazon.com/glue/) is a serverless data integration
service that makes it easier to discover, prepare, move, and integrate data
from multiple sources for analytics, machine learning (ML), and application
development.

The Glue L2 construct has convenience methods working backwards from common
use cases and sets required parameters to defaults that align with recommended
best practices for each job type. It also provides customers with a balance
between flexibility via optional parameter overrides, and opinionated
interfaces that discouraging anti-patterns, resulting in reduced time to develop
and deploy new resources.

### References

* [Glue Launch Announcement](https://aws.amazon.com/blogs/aws/launch-aws-glue-now-generally-available/)
* [Glue Documentation](https://docs.aws.amazon.com/glue/index.html)
* [Glue L1 (CloudFormation) Constructs](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/AWS_Glue.html)
* Prior version of the [@aws-cdk/aws-glue-alpha module](https://github.com/aws/aws-cdk/blob/v2.51.1/packages/%40aws-cdk/aws-glue/README.md)

## Create a Glue Job

A Job encapsulates a script that connects to data sources, processes
them, and then writes output to a data target. There are four types of Glue
Jobs: Spark (ETL and Streaming), Python Shell, Ray, and Flex Jobs. Most
of the required parameters for these jobs are common across all types,
but there are a few differences depending on the languages supported
and features provided by each type. For all job types, the L2 defaults
to AWS best practice recommendations, such as:

* Use of Secrets Manager for Connection JDBC strings
* Glue job autoscaling
* Default parameter values for Glue job creation

This iteration of the L2 construct introduces breaking changes to
the existing glue-alpha-module, but these changes streamline the developer
experience, introduce new constants for defaults, and replacing synth-time
validations with interface contracts for enforcement of the parameter combinations
that Glue supports. As an opinionated construct, the Glue L2 construct does
not allow developers to create resources that use non-current versions
of Glue or deprecated language dependencies (e.g. deprecated versions of Python).
As always, L1s allow you to specify a wider range of parameters if you need
or want to use alternative configurations.

Optional and required parameters for each job are enforced via interface
rather than validation; see [Glue's public documentation](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api.html)
for more granular details.

### Spark Jobs

1. **ETL Jobs**

ETL jobs support pySpark and Scala languages, for which there are separate but
similar constructors. ETL jobs default to the G2 worker type, but you can
override this default with other supported worker type values (G1, G2, G4
and G8). ETL jobs defaults to Glue version 4.0, which you can override to 3.0.
The following ETL features are enabled by default:
`—enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log.`
You can find more details about version, worker type and other features in
[Glue's public documentation](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html).

Reference the pyspark-etl-jobs.test.ts and scalaspark-etl-jobs.test.ts unit tests
for examples of required-only and optional job parameters when creating these
types of jobs.

For the sake of brevity, examples are shown using the pySpark job variety.

Example with only required parameters:

```python
import aws_cdk as cdk
import aws_cdk.aws_iam as iam
# stack: cdk.Stack
# role: iam.IRole
# script: glue.Code

glue.PySparkEtlJob(stack, "PySparkETLJob",
    role=role,
    script=script,
    job_name="PySparkETLJob"
)
```

Example with optional override parameters:

```python
import aws_cdk as cdk
import aws_cdk.aws_iam as iam
# stack: cdk.Stack
# role: iam.IRole
# script: glue.Code

glue.PySparkEtlJob(stack, "PySparkETLJob",
    job_name="PySparkETLJobCustomName",
    description="This is a description",
    role=role,
    script=script,
    glue_version=glue.GlueVersion.V3_0,
    continuous_logging=glue.ContinuousLoggingProps(enabled=False),
    worker_type=glue.WorkerType.G_2X,
    max_concurrent_runs=100,
    timeout=cdk.Duration.hours(2),
    connections=[glue.Connection.from_connection_name(stack, "Connection", "connectionName")],
    security_configuration=glue.SecurityConfiguration.from_security_configuration_name(stack, "SecurityConfig", "securityConfigName"),
    tags={
        "FirstTagName": "FirstTagValue",
        "SecondTagName": "SecondTagValue",
        "XTagName": "XTagValue"
    },
    number_of_workers=2,
    max_retries=2
)
```

**Streaming Jobs**

Streaming jobs are similar to ETL jobs, except that they perform ETL on data
streams using the Apache Spark Structured Streaming framework. Some Spark
job features are not available to Streaming ETL jobs. They support Scala
and pySpark languages. PySpark streaming jobs default Python 3.9,
which you can override with any non-deprecated version of Python. It
defaults to the G2 worker type and Glue 4.0, both of which you can override.
The following best practice features are enabled by default:
`—enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log`.

Reference the pyspark-streaming-jobs.test.ts and scalaspark-streaming-jobs.test.ts
unit tests for examples of required-only and optional job parameters when creating
these types of jobs.

Example with only required parameters:

```python
import aws_cdk as cdk
import aws_cdk.aws_iam as iam
# stack: cdk.Stack
# role: iam.IRole
# script: glue.Code

glue.PySparkStreamingJob(stack, "ImportedJob", role=role, script=script)
```

Example with optional override parameters:

```python
import aws_cdk as cdk
import aws_cdk.aws_iam as iam
# stack: cdk.Stack
# role: iam.IRole
# script: glue.Code

glue.PySparkStreamingJob(stack, "PySparkStreamingJob",
    job_name="PySparkStreamingJobCustomName",
    description="This is a description",
    role=role,
    script=script,
    glue_version=glue.GlueVersion.V3_0,
    continuous_logging=glue.ContinuousLoggingProps(enabled=False),
    worker_type=glue.WorkerType.G_2X,
    max_concurrent_runs=100,
    timeout=cdk.Duration.hours(2),
    connections=[glue.Connection.from_connection_name(stack, "Connection", "connectionName")],
    security_configuration=glue.SecurityConfiguration.from_security_configuration_name(stack, "SecurityConfig", "securityConfigName"),
    tags={
        "FirstTagName": "FirstTagValue",
        "SecondTagName": "SecondTagValue",
        "XTagName": "XTagValue"
    },
    number_of_workers=2,
    max_retries=2
)
```

**Flex Jobs**

The flexible execution class is appropriate for non-urgent jobs such as
pre-production jobs, testing, and one-time data loads. Flexible jobs default
to Glue version 3.0 and worker type `G_2X`. The following best practice
features are enabled by default:
`—enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log`

Reference the pyspark-flex-etl-jobs.test.ts and scalaspark-flex-etl-jobs.test.ts
unit tests for examples of required-only and optional job parameters when creating
these types of jobs.

Example with only required parameters:

```python
import aws_cdk as cdk
import aws_cdk.aws_iam as iam
# stack: cdk.Stack
# role: iam.IRole
# script: glue.Code

glue.PySparkFlexEtlJob(stack, "ImportedJob", role=role, script=script)
```

Example with optional override parameters:

```python
import aws_cdk as cdk
import aws_cdk.aws_iam as iam
# stack: cdk.Stack
# role: iam.IRole
# script: glue.Code

glue.PySparkEtlJob(stack, "pySparkEtlJob",
    job_name="pySparkEtlJob",
    description="This is a description",
    role=role,
    script=script,
    glue_version=glue.GlueVersion.V3_0,
    continuous_logging=glue.ContinuousLoggingProps(enabled=False),
    worker_type=glue.WorkerType.G_2X,
    max_concurrent_runs=100,
    timeout=cdk.Duration.hours(2),
    connections=[glue.Connection.from_connection_name(stack, "Connection", "connectionName")],
    security_configuration=glue.SecurityConfiguration.from_security_configuration_name(stack, "SecurityConfig", "securityConfigName"),
    tags={
        "FirstTagName": "FirstTagValue",
        "SecondTagName": "SecondTagValue",
        "XTagName": "XTagValue"
    },
    number_of_workers=2,
    max_retries=2
)
```

### Python Shell Jobs

Python shell jobs support a Python version that depends on the AWS Glue
version you use. These can be used to schedule and run tasks that don't
require an Apache Spark environment. Python shell jobs default to
Python 3.9 and a MaxCapacity of `0.0625`. Python 3.9 supports pre-loaded
analytics libraries using the `library-set=analytics` flag, which is
enabled by default.

Reference the pyspark-shell-job.test.ts unit tests for examples of
required-only and optional job parameters when creating these types of jobs.

Example with only required parameters:

```python
import aws_cdk as cdk
import aws_cdk.aws_iam as iam
# stack: cdk.Stack
# role: iam.IRole
# script: glue.Code

glue.PythonShellJob(stack, "ImportedJob", role=role, script=script)
```

Example with optional override parameters:

```python
import aws_cdk as cdk
import aws_cdk.aws_iam as iam
# stack: cdk.Stack
# role: iam.IRole
# script: glue.Code

glue.PythonShellJob(stack, "PythonShellJob",
    job_name="PythonShellJobCustomName",
    description="This is a description",
    python_version=glue.PythonVersion.TWO,
    max_capacity=glue.MaxCapacity.DPU_1,
    role=role,
    script=script,
    glue_version=glue.GlueVersion.V2_0,
    continuous_logging=glue.ContinuousLoggingProps(enabled=False),
    worker_type=glue.WorkerType.G_2X,
    max_concurrent_runs=100,
    timeout=cdk.Duration.hours(2),
    connections=[glue.Connection.from_connection_name(stack, "Connection", "connectionName")],
    security_configuration=glue.SecurityConfiguration.from_security_configuration_name(stack, "SecurityConfig", "securityConfigName"),
    tags={
        "FirstTagName": "FirstTagValue",
        "SecondTagName": "SecondTagValue",
        "XTagName": "XTagValue"
    },
    number_of_workers=2,
    max_retries=2
)
```

### Ray Jobs

Glue Ray jobs use worker type Z.2X and Glue version 4.0. These are not
overrideable since these are the only configuration that Glue Ray jobs
currently support. The runtime defaults to Ray2.4 and min workers defaults to 3.

Reference the ray-job.test.ts unit tests for examples of required-only and
optional job parameters when creating these types of jobs.

Example with only required parameters:

```python
import aws_cdk as cdk
import aws_cdk.aws_iam as iam
# stack: cdk.Stack
# role: iam.IRole
# script: glue.Code

glue.RayJob(stack, "ImportedJob", role=role, script=script)
```

Example with optional override parameters:

```python
import aws_cdk as cdk
import aws_cdk.aws_iam as iam
# stack: cdk.Stack
# role: iam.IRole
# script: glue.Code

glue.RayJob(stack, "ImportedJob",
    role=role,
    script=script,
    job_name="RayCustomJobName",
    description="This is a description",
    worker_type=glue.WorkerType.Z_2X,
    number_of_workers=5,
    runtime=glue.Runtime.RAY_TWO_FOUR,
    max_retries=3,
    max_concurrent_runs=100,
    timeout=cdk.Duration.hours(2),
    connections=[glue.Connection.from_connection_name(stack, "Connection", "connectionName")],
    security_configuration=glue.SecurityConfiguration.from_security_configuration_name(stack, "SecurityConfig", "securityConfigName"),
    tags={
        "FirstTagName": "FirstTagValue",
        "SecondTagName": "SecondTagValue",
        "XTagName": "XTagValue"
    }
)
```

### Enable Job Run Queuing

AWS Glue job queuing monitors your account level quotas and limits. If quotas or limits are insufficient to start a Glue job run, AWS Glue will automatically queue the job and wait for limits to free up. Once limits become available, AWS Glue will retry the job run. Glue jobs will queue for limits like max concurrent job runs per account, max concurrent Data Processing Units (DPU), and resource unavailable due to IP address exhaustion in Amazon Virtual Private Cloud (Amazon VPC).

Enable job run queuing by setting the `jobRunQueuingEnabled` property to `true`.

```python
import aws_cdk as cdk
import aws_cdk.aws_iam as iam
# stack: cdk.Stack
# role: iam.IRole
# script: glue.Code

glue.PySparkEtlJob(stack, "PySparkETLJob",
    role=role,
    script=script,
    job_name="PySparkETLJob",
    job_run_queuing_enabled=True
)
```

### Uploading scripts from the CDK app repository to S3

Similar to other L2 constructs, the Glue L2 automates uploading / updating
scripts to S3 via an optional fromAsset parameter pointing to a script
in the local file structure. You provide the existing S3 bucket and
path to which you'd like the script to be uploaded.

Reference the unit tests for examples of repo and S3 code target examples.

### Workflow Triggers

You can use Glue workflows to create and visualize complex
extract, transform, and load (ETL) activities involving multiple crawlers,
jobs, and triggers. Standalone triggers are an anti-pattern, so you must
create triggers from within a workflow using the L2 construct.

Within a workflow object, there are functions to create different
types of triggers with actions and predicates. You then add those triggers
to jobs.

StartOnCreation defaults to true for all trigger types, but you can
override it if you prefer for your trigger not to start on creation.

Reference the workflow-triggers.test.ts unit tests for examples of creating
workflows and triggers.

1. **On-Demand Triggers**

On-demand triggers can start glue jobs or crawlers. This construct provides
convenience functions to create on-demand crawler or job triggers. The constructor
takes an optional description parameter, but abstracts the requirement of an
actions list using the job or crawler objects using conditional types.

1. **Scheduled Triggers**

You can create scheduled triggers using cron expressions. This construct
provides daily, weekly, and monthly convenience functions,
as well as a custom function that allows you to create your own
custom timing using the [existing event Schedule class](https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_events.Schedule.html)
without having to build your own cron expressions. The L2 extracts
the expression that Glue requires from the Schedule object. The constructor
takes an optional description and a list of jobs or crawlers as actions.

#### **3. Notify  Event Triggers**

There are two types of notify event triggers: batching and non-batching.
For batching triggers, you must specify `BatchSize`. For non-batching
triggers, `BatchSize` defaults to 1. For both triggers, `BatchWindow`
defaults to 900 seconds, but you can override the window to align with
your workload's requirements.

#### **4. Conditional Triggers**

Conditional triggers have a predicate and actions associated with them.
The trigger actions are executed when the predicateCondition is true.

### Connection Properties

A `Connection` allows Glue jobs, crawlers and development endpoints to access
certain types of data stores.

***Secrets Management
**You must specify JDBC connection credentials in Secrets Manager and
provide the Secrets Manager Key name as a property to the job connection.

* **Networking - the CDK determines the best fit subnet for Glue connection
  configuration
  **The prior version of the glue-alpha-module requires the developer to
  specify the subnet of the Connection when it’s defined. Now, you can still
  specify the specific subnet you want to use, but are no longer required
  to. You are only required to provide a VPC and either a public or private
  subnet selection. Without a specific subnet provided, the L2 leverages the
  existing [EC2 Subnet Selection](https://docs.aws.amazon.com/cdk/api/v2/python/aws_cdk.aws_ec2/SubnetSelection.html)
  library to make the best choice selection for the subnet.

```python
# security_group: ec2.SecurityGroup
# subnet: ec2.Subnet

glue.Connection(self, "MyConnection",
    type=glue.ConnectionType.NETWORK,
    # The security groups granting AWS Glue inbound access to the data source within the VPC
    security_groups=[security_group],
    # The VPC subnet which contains the data source
    subnet=subnet
)
```

For RDS `Connection` by JDBC, it is recommended to manage credentials using AWS Secrets Manager. To use Secret, specify `SECRET_ID` in `properties` like the following code. Note that in this case, the subnet must have a route to the AWS Secrets Manager VPC endpoint or to the AWS Secrets Manager endpoint through a NAT gateway.

```python
# security_group: ec2.SecurityGroup
# subnet: ec2.Subnet
# db: rds.DatabaseCluster

glue.Connection(self, "RdsConnection",
    type=glue.ConnectionType.JDBC,
    security_groups=[security_group],
    subnet=subnet,
    properties={
        "JDBC_CONNECTION_URL": f"jdbc:mysql://{db.clusterEndpoint.socketAddress}/databasename",
        "JDBC_ENFORCE_SSL": "false",
        "SECRET_ID": db.secret.secret_name
    }
)
```

If you need to use a connection type that doesn't exist as a static member on `ConnectionType`, you can instantiate a `ConnectionType` object, e.g: `new glue.ConnectionType('NEW_TYPE')`.

See [Adding a Connection to Your Data Store](https://docs.aws.amazon.com/glue/latest/dg/populate-add-connection.html) and [Connection Structure](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-connections.html#aws-glue-api-catalog-connections-Connection) documentation for more information on the supported data stores and their configurations.

## SecurityConfiguration

A `SecurityConfiguration` is a set of security properties that can be used by AWS Glue to encrypt data at rest.

```python
glue.SecurityConfiguration(self, "MySecurityConfiguration",
    cloud_watch_encryption=glue.CloudWatchEncryption(
        mode=glue.CloudWatchEncryptionMode.KMS
    ),
    job_bookmarks_encryption=glue.JobBookmarksEncryption(
        mode=glue.JobBookmarksEncryptionMode.CLIENT_SIDE_KMS
    ),
    s3_encryption=glue.S3Encryption(
        mode=glue.S3EncryptionMode.KMS
    )
)
```

By default, a shared KMS key is created for use with the encryption configurations that require one. You can also supply your own key for each encryption config, for example, for CloudWatch encryption:

```python
# key: kms.Key

glue.SecurityConfiguration(self, "MySecurityConfiguration",
    cloud_watch_encryption=glue.CloudWatchEncryption(
        mode=glue.CloudWatchEncryptionMode.KMS,
        kms_key=key
    )
)
```

See [documentation](https://docs.aws.amazon.com/glue/latest/dg/encryption-security-configuration.html) for more info for Glue encrypting data written by Crawlers, Jobs, and Development Endpoints.

## Database

A `Database` is a logical grouping of `Tables` in the Glue Catalog.

```python
glue.Database(self, "MyDatabase",
    database_name="my_database",
    description="my_database_description"
)
```

## Table

A Glue table describes a table of data in S3: its structure (column names and types), location of data (S3 objects with a common prefix in a S3 bucket), and format for the files (Json, Avro, Parquet, etc.):

```python
# my_database: glue.Database

glue.S3Table(self, "MyTable",
    database=my_database,
    columns=[glue.Column(
        name="col1",
        type=glue.Schema.STRING
    ), glue.Column(
        name="col2",
        type=glue.Schema.array(glue.Schema.STRING),
        comment="col2 is an array of strings"
    )],
    data_format=glue.DataFormat.JSON
)
```

By default, a S3 bucket will be created to store the table's data but you can manually pass the `bucket` and `s3Prefix`:

```python
# my_bucket: s3.Bucket
# my_database: glue.Database

glue.S3Table(self, "MyTable",
    bucket=my_bucket,
    s3_prefix="my-table/",
    # ...
    database=my_database,
    columns=[glue.Column(
        name="col1",
        type=glue.Schema.STRING
    )],
    data_format=glue.DataFormat.JSON
)
```

Glue tables can be configured to contain user-defined properties, to describe the physical storage of table data, through the `storageParameters` property:

```python
# my_database: glue.Database

glue.S3Table(self, "MyTable",
    storage_parameters=[
        glue.StorageParameter.skip_header_line_count(1),
        glue.StorageParameter.compression_type(glue.CompressionType.GZIP),
        glue.StorageParameter.custom("separatorChar", ",")
    ],
    # ...
    database=my_database,
    columns=[glue.Column(
        name="col1",
        type=glue.Schema.STRING
    )],
    data_format=glue.DataFormat.JSON
)
```

Glue tables can also be configured to contain user-defined table properties through the [`parameters`](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-glue-table-tableinput.html#cfn-glue-table-tableinput-parameters) property:

```python
# my_database: glue.Database

glue.S3Table(self, "MyTable",
    parameters={
        "key1": "val1",
        "key2": "val2"
    },
    database=my_database,
    columns=[glue.Column(
        name="col1",
        type=glue.Schema.STRING
    )],
    data_format=glue.DataFormat.JSON
)
```

### Partition Keys

To improve query performance, a table can specify `partitionKeys` on which data is stored and queried separately. For example, you might partition a table by `year` and `month` to optimize queries based on a time window:

```python
# my_database: glue.Database

glue.S3Table(self, "MyTable",
    database=my_database,
    columns=[glue.Column(
        name="col1",
        type=glue.Schema.STRING
    )],
    partition_keys=[glue.Column(
        name="year",
        type=glue.Schema.SMALL_INT
    ), glue.Column(
        name="month",
        type=glue.Schema.SMALL_INT
    )],
    data_format=glue.DataFormat.JSON
)
```

### Partition Indexes

Another way to improve query performance is to specify partition indexes. If no partition indexes are
present on the table, AWS Glue loads all partitions of the table and filters the loaded partitions using
the query expression. The query takes more time to run as the number of partitions increase. With an
index, the query will try to fetch a subset of the partitions instead of loading all partitions of the
table.

The keys of a partition index must be a subset of the partition keys of the table. You can have a
maximum of 3 partition indexes per table. To specify a partition index, you can use the `partitionIndexes`
property:

```python
# my_database: glue.Database

glue.S3Table(self, "MyTable",
    database=my_database,
    columns=[glue.Column(
        name="col1",
        type=glue.Schema.STRING
    )],
    partition_keys=[glue.Column(
        name="year",
        type=glue.Schema.SMALL_INT
    ), glue.Column(
        name="month",
        type=glue.Schema.SMALL_INT
    )],
    partition_indexes=[glue.PartitionIndex(
        index_name="my-index",  # optional
        key_names=["year"]
    )],  # supply up to 3 indexes
    data_format=glue.DataFormat.JSON
)
```

Alternatively, you can call the `addPartitionIndex()` function on a table:

```python
# my_table: glue.Table

my_table.add_partition_index(
    index_name="my-index",
    key_names=["year"]
)
```

### Partition Filtering

If you have a table with a large number of partitions that grows over time, consider using AWS Glue partition indexing and filtering.

```python
# my_database: glue.Database

glue.S3Table(self, "MyTable",
    database=my_database,
    columns=[glue.Column(
        name="col1",
        type=glue.Schema.STRING
    )],
    partition_keys=[glue.Column(
        name="year",
        type=glue.Schema.SMALL_INT
    ), glue.Column(
        name="month",
        type=glue.Schema.SMALL_INT
    )],
    data_format=glue.DataFormat.JSON,
    enable_partition_filtering=True
)
```

### Glue Connections

Glue connections allow external data connections to third party databases and data warehouses. However, these connections can also be assigned to Glue Tables, allowing you to query external data sources using the Glue Data Catalog.

Whereas `S3Table` will point to (and if needed, create) a bucket to store the tables' data, `ExternalTable` will point to an existing table in a data source. For example, to create a table in Glue that points to a table in Redshift:

```python
# my_connection: glue.Connection
# my_database: glue.Database

glue.ExternalTable(self, "MyTable",
    connection=my_connection,
    external_data_location="default_db_public_example",  # A table in Redshift
    # ...
    database=my_database,
    columns=[glue.Column(
        name="col1",
        type=glue.Schema.STRING
    )],
    data_format=glue.DataFormat.JSON
)
```

## [Encryption](https://docs.aws.amazon.com/athena/latest/ug/encryption.html)

You can enable encryption on a Table's data:

* [S3Managed](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingServerSideEncryption.html) - (default) Server side encryption (`SSE-S3`) with an Amazon S3-managed key.

```python
# my_database: glue.Database

glue.S3Table(self, "MyTable",
    encryption=glue.TableEncryption.S3_MANAGED,
    # ...
    database=my_database,
    columns=[glue.Column(
        name="col1",
        type=glue.Schema.STRING
    )],
    data_format=glue.DataFormat.JSON
)
```

* [Kms](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption.html) - Server-side encryption (`SSE-KMS`) with an AWS KMS Key managed by the account owner.

```python
# my_database: glue.Database

# KMS key is created automatically
glue.S3Table(self, "MyTable",
    encryption=glue.TableEncryption.KMS,
    # ...
    database=my_database,
    columns=[glue.Column(
        name="col1",
        type=glue.Schema.STRING
    )],
    data_format=glue.DataFormat.JSON
)

# with an explicit KMS key
glue.S3Table(self, "MyTable",
    encryption=glue.TableEncryption.KMS,
    encryption_key=kms.Key(self, "MyKey"),
    # ...
    database=my_database,
    columns=[glue.Column(
        name="col1",
        type=glue.Schema.STRING
    )],
    data_format=glue.DataFormat.JSON
)
```

* [KmsManaged](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption.html) - Server-side encryption (`SSE-KMS`), like `Kms`, except with an AWS KMS Key managed by the AWS Key Management Service.

```python
# my_database: glue.Database

glue.S3Table(self, "MyTable",
    encryption=glue.TableEncryption.KMS_MANAGED,
    # ...
    database=my_database,
    columns=[glue.Column(
        name="col1",
        type=glue.Schema.STRING
    )],
    data_format=glue.DataFormat.JSON
)
```

* [ClientSideKms](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingClientSideEncryption.html#client-side-encryption-kms-managed-master-key-intro) - Client-side encryption (`CSE-KMS`) with an AWS KMS Key managed by the account owner.

```python
# my_database: glue.Database

# KMS key is created automatically
glue.S3Table(self, "MyTable",
    encryption=glue.TableEncryption.CLIENT_SIDE_KMS,
    # ...
    database=my_database,
    columns=[glue.Column(
        name="col1",
        type=glue.Schema.STRING
    )],
    data_format=glue.DataFormat.JSON
)

# with an explicit KMS key
glue.S3Table(self, "MyTable",
    encryption=glue.TableEncryption.CLIENT_SIDE_KMS,
    encryption_key=kms.Key(self, "MyKey"),
    # ...
    database=my_database,
    columns=[glue.Column(
        name="col1",
        type=glue.Schema.STRING
    )],
    data_format=glue.DataFormat.JSON
)
```

*Note: you cannot provide a `Bucket` when creating the `S3Table` if you wish to use server-side encryption (`KMS`, `KMS_MANAGED` or `S3_MANAGED`)*.

## Types

A table's schema is a collection of columns, each of which have a `name` and a `type`. Types are recursive structures, consisting of primitive and complex types:

```python
# my_database: glue.Database

glue.S3Table(self, "MyTable",
    columns=[glue.Column(
        name="primitive_column",
        type=glue.Schema.STRING
    ), glue.Column(
        name="array_column",
        type=glue.Schema.array(glue.Schema.INTEGER),
        comment="array<integer>"
    ), glue.Column(
        name="map_column",
        type=glue.Schema.map(glue.Schema.STRING, glue.Schema.TIMESTAMP),
        comment="map<string,string>"
    ), glue.Column(
        name="struct_column",
        type=glue.Schema.struct([
            name="nested_column",
            type=glue.Schema.DATE,
            comment="nested comment"
        ]),
        comment="struct<nested_column:date COMMENT 'nested comment'>"
    )],
    # ...
    database=my_database,
    data_format=glue.DataFormat.JSON
)
```

## Public FAQ

### What are we launching today?

We’re launching new features to an AWS CDK Glue L2 Construct to provide
best-practice defaults and convenience methods to create Glue Jobs, Connections,
Triggers, Workflows, and the underlying permissions and configuration.

### Why should I use this Construct?

Developers should use this Construct to reduce the amount of boilerplate
code and complexity each individual has to navigate, and make it easier to
create best-practice Glue resources.

### What’s not in scope?

Glue Crawlers and other resources that are now managed by the AWS LakeFormation
team are not in scope for this effort. Developers should use existing methods
to create these resources, and the new Glue L2 construct assumes they already
exist as inputs. While best practice is for application and infrastructure code
to be as close as possible for teams using fully-implemented DevOps mechanisms,
in practice these ETL scripts are likely managed by a data science team who
know Python or Scala and don’t necessarily own or manage their own
infrastructure deployments. We want to meet developers where they are, and not
assume that all of the code resides in the same repository, Developers can
automate this themselves via the CDK, however, if they do own both.

Validating Glue version and feature use per AWS region at synth time is also
not in scope. AWS’ intention is for all features to eventually be propagated to
all Global regions, so the complexity involved in creating and updating region-
specific configuration to match shifting feature sets does not out-weigh the
likelihood that a developer will use this construct to deploy resources to a
region without a particular new feature to a region that doesn’t yet support
it without researching or manually attempting to use that feature before
developing it via IaC. The developer will, of course, still get feedback from
the underlying Glue APIs as CloudFormation deploys the resources similar to the
current CDK L1 Glue experience.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/aws/aws-cdk",
    "name": "aws-cdk.aws-glue-alpha",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "~=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "Amazon Web Services",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/d9/da/76e92e05c07f61f7acdac81d96ca0a542b4b5bd32d7a1a31d482b66e30dd/aws_cdk_aws_glue_alpha-2.179.0a0.tar.gz",
    "platform": null,
    "description": "# AWS Glue Construct Library\n\n<!--BEGIN STABILITY BANNER-->---\n\n\n![cdk-constructs: Experimental](https://img.shields.io/badge/cdk--constructs-experimental-important.svg?style=for-the-badge)\n\n> The APIs of higher level constructs in this module are experimental and under active development.\n> They are subject to non-backward compatible changes or removal in any future version. These are\n> not subject to the [Semantic Versioning](https://semver.org/) model and breaking changes will be\n> announced in the release notes. This means that while you may use them, you may need to update\n> your source code when upgrading to a newer version of this package.\n\n---\n<!--END STABILITY BANNER-->\n\nThis module is part of the [AWS Cloud Development Kit](https://github.com/aws/aws-cdk) project.\n\n## README\n\n[AWS Glue](https://aws.amazon.com/glue/) is a serverless data integration\nservice that makes it easier to discover, prepare, move, and integrate data\nfrom multiple sources for analytics, machine learning (ML), and application\ndevelopment.\n\nThe Glue L2 construct has convenience methods working backwards from common\nuse cases and sets required parameters to defaults that align with recommended\nbest practices for each job type. It also provides customers with a balance\nbetween flexibility via optional parameter overrides, and opinionated\ninterfaces that discouraging anti-patterns, resulting in reduced time to develop\nand deploy new resources.\n\n### References\n\n* [Glue Launch Announcement](https://aws.amazon.com/blogs/aws/launch-aws-glue-now-generally-available/)\n* [Glue Documentation](https://docs.aws.amazon.com/glue/index.html)\n* [Glue L1 (CloudFormation) Constructs](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/AWS_Glue.html)\n* Prior version of the [@aws-cdk/aws-glue-alpha module](https://github.com/aws/aws-cdk/blob/v2.51.1/packages/%40aws-cdk/aws-glue/README.md)\n\n## Create a Glue Job\n\nA Job encapsulates a script that connects to data sources, processes\nthem, and then writes output to a data target. There are four types of Glue\nJobs: Spark (ETL and Streaming), Python Shell, Ray, and Flex Jobs. Most\nof the required parameters for these jobs are common across all types,\nbut there are a few differences depending on the languages supported\nand features provided by each type. For all job types, the L2 defaults\nto AWS best practice recommendations, such as:\n\n* Use of Secrets Manager for Connection JDBC strings\n* Glue job autoscaling\n* Default parameter values for Glue job creation\n\nThis iteration of the L2 construct introduces breaking changes to\nthe existing glue-alpha-module, but these changes streamline the developer\nexperience, introduce new constants for defaults, and replacing synth-time\nvalidations with interface contracts for enforcement of the parameter combinations\nthat Glue supports. As an opinionated construct, the Glue L2 construct does\nnot allow developers to create resources that use non-current versions\nof Glue or deprecated language dependencies (e.g. deprecated versions of Python).\nAs always, L1s allow you to specify a wider range of parameters if you need\nor want to use alternative configurations.\n\nOptional and required parameters for each job are enforced via interface\nrather than validation; see [Glue's public documentation](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api.html)\nfor more granular details.\n\n### Spark Jobs\n\n1. **ETL Jobs**\n\nETL jobs support pySpark and Scala languages, for which there are separate but\nsimilar constructors. ETL jobs default to the G2 worker type, but you can\noverride this default with other supported worker type values (G1, G2, G4\nand G8). ETL jobs defaults to Glue version 4.0, which you can override to 3.0.\nThe following ETL features are enabled by default:\n`\u2014enable-metrics, \u2014enable-spark-ui, \u2014enable-continuous-cloudwatch-log.`\nYou can find more details about version, worker type and other features in\n[Glue's public documentation](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html).\n\nReference the pyspark-etl-jobs.test.ts and scalaspark-etl-jobs.test.ts unit tests\nfor examples of required-only and optional job parameters when creating these\ntypes of jobs.\n\nFor the sake of brevity, examples are shown using the pySpark job variety.\n\nExample with only required parameters:\n\n```python\nimport aws_cdk as cdk\nimport aws_cdk.aws_iam as iam\n# stack: cdk.Stack\n# role: iam.IRole\n# script: glue.Code\n\nglue.PySparkEtlJob(stack, \"PySparkETLJob\",\n    role=role,\n    script=script,\n    job_name=\"PySparkETLJob\"\n)\n```\n\nExample with optional override parameters:\n\n```python\nimport aws_cdk as cdk\nimport aws_cdk.aws_iam as iam\n# stack: cdk.Stack\n# role: iam.IRole\n# script: glue.Code\n\nglue.PySparkEtlJob(stack, \"PySparkETLJob\",\n    job_name=\"PySparkETLJobCustomName\",\n    description=\"This is a description\",\n    role=role,\n    script=script,\n    glue_version=glue.GlueVersion.V3_0,\n    continuous_logging=glue.ContinuousLoggingProps(enabled=False),\n    worker_type=glue.WorkerType.G_2X,\n    max_concurrent_runs=100,\n    timeout=cdk.Duration.hours(2),\n    connections=[glue.Connection.from_connection_name(stack, \"Connection\", \"connectionName\")],\n    security_configuration=glue.SecurityConfiguration.from_security_configuration_name(stack, \"SecurityConfig\", \"securityConfigName\"),\n    tags={\n        \"FirstTagName\": \"FirstTagValue\",\n        \"SecondTagName\": \"SecondTagValue\",\n        \"XTagName\": \"XTagValue\"\n    },\n    number_of_workers=2,\n    max_retries=2\n)\n```\n\n**Streaming Jobs**\n\nStreaming jobs are similar to ETL jobs, except that they perform ETL on data\nstreams using the Apache Spark Structured Streaming framework. Some Spark\njob features are not available to Streaming ETL jobs. They support Scala\nand pySpark languages. PySpark streaming jobs default Python 3.9,\nwhich you can override with any non-deprecated version of Python. It\ndefaults to the G2 worker type and Glue 4.0, both of which you can override.\nThe following best practice features are enabled by default:\n`\u2014enable-metrics, \u2014enable-spark-ui, \u2014enable-continuous-cloudwatch-log`.\n\nReference the pyspark-streaming-jobs.test.ts and scalaspark-streaming-jobs.test.ts\nunit tests for examples of required-only and optional job parameters when creating\nthese types of jobs.\n\nExample with only required parameters:\n\n```python\nimport aws_cdk as cdk\nimport aws_cdk.aws_iam as iam\n# stack: cdk.Stack\n# role: iam.IRole\n# script: glue.Code\n\nglue.PySparkStreamingJob(stack, \"ImportedJob\", role=role, script=script)\n```\n\nExample with optional override parameters:\n\n```python\nimport aws_cdk as cdk\nimport aws_cdk.aws_iam as iam\n# stack: cdk.Stack\n# role: iam.IRole\n# script: glue.Code\n\nglue.PySparkStreamingJob(stack, \"PySparkStreamingJob\",\n    job_name=\"PySparkStreamingJobCustomName\",\n    description=\"This is a description\",\n    role=role,\n    script=script,\n    glue_version=glue.GlueVersion.V3_0,\n    continuous_logging=glue.ContinuousLoggingProps(enabled=False),\n    worker_type=glue.WorkerType.G_2X,\n    max_concurrent_runs=100,\n    timeout=cdk.Duration.hours(2),\n    connections=[glue.Connection.from_connection_name(stack, \"Connection\", \"connectionName\")],\n    security_configuration=glue.SecurityConfiguration.from_security_configuration_name(stack, \"SecurityConfig\", \"securityConfigName\"),\n    tags={\n        \"FirstTagName\": \"FirstTagValue\",\n        \"SecondTagName\": \"SecondTagValue\",\n        \"XTagName\": \"XTagValue\"\n    },\n    number_of_workers=2,\n    max_retries=2\n)\n```\n\n**Flex Jobs**\n\nThe flexible execution class is appropriate for non-urgent jobs such as\npre-production jobs, testing, and one-time data loads. Flexible jobs default\nto Glue version 3.0 and worker type `G_2X`. The following best practice\nfeatures are enabled by default:\n`\u2014enable-metrics, \u2014enable-spark-ui, \u2014enable-continuous-cloudwatch-log`\n\nReference the pyspark-flex-etl-jobs.test.ts and scalaspark-flex-etl-jobs.test.ts\nunit tests for examples of required-only and optional job parameters when creating\nthese types of jobs.\n\nExample with only required parameters:\n\n```python\nimport aws_cdk as cdk\nimport aws_cdk.aws_iam as iam\n# stack: cdk.Stack\n# role: iam.IRole\n# script: glue.Code\n\nglue.PySparkFlexEtlJob(stack, \"ImportedJob\", role=role, script=script)\n```\n\nExample with optional override parameters:\n\n```python\nimport aws_cdk as cdk\nimport aws_cdk.aws_iam as iam\n# stack: cdk.Stack\n# role: iam.IRole\n# script: glue.Code\n\nglue.PySparkEtlJob(stack, \"pySparkEtlJob\",\n    job_name=\"pySparkEtlJob\",\n    description=\"This is a description\",\n    role=role,\n    script=script,\n    glue_version=glue.GlueVersion.V3_0,\n    continuous_logging=glue.ContinuousLoggingProps(enabled=False),\n    worker_type=glue.WorkerType.G_2X,\n    max_concurrent_runs=100,\n    timeout=cdk.Duration.hours(2),\n    connections=[glue.Connection.from_connection_name(stack, \"Connection\", \"connectionName\")],\n    security_configuration=glue.SecurityConfiguration.from_security_configuration_name(stack, \"SecurityConfig\", \"securityConfigName\"),\n    tags={\n        \"FirstTagName\": \"FirstTagValue\",\n        \"SecondTagName\": \"SecondTagValue\",\n        \"XTagName\": \"XTagValue\"\n    },\n    number_of_workers=2,\n    max_retries=2\n)\n```\n\n### Python Shell Jobs\n\nPython shell jobs support a Python version that depends on the AWS Glue\nversion you use. These can be used to schedule and run tasks that don't\nrequire an Apache Spark environment. Python shell jobs default to\nPython 3.9 and a MaxCapacity of `0.0625`. Python 3.9 supports pre-loaded\nanalytics libraries using the `library-set=analytics` flag, which is\nenabled by default.\n\nReference the pyspark-shell-job.test.ts unit tests for examples of\nrequired-only and optional job parameters when creating these types of jobs.\n\nExample with only required parameters:\n\n```python\nimport aws_cdk as cdk\nimport aws_cdk.aws_iam as iam\n# stack: cdk.Stack\n# role: iam.IRole\n# script: glue.Code\n\nglue.PythonShellJob(stack, \"ImportedJob\", role=role, script=script)\n```\n\nExample with optional override parameters:\n\n```python\nimport aws_cdk as cdk\nimport aws_cdk.aws_iam as iam\n# stack: cdk.Stack\n# role: iam.IRole\n# script: glue.Code\n\nglue.PythonShellJob(stack, \"PythonShellJob\",\n    job_name=\"PythonShellJobCustomName\",\n    description=\"This is a description\",\n    python_version=glue.PythonVersion.TWO,\n    max_capacity=glue.MaxCapacity.DPU_1,\n    role=role,\n    script=script,\n    glue_version=glue.GlueVersion.V2_0,\n    continuous_logging=glue.ContinuousLoggingProps(enabled=False),\n    worker_type=glue.WorkerType.G_2X,\n    max_concurrent_runs=100,\n    timeout=cdk.Duration.hours(2),\n    connections=[glue.Connection.from_connection_name(stack, \"Connection\", \"connectionName\")],\n    security_configuration=glue.SecurityConfiguration.from_security_configuration_name(stack, \"SecurityConfig\", \"securityConfigName\"),\n    tags={\n        \"FirstTagName\": \"FirstTagValue\",\n        \"SecondTagName\": \"SecondTagValue\",\n        \"XTagName\": \"XTagValue\"\n    },\n    number_of_workers=2,\n    max_retries=2\n)\n```\n\n### Ray Jobs\n\nGlue Ray jobs use worker type Z.2X and Glue version 4.0. These are not\noverrideable since these are the only configuration that Glue Ray jobs\ncurrently support. The runtime defaults to Ray2.4 and min workers defaults to 3.\n\nReference the ray-job.test.ts unit tests for examples of required-only and\noptional job parameters when creating these types of jobs.\n\nExample with only required parameters:\n\n```python\nimport aws_cdk as cdk\nimport aws_cdk.aws_iam as iam\n# stack: cdk.Stack\n# role: iam.IRole\n# script: glue.Code\n\nglue.RayJob(stack, \"ImportedJob\", role=role, script=script)\n```\n\nExample with optional override parameters:\n\n```python\nimport aws_cdk as cdk\nimport aws_cdk.aws_iam as iam\n# stack: cdk.Stack\n# role: iam.IRole\n# script: glue.Code\n\nglue.RayJob(stack, \"ImportedJob\",\n    role=role,\n    script=script,\n    job_name=\"RayCustomJobName\",\n    description=\"This is a description\",\n    worker_type=glue.WorkerType.Z_2X,\n    number_of_workers=5,\n    runtime=glue.Runtime.RAY_TWO_FOUR,\n    max_retries=3,\n    max_concurrent_runs=100,\n    timeout=cdk.Duration.hours(2),\n    connections=[glue.Connection.from_connection_name(stack, \"Connection\", \"connectionName\")],\n    security_configuration=glue.SecurityConfiguration.from_security_configuration_name(stack, \"SecurityConfig\", \"securityConfigName\"),\n    tags={\n        \"FirstTagName\": \"FirstTagValue\",\n        \"SecondTagName\": \"SecondTagValue\",\n        \"XTagName\": \"XTagValue\"\n    }\n)\n```\n\n### Enable Job Run Queuing\n\nAWS Glue job queuing monitors your account level quotas and limits. If quotas or limits are insufficient to start a Glue job run, AWS Glue will automatically queue the job and wait for limits to free up. Once limits become available, AWS Glue will retry the job run. Glue jobs will queue for limits like max concurrent job runs per account, max concurrent Data Processing Units (DPU), and resource unavailable due to IP address exhaustion in Amazon Virtual Private Cloud (Amazon VPC).\n\nEnable job run queuing by setting the `jobRunQueuingEnabled` property to `true`.\n\n```python\nimport aws_cdk as cdk\nimport aws_cdk.aws_iam as iam\n# stack: cdk.Stack\n# role: iam.IRole\n# script: glue.Code\n\nglue.PySparkEtlJob(stack, \"PySparkETLJob\",\n    role=role,\n    script=script,\n    job_name=\"PySparkETLJob\",\n    job_run_queuing_enabled=True\n)\n```\n\n### Uploading scripts from the CDK app repository to S3\n\nSimilar to other L2 constructs, the Glue L2 automates uploading / updating\nscripts to S3 via an optional fromAsset parameter pointing to a script\nin the local file structure. You provide the existing S3 bucket and\npath to which you'd like the script to be uploaded.\n\nReference the unit tests for examples of repo and S3 code target examples.\n\n### Workflow Triggers\n\nYou can use Glue workflows to create and visualize complex\nextract, transform, and load (ETL) activities involving multiple crawlers,\njobs, and triggers. Standalone triggers are an anti-pattern, so you must\ncreate triggers from within a workflow using the L2 construct.\n\nWithin a workflow object, there are functions to create different\ntypes of triggers with actions and predicates. You then add those triggers\nto jobs.\n\nStartOnCreation defaults to true for all trigger types, but you can\noverride it if you prefer for your trigger not to start on creation.\n\nReference the workflow-triggers.test.ts unit tests for examples of creating\nworkflows and triggers.\n\n1. **On-Demand Triggers**\n\nOn-demand triggers can start glue jobs or crawlers. This construct provides\nconvenience functions to create on-demand crawler or job triggers. The constructor\ntakes an optional description parameter, but abstracts the requirement of an\nactions list using the job or crawler objects using conditional types.\n\n1. **Scheduled Triggers**\n\nYou can create scheduled triggers using cron expressions. This construct\nprovides daily, weekly, and monthly convenience functions,\nas well as a custom function that allows you to create your own\ncustom timing using the [existing event Schedule class](https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_events.Schedule.html)\nwithout having to build your own cron expressions. The L2 extracts\nthe expression that Glue requires from the Schedule object. The constructor\ntakes an optional description and a list of jobs or crawlers as actions.\n\n#### **3. Notify  Event Triggers**\n\nThere are two types of notify event triggers: batching and non-batching.\nFor batching triggers, you must specify `BatchSize`. For non-batching\ntriggers, `BatchSize` defaults to 1. For both triggers, `BatchWindow`\ndefaults to 900 seconds, but you can override the window to align with\nyour workload's requirements.\n\n#### **4. Conditional Triggers**\n\nConditional triggers have a predicate and actions associated with them.\nThe trigger actions are executed when the predicateCondition is true.\n\n### Connection Properties\n\nA `Connection` allows Glue jobs, crawlers and development endpoints to access\ncertain types of data stores.\n\n***Secrets Management\n**You must specify JDBC connection credentials in Secrets Manager and\nprovide the Secrets Manager Key name as a property to the job connection.\n\n* **Networking - the CDK determines the best fit subnet for Glue connection\n  configuration\n  **The prior version of the glue-alpha-module requires the developer to\n  specify the subnet of the Connection when it\u2019s defined. Now, you can still\n  specify the specific subnet you want to use, but are no longer required\n  to. You are only required to provide a VPC and either a public or private\n  subnet selection. Without a specific subnet provided, the L2 leverages the\n  existing [EC2 Subnet Selection](https://docs.aws.amazon.com/cdk/api/v2/python/aws_cdk.aws_ec2/SubnetSelection.html)\n  library to make the best choice selection for the subnet.\n\n```python\n# security_group: ec2.SecurityGroup\n# subnet: ec2.Subnet\n\nglue.Connection(self, \"MyConnection\",\n    type=glue.ConnectionType.NETWORK,\n    # The security groups granting AWS Glue inbound access to the data source within the VPC\n    security_groups=[security_group],\n    # The VPC subnet which contains the data source\n    subnet=subnet\n)\n```\n\nFor RDS `Connection` by JDBC, it is recommended to manage credentials using AWS Secrets Manager. To use Secret, specify `SECRET_ID` in `properties` like the following code. Note that in this case, the subnet must have a route to the AWS Secrets Manager VPC endpoint or to the AWS Secrets Manager endpoint through a NAT gateway.\n\n```python\n# security_group: ec2.SecurityGroup\n# subnet: ec2.Subnet\n# db: rds.DatabaseCluster\n\nglue.Connection(self, \"RdsConnection\",\n    type=glue.ConnectionType.JDBC,\n    security_groups=[security_group],\n    subnet=subnet,\n    properties={\n        \"JDBC_CONNECTION_URL\": f\"jdbc:mysql://{db.clusterEndpoint.socketAddress}/databasename\",\n        \"JDBC_ENFORCE_SSL\": \"false\",\n        \"SECRET_ID\": db.secret.secret_name\n    }\n)\n```\n\nIf you need to use a connection type that doesn't exist as a static member on `ConnectionType`, you can instantiate a `ConnectionType` object, e.g: `new glue.ConnectionType('NEW_TYPE')`.\n\nSee [Adding a Connection to Your Data Store](https://docs.aws.amazon.com/glue/latest/dg/populate-add-connection.html) and [Connection Structure](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-connections.html#aws-glue-api-catalog-connections-Connection) documentation for more information on the supported data stores and their configurations.\n\n## SecurityConfiguration\n\nA `SecurityConfiguration` is a set of security properties that can be used by AWS Glue to encrypt data at rest.\n\n```python\nglue.SecurityConfiguration(self, \"MySecurityConfiguration\",\n    cloud_watch_encryption=glue.CloudWatchEncryption(\n        mode=glue.CloudWatchEncryptionMode.KMS\n    ),\n    job_bookmarks_encryption=glue.JobBookmarksEncryption(\n        mode=glue.JobBookmarksEncryptionMode.CLIENT_SIDE_KMS\n    ),\n    s3_encryption=glue.S3Encryption(\n        mode=glue.S3EncryptionMode.KMS\n    )\n)\n```\n\nBy default, a shared KMS key is created for use with the encryption configurations that require one. You can also supply your own key for each encryption config, for example, for CloudWatch encryption:\n\n```python\n# key: kms.Key\n\nglue.SecurityConfiguration(self, \"MySecurityConfiguration\",\n    cloud_watch_encryption=glue.CloudWatchEncryption(\n        mode=glue.CloudWatchEncryptionMode.KMS,\n        kms_key=key\n    )\n)\n```\n\nSee [documentation](https://docs.aws.amazon.com/glue/latest/dg/encryption-security-configuration.html) for more info for Glue encrypting data written by Crawlers, Jobs, and Development Endpoints.\n\n## Database\n\nA `Database` is a logical grouping of `Tables` in the Glue Catalog.\n\n```python\nglue.Database(self, \"MyDatabase\",\n    database_name=\"my_database\",\n    description=\"my_database_description\"\n)\n```\n\n## Table\n\nA Glue table describes a table of data in S3: its structure (column names and types), location of data (S3 objects with a common prefix in a S3 bucket), and format for the files (Json, Avro, Parquet, etc.):\n\n```python\n# my_database: glue.Database\n\nglue.S3Table(self, \"MyTable\",\n    database=my_database,\n    columns=[glue.Column(\n        name=\"col1\",\n        type=glue.Schema.STRING\n    ), glue.Column(\n        name=\"col2\",\n        type=glue.Schema.array(glue.Schema.STRING),\n        comment=\"col2 is an array of strings\"\n    )],\n    data_format=glue.DataFormat.JSON\n)\n```\n\nBy default, a S3 bucket will be created to store the table's data but you can manually pass the `bucket` and `s3Prefix`:\n\n```python\n# my_bucket: s3.Bucket\n# my_database: glue.Database\n\nglue.S3Table(self, \"MyTable\",\n    bucket=my_bucket,\n    s3_prefix=\"my-table/\",\n    # ...\n    database=my_database,\n    columns=[glue.Column(\n        name=\"col1\",\n        type=glue.Schema.STRING\n    )],\n    data_format=glue.DataFormat.JSON\n)\n```\n\nGlue tables can be configured to contain user-defined properties, to describe the physical storage of table data, through the `storageParameters` property:\n\n```python\n# my_database: glue.Database\n\nglue.S3Table(self, \"MyTable\",\n    storage_parameters=[\n        glue.StorageParameter.skip_header_line_count(1),\n        glue.StorageParameter.compression_type(glue.CompressionType.GZIP),\n        glue.StorageParameter.custom(\"separatorChar\", \",\")\n    ],\n    # ...\n    database=my_database,\n    columns=[glue.Column(\n        name=\"col1\",\n        type=glue.Schema.STRING\n    )],\n    data_format=glue.DataFormat.JSON\n)\n```\n\nGlue tables can also be configured to contain user-defined table properties through the [`parameters`](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-glue-table-tableinput.html#cfn-glue-table-tableinput-parameters) property:\n\n```python\n# my_database: glue.Database\n\nglue.S3Table(self, \"MyTable\",\n    parameters={\n        \"key1\": \"val1\",\n        \"key2\": \"val2\"\n    },\n    database=my_database,\n    columns=[glue.Column(\n        name=\"col1\",\n        type=glue.Schema.STRING\n    )],\n    data_format=glue.DataFormat.JSON\n)\n```\n\n### Partition Keys\n\nTo improve query performance, a table can specify `partitionKeys` on which data is stored and queried separately. For example, you might partition a table by `year` and `month` to optimize queries based on a time window:\n\n```python\n# my_database: glue.Database\n\nglue.S3Table(self, \"MyTable\",\n    database=my_database,\n    columns=[glue.Column(\n        name=\"col1\",\n        type=glue.Schema.STRING\n    )],\n    partition_keys=[glue.Column(\n        name=\"year\",\n        type=glue.Schema.SMALL_INT\n    ), glue.Column(\n        name=\"month\",\n        type=glue.Schema.SMALL_INT\n    )],\n    data_format=glue.DataFormat.JSON\n)\n```\n\n### Partition Indexes\n\nAnother way to improve query performance is to specify partition indexes. If no partition indexes are\npresent on the table, AWS Glue loads all partitions of the table and filters the loaded partitions using\nthe query expression. The query takes more time to run as the number of partitions increase. With an\nindex, the query will try to fetch a subset of the partitions instead of loading all partitions of the\ntable.\n\nThe keys of a partition index must be a subset of the partition keys of the table. You can have a\nmaximum of 3 partition indexes per table. To specify a partition index, you can use the `partitionIndexes`\nproperty:\n\n```python\n# my_database: glue.Database\n\nglue.S3Table(self, \"MyTable\",\n    database=my_database,\n    columns=[glue.Column(\n        name=\"col1\",\n        type=glue.Schema.STRING\n    )],\n    partition_keys=[glue.Column(\n        name=\"year\",\n        type=glue.Schema.SMALL_INT\n    ), glue.Column(\n        name=\"month\",\n        type=glue.Schema.SMALL_INT\n    )],\n    partition_indexes=[glue.PartitionIndex(\n        index_name=\"my-index\",  # optional\n        key_names=[\"year\"]\n    )],  # supply up to 3 indexes\n    data_format=glue.DataFormat.JSON\n)\n```\n\nAlternatively, you can call the `addPartitionIndex()` function on a table:\n\n```python\n# my_table: glue.Table\n\nmy_table.add_partition_index(\n    index_name=\"my-index\",\n    key_names=[\"year\"]\n)\n```\n\n### Partition Filtering\n\nIf you have a table with a large number of partitions that grows over time, consider using AWS Glue partition indexing and filtering.\n\n```python\n# my_database: glue.Database\n\nglue.S3Table(self, \"MyTable\",\n    database=my_database,\n    columns=[glue.Column(\n        name=\"col1\",\n        type=glue.Schema.STRING\n    )],\n    partition_keys=[glue.Column(\n        name=\"year\",\n        type=glue.Schema.SMALL_INT\n    ), glue.Column(\n        name=\"month\",\n        type=glue.Schema.SMALL_INT\n    )],\n    data_format=glue.DataFormat.JSON,\n    enable_partition_filtering=True\n)\n```\n\n### Glue Connections\n\nGlue connections allow external data connections to third party databases and data warehouses. However, these connections can also be assigned to Glue Tables, allowing you to query external data sources using the Glue Data Catalog.\n\nWhereas `S3Table` will point to (and if needed, create) a bucket to store the tables' data, `ExternalTable` will point to an existing table in a data source. For example, to create a table in Glue that points to a table in Redshift:\n\n```python\n# my_connection: glue.Connection\n# my_database: glue.Database\n\nglue.ExternalTable(self, \"MyTable\",\n    connection=my_connection,\n    external_data_location=\"default_db_public_example\",  # A table in Redshift\n    # ...\n    database=my_database,\n    columns=[glue.Column(\n        name=\"col1\",\n        type=glue.Schema.STRING\n    )],\n    data_format=glue.DataFormat.JSON\n)\n```\n\n## [Encryption](https://docs.aws.amazon.com/athena/latest/ug/encryption.html)\n\nYou can enable encryption on a Table's data:\n\n* [S3Managed](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingServerSideEncryption.html) - (default) Server side encryption (`SSE-S3`) with an Amazon S3-managed key.\n\n```python\n# my_database: glue.Database\n\nglue.S3Table(self, \"MyTable\",\n    encryption=glue.TableEncryption.S3_MANAGED,\n    # ...\n    database=my_database,\n    columns=[glue.Column(\n        name=\"col1\",\n        type=glue.Schema.STRING\n    )],\n    data_format=glue.DataFormat.JSON\n)\n```\n\n* [Kms](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption.html) - Server-side encryption (`SSE-KMS`) with an AWS KMS Key managed by the account owner.\n\n```python\n# my_database: glue.Database\n\n# KMS key is created automatically\nglue.S3Table(self, \"MyTable\",\n    encryption=glue.TableEncryption.KMS,\n    # ...\n    database=my_database,\n    columns=[glue.Column(\n        name=\"col1\",\n        type=glue.Schema.STRING\n    )],\n    data_format=glue.DataFormat.JSON\n)\n\n# with an explicit KMS key\nglue.S3Table(self, \"MyTable\",\n    encryption=glue.TableEncryption.KMS,\n    encryption_key=kms.Key(self, \"MyKey\"),\n    # ...\n    database=my_database,\n    columns=[glue.Column(\n        name=\"col1\",\n        type=glue.Schema.STRING\n    )],\n    data_format=glue.DataFormat.JSON\n)\n```\n\n* [KmsManaged](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption.html) - Server-side encryption (`SSE-KMS`), like `Kms`, except with an AWS KMS Key managed by the AWS Key Management Service.\n\n```python\n# my_database: glue.Database\n\nglue.S3Table(self, \"MyTable\",\n    encryption=glue.TableEncryption.KMS_MANAGED,\n    # ...\n    database=my_database,\n    columns=[glue.Column(\n        name=\"col1\",\n        type=glue.Schema.STRING\n    )],\n    data_format=glue.DataFormat.JSON\n)\n```\n\n* [ClientSideKms](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingClientSideEncryption.html#client-side-encryption-kms-managed-master-key-intro) - Client-side encryption (`CSE-KMS`) with an AWS KMS Key managed by the account owner.\n\n```python\n# my_database: glue.Database\n\n# KMS key is created automatically\nglue.S3Table(self, \"MyTable\",\n    encryption=glue.TableEncryption.CLIENT_SIDE_KMS,\n    # ...\n    database=my_database,\n    columns=[glue.Column(\n        name=\"col1\",\n        type=glue.Schema.STRING\n    )],\n    data_format=glue.DataFormat.JSON\n)\n\n# with an explicit KMS key\nglue.S3Table(self, \"MyTable\",\n    encryption=glue.TableEncryption.CLIENT_SIDE_KMS,\n    encryption_key=kms.Key(self, \"MyKey\"),\n    # ...\n    database=my_database,\n    columns=[glue.Column(\n        name=\"col1\",\n        type=glue.Schema.STRING\n    )],\n    data_format=glue.DataFormat.JSON\n)\n```\n\n*Note: you cannot provide a `Bucket` when creating the `S3Table` if you wish to use server-side encryption (`KMS`, `KMS_MANAGED` or `S3_MANAGED`)*.\n\n## Types\n\nA table's schema is a collection of columns, each of which have a `name` and a `type`. Types are recursive structures, consisting of primitive and complex types:\n\n```python\n# my_database: glue.Database\n\nglue.S3Table(self, \"MyTable\",\n    columns=[glue.Column(\n        name=\"primitive_column\",\n        type=glue.Schema.STRING\n    ), glue.Column(\n        name=\"array_column\",\n        type=glue.Schema.array(glue.Schema.INTEGER),\n        comment=\"array<integer>\"\n    ), glue.Column(\n        name=\"map_column\",\n        type=glue.Schema.map(glue.Schema.STRING, glue.Schema.TIMESTAMP),\n        comment=\"map<string,string>\"\n    ), glue.Column(\n        name=\"struct_column\",\n        type=glue.Schema.struct([\n            name=\"nested_column\",\n            type=glue.Schema.DATE,\n            comment=\"nested comment\"\n        ]),\n        comment=\"struct<nested_column:date COMMENT 'nested comment'>\"\n    )],\n    # ...\n    database=my_database,\n    data_format=glue.DataFormat.JSON\n)\n```\n\n## Public FAQ\n\n### What are we launching today?\n\nWe\u2019re launching new features to an AWS CDK Glue L2 Construct to provide\nbest-practice defaults and convenience methods to create Glue Jobs, Connections,\nTriggers, Workflows, and the underlying permissions and configuration.\n\n### Why should I use this Construct?\n\nDevelopers should use this Construct to reduce the amount of boilerplate\ncode and complexity each individual has to navigate, and make it easier to\ncreate best-practice Glue resources.\n\n### What\u2019s not in scope?\n\nGlue Crawlers and other resources that are now managed by the AWS LakeFormation\nteam are not in scope for this effort. Developers should use existing methods\nto create these resources, and the new Glue L2 construct assumes they already\nexist as inputs. While best practice is for application and infrastructure code\nto be as close as possible for teams using fully-implemented DevOps mechanisms,\nin practice these ETL scripts are likely managed by a data science team who\nknow Python or Scala and don\u2019t necessarily own or manage their own\ninfrastructure deployments. We want to meet developers where they are, and not\nassume that all of the code resides in the same repository, Developers can\nautomate this themselves via the CDK, however, if they do own both.\n\nValidating Glue version and feature use per AWS region at synth time is also\nnot in scope. AWS\u2019 intention is for all features to eventually be propagated to\nall Global regions, so the complexity involved in creating and updating region-\nspecific configuration to match shifting feature sets does not out-weigh the\nlikelihood that a developer will use this construct to deploy resources to a\nregion without a particular new feature to a region that doesn\u2019t yet support\nit without researching or manually attempting to use that feature before\ndeveloping it via IaC. The developer will, of course, still get feedback from\nthe underlying Glue APIs as CloudFormation deploys the resources similar to the\ncurrent CDK L1 Glue experience.\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "The CDK Construct Library for AWS::Glue",
    "version": "2.179.0a0",
    "project_urls": {
        "Homepage": "https://github.com/aws/aws-cdk",
        "Source": "https://github.com/aws/aws-cdk.git"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a1cf110c77168e2322477eceadcf62c50a0121d4affbe1cfc8d5608eac5e431e",
                "md5": "47aaba23ea2da1889825201de97d4f31",
                "sha256": "f794c7dccb482e2c759c7d12409da13842e2c86bbe22c76a0cffc522da92bca1"
            },
            "downloads": -1,
            "filename": "aws_cdk.aws_glue_alpha-2.179.0a0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "47aaba23ea2da1889825201de97d4f31",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "~=3.8",
            "size": 417640,
            "upload_time": "2025-02-18T00:34:04",
            "upload_time_iso_8601": "2025-02-18T00:34:04.337553Z",
            "url": "https://files.pythonhosted.org/packages/a1/cf/110c77168e2322477eceadcf62c50a0121d4affbe1cfc8d5608eac5e431e/aws_cdk.aws_glue_alpha-2.179.0a0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d9da76e92e05c07f61f7acdac81d96ca0a542b4b5bd32d7a1a31d482b66e30dd",
                "md5": "9510db0bab30e21266d531af0a2d99bd",
                "sha256": "a3aec6529d28072f98cf5d67649bb9157251a609d3332b51cea05373e85c2c05"
            },
            "downloads": -1,
            "filename": "aws_cdk_aws_glue_alpha-2.179.0a0.tar.gz",
            "has_sig": false,
            "md5_digest": "9510db0bab30e21266d531af0a2d99bd",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "~=3.8",
            "size": 433542,
            "upload_time": "2025-02-18T00:35:09",
            "upload_time_iso_8601": "2025-02-18T00:35:09.531108Z",
            "url": "https://files.pythonhosted.org/packages/d9/da/76e92e05c07f61f7acdac81d96ca0a542b4b5bd32d7a1a31d482b66e30dd/aws_cdk_aws_glue_alpha-2.179.0a0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-18 00:35:09",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "aws",
    "github_project": "aws-cdk",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "aws-cdk.aws-glue-alpha"
}

Amazon Web Services