# SageMakerStudioDataEngineeringSessions
SageMaker Unified Studio Data Engineering Sessions
This pacakge depends on SageMaker Unified Studio environment, if you are using SageMaker Unified Studio, see [AWS Doc](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/what-is-sagemaker-unified-studio.html) for guidance.
This package contains functionality to support SageMaker Unified Studio connecting to various AWS Compute including EMR/EMR Serverless/Glue/Redshift etc.
It is utilizing [ipython magics](https://ipython.readthedocs.io/en/stable/interactive/magics.html) and [AWS DataZone Connections](https://docs.aws.amazon.com/datazone/latest/APIReference/API_ListConnections.html) to achieve the following features.
## Features
- Connect to remote compute
- Execute Spark code in remote compute in Python/Scala
- Execute SQL queries in remote compute
- Send local variables to remote compute
## How to setup
If you are using SageMaker Unifed Studio, you can skip this part, SageMaker Unifed Studio already set up the package.
This package contains various Jupyter Magics to achieve its functionality.
To load these magics, make sure you have iPython config file generated. If not, you could run `ipython profile create`, then a file with path `~/.ipython/profile_default/ipython_config.py` should be generated
Then you will need to add the following line in the end of that config file
```
c.InteractiveShellApp.extensions.extend(['sagemaker_studio_dataengineering_sessions.sagemaker_connection_magic'])
```
Once that is finished, you could restart the ipython kernel and run `%help` to see a list of supported magics
## Examples
To connect to remote compute, a DataZone Connection is required, you could create it via [CreateConnection API](https://docs.aws.amazon.com/datazone/latest/APIReference/API_CreateConnection.html), Let's say there's an existing connection called project.spark.
### Supported Connection Type:
- IAM
- SPARK
- REDSHIFT
- ATHENA
### Connect to remote compute and Execute Spark Code in Python
The following example will connect to AWS Glue Interactive session and run the spark code in Glue.
```
%%pyspark project.spark
import sys
import boto3
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
args = getResolvedOptions(sys.argv, ["redshift_url", "redshift_iam_role", "redshift_tempdir","redshift_jdbc_iam_url"])
print(f"{args}")
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
df = spark.read.csv(f"s3://sagemaker-example-files-prod-{boto3.session.Session().region_name}/datasets/tabular/dirty-titanic/", header=True)
df.show(5, truncate=False)
df.printSchema()
df.createOrReplaceTempView("df_sql_tempview")
```
### Execute Spark Code in Scala
The following example will connect to AWS Glue Interactive session and run the spark code in Scala.
```
%%scalaspark project.spark
val dfScala = spark.sql("SELECT count(0) FROM df_sql_tempview")
dfScala.show()
```
### Execute SQL query in remote compute
The following example will connect to AWS Glue Interactive session and run the spark code in Scala.
```
%%sql project.redshift
select current_user()
```
### Some other helpful magics
```
%help - list available magics and related information
%send_to_remote - send local variable to remote compute
%%configure - configure spark application config in remote compute
```
Raw data
{
"_id": null,
"home_page": null,
"name": "sagemaker-studio-dataengineering-sessions",
"maintainer": "sagemaker-unified-studio",
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": null,
"keywords": "AWS, Amazon, Data Engineering, SageMaker, SageMaker Unified Studio",
"author": "Amazon Web Services",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/55/4c/3d2b84eaab3fa9887e17aa5f8ea9c3c03a28bc23d667163e4bc353baefb7/sagemaker_studio_dataengineering_sessions-1.1.4.tar.gz",
"platform": null,
"description": "# SageMakerStudioDataEngineeringSessions\n\nSageMaker Unified Studio Data Engineering Sessions\n\nThis pacakge depends on SageMaker Unified Studio environment, if you are using SageMaker Unified Studio, see [AWS Doc](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/what-is-sagemaker-unified-studio.html) for guidance.\n\nThis package contains functionality to support SageMaker Unified Studio connecting to various AWS Compute including EMR/EMR Serverless/Glue/Redshift etc. \n\nIt is utilizing [ipython magics](https://ipython.readthedocs.io/en/stable/interactive/magics.html) and [AWS DataZone Connections](https://docs.aws.amazon.com/datazone/latest/APIReference/API_ListConnections.html) to achieve the following features.\n\n## Features\n\n- Connect to remote compute\n- Execute Spark code in remote compute in Python/Scala\n- Execute SQL queries in remote compute\n- Send local variables to remote compute\n\n\n## How to setup\n\nIf you are using SageMaker Unifed Studio, you can skip this part, SageMaker Unifed Studio already set up the package.\n\nThis package contains various Jupyter Magics to achieve its functionality.\n\nTo load these magics, make sure you have iPython config file generated. If not, you could run `ipython profile create`, then a file with path `~/.ipython/profile_default/ipython_config.py` should be generated\n\nThen you will need to add the following line in the end of that config file\n\n```\nc.InteractiveShellApp.extensions.extend(['sagemaker_studio_dataengineering_sessions.sagemaker_connection_magic'])\n```\n\nOnce that is finished, you could restart the ipython kernel and run `%help` to see a list of supported magics\n\n## Examples\n\n\nTo connect to remote compute, a DataZone Connection is required, you could create it via [CreateConnection API](https://docs.aws.amazon.com/datazone/latest/APIReference/API_CreateConnection.html), Let's say there's an existing connection called project.spark. \n\n### Supported Connection Type:\n\n- IAM\n- SPARK\n- REDSHIFT\n- ATHENA\n\n### Connect to remote compute and Execute Spark Code in Python\nThe following example will connect to AWS Glue Interactive session and run the spark code in Glue.\n\n```\n%%pyspark project.spark\n\nimport sys\nimport boto3\nfrom awsglue.utils import getResolvedOptions\nfrom pyspark.context import SparkContext\nfrom pyspark.sql import SparkSession\nfrom pyspark.sql.functions import col\n\nargs = getResolvedOptions(sys.argv, [\"redshift_url\", \"redshift_iam_role\", \"redshift_tempdir\",\"redshift_jdbc_iam_url\"])\nprint(f\"{args}\")\n\nsc = SparkContext.getOrCreate()\nspark = SparkSession(sc)\n\ndf = spark.read.csv(f\"s3://sagemaker-example-files-prod-{boto3.session.Session().region_name}/datasets/tabular/dirty-titanic/\", header=True)\ndf.show(5, truncate=False)\ndf.printSchema()\n\ndf.createOrReplaceTempView(\"df_sql_tempview\")\n```\n\n### Execute Spark Code in Scala\nThe following example will connect to AWS Glue Interactive session and run the spark code in Scala.\n\n```\n%%scalaspark project.spark\nval dfScala = spark.sql(\"SELECT count(0) FROM df_sql_tempview\")\ndfScala.show()\n```\n\n### Execute SQL query in remote compute\nThe following example will connect to AWS Glue Interactive session and run the spark code in Scala.\n\n```\n%%sql project.redshift\nselect current_user()\n```\n\n### Some other helpful magics\n\n```\n%help - list available magics and related information\n\n%send_to_remote - send local variable to remote compute\n\n%%configure - configure spark application config in remote compute\n```",
"bugtrack_url": null,
"license": null,
"summary": "A python Package to run Spark code in different AWS Compute",
"version": "1.1.4",
"project_urls": null,
"split_keywords": [
"aws",
" amazon",
" data engineering",
" sagemaker",
" sagemaker unified studio"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "756f9bd4ea383420097d6093995abec3d348c751e0155de7816b5822859db39a",
"md5": "2e40b5a8905d38b5ee253bd840eac6eb",
"sha256": "d58fb681433cdd00a08fc2f9e2a9db0914e597d13bddbd39923a142ce2f7516f"
},
"downloads": -1,
"filename": "sagemaker_studio_dataengineering_sessions-1.1.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2e40b5a8905d38b5ee253bd840eac6eb",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 271384,
"upload_time": "2025-07-17T00:54:48",
"upload_time_iso_8601": "2025-07-17T00:54:48.790978Z",
"url": "https://files.pythonhosted.org/packages/75/6f/9bd4ea383420097d6093995abec3d348c751e0155de7816b5822859db39a/sagemaker_studio_dataengineering_sessions-1.1.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "554c3d2b84eaab3fa9887e17aa5f8ea9c3c03a28bc23d667163e4bc353baefb7",
"md5": "068f703dc0a7d8bad24602fc6efb3725",
"sha256": "a70daf143340434445ddd9c3061528d19004635dd2d5df26a14920bfbd7a7ce5"
},
"downloads": -1,
"filename": "sagemaker_studio_dataengineering_sessions-1.1.4.tar.gz",
"has_sig": false,
"md5_digest": "068f703dc0a7d8bad24602fc6efb3725",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 218184,
"upload_time": "2025-07-17T00:54:50",
"upload_time_iso_8601": "2025-07-17T00:54:50.148345Z",
"url": "https://files.pythonhosted.org/packages/55/4c/3d2b84eaab3fa9887e17aa5f8ea9c3c03a28bc23d667163e4bc353baefb7/sagemaker_studio_dataengineering_sessions-1.1.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-17 00:54:50",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "sagemaker-studio-dataengineering-sessions"
}