# CloudTik: Cloud Scale Platform for Distributed Analytics and AI
## Introduction
### The Problem
Building and operating fully distributed and high performance data analytics and AI platform are complex and time-consuming.
This is usually hard for small or middle enterprises not saying individuals.
While the existing solutions for solving distributed analytics and AI problems on cloud
have major challenges on a combination of various aspects cared by users.
These include high cost for software services, non-optimal performance on the corresponding hardware,
the complexity of operating and running such a platform and lack of transparency.
CloudTik enables researchers, data scientists, and enterprises to easily create and manage analytics and AI platform on public clouds,
with out-of-box optimized functionalities and performance, and to go quickly to focus on running the business workloads
in hours or in even minutes instead of spending months to construct and optimize the platform.
### CloudTik Solution
CloudTik is designed for solving the above challenges by providing a platform to help user
focuses on business development and achieve "Develop once, run everywhere" with the following core capabilities:
- Scalable, robust, and unified control plane and runtimes for all environments:
- Public cloud providers and Kubernetes
- Single node virtual clustering
- Local or on-premise clusters
- Out of box optimized runtimes for storage, database, analytics and AI
- Optimized Spark runtime with CloudTik optimizations
- Optimized AI runtime with Intel oneAPI
- Infrastructure and runtimes to support microservices orchestration with:
- Service discovery - service registry, service discover, service DNS naming
- Load balancing - Layer 4 or Layer 7 load balancer working with built-in service discovery
- Support of major public cloud providers:
- AWS - Amazon Elastic Compute Cloud (EC2) or Amazon Elastic Kubernetes Service (EKS)
- Azure - Azure Virtual Machines or Azure Kubernetes Service (AKS)
- GCP - Google Compute Engine (GCE) or Google Kubernetes Engine (GKE)
- Alibaba Cloud - Elastic Compute Service (ECS)
- Kubernetes and more
- A fully open architecture and open-sourced platform
## High Level Concepts
### Workspace
Workspace is the CloudTik concept to act as the container of a set of clusters and the shared Cloud
resources among these clusters.
When a workspace for specific cloud provider is created, all the shared resources for implementing the unified
design are created. These include network resources (like VPC, subnets, NAT gateways, firewall rules),
instance profiles, cloud storage and so on. Although the actual resources varies between cloud providers while
the design the resources achieved is consistent.
### Cluster
Within a workspace, one or more clusters with needed services(runtimes) can be started.
These clusters will share a lot of common configurations
such as network (they are in the same VPC) but vary on other aspects including instance types, scale of the cluster,
services running and so on. The services provided by one cluster can be discovered by other clusters
and be consumed.
### Providers
CloudTik provider abstracts the hardware infrastructure layer so that CloudTik common facilities and runtimes
can consistently run on every provider environments. The support of different public cloud are implemented as providers
(such as AWS, Azure, GCP providers). Beyond the public cloud environments, we also support
virtual single node clustering, local or on-premise clusters which are also implemented as providers
(for example, virtual, local and on-premise providers)
### Runtimes
For each cluster started, user can configure very easily which runtimes
(such as Spark runtime or Machine Learning/Deep Learning runtime) are needed.
CloudTik has designed the runtimes with the optimized configurations and libraries.
And when the cluster is running, the runtimes are properly configured and ready for running your workload.
## Getting Started with CloudTik
### 1. Preparing Python environment
CloudTik requires a Python environment on Linux. We recommend using Conda to manage Python environments and packages.
If you don't have Conda installed, please refer to `dev/install-conda.sh` to install Conda on Linux.
```
git clone https://github.com/cloudtik/cloudtik.git && cd cloudtik
bash dev/install-conda.sh
```
Once Conda is installed, create an environment with a specific Python version as below.
CloudTik currently supports Python 3.8 or above. Take Python 3.9 as an example,
```
conda create -n cloudtik -y python=3.9
conda activate cloudtik
```
### 2. Installing CloudTik
Execute the following `pip` commands to install CloudTik on your working machine for specific cloud providers.
Take AWS for example,
```
pip install cloudtik[aws]
```
Replace `cloudtik[aws]` with `clouditk[azure]`, `cloudtik[gcp]`, `cloudtik[aliyun]`
if you want to create clusters on Azure, GCP, Alibaba Cloud respectively.
If you want to run on Kubernetes, install `cloudtik[kubernetes]`.
Or `clouditk[eks]` or `cloudtik[gke]` if you are running on AWS EKS or GCP GKE cluster.
Use `cloudtik[all]` if you want to manage clusters with all supported Cloud providers.
If you don't have a public cloud account, you can also play with CloudTik
easily locally with the same clustering experiences using virtual, on-premise or local providers.
For this case, simply install cloudtik core as following command,
```
pip install cloudtik
```
Please refer to [User Guide: Running Clusters Locally](https://cloudtik.readthedocs.io/en/latest/UserGuide/running-locally.html)
for detailed guide for this case.
### 3. Authentication to Cloud Providers API
After CloudTik is installed on your working machine, you need to configure or log into your Cloud account to
authenticate the cloud provider CLI on this machine.
#### AWS
First, install AWS CLI (command line interface) on your working machine. Please refer to
[Installing AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html)
for detailed instructions.
After AWS CLI is installed, you need to configure AWS CLI about credentials. The quickest way to configure it
is to run `aws configure` command, and you can refer to
[Managing access keys](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html#Using_CreateAccessKey)
to get *AWS Access Key ID* and *AWS Secret Access Key*.
More details for AWS CLI can be found in [AWS CLI Getting Started](https://github.com/aws/aws-cli/tree/v2#getting-started).
#### Azure
After CloudTik is installed on your working machine, login to Azure using `az login`.
Refer to [Sign in with Azure CLI](https://docs.microsoft.com/en-us/cli/azure/authenticate-azure-cli) for more details.
#### GCP
If you use service account authentication, follow [Creating a service account](https://cloud.google.com/docs/authentication/getting-started#creating_a_service_account)
to create a service account on Google Cloud.
A JSON file should be safely downloaded to your local computer, and then set the `GOOGLE_APPLICATION_CREDENTIALS` environment
variable as described in the [Setting the environment variable](https://cloud.google.com/docs/authentication/getting-started#setting_the_environment_variable)
on your working machine.
If you are using user account authentication, refer to [User Guide: Login to Cloud](https://cloudtik.readthedocs.io/en/latest/UserGuide/login-to-cloud.html#gcp) for details.
#### Alibaba Cloud
The simple way to set up Alibaba Cloud credentials for CloudTik use is
to export the access key ID and access key secret of your cloud account:
```
export ALIBABA_CLOUD_ACCESS_KEY_ID=xxxxxxxxxxxxxxxxxxxxxxxx
export ALIBABA_CLOUD_ACCESS_KEY_SECRET=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
```
For more options of Alibaba Cloud credentials configuration in CloudTik,
refer to [User Guide: Login to Cloud](https://cloudtik.readthedocs.io/en/latest/UserGuide/login-to-cloud.html#alibaba-cloud).
Note: please activate OSS through Alibaba Cloud Console before going to the next step.
#### Kubernetes
If you are running CloudTik on a generic Kubernetes cluster, the authentication setup is simple.
You just need to authenticate your kubectl at your working machine to be able to access the Kubernetes cluster.
If you are running cloud Kubernetes engine (such as AWS EKS, GCP GKE or Azure AKE)
with cloud integrations with access to cloud resources such as cloud storage,
you need both kubectl authentication to cloud Kubernetes cluster and cloud API credentials configuration above.
For detailed information of how configure Kubernetes with cloud integrations,
refer to [User Guide: Login to Cloud - Kubernetes](https://cloudtik.readthedocs.io/en/latest/UserGuide/login-to-cloud.html#kubernetes)
### 4. Creating a Workspace for Clusters.
Once you authenticated with your cloud provider, you can start to create a Workspace.
CloudTik uses **Workspace** concept to easily manage shared Cloud resources such as VPC network resources,
identity and role resources, firewall or security groups, and cloud storage resources.
By default, CloudTik will create a workspace managed cloud storage
(S3 for AWS, Data Lake Storage Gen 2 for Azure, GCS for GCP) for use without any user configurations.
**Note: Some resources like NAT gateway or elastic IP resources in Workspace cost money.
The price policy may vary among cloud providers.
Please check the price policy of the specific cloud provider to avoid undesired cost.**
Within a workspace, you can start one or more clusters with different combination of runtime services.
Create a configuration workspace yaml file to specify the unique workspace name, cloud provider type and a few cloud
provider properties.
Take AWS for example,
```
# A unique identifier for the workspace.
workspace_name: example-workspace
# Cloud-provider specific configuration.
provider:
type: aws
region: us-west-2
# Use allowed_ssh_sources to allow SSH access from your client machine
allowed_ssh_sources:
- 0.0.0.0/0
```
*NOTE:* `0.0.0.0/0` in `allowed_ssh_sources` will allow any IP addresses to connect to your cluster as long as it has the cluster private key.
For more security, you need to change from `0.0.0.0/0` to restricted CIDR ranges for your case.
Use the following command to create and provision a Workspace:
```
cloudtik workspace create /path/to/your-workspace-config.yaml
```
Check [Configuration Examples](https://github.com/cloudtik/cloudtik/tree/main/examples/cluster) folder for more Workspace configuration file examples
for AWS, Azure, GCP, Kubernetes (AWS EKS or GCP GKE).
If you encounter problems on creating a Workspace, a common cause is that your current login account
for the cloud doesn't have enough privileges to create some resources such as VPC, storages, public ip and so on.
Make sure your current account have enough privileges. An admin or owner role will give the latest chance to have
all these privileges.
### 5. Starting a cluster with runtimes
Now you can start a cluster running Spark by default:
```
cloudtik start /path/to/your-cluster-config.yaml
```
A typical cluster configuration file is usually very simple thanks to design of CloudTik's templates with inheritance.
Take AWS for example,
```
# An example of standard 1 + 3 nodes cluster with standard instance type
from: aws/standard
# Workspace into which to launch the cluster
workspace_name: example-workspace
# A unique identifier for the cluster.
cluster_name: example
# Cloud-provider specific configuration.
provider:
type: aws
region: us-west-2
auth:
ssh_user: ubuntu
# Set proxy if you are in corporation network. For example,
# ssh_proxy_command: "ncat --proxy-type socks5 --proxy your_proxy_host:your_proxy_port %h %p"
available_node_types:
worker.default:
# The minimum number of worker nodes to launch.
min_workers: 3
```
This example can be found in CloudTik source code folder `examples/cluster/aws/example-standard.yaml`.
You need only a few key settings in the configuration file to launch a Spark cluster.
As for `auth` above, please set proxy if your working node is using corporation network.
```
auth:
ssh_user: ubuntu
ssh_proxy_command: "ncat --proxy-type socks5 --proxy <your_proxy_host>:<your_proxy_port> %h %p"
```
The cluster key will be created automatically for AWS and GCP if not specified.
The created private key file can be found in .ssh folder of your home folder.
For Azure, you need to generate an RSA key pair manually (use `ssh-keygen -t rsa -b 4096` to generate a new ssh key pair).
and configure the public and private key as following,
```
auth:
ssh_private_key: ~/.ssh/my_cluster_rsa_key
ssh_public_key: ~/.ssh/my_cluster_rsa_key.pub
```
If you need different runtime components in the cluster,
in the cluster configuration file, you can set the runtime types. For example,
```
runtime:
types: [spark, ai]
```
It will run a cluster with spark and AI runtimes.
Refer to `examples/cluster` directory for more cluster configurations examples.
### 6. Running analytics and AI workloads
Once the cluster is started, you can run Spark analytics and AI workloads
which are designed to be distributed and large scale in nature.
Below provides the information of some basic examples to start with.
As to running optimized Spark and AI, you can refer to [Running Optimized Analytics with Spark](https://cloudtik.readthedocs.io/en/latest/UserGuide/running-optimized-ai.html)
and [Running Optimized AI](https://cloudtik.readthedocs.io/en/latest/UserGuide/running-optimized-ai.html) for more information.
#### Running spark PI example
Running a Spark job is very straight forward. Spark PI job for example,
```
cloudtik exec ./your-cluster-config.yaml "spark-submit --master yarn --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.yarn.submit.waitAppCompletion=false \$SPARK_HOME/examples/jars/spark-examples.jar 12345" --job-waiter=spark
```
Refer to [Run Spark PI Example](examples/spark) for more details.
#### Running analytics benchmarks
CloudTik provides ready to use tools for running TPC-DS benchmark
on a CloudTik spark runtime cluster.
Refer to [Run TPC-DS performance benchmark for Spark](tools/benchmarks/spark)
for a detailed step-by-step guide.
#### Running machine learning and deep learning examples
CloudTik provides ready to run examples for demonstrating
how distributed AI jobs can be implemented in CloudTik Spark and AI runtime cluster.
Refer to [Distributed AI Examples](examples/ai)
for a detailed step-by-step guide.
#### Workflow examples
User can integrate CloudTik with external workflows using bash scripts or python
for running on-demand cluster and jobs.
Refer to [Workflow Integration Examples](examples/workflows) for example scripts.
### 7. Managing clusters
CloudTik provides very powerful capability to monitor and manage the cluster.
#### Cluster status and information
Use the following commands to show various cluster information.
```
# Check cluster status with:
cloudtik status /path/to/your-cluster-config.yaml
# Show cluster summary information and useful links to connect to cluster web UI.
cloudtik info /path/to/your-cluster-config.yaml
cloudtik head-ip /path/to/your-cluster-config.yaml
cloudtik worker-ips /path/to/your-cluster-config.yaml
```
#### Attach to the cluster head (or specific node)
Connect to a terminal of cluster head node.
```
cloudtik attach /path/to/your-cluster-config.yaml
```
#### Execute and Submit Jobs
Execute a command via SSH on cluster head node or a specified node.
```
cloudtik exec /path/to/your-cluster-config.yaml [command]
```
#### Manage Files
Upload files or directories to cluster.
```
cloudtik rsync-up /path/to/your-cluster-config.yaml [source] [target]
```
Download files or directories from cluster.
```
cloudtik rsync-down /path/to/your-cluster-config.yaml [source] [target]
```
### 8. Tearing Down
#### Terminate a Cluster
Stop and delete the cluster.
```
cloudtik stop /path/to/your-cluster-config.yaml
```
#### Delete the Workspace
Delete the workspace and all the network resources within it.
```
cloudtik workspace delete /path/to/your-workspace-config.yaml
```
Be default, the managed cloud storage will not be deleted.
Add --delete-managed-storage option to force deletion of manged cloud storage.
For more information as to the commands, you can use `cloudtik --help` or `cloudtik [command] --help` to get detailed instructions.
Raw data
{
"_id": null,
"home_page": "https://github.com/cloudtik/cloudtik.git",
"name": "cloudtik",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "Distributed Cloud Data Analytic AI Spark",
"author": "Chen Haifeng",
"author_email": "",
"download_url": "",
"platform": null,
"description": "# CloudTik: Cloud Scale Platform for Distributed Analytics and AI\n\n## Introduction\n\n### The Problem\nBuilding and operating fully distributed and high performance data analytics and AI platform are complex and time-consuming.\nThis is usually hard for small or middle enterprises not saying individuals.\n\nWhile the existing solutions for solving distributed analytics and AI problems on cloud\nhave major challenges on a combination of various aspects cared by users.\nThese include high cost for software services, non-optimal performance on the corresponding hardware,\nthe complexity of operating and running such a platform and lack of transparency.\n\nCloudTik enables researchers, data scientists, and enterprises to easily create and manage analytics and AI platform on public clouds,\nwith out-of-box optimized functionalities and performance, and to go quickly to focus on running the business workloads\nin hours or in even minutes instead of spending months to construct and optimize the platform.\n\n### CloudTik Solution\nCloudTik is designed for solving the above challenges by providing a platform to help user\nfocuses on business development and achieve \"Develop once, run everywhere\" with the following core capabilities:\n- Scalable, robust, and unified control plane and runtimes for all environments:\n - Public cloud providers and Kubernetes\n - Single node virtual clustering\n - Local or on-premise clusters\n- Out of box optimized runtimes for storage, database, analytics and AI\n - Optimized Spark runtime with CloudTik optimizations\n - Optimized AI runtime with Intel oneAPI\n- Infrastructure and runtimes to support microservices orchestration with:\n - Service discovery - service registry, service discover, service DNS naming\n - Load balancing - Layer 4 or Layer 7 load balancer working with built-in service discovery\n- Support of major public cloud providers:\n - AWS - Amazon Elastic Compute Cloud (EC2) or Amazon Elastic Kubernetes Service (EKS)\n - Azure - Azure Virtual Machines or Azure Kubernetes Service (AKS)\n - GCP - Google Compute Engine (GCE) or Google Kubernetes Engine (GKE)\n - Alibaba Cloud - Elastic Compute Service (ECS)\n - Kubernetes and more\n- A fully open architecture and open-sourced platform\n\n## High Level Concepts\n### Workspace\nWorkspace is the CloudTik concept to act as the container of a set of clusters and the shared Cloud\nresources among these clusters.\n\nWhen a workspace for specific cloud provider is created, all the shared resources for implementing the unified\ndesign are created. These include network resources (like VPC, subnets, NAT gateways, firewall rules),\ninstance profiles, cloud storage and so on. Although the actual resources varies between cloud providers while\nthe design the resources achieved is consistent.\n\n### Cluster\nWithin a workspace, one or more clusters with needed services(runtimes) can be started.\nThese clusters will share a lot of common configurations\nsuch as network (they are in the same VPC) but vary on other aspects including instance types, scale of the cluster,\nservices running and so on. The services provided by one cluster can be discovered by other clusters\nand be consumed.\n\n### Providers\nCloudTik provider abstracts the hardware infrastructure layer so that CloudTik common facilities and runtimes\ncan consistently run on every provider environments. The support of different public cloud are implemented as providers\n(such as AWS, Azure, GCP providers). Beyond the public cloud environments, we also support\nvirtual single node clustering, local or on-premise clusters which are also implemented as providers\n(for example, virtual, local and on-premise providers)\n\n### Runtimes\nFor each cluster started, user can configure very easily which runtimes\n(such as Spark runtime or Machine Learning/Deep Learning runtime) are needed.\nCloudTik has designed the runtimes with the optimized configurations and libraries.\nAnd when the cluster is running, the runtimes are properly configured and ready for running your workload.\n\n## Getting Started with CloudTik\n\n### 1. Preparing Python environment\n\nCloudTik requires a Python environment on Linux. We recommend using Conda to manage Python environments and packages.\n\nIf you don't have Conda installed, please refer to `dev/install-conda.sh` to install Conda on Linux.\n\n```\ngit clone https://github.com/cloudtik/cloudtik.git && cd cloudtik\nbash dev/install-conda.sh\n```\n\nOnce Conda is installed, create an environment with a specific Python version as below.\nCloudTik currently supports Python 3.8 or above. Take Python 3.9 as an example,\n\n```\nconda create -n cloudtik -y python=3.9\nconda activate cloudtik\n```\n\n### 2. Installing CloudTik\n\nExecute the following `pip` commands to install CloudTik on your working machine for specific cloud providers. \n\nTake AWS for example,\n\n```\npip install cloudtik[aws]\n```\n\nReplace `cloudtik[aws]` with `clouditk[azure]`, `cloudtik[gcp]`, `cloudtik[aliyun]`\nif you want to create clusters on Azure, GCP, Alibaba Cloud respectively.\n\nIf you want to run on Kubernetes, install `cloudtik[kubernetes]`.\nOr `clouditk[eks]` or `cloudtik[gke]` if you are running on AWS EKS or GCP GKE cluster.\nUse `cloudtik[all]` if you want to manage clusters with all supported Cloud providers.\n\nIf you don't have a public cloud account, you can also play with CloudTik\neasily locally with the same clustering experiences using virtual, on-premise or local providers.\nFor this case, simply install cloudtik core as following command,\n```\npip install cloudtik\n```\nPlease refer to [User Guide: Running Clusters Locally](https://cloudtik.readthedocs.io/en/latest/UserGuide/running-locally.html)\nfor detailed guide for this case.\n\n\n### 3. Authentication to Cloud Providers API\n\nAfter CloudTik is installed on your working machine, you need to configure or log into your Cloud account to \nauthenticate the cloud provider CLI on this machine.\n\n#### AWS\n\nFirst, install AWS CLI (command line interface) on your working machine. Please refer to\n[Installing AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html)\nfor detailed instructions.\n\nAfter AWS CLI is installed, you need to configure AWS CLI about credentials. The quickest way to configure it \nis to run `aws configure` command, and you can refer to\n[Managing access keys](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html#Using_CreateAccessKey)\nto get *AWS Access Key ID* and *AWS Secret Access Key*.\n\nMore details for AWS CLI can be found in [AWS CLI Getting Started](https://github.com/aws/aws-cli/tree/v2#getting-started).\n\n#### Azure\n\nAfter CloudTik is installed on your working machine, login to Azure using `az login`.\nRefer to [Sign in with Azure CLI](https://docs.microsoft.com/en-us/cli/azure/authenticate-azure-cli) for more details.\n\n#### GCP\n\nIf you use service account authentication, follow [Creating a service account](https://cloud.google.com/docs/authentication/getting-started#creating_a_service_account)\nto create a service account on Google Cloud. \n\nA JSON file should be safely downloaded to your local computer, and then set the `GOOGLE_APPLICATION_CREDENTIALS` environment\nvariable as described in the [Setting the environment variable](https://cloud.google.com/docs/authentication/getting-started#setting_the_environment_variable)\non your working machine.\n\nIf you are using user account authentication, refer to [User Guide: Login to Cloud](https://cloudtik.readthedocs.io/en/latest/UserGuide/login-to-cloud.html#gcp) for details.\n\n#### Alibaba Cloud\nThe simple way to set up Alibaba Cloud credentials for CloudTik use is\nto export the access key ID and access key secret of your cloud account:\n\n```\nexport ALIBABA_CLOUD_ACCESS_KEY_ID=xxxxxxxxxxxxxxxxxxxxxxxx\nexport ALIBABA_CLOUD_ACCESS_KEY_SECRET=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n```\n\nFor more options of Alibaba Cloud credentials configuration in CloudTik,\nrefer to [User Guide: Login to Cloud](https://cloudtik.readthedocs.io/en/latest/UserGuide/login-to-cloud.html#alibaba-cloud).\n\nNote: please activate OSS through Alibaba Cloud Console before going to the next step.\n\n#### Kubernetes\nIf you are running CloudTik on a generic Kubernetes cluster, the authentication setup is simple.\nYou just need to authenticate your kubectl at your working machine to be able to access the Kubernetes cluster.\n\nIf you are running cloud Kubernetes engine (such as AWS EKS, GCP GKE or Azure AKE)\nwith cloud integrations with access to cloud resources such as cloud storage,\nyou need both kubectl authentication to cloud Kubernetes cluster and cloud API credentials configuration above.\n\nFor detailed information of how configure Kubernetes with cloud integrations,\nrefer to [User Guide: Login to Cloud - Kubernetes](https://cloudtik.readthedocs.io/en/latest/UserGuide/login-to-cloud.html#kubernetes)\n\n### 4. Creating a Workspace for Clusters.\nOnce you authenticated with your cloud provider, you can start to create a Workspace.\n\nCloudTik uses **Workspace** concept to easily manage shared Cloud resources such as VPC network resources,\nidentity and role resources, firewall or security groups, and cloud storage resources.\nBy default, CloudTik will create a workspace managed cloud storage\n(S3 for AWS, Data Lake Storage Gen 2 for Azure, GCS for GCP) for use without any user configurations.\n\n**Note: Some resources like NAT gateway or elastic IP resources in Workspace cost money.\nThe price policy may vary among cloud providers.\nPlease check the price policy of the specific cloud provider to avoid undesired cost.**\n\nWithin a workspace, you can start one or more clusters with different combination of runtime services.\n\nCreate a configuration workspace yaml file to specify the unique workspace name, cloud provider type and a few cloud \nprovider properties. \n\nTake AWS for example,\n\n```\n# A unique identifier for the workspace.\nworkspace_name: example-workspace\n\n# Cloud-provider specific configuration.\nprovider:\n type: aws\n region: us-west-2\n # Use allowed_ssh_sources to allow SSH access from your client machine\n allowed_ssh_sources:\n - 0.0.0.0/0\n```\n*NOTE:* `0.0.0.0/0` in `allowed_ssh_sources` will allow any IP addresses to connect to your cluster as long as it has the cluster private key.\nFor more security, you need to change from `0.0.0.0/0` to restricted CIDR ranges for your case.\n\nUse the following command to create and provision a Workspace:\n\n```\ncloudtik workspace create /path/to/your-workspace-config.yaml\n```\n\nCheck [Configuration Examples](https://github.com/cloudtik/cloudtik/tree/main/examples/cluster) folder for more Workspace configuration file examples\nfor AWS, Azure, GCP, Kubernetes (AWS EKS or GCP GKE).\n\nIf you encounter problems on creating a Workspace, a common cause is that your current login account\nfor the cloud doesn't have enough privileges to create some resources such as VPC, storages, public ip and so on.\nMake sure your current account have enough privileges. An admin or owner role will give the latest chance to have\nall these privileges.\n\n### 5. Starting a cluster with runtimes\n\nNow you can start a cluster running Spark by default:\n\n```\ncloudtik start /path/to/your-cluster-config.yaml\n```\n\nA typical cluster configuration file is usually very simple thanks to design of CloudTik's templates with inheritance.\n\nTake AWS for example,\n\n```\n# An example of standard 1 + 3 nodes cluster with standard instance type\nfrom: aws/standard\n\n# Workspace into which to launch the cluster\nworkspace_name: example-workspace\n\n# A unique identifier for the cluster.\ncluster_name: example\n\n# Cloud-provider specific configuration.\nprovider:\n type: aws\n region: us-west-2\n\nauth:\n ssh_user: ubuntu\n # Set proxy if you are in corporation network. For example,\n # ssh_proxy_command: \"ncat --proxy-type socks5 --proxy your_proxy_host:your_proxy_port %h %p\"\n\navailable_node_types:\n worker.default:\n # The minimum number of worker nodes to launch.\n min_workers: 3\n```\nThis example can be found in CloudTik source code folder `examples/cluster/aws/example-standard.yaml`.\n\nYou need only a few key settings in the configuration file to launch a Spark cluster.\n\nAs for `auth` above, please set proxy if your working node is using corporation network.\n\n```\nauth:\n ssh_user: ubuntu\n ssh_proxy_command: \"ncat --proxy-type socks5 --proxy <your_proxy_host>:<your_proxy_port> %h %p\"\n```\n\nThe cluster key will be created automatically for AWS and GCP if not specified.\nThe created private key file can be found in .ssh folder of your home folder.\nFor Azure, you need to generate an RSA key pair manually (use `ssh-keygen -t rsa -b 4096` to generate a new ssh key pair).\nand configure the public and private key as following,\n\n```\nauth:\n ssh_private_key: ~/.ssh/my_cluster_rsa_key\n ssh_public_key: ~/.ssh/my_cluster_rsa_key.pub\n```\n\nIf you need different runtime components in the cluster,\nin the cluster configuration file, you can set the runtime types. For example,\n```\nruntime:\n types: [spark, ai]\n```\nIt will run a cluster with spark and AI runtimes.\n\nRefer to `examples/cluster` directory for more cluster configurations examples.\n\n### 6. Running analytics and AI workloads\n\nOnce the cluster is started, you can run Spark analytics and AI workloads\nwhich are designed to be distributed and large scale in nature.\n\nBelow provides the information of some basic examples to start with.\nAs to running optimized Spark and AI, you can refer to [Running Optimized Analytics with Spark](https://cloudtik.readthedocs.io/en/latest/UserGuide/running-optimized-ai.html)\nand [Running Optimized AI](https://cloudtik.readthedocs.io/en/latest/UserGuide/running-optimized-ai.html) for more information.\n\n#### Running spark PI example\n\nRunning a Spark job is very straight forward. Spark PI job for example,\n\n```\ncloudtik exec ./your-cluster-config.yaml \"spark-submit --master yarn --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.yarn.submit.waitAppCompletion=false \\$SPARK_HOME/examples/jars/spark-examples.jar 12345\" --job-waiter=spark\n```\n\nRefer to [Run Spark PI Example](examples/spark) for more details.\n\n#### Running analytics benchmarks\n\nCloudTik provides ready to use tools for running TPC-DS benchmark\non a CloudTik spark runtime cluster.\n\nRefer to [Run TPC-DS performance benchmark for Spark](tools/benchmarks/spark)\nfor a detailed step-by-step guide.\n\n#### Running machine learning and deep learning examples\n\nCloudTik provides ready to run examples for demonstrating\nhow distributed AI jobs can be implemented in CloudTik Spark and AI runtime cluster.\n\nRefer to [Distributed AI Examples](examples/ai)\nfor a detailed step-by-step guide.\n\n#### Workflow examples\nUser can integrate CloudTik with external workflows using bash scripts or python\nfor running on-demand cluster and jobs.\n\nRefer to [Workflow Integration Examples](examples/workflows) for example scripts.\n\n### 7. Managing clusters\n\nCloudTik provides very powerful capability to monitor and manage the cluster.\n\n#### Cluster status and information\n\nUse the following commands to show various cluster information.\n\n```\n# Check cluster status with:\ncloudtik status /path/to/your-cluster-config.yaml\n\n# Show cluster summary information and useful links to connect to cluster web UI.\ncloudtik info /path/to/your-cluster-config.yaml\ncloudtik head-ip /path/to/your-cluster-config.yaml\ncloudtik worker-ips /path/to/your-cluster-config.yaml\n```\n#### Attach to the cluster head (or specific node)\n\nConnect to a terminal of cluster head node.\n\n```\ncloudtik attach /path/to/your-cluster-config.yaml\n```\n\n#### Execute and Submit Jobs\n\nExecute a command via SSH on cluster head node or a specified node.\n\n```\ncloudtik exec /path/to/your-cluster-config.yaml [command]\n```\n\n#### Manage Files\n\nUpload files or directories to cluster.\n\n``` \ncloudtik rsync-up /path/to/your-cluster-config.yaml [source] [target]\n```\n \nDownload files or directories from cluster.\n\n```\ncloudtik rsync-down /path/to/your-cluster-config.yaml [source] [target]\n```\n\n### 8. Tearing Down\n\n#### Terminate a Cluster\n\nStop and delete the cluster.\n\n```\ncloudtik stop /path/to/your-cluster-config.yaml\n```\n\n#### Delete the Workspace\n\nDelete the workspace and all the network resources within it.\n\n```\ncloudtik workspace delete /path/to/your-workspace-config.yaml\n```\nBe default, the managed cloud storage will not be deleted.\nAdd --delete-managed-storage option to force deletion of manged cloud storage.\n\nFor more information as to the commands, you can use `cloudtik --help` or `cloudtik [command] --help` to get detailed instructions.\n",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "CloudTik: a cloud scale platform for distributed analytics and AI on public clouds",
"version": "1.4.0",
"project_urls": {
"Homepage": "https://github.com/cloudtik/cloudtik.git"
},
"split_keywords": [
"distributed",
"cloud",
"data",
"analytic",
"ai",
"spark"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "a4543308dfe5cb1b193d11ba83ec1c51b0e3c2a3ccd46810fcfc0a858ed66dcf",
"md5": "3692ee4c3a0ab393c458423f933f98dd",
"sha256": "c20b11c24cc9fe3f9e02fca8ed91f356389d17711a284a0989d3e9ef86aea929"
},
"downloads": -1,
"filename": "cloudtik-1.4.0-cp310-cp310-manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "3692ee4c3a0ab393c458423f933f98dd",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.8",
"size": 6537981,
"upload_time": "2023-09-02T07:05:52",
"upload_time_iso_8601": "2023-09-02T07:05:52.393446Z",
"url": "https://files.pythonhosted.org/packages/a4/54/3308dfe5cb1b193d11ba83ec1c51b0e3c2a3ccd46810fcfc0a858ed66dcf/cloudtik-1.4.0-cp310-cp310-manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "96ff282d75cf03062d1bd4ec2ecfd41559b5e71ebe503ead547a0b2e0d60bd58",
"md5": "ecb7c0dafeb8e4e3462478869db622bb",
"sha256": "5f11451f45764bd288caa1c8616947dba404ff84b8fda310d096e99ef273ad5a"
},
"downloads": -1,
"filename": "cloudtik-1.4.0-cp311-cp311-manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "ecb7c0dafeb8e4e3462478869db622bb",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.8",
"size": 6537976,
"upload_time": "2023-09-02T07:05:54",
"upload_time_iso_8601": "2023-09-02T07:05:54.759709Z",
"url": "https://files.pythonhosted.org/packages/96/ff/282d75cf03062d1bd4ec2ecfd41559b5e71ebe503ead547a0b2e0d60bd58/cloudtik-1.4.0-cp311-cp311-manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "d9cd43e75e51402d8e8eb9ac677aa896fc7e72a06196fab1f4cab71a03ef9e32",
"md5": "f0e9f2d18c548395adaa5abfae544b77",
"sha256": "5bfa79373068eee432fa13e82d027e56866d0ebf0e6278ccaf22373b39af6bdd"
},
"downloads": -1,
"filename": "cloudtik-1.4.0-cp38-cp38-manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "f0e9f2d18c548395adaa5abfae544b77",
"packagetype": "bdist_wheel",
"python_version": "cp38",
"requires_python": ">=3.8",
"size": 6537978,
"upload_time": "2023-09-02T07:05:57",
"upload_time_iso_8601": "2023-09-02T07:05:57.254586Z",
"url": "https://files.pythonhosted.org/packages/d9/cd/43e75e51402d8e8eb9ac677aa896fc7e72a06196fab1f4cab71a03ef9e32/cloudtik-1.4.0-cp38-cp38-manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "e816b71b943688b3571129e3aff874ad091dd3f2c45c697fe7788d92ea8f8544",
"md5": "e793c9c702a83d00fd4f4fe93bc09847",
"sha256": "32f332159a27eb77a1fe52c9cb30b5fdb4f015dd9dff146bc10af22c21fcfa49"
},
"downloads": -1,
"filename": "cloudtik-1.4.0-cp39-cp39-manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "e793c9c702a83d00fd4f4fe93bc09847",
"packagetype": "bdist_wheel",
"python_version": "cp39",
"requires_python": ">=3.8",
"size": 6537975,
"upload_time": "2023-09-02T07:05:59",
"upload_time_iso_8601": "2023-09-02T07:05:59.154740Z",
"url": "https://files.pythonhosted.org/packages/e8/16/b71b943688b3571129e3aff874ad091dd3f2c45c697fe7788d92ea8f8544/cloudtik-1.4.0-cp39-cp39-manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-09-02 07:05:52",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "cloudtik",
"github_project": "cloudtik",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "cloudtik"
}