Start training AI job by using GCP ai platform from scratch

Google GCP has two important platforms for AI jobs. One is ai-platform which is the traditional platform, another is the new one called Vertex AI, which is a SaaS platform, where you can make AI jobs auto-finished by AutoML.

If you want to have more controls over the training processes such as algorithm and cost, you would choose ai-platform. Vertex is smart, but it's too expensive.

First let us start up an GCP VM instance by terraform. It requires more space on memory, so I choose "e2-medium" instance. Here is my main.cf for terraform.

 terraform {
   required_providers {
     google = {
       source  = "hashicorp/google"
       version = "3.5.0"
     }
   }
 }
   
 provider "google" {
   credentials = file("/home/xxx/terraform/gcp/hostcache.json")
   
   project = "blissful-canyon-xxx"
   region  = "us-west1"
   zone    = "us-west1-b"
 }
   
 resource "google_compute_instance" "vm_instance" {
   name         = "terraform-instance"
   machine_type = "e2-medium"
   
   boot_disk {
     initialize_params {
       image = "ubuntu-os-cloud/ubuntu-2004-lts"
     }
   }
   
   network_interface {
    network = "default"
    access_config {
    }
   }
 }

Next, run the following commands to validate the configuration, startup the instance and check the results.

 $ terraform validate
 $ terraform apply
 $ terraform show

After then you can log into the instance via gcloud CLI which is pre-installed in your own PC.

 $ gcloud compute ssh terraform-instance
 Welcome to Ubuntu 20.04.5 LTS (GNU/Linux 5.15.0-1025-gcp x86_64)

I have got an instance whth 2 AMD cores and 4 GB memory, as the following info.

 Basic System Information:
 ---------------------------------
 Uptime     : 0 days, 0 hours, 20 minutes
 Processor  : AMD EPYC 7B12
 CPU cores  : 2 @ 2249.998 MHz
 AES-NI     : ✔ Enabled
 VM-x/AMD-V : ❌ Disabled
 RAM        : 3.8 GiB
 Swap       : 0.0 KiB
 Disk       : 9.6 GiB
 Distro     : Ubuntu 20.04.5 LTS
 Kernel     : 5.15.0-1025-gcp

After login, run the following commands to update the system.

 $ sudo apt update
 $ sudo apt upgrade

We are using ubuntu 20.04 OS which has python3 installed by default. And we want to install pip3 as well.

 $ sudo apt install python3-pip

Now, let's install tensorflow and other packages for machine learning.

 $ sudo pip3 install tensorflow
 $ sudo pip3 install pandas
 $ sudo pip3 install scikit-learn
 $ sudo pip3 install google-cloud-storage
 $ sudo apt install graphviz
 $ sudo pip3 install pydot

In order to submit training jobs to ai-platform, we need to install gcloud CLI in this new created instance. Please see the following doc for how to install it.

Install the gcloud CLI

After gcloud CLI is installed, run the following command to make it get authorized by GCP.

 $ gcloud init

And, run the following command to make client libraries get authorized by GCP too.

 $ gcloud auth application-default login

You will see the output like,

 Credentials saved to file: [/home/xxx/.config/gcloud/application_default_credentials.json]
   
 These credentials will be used by any library that requests Application Default Credentials (ADC).
   
 Quota project "blissful-canyon-xxx" was added to ADC which can be used by Google client libraries for billing and quota. Note that some services may still bill the project owning the resource.

You have to create a storage bucket in GCP Cloud Storage. I have created that one in their console with the name "mljobs".

Next, download the dataset and upload it to cloud storage. See follows.

 $ wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
 $ mv iris.data iris.csv
 $ gsutil cp iris.csv gs://mljobs

Create the project dir "iris" and put a source file in the subdir "src".

 $ mkdir -p iris/src
 $ cd iris/src
 $ vi train.py

What's the code in train.py? They are just an instance for SVM algorithm implementation.

 import datetime
 import os
 import subprocess
 import sys
 import pandas as pd
 from sklearn import svm
 import joblib
 from google.cloud import storage
 import sklearn
   
 print('sklearn: {}'.format(sklearn.__version__))
   
 # Create a Cloud Storage client to download the data a nd upload the model
 storage_client = storage.Client()
   
 # Download the data
 public_bucket = storage_client.bucket('mljobs')
 blob = public_bucket.blob('iris.csv')
 blob.download_to_filename('iris.csv')
   
 #Read the training data from the file
 iris_data = pd.read_csv('./iris.csv',sep=',',names=["sepal_length", "sepal_width", "petal_length","petal_width","species"])
   
 #Assigning the classes and removing the target variable 
 iris_label = iris_data.pop('species')
   
 #We're going to be using the SVC (support vector classifier) SVM (support vector machine)
 classifier = svm.SVC(gamma='auto')
   
 #Training the model
 classifier.fit(iris_data, iris_label)
   
 #Saving the data locally
 model_filename = 'model.joblib'
 joblib.dump(classifier, model_filename)
   
 # Create a Cloud Storage client to upload the model
 bucket = storage_client.bucket('mljobs')
 blob = bucket.blob(model_filename)
 blob.upload_from_filename(model_filename)

Now run the script train.py in local to check if it has any errors.

 $ python3 train.py 
 sklearn: 1.2.0

This script generates a model file named "model.joblib" which is put into cloud storage bucket.

In the source dir, remove the new generated files and create an empty file "__init__.py".

 $ rm -f iris.csv model.joblib 
 $ touch __init__.py

Next, go back to project dir "iris", create the bash script "submit.sh" whose content are follows.

 #!/bin/bash
   
 gcloud ai-platform jobs submit training iris_0001 \
     --module-name=src.train \
     --package-path=./src \
     --staging-bucket=gs://mljobs \
     --region=$(gcloud config get compute/region) \
     --scale-tier=CUSTOM \
     --master-machine-type=n1-standard-8 \
     --python-version=3.7 \
     --runtime-version=2.8

Please note you have to use tensorflow 2.8 and python 3.7 as versions here. Just because ai-platform doesn't support the higher versions of them yet.

Run this script and you will see the output as follows.

 $ ./submit.sh 
 Job [iris_0001] submitted successfully.
 Your job is still active. You may view the status of your job with the command
   
   $ gcloud ai-platform jobs describe iris_0001
   
 or continue streaming the logs with the command
   
   $ gcloud ai-platform jobs stream-logs iris_0001
 jobId: iris_0001
 state: QUEUED

It says that you have submitted the training job successfully. You can run the commands in the output to check the job's status and logs. For instance, my query to the stream logs shows,

 $ gcloud ai-platform jobs stream-logs iris_0001
   
 INFO	2022-12-19 03:01:44 +0000	service		Validating job requirements...
 INFO	2022-12-19 03:01:44 +0000	service		Job creation request has been successfully validated.
 INFO	2022-12-19 03:01:44 +0000	service		Waiting for job to be provisioned.
 INFO	2022-12-19 03:01:44 +0000	service		Job iris_0001 is queued.
 INFO	2022-12-19 03:01:48 +0000	service		Waiting for training program to start.
 NOTICE	2022-12-19 03:02:19 +0000	master-replica-0.gcsfuse		Opening GCS connection...
 NOTICE	2022-12-19 03:02:19 +0000	master-replica-0.gcsfuse		Mounting file system "gcsfuse"...
 NOTICE	2022-12-19 03:02:19 +0000	master-replica-0.gcsfuse		File system has been successfully mounted.
 INFO	2022-12-19 03:03:56 +0000	master-replica-0		Running task with arguments: --cluster={"chief": ["127.0.0.1:2222"]} --task={"type": "chief", "index": 0} --job={  "scale_tier": "CUSTOM",  "master_type": "n1-standard-8",  "package_uris": ["gs://mljobs/iris_0001/4bb82b2c99386ecd36cb90e0281f24d6d419843e5fb3cb98c5b00ac0d92dfcb5/src-0.0.0.tar.gz"],  "python_module": "src.train",  "region": "us-west1",  "runtime_version": "2.8",  "run_on_raw_vm": true,  "python_version": "3.7"}
 INFO	2022-12-19 03:04:05 +0000	master-replica-0		Running module src.train.
 INFO	2022-12-19 03:04:05 +0000	master-replica-0		Downloading the package: gs://mljobs/iris_0001/4bb82b2c99386ecd36cb90e0281f24d6d419843e5fb3cb98c5b00ac0d92dfcb5/src-0.0.0.tar.gz
 INFO	2022-12-19 03:04:05 +0000	master-replica-0		Running command: gsutil -q cp gs://mljobs/iris_0001/4bb82b2c99386ecd36cb90e0281f24d6d419843e5fb3cb98c5b00ac0d92dfcb5/src-0.0.0.tar.gz src-0.0.0.tar.gz
   
 ...
   
 INFO	2022-12-19 03:04:12 +0000	master-replica-0		Running command: python3 -m src.train
 INFO	2022-12-19 03:04:14 +0000	master-replica-0		sklearn: 1.0.2
 INFO	2022-12-19 03:04:14 +0000	master-replica-0		Module completed; cleaning up.
 INFO	2022-12-19 03:04:14 +0000	master-replica-0		Clean up finished.
 INFO	2022-12-19 03:04:14 +0000	master-replica-0		Task completed successfully.

And check the job's status,

 $ gcloud ai-platform jobs describe iris_0001
 createTime: '2022-12-19T03:01:44Z'
 endTime: '2022-12-19T03:10:03Z'
 etag: 4gDNLPs20qk=
 jobId: iris_0001
 jobPosition: '0'
 startTime: '2022-12-19T03:04:31Z'
 state: SUCCEEDED
 trainingInput:
   masterType: n1-standard-8
   packageUris:
   - gs://mljobs/iris_0001/4bb82b2c99386ecd36cb90e0281f24d6d419843e5fb3cb98c5b00ac0d92dfcb5/src-0.0.0.tar.gz
   pythonModule: src.train
   pythonVersion: '3.7'
   region: us-west1
   runtimeVersion: '2.8'
   scaleTier: CUSTOM
 trainingOutput:
   consumedMLUnits: 0.13

And you can check the job's running results in web console.

AI Platform Jobs

You have the training job done well by using GCP ai-platform now.

Also you can use GCP Vertex AI which is much simpler to train an AI job. You even don't need to write one line of code by using their AutoML capability. But Vertex is more expensive than ai-platform, since you can't customize and optimize your code, and you can't control the training time, hardware specs etc.