This sample demonstrates how to start training jobs using your own training script, packaged in a SageMaker-compatible container, using the Amazon SageMaker Operator for Kubernetes.
This sample assumes that you have already configured an EKS cluster with the operator. It also assumes that you have installed kubectl
- you can find a link on our installation page.
In order to follow this script, you must first create a training script packaged in a Dockerfile that is compatible with Amazon SageMaker. The Distributed Mask R-CNN sample, published by the SageMaker team, contains a predefined training script and helper bash scripts for reference.
All SageMaker training jobs are run from within a container with all necessary dependencies and modules pre-installed and with the training scripts referencing the acceptable input and output directories. This container should be uploaded to an ECR repository accessible from within your AWS account. When uploaded correctly, you should have a repository URL and tag associated with the container image - this will be needed for the next step.
A container image URL and tag looks has the following structure:
<account number>.dkr.ecr.<region>.amazonaws.com/<image name>:<tag>
In the my-training-job.yaml
file, modify the placeholder values with those associated with your account and training job. The spec.algorithmSpecification.trainingImage
should be the container image from the previous step. The spec.roleArn
field should be the ARN of an IAM role which has permissions to access your S3 resources. If you have not yet created a role with these permissions, you can find an example policy at Amazon SageMaker Roles.
To submit your prepared training job specification, apply the specification to your EKS cluster as such:
$ kubectl apply -f my-training-job.yaml
trainingjob.sagemaker.aws.amazon.com/my-training-job created
To monitor the training job once it has started, you can see the full status and any additional errors with the following command:
$ kubectl describe trainingjob my-training-job