Model Training and Inference on on SageMaker
GraphStorm can run on Amazon Sagemaker to leverage SageMaker’s ML DevOps capabilities.
Prerequisites
In order to use GraphStorm on Amazon SageMaker, users need to have AWS access to the following AWS services.
SageMaker service. Please refer to Amazon SageMaker service for how to get access to Amazon SageMaker.
Amazon ECR. Please refer to Amazon Elastic Container Registry service for how to get access to Amazon ECR.
S3 service. Please refer to Amazon S3 service for how to get access to Amazon S3.
SageMaker Framework Containers. Please follow AWS Deep Learning Containers guideline to get access to the image.
Amazon EC2 (optional). Please refer to Amazon EC2 service for how to get access to Amazon EC2.
Setup GraphStorm SageMaker Docker Image
GraphStorm uses SageMaker’s BYOC (Bring Your Own Container) mode. Therefore, before launching GraphStorm on SageMaker, there are two steps required to setup a GraphStorm SageMaker Docker image.
Building and pushing a SageMaker uses the same scripts as for building a local image, described in GraphStorm Docker build instructions.
Your executing role should have full ECR access to be able to pull from ECR to build the image, create an ECR repository if it doesn’t exist, and push the GSProcessing image to the repository. See the [official ECR docs](https://docs.aws.amazon.com/AmazonECR/latest/userguide/image-push-iam.html) for details.
In short you can run the following:
cd graphstorm/
bash docker/build_graphstorm_image.sh --environment sagemaker
bash docker/push_graphstorm_image.sh --environment sagemaker --region "us-east-1" --account "123456789012"
# Will push an image to '123456789012.dkr.ecr.us-east-1.amazonaws.com/graphstorm:sagemaker-gpu'
See bash docker/build_graphstorm_image.sh --help
and bash docker/push_graphstorm_image.sh --help
for more build and push options.
Run GraphStorm on SageMaker
To run GraphStorm with the Amazon SageMaker service, users should set up a local Python environment with the SageMaker library installed and GraphStorm’s SageMaker helper scripts.
Use the below command to install SageMaker.
pip install --upgrade sagemaker
Clone the GraphStorm repository using the following command
git clone https://github.com/awslabs/graphstorm.git
# Change to the GraphStorm directory
cd graphstorm
For the remainder of this guide we assume the starting working directory is the root of the GraphStorm repository.
Prepare graph data
Unlike GraphStorm’s Standalone mode and the Distributed mode, which rely on local disk or shared file system to store the partitioned graph, SageMaker utilizes Amazon S3 as the shared data storage for distributing partitioned graphs and the configuration YAML file.
This tutorial uses the same three-partition OGB-MAG graph and the Link Prediction task as those introduced in the Partition a Graph section of the Use GraphStorm in a Distributed Cluster tutorial. After generating the partitioned OGB-MAG graphs, use the following commands to upload them and the configuration YAML file to an S3 bucket.
aws s3 cp --recursive /data/ogbn_mag_lp_3p s3://<PATH_TO_DATA>/ogbn_mag_lp_3p
aws s3 cp /graphstorm/training_scripts/gsgnn_lp/mag_lp.yaml s3://<PATH_TO_TRAINING_CONFIG>/mag_lp.yaml
Please replace <PATH_TO_DATA> and <PATH_TO_TRAINING_CONFIG> with your own S3 bucket URI.
Launch training
Launching GraphStorm training on SageMaker is similar as launching in the Standalone mode and the Distributed mode, except for three diffences:
The launch commands are located in the
graphstorm/sagemaker
folder, andUsers need to provide AWS service-related information in the command.
All paths for saving models, embeddings, and prediction results should be specified as S3 locations using the S3 related arguments.
Users can use the following commands to launch a GraphStorm Link Prediction training job with the OGB-MAG graph.
cd /path-to-graphstorm/sagemaker/
python3 launch/launch_train.py \
--image-url <AMAZON_ECR_IMAGE_URI> \
--region <REGION> \
--entry-point run/train_entry.py \
--role <ROLE_ARN> \
--instance-count 3 \
--graph-data-s3 s3://<PATH_TO_DATA>/ogbn_mag_lp_3p \
--yaml-s3 s3://<PATH_TO_TRAINING_CONFIG>/mag_lp.yaml \
--model-artifact-s3 s3://<PATH_TO_SAVE_TRAINED_MODEL>/ \
--graph-name ogbn-mag \
--task-type link_prediction \
--lp-decoder-type dot_product \
--num-layers 1 \
--fanout 10 \
--hidden-size 128 \
--backend gloo \
--batch-size 128
Please replace <AMAZON_ECR_IMAGE_URI> with the <IMAGE_NAME>:<IMAGE_TAG> that are uploaded in the Step 2, e.g., 888888888888.dkr.ecr.us-east-1.amazonaws.com/graphstorm:sm
, replace the <REGION> with the region where ECR image repository is located, e.g., us-east-1
, and replace the <ROLE_ARN> with your AWS account ARN that has SageMaker execution role, e.g., "arn:aws:iam::<ACCOUNT_ID>:role/service-role/AmazonSageMaker-ExecutionRole-20220627T143571"
.
Because we are using a three-partition OGB-MAG graph, we need to set the --instance-count
to 3 in this command.
The trained model artifact will be stored in the S3 location provided through the --model-artifact-s3
argument. You can use the following command to check the model artifacts after the training completes.
If you want to resume a saved model checkpoint to do model fine-tuning you can pass
the S3 address of the model checkpoint through the --model-checkpoint-to-load
argument. For example by passing --model-checkpoint-to-load s3://mag-model/epoch-2/
,
GraphStorm will initialize the model parameters with the model checkpoint stored in s3://mag-model/epoch-2/
.
aws s3 ls s3://<PATH_TO_SAVE_TRAINED_MODEL>/
Launch inference
Users can use the following command to launch a GraphStorm Link Prediction inference job on the OGB-MAG graph.
python3 launch/launch_infer.py \
--image-url <AMAZON_ECR_IMAGE_URI> \
--region <REGION> \
--entry-point run/infer_entry.py \
--role <ROLE_ARN> \
--instance-count 3 \
--graph-data-s3 s3://<PATH_TO_DATA>/ogbn_mag_lp_3p \
--yaml-s3 s3://<PATH_TO_TRAINING_CONFIG>/mag_lp.yaml \
--model-artifact-s3 s3://<PATH_TO_SAVE_TRAINED_MODEL>/ \
--raw-node-mappings-s3 s3://<PATH_TO_DATA>/ogbn_mag_lp_3p/raw_id_mappings \
--output-emb-s3 s3://<PATH_TO_SAVE_GENERATED_NODE_EMBEDDING>/ \
--output-prediction-s3 s3://<PATH_TO_SAVE_PREDICTION_RESULTS> \
--graph-name ogbn-mag \
--task-type link_prediction \
--num-layers 1 \
--fanout 10 \
--hidden-size 128 \
--backend gloo \
--batch-size 128
Note
Different from the training command’s argument, in the inference command, the value of the
--model-artifact-s3
argument needs to be path to a saved model. By default, it is stored under an S3 path with specific training epoch or epoch plus iteration number, e.g.,s3://models/epoch-0-iter-999
, where the trained model artifacts were saved.If
--raw-node-mappings-s3
is not provided, it will be default to the{graph-data-s3}/raw_id_mappings
. The expected graph mappings files should benode_mapping.pt
,edge_mapping.pt
and parquet files underraw_id_mappings
. They record the mapping between original node and edge ids in the raw data files and the ids of nodes and edges in the Graph Node ID space. These files are created during graph construction by either GConstruct or GSProcessing.
As the outcomes of the inference command, the generated node embeddings will be uploaded to s3://<PATH_TO_SAVE_GENERATED_NODE_EMBEDDING>/
. For node classification/regression or edge classification/regression tasks, users can use --output-prediction-s3
to specify the saving locations of prediction results.
Users can use the following commands to check the corresponding outputs:
aws s3 ls s3://<PATH_TO_SAVE_GENERATED_NODE_EMBEDDING>/
aws s3 ls s3://<PATH_TO_SAVE_PREDICTION_RESULTS>/
Launch embedding generation task
Users can use the following example command to launch a GraphStorm embedding generation job in the ogbn-mag
data without generating predictions.
python3 launch/launch_infer.py \
--image-url <AMAZON_ECR_IMAGE_URI> \
--region <REGION> \
--entry-point run/infer_entry.py \
--role <ROLE_ARN> \
--instance-count 3 \
--graph-data-s3 s3://<PATH_TO_DATA>/ogbn_mag_lp_3p \
--yaml-s3 s3://<PATH_TO_TRAINING_CONFIG>/mag_lp.yaml \
--model-artifact-s3 s3://<PATH_TO_SAVE_TRAINED_MODEL>/ \
--raw-node-mappings-s3 s3://<PATH_TO_DATA>/ogbn_mag_lp_3p/raw_id_mappings \
--task-type compute_emb \
--output-emb-s3 s3://<PATH_TO_SAVE_GENERATED_NODE_EMBEDDING>/ \
--graph-name ogbn-mag \
--restore-model-layers embed,gnn
Launch graph partitioning task
If your data are in the DGL chunked format you can perform distributed partitioning using SageMaker to prepare your data for distributed training.
python launch/launch_partition.py \
--graph-data-s3 ${DATASET_S3_PATH} \
--num-parts ${NUM_PARTITIONS} \
--instance-count ${NUM_PARTITIONS} \
--output-data-s3 ${OUTPUT_PATH} \
--instance-type ${INSTANCE_TYPE} \
--image-url ${IMAGE_URI} \
--region ${REGION} \
--role ${ROLE} \
--entry-point "run/partition_entry.py" \
--metadata-filename ${METADATA_FILE} \
--log-level INFO \
--partition-algorithm ${ALGORITHM}
Running the above will take the dataset in chunked format
from ${DATASET_S3_PATH}
as input and create a DistDGL graph with
${NUM_PARTITIONS}
under the output path, ${OUTPUT_PATH}
.
Currently we only support random
as the partitioning algorithm.
Launch hyper-parameter optimization task
GraphStorm supports automatic model tuning with SageMaker AI, which allows you to optimize the hyper-parameters of your model with an easy-to-use interface.
The sagemaker/launch/launch_hyperparameter_tuning.py
script can act as a thin
wrapper for SageMaker’s HyperParameterTuner.
You define the hyper-parameters of interest by passing a filepath to a JSON file, or a python dictionary as a string, where the structure of the dictionary is the same as for SageMaker’s Dynamic hyper-parameters <https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-define-ranges.html#automatic-model-tuning-define-ranges-dynamic>. For example your JSON file can look like:
# Content of my_param_ranges.json
{
"ParameterRanges": {
"CategoricalParameterRanges": [
{
"Name": "model_encoder_type",
"Values": ["rgcn", "hgt"]
}
],
"ContinuousParameterRanges": [
{
"Name": "lr",
"MinValue": "1e-5",
"MaxValue" : "1e-2",
"ScalingType": "Auto"
}
],
"IntegerParameterRanges": [
{
"Name": "hidden_size",
"MinValue": "64",
"MaxValue": "256",
"ScalingType": "Auto"
}
]
}
}
Which you can then use to launch an HPO job:
# Example hyper-parameter ranges
python launch/launch_hyperparameter_tuning.py \
--hyperparameter-ranges my_param_ranges.json
# Other launch parameters...
For continuous and integer parameters you can provide a ScalingType
string that directly corresponds to one of SageMaker’s
scaling types.
By default scaling type will be 'Auto'
.
Use --metric-name
to provide the name of a GraphStorm metric to use as a tuning objective,
e.g. "accuracy"
. See the entry for eval_metric
in Evaluation Metrics
for a full list of supported metrics.
--eval-mask
defines which dataset to collect metrics from, and
can be either "test"
or "val"
to collect metrics from test or validation set,
respectively. Finally use --objective-type
to set the type of the objective,
which can be either "Maximize"
or "Minimize"
.
See the SageMaker documentation
for more details
Finally you can use --strategy
to select the optimization strategy
from one of “Bayesian”, “Random”, “Hyperband”, “Grid”. See the
SageMaker documentation
for more details on each strategy.
To use the Hyperband strategy you should provide the --hb-max-epochs
and --hb-min-epochs
to the launch script
to determine the maximum and minimum resource allocation (in terms of number of epochs) per job. See the
SageMaker HPO user guide
and Hyperband configuration docs
for details.
Example HPO call:
python launch/launch_hyperparameter_tuning.py \
--task-name my-gnn-hpo-job \
--role arn:aws:iam::123456789012:role/SageMakerRole \
--region us-west-2 \
--image-url 123456789012.dkr.ecr.us-west-2.amazonaws.com/graphstorm:sagemaker-gpu \
--graph-name my-graph \
--task-type node_classification \
--graph-data-s3 s3://my-bucket/graph-data/ \
--yaml-s3 s3://my-bucket/train.yaml \
--model-artifact-s3 s3://my-bucket/model-artifacts/ \
--max-jobs 20 \
--max-parallel-jobs 4 \
--hyperparameter-ranges my_param_ranges.json \
--metric-name "accuracy" \
--eval-mask "val" \
--objective-type "Maximize" \
--strategy "Bayesian"
Passing additional arguments to the SageMaker Estimator
Sometimes you might want to pass additional arguments to the constructor
of the SageMaker Estimator/Processor object that we use to launch SageMaker
tasks, e.g. to set a max runtime, or set a VPC configuration. Our launch
scripts support forwarding arguments to the base class object through a
kwargs
dictionary.
To pass additional kwargs
directly to the Estimator/Processor
constructor, you can use the --sm-estimator-parameters
argument,
providing a string of space-separated arguments (enclosed in double
quotes "
to ensure correct parsing) and the format
<argname>=<value>
for each argument.
<argname>
needs to be a valid SageMaker Estimator/Processor argument
name and <value>
is a value that can be parsed as a Python literal,
without spaces.
For example, to pass a specific max runtime, subnet list, and enable inter-container traffic encryption for a train, inference, or partition job you’d use:
python3 launch/launch_[infer|train|partition] \
<other arugments> \
--sm-estimator-parameters "max_run=3600 volume_size=100 encrypt_inter_container_traffic=True subnets=['subnet-1234','subnet-4567']"
Notice how we don’t include any spaces in
['subnet-1234','subnet-4567']
to ensure correct parsing of the list.
The train, inference and partition scripts launch SageMaker Training
jobs that rely on the Estimator
base class: For a full list of
Estimator
parameters see the SageMaker Estimator documentation.
The GConstruct job will launch a SageMaker Processing job that relies on
the Processor
base class, so its arguments are different,
e.g. volume_size_in_gb
for the Processor
vs. volume_size
for
the Estimator
. For a full list of Processor
parameters, see the SageMaker Processor documentation.
Using Processor
arguments the above example would become:
python3 launch/launch_gconstruct \
<other arugments> \
--sm-estimator-parameters "max_runtime_in_seconds=3600 volume_size_in_gb=100"
Run GraphStorm SageMaker jobs locally
You can use SageMaker’s local mode to test your SageMaker jobs locally before launching large-scale jobs, to ensure your configuration or other changes are correct.
First, you need to ensure your SageMaker installation has the necessary dependencies.
SageMaker Local Mode requires
Docker Compose
and a SageMaker Python SDK installation with local
extras:
pip install 'sagemaker[local]' --upgrade
When launching your SageMaker job, use local
as the instance type:
python3 launch/launch_[infer|train|partition] \
<other arguments> \
--instance-type "local"
This command will launch the GraphStorm job locally, by spinning up local Docker containers and using Docker compose. See the SageMaker security configuration user guide for more information, or the local pipeline examples in the SageMaker SDK.
Note
If you encounter a
bus error
during training, try increasing the shared memory size (shm_size
) assigned to your local container. See the SageMaker Local mode configuration docs for instructions on how to do so.If running on EC2 and you would like to use the session credentials instead of EC2 Metadata Service credentials set the environment variable
USE_SHORT_LIVED_CREDENTIALS=1
To run GPU jobs locally your host instance will need to have a GPU available and CUDA installed.
Legacy image building instructions
Since GraphStorm 0.4.0 we provide new build scripts to facilitate easier image building and pushing to ECR. In this section we provide the instructions for the legacy scripts. These scripts will be deprecated in version 0.5 and removed in a future version of GraphStorm.
Step 1: Build a SageMaker-compatible Docker image
Note
Please make sure your account has access key (AK) and security access key (SK) configured to authenticate accesses to AWS services.
For more details of Amazon ECR operation via CLI, users can refer to the Using Amazon ECR with the AWS CLI document.
First, in a Linux machine, configure a Docker environment by following the Docker documentation suggestions.
In order to use the SageMaker base Docker image, users need to use the following command to authenticate to pull SageMaker images.
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
Then, clone GraphStorm source code, and build a GraphStorm SageMaker compatible Docker image from source with commands:
git clone https://github.com/awslabs/graphstorm.git
cd /path-to-graphstorm/docker/
bash /path-to-graphstorm/docker/build_docker_sagemaker.sh /path-to-graphstorm/ <DEVICE_TYPE> <IMAGE_NAME> <IMAGE_TAG>
The build_docker_sagemaker.sh
script takes four arguments:
path-to-graphstorm (required), is the absolute path of the
graphstorm
folder, where you cloned the GraphStorm source code. For example, the path could be/code/graphstorm
.DEVICE_TYPE (optional), is the intended device type of the to-be built Docker image. There are two options:
cpu
for building CPU-compatible images, andgpu
for building Nvidia GPU-compatible images. Default isgpu
.IMAGE_NAME (optional), is the assigned name of the to-be built Docker image. Default is
graphstorm
.
Warning
In order to upload the GraphStorm SageMaker Docker image to Amazon ECR, users need to define the <IMAGE_NAME> to include the ECR URI string, <AWS_ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/, e.g., 888888888888.dkr.ecr.us-east-1.amazonaws.com/graphstorm
.
IMAGE_TAG (optional), is the assigned tag name of the to-be built Docker image. Default is
sm-<DEVICE_TYPE>
, that is,sm-gpu
for GPU images,sm-cpu
for CPU images.
Once the build_docker_sagemaker.sh
command completes successfully, there will be a Docker image, named <IMAGE_NAME>:<IMAGE_TAG>
,
such as 888888888888.dkr.ecr.us-east-1.amazonaws.com/graphstorm:sm-gpu
, in the local repository, which could be listed by running:
docker image ls
Step 2: Upload Docker Images to Amazon ECR Repository
Because SageMaker relies on Amazon ECR to access customers’ own Docker images, users need to upload Docker images built in the Step 1 to their own ECR repository.
The following command will authenticate the user account to access to user’s ECR repository via AWS CLI.
aws ecr get-login-password --region <REGION> | docker login --username AWS --password-stdin <AWS_ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com
Please replace the <REGION> and <AWS_ACCOUNT_ID> with your own account information and be consistent with the values used in the Step 1.
In addition, users need to create an ECR repository at the specified <REGION> with the name as <IMAGE_NAME> WITHOUT the ECR URI string, e.g., graphstorm
.
And then use the below command to push the built GraphStorm Docker image to users’ own ECR repository.
docker push <IMAGE_NAME>:<IMAGE_TAG>
Please replace the <IMAGE_NAME> and <IMAGE_TAG> with the actual Docker image name and tag, e.g., 888888888888.dkr.ecr.us-east-1.amazonaws.com/graphstorm:sm-gpu
.