Running partition jobs on Amazon SageMaker

Once the distributed processing is complete, you can use Amazon SageMaker launch scripts to launch distributed processing jobs with AWS resources.

Build the Docker Image for GSPartition Jobs on Amazon SageMaker

GSPartition job on Amazon SageMaker uses its SageMaker’s BYOC (Bring Your Own Container) mode.

To build and push the GraphStorm SageMaker image follow the instructions in Setup GraphStorm SageMaker Docker Image.

Launch the GSPartition Job on Amazon SageMaker

For this example, we’ll use an Amazon SageMaker cluster with 2 ml.t3.xlarge instances. We assume the data is already on an AWS S3 bucket. For large graphs, users can choose larger instances or more instances.

Install dependencies

To run GraphStorm with the Amazon SageMaker service, users should install the Amazon SageMaker library and copy GraphStorm’s SageMaker tools.

Use the below command to install Amazon SageMaker.

pip install sagemaker

Copy GraphStorm SageMaker tools. Users can clone the GraphStorm repository using the following command or copy the sagemaker folder to the instance.

git clone https://github.com/awslabs/graphstorm.git

Launch GSPartition task

Users can use the following command to launch partition jobs.

python launch/launch_partition.py \
    --graph-data-s3 ${DATASET_S3_PATH} \
    --num-parts 2 \
    --instance-count 2 \
    --output-data-s3 ${OUTPUT_PATH} \
    --instance-type ml.t3.xlarge \
    --image-url ${IMAGE_URI} \
    --region ${REGION} \
    --role ${ROLE}  \
    --entry-point "run/partition_entry.py" \
    --metadata-filename ${METADATA_FILE} \
    --log-level INFO \
    --partition-algorithm random

Warning

The --num-parts should be equal to the --instance-count here.

Running the above will take the dataset after GSProcessing from ${DATASET_S3_PATH} as input and create a DistDGL graph with ${NUM_PARTITIONS} under the output path, ${OUTPUT_PATH}. Currently we only support random as the partitioning algorithm for sagemaker.