GraphStorm Processing Distributed Setup
In this guide we’ll demonstrate how to prepare your environment to run GraphStorm Processing (GSProcessing) jobs on Amazon SageMaker.
We’re assuming a Linux host environment used throughout this tutorial, but other OS should work fine as well.
The steps required are:
Clone the GraphStorm repository.
Install Docker.
Install Poetry.
Set up AWS access.
Build the GraphStorm Processing image using Docker.
Push the image to the Amazon Elastic Container Registry (ECR).
Launch a SageMaker Processing or EMR Serverless job using the example scripts.
Clone the GraphStorm repository
You can clone the GraphStorm repository using
git clone https://github.com/awslabs/graphstorm.git
You can then navigate to the graphstorm-processing/docker directory
that contains the relevant code:
cd ./graphstorm/graphstorm-processing/docker
Install Docker
To get started with building the GraphStorm Processing image you’ll need to have the Docker engine installed.
To install Docker follow the instructions at the official site.
Install Poetry
We use Poetry as our build tool and for dependency management, so we need to install it to facilitate building the library.
You can install Poetry using:
curl -sSL https://install.python-poetry.org | python3 -
For detailed installation instructions the Poetry docs.
Set up AWS access
To build and push the image to ECR we’ll make use of the
aws-cli and we’ll need valid AWS credentials as well.
To install the AWS CLI you can use:
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
To set up credentials for use with aws-cli see the
AWS docs.
Your role should have full ECR access to be able to pull from ECR to build the image, create an ECR repository if it doesn’t exist, and push the GSProcessing image to the repository.
Building the GraphStorm Processing image using Docker
Once Docker and Poetry are installed, and your AWS credentials are set up,
we can use the provided scripts
in the graphstorm-processing/docker directory to build the image.
GSProcessing supports Amazon SageMaker and EMR Serverless as execution environments, so we need to choose which image we want to build first.
The build_gsprocessing_image.sh script can build the image
locally and tag it, provided the intended execution environment,
using the -e/--environment argument. The supported environments
are sagemaker and emr-serverless.
For example, assuming our current directory is where
we cloned graphstorm/graphstorm-processing, we can use
the following to build the SageMaker image:
bash docker/build_gsprocessing_image.sh --environment sagemaker
The above will use the SageMaker-specific Dockerfile of the latest available GSProcessing version,
build an image and tag it as graphstorm-processing-sagemaker:${VERSION}-x86_64 where
${VERSION} will take be the latest available GSProcessing version.
The script also supports other arguments to customize the image name,
tag and other aspects of the build. See bash docker/build_gsprocessing_image.sh --help
for more information.
Packaging Huggingface models into the image
If you plan to use text transformations (see Supported transformations)
that utilize Huggingface models, you can opt to include the Huggingface model cache directly in your Docker image.
The build_gsprocessing_image.sh script provides an option to embed the Huggingface model cache within the Docker image, using the --hf-model argument.
You can do this for both the SageMaker and EMR Serverless images. It is a good way to save cost as it avoids downloading models after launching the job.
If you’d rather download the Huggingface models at runtime, for EMR Serverless images, setting up a VPC and NAT route is a necessary.
You can find detailed instructions on creating a VPC for EMR Serverless in the AWS documentation: Create a VPC on emr-serverless.
bash docker/build_gsprocessing_image.sh --environment sagemaker --model-name bert-base-uncased
bash docker/build_gsprocessing_image.sh --environment emr-serverless --model-name bert-base-uncased
Support for arm64 architecture
For EMR Serverless images, it is possible to build images for the arm64 architecture,
which can lead to improved runtime and cost compared to x86_64. For more details
on EMR Serverless architecture options see the
official docs.
You can build an arm64
image natively by installing Docker and following the above process on an ARM instance such
as M6G or M7G. See the AWS documentation
for instances powered by the Graviton processor.
To build arm64 images
on an x86_64 host you need to enable multi-platform builds for Docker. The easiest way
to do so is to use QEMU emulation. To install the QEMU related libraries you can run
On Ubuntu
sudo apt install -y qemu binfmt-support qemu-user-static
On Amazon Linux/CentOS:
sudo yum instal -y qemu-system-arm qemu qemu-user qemu-kvm qemu-kvm-tools \
libvirt virt-install libvirt-python libguestfs-tools-c
Finally you’d need to ensure binfmt_misc is configured for different platforms by running
docker run --privileged --rm tonistiigi/binfmt --install all
To verify your Docker installation is ready for multi-platform builds you can run:
docker buildx ls
NAME/NODE DRIVER/ENDPOINT STATUS BUILDKIT PLATFORMS
default * docker
default default running v0.8+unknown linux/amd64, linux/arm64
Finally, to build an EMR Serverless GSProcessing image for the arm64 architecture you can run:
bash docker/build_gsprocessing_image.sh --environment emr-serverless --architecture arm64
Note
Building images for the first time under emulation using QEMU
can be significantly slower than native builds
(more than 20 minutes to build the GSProcessing arm64 image).
After the first build, follow up builds that only change the GSProcessing code
will be less than a minute thanks to Docker’s caching.
To speed up the build process you can build on an ARM-native instance,
look into using buildx with multiple native nodes, or use cross-compilation.
See the official Docker documentation
for details.
Push the image to the Amazon Elastic Container Registry (ECR)
Once the image is built we can use the push_gsprocessing_image.sh script
that will create an ECR repository if needed and push the image we just built.
The script again requires us to provide the intended execution environment using
the -e/--environment argument,
and by default will create a repository named graphstorm-processing-<environment> in the us-west-2 region,
on the default AWS account aws-cli is configured for,
and push the image tagged with the latest version of GSProcessing.
The script supports 4 optional arguments:
Image name/repository. (
-i/--image) Default:graphstorm-processing-<environment>Image tag. (
-v/--version) Default:<latest_library_version>e.g.0.2.2.ECR region. (
-r/--region) Default:us-west-2.AWS Account ID. (
-a/--account) Default: Uses the account ID detected by theaws-cli.
Example:
bash docker/push_gsprocessing_image.sh -e sagemaker -r "us-west-2" -a "1234567890"
To push an EMR Serverless arm64 image you’d similarly run:
bash docker/push_gsprocessing_image.sh -e emr-serverless --architecture arm64 \
-r "us-west-2" -a "1234567890"
Upload data to S3
For distributed jobs we use S3 as our storage source and target, so before running any example we’ll need to upload our data to S3. To do so you will need to have read/write access to an S3 bucket, and the requisite AWS credentials and permissions.
We will use the AWS CLI to upload data so make sure it is installed and configured in you local environment.
Assuming graphstorm/graphstorm-processing is our current working
directory we can upload the data to S3 using:
MY_BUCKET="enter-your-bucket-name-here"
REGION="bucket-region" # e.g. us-west-2
aws --region ${REGION} s3 sync ./tests/resources/small_heterogeneous_graph/ \
"s3://${MY_BUCKET}/gsprocessing-input"
Note
Make sure you are uploading your data to a bucket that was created in the same region as the ECR image you pushed.
Launch a SageMaker Processing job using the example scripts.
Once the setup is complete, you can follow the SageMaker Processing job guide to launch your distributed processing job using Amazon SageMaker resources.
Launch an EMR Serverless job using the example scripts.
In addition to Amazon SageMaker you can also use EMR Serverless as an execution environment to allow you to scale to even larger datasets (recommended when your graph has 30B+ edges). Its setup is more involved than Amazon SageMaker, so we only recommend it for experienced AWS users. Follow the EMR Serverless job guide to launch your distributed processing job using EMR Serverless resources.