.. _gsprocessing_distributed_setup: GSProcessing Distributed Setup ======================================= In this guide we'll demonstrate how to prepare your environment to run GraphStorm Processing (GSProcessing) jobs on Amazon SageMaker. We're assuming a Linux host environment used throughout this tutorial, but other OS should work fine as well. The steps required are: - Clone the GraphStorm repository. - Install Docker. - Set up AWS access. - Build the GraphStorm Processing image using Docker. - Push the image to the Amazon Elastic Container Registry (ECR). - Launch a SageMaker Processing or EMR Serverless job using the example scripts. Clone the GraphStorm repository ------------------------------- You can clone the GraphStorm repository using .. code-block:: bash git clone https://github.com/awslabs/graphstorm.git You can then navigate to the ``graphstorm-processing/`` directory that contains the relevant code: .. code-block:: bash cd ./graphstorm/graphstorm-processing/ Install Docker -------------- To get started with building the GraphStorm Processing image you'll need to have the Docker engine installed. To install Docker follow the instructions at the `official site `_. Set up AWS access ----------------- To build and push the image to ECR we'll make use of the ``aws-cli`` and we'll need valid AWS credentials as well. To install the AWS CLI you can use: .. code-block:: bash curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" unzip awscliv2.zip sudo ./aws/install To set up credentials for use with ``aws-cli`` see the `AWS docs `_. Your role should have full ECR access to be able to pull from ECR to build the image, create an ECR repository if it doesn't exist, and push the GSProcessing image to the repository. Building the GraphStorm Processing image using Docker ----------------------------------------------------- Once Docker and the aws CLI are installed, and your AWS credentials are set up, we can use the provided scripts in the ``graphstorm-processing/docker`` directory to build the image. GSProcessing supports Amazon SageMaker, EMR on EC2, and EMR Serverless as execution environments, so we need to choose which image we want to build first. The ``build_gsprocessing_image.sh`` script can build the image locally and tag it, provided the intended execution environment, using the ``-e/--environment`` argument. The supported environments are ``sagemaker``, ``emr``, and ``emr-serverless``. For example, assuming our current directory is where we cloned ``graphstorm/graphstorm-processing``, we can use the following to build the SageMaker image: .. code-block:: bash bash docker/build_gsprocessing_image.sh --environment sagemaker The above will use the SageMaker-specific Dockerfile of the latest available GSProcessing version, build an image and tag it as ``graphstorm-processing-sagemaker:${VERSION}-x86_64`` where ``${VERSION}`` will take be the latest available GSProcessing version. The script also supports other arguments to customize the image name, tag and other aspects of the build. See ``bash docker/build_gsprocessing_image.sh --help`` for more information. Packaging Huggingface models into the image ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If you plan to use text transformations (see :ref:`gsp-supported-transformations-ref`) that utilize Huggingface models, you can opt to include the Huggingface model cache directly in your Docker image. The ``build_gsprocessing_image.sh`` script provides an option to embed the Huggingface model cache within the Docker image, using the ``--hf-model`` argument. You can do this for both the SageMaker and EMR Serverless images. It is a good way to save cost as it avoids downloading models after launching the job. If you'd rather download the Huggingface models at runtime, for EMR Serverless images, setting up a VPC and NAT route is a necessary. You can find detailed instructions on creating a VPC for EMR Serverless in the AWS documentation: `Create a VPC on emr-serverless `_. .. code-block:: bash bash docker/build_gsprocessing_image.sh --environment sagemaker --hf-model bert-base-uncased bash docker/build_gsprocessing_image.sh --environment emr-serverless --hf-model bert-base-uncased Support for arm64 architecture ------------------------------ For EMR and EMR Serverless images, it is possible to build images for the ``arm64`` architecture, which can lead to improved runtime and cost compared to ``x86_64``. For more details on EMR Serverless architecture options see the `official docs `_. You can build an ``arm64`` image natively by installing Docker and following the above process on an ARM instance such as ``M6G`` or ``M7G``. See the `AWS documentation `_ for instances powered by the Graviton processor. To build ``arm64`` images on an ``x86_64`` host you need to enable multi-platform builds for Docker. The easiest way to do so is to use QEMU emulation. To install the QEMU related libraries you can run On Ubuntu .. code-block:: bash sudo apt install -y qemu binfmt-support qemu-user-static On Amazon Linux/CentOS: .. code-block:: bash sudo yum instal -y qemu-system-arm qemu qemu-user qemu-kvm qemu-kvm-tools \ libvirt virt-install libvirt-python libguestfs-tools-c Finally you'd need to ensure ``binfmt_misc`` is configured for different platforms by running .. code-block:: bash docker run --privileged --rm tonistiigi/binfmt --install all To verify your Docker installation is ready for multi-platform builds you can run: .. code-block:: bash docker buildx ls NAME/NODE DRIVER/ENDPOINT STATUS BUILDKIT PLATFORMS default * docker default default running v0.8+unknown linux/amd64, linux/arm64 Finally, to build an EMR Serverless GSProcessing image for the ``arm64`` architecture you can run: .. code-block:: bash bash docker/build_gsprocessing_image.sh --environment emr-serverless --architecture arm64 .. note:: Building images for the first time under emulation using QEMU can be significantly slower than native builds (more than 20 minutes to build the GSProcessing ``arm64`` image). After the first build, follow up builds that only change the GSProcessing code will be less than a minute thanks to Docker's caching. To speed up the build process you can build on an ARM-native instance, look into using ``buildx`` with multiple native nodes, or use cross-compilation. See `the official Docker documentation `_ for details. Push the image to the Amazon Elastic Container Registry (ECR) ------------------------------------------------------------- Once the image is built we can use the ``push_gsprocessing_image.sh`` script that will create an ECR repository if needed and push the image we just built. The script again requires us to provide the intended execution environment using the ``-e/--environment`` argument, and by default will create a repository named ``graphstorm-processing-`` in the ``us-west-2`` region, on the default AWS account ``aws-cli`` is configured for, and push the image tagged with the latest version of GSProcessing. The script supports 4 optional arguments: 1. Image name/repository. (``-i/--image``) Default: ``graphstorm-processing-`` 2. Image tag. (``-v/--version``) Default: ```` e.g. ``0.2.2``. 3. ECR region. (``-r/--region``) Default: ``us-west-2``. 4. AWS Account ID. (``-a/--account``) Default: Uses the account ID detected by the ``aws-cli``. Example: .. code-block:: bash bash docker/push_gsprocessing_image.sh -e sagemaker -r "us-west-2" -a "1234567890" To push an EMR Serverless ``arm64`` image you'd similarly run: .. code-block:: bash bash docker/push_gsprocessing_image.sh -e emr-serverless --architecture arm64 \ -r "us-west-2" -a "1234567890" .. _gsp-upload-data-ref: Upload data to S3 ----------------- For distributed jobs we use S3 as our storage source and target, so before running any example we'll need to upload our data to S3. To do so you will need to have read/write access to an S3 bucket, and the requisite AWS credentials and permissions. We will use the AWS CLI to upload data so make sure it is `installed `_ and `configured `_ in you local environment. Assuming ``graphstorm/graphstorm-processing`` is our current working directory we can upload the data to S3 using: .. code-block:: bash MY_BUCKET="enter-your-bucket-name-here" REGION="bucket-region" # e.g. us-west-2 aws --region ${REGION} s3 sync ./tests/resources/small_heterogeneous_graph/ \ "s3://${MY_BUCKET}/gsprocessing-input" .. note:: Make sure you are uploading your data to a bucket that was created in the same region as the ECR image you pushed. Launch a SageMaker Processing job using the example scripts. ------------------------------------------------------------ Once the setup is complete, you can follow the :ref:`SageMaker Processing job guide` to launch your distributed processing job using Amazon SageMaker resources. Launch an EMR Serverless job using the example scripts. ------------------------------------------------------------ In addition to Amazon SageMaker you can also use EMR Serverless as an execution environment to allow you to scale to even larger datasets (recommended when your graph has 30B+ edges). Its setup is more involved than Amazon SageMaker, so we only recommend it for experienced AWS users. Follow the :ref:`EMR Serverless job guide` to launch your distributed processing job using EMR Serverless resources.