===============================================
Running partition jobs on Amazon EC2 Clusters
===============================================

Once the :ref:`distributed processing<gsprocessing_distributed_setup>` is completed,
users can start the partition jobs. This tutorial will provide instructions on how to setup an EC2 cluster and
start GSPartition jobs on it.

Create a GraphStorm Cluster
----------------------------

Setup instances of a cluster
.............................
A cluster contains several instances, each of which runs a GraphStorm Docker container. Before creating a cluster, we recommend to
follow the :ref:`Environment Setup <setup_docker>`. The guide shows how to build GraphStorm Docker images, and use a Docker container registry,
e.g. `AWS ECR <https://docs.aws.amazon.com/ecr/>`_ , to upload the GraphStorm image to an ECR repository, pull it on the instances in the cluster,
and finally start the image as a container.

.. note::

    If you are planning to use **parmetis** algorithm, please prepare your docker image using the following instructions:

    .. code-block:: bash

        git clone https://github.com/awslabs/graphstorm.git

        cd ./graphstorm/docker/

        bash docker/build_graphstorm_image.sh --environment local --use-parmetis

See ``bash docker/build_graphstorm_image.sh --help`` for more build options.

Setup a shared file system for the cluster
...........................................
A cluster requires a shared file system, such as NFS or `EFS <https://docs.aws.amazon.com/efs/>`_, mounted to each instance in the cluster, in which all GraphStorm containers can share data files, save model artifacts and prediction results.

`Here <https://github.com/dmlc/dgl/tree/master/examples/pytorch/graphsage/dist#step-0-setup-a-distributed-file-system>`_ is the instruction of setting up an NFS for a cluster. As the steps of setting an NFS could be various on different systems, we suggest users to look for additional information about NFS setting. Here are some available resources: `NFS tutorial <https://www.digitalocean.com/community/tutorials/how-to-set-up-an-nfs-mount-on-ubuntu-22-04>`_ by DigitalOcean, `NFS document <https://ubuntu.com/server/docs/service-nfs>`_ for Ubuntu.

For an AWS EC2 cluster, users can also use EFS as the shared file system. Please follow 1) `the instruction of creating EFS <https://docs.aws.amazon.com/efs/latest/ug/gs-step-two-create-efs-resources.html>`_; 2) `the instruction of installing an EFS client <https://docs.aws.amazon.com/efs/latest/ug/installing-amazon-efs-utils.html>`_; and 3) `the instructions of mounting the EFS filesystem <https://docs.aws.amazon.com/efs/latest/ug/efs-mount-helper.html>`_ to set up EFS.

After setting up a shared file system, we can keep all graph data in a shared folder. Then mount the data folder to the ``/path_to_data/`` of each instances in the cluster so that all GraphStorm containers can access the data.

Run a GraphStorm container
...........................
In each instance, use the following command to start a GraphStorm Docker container and run it as a backend daemon on cpu.

.. code-block:: shell

    docker run -v /path_to_data/:/data \
                      -v /dev/shm:/dev/shm \
                      --network=host \
                      -d --name test graphstorm:local-cpu service ssh restart

This command mounts the shared ``/path_to_data/`` folder to a container's ``/data/`` folder by which GraphStorm codes can access graph data and save the partition result.

Setup the IP Address File and Check Port Status
----------------------------------------------------------

Collect the IP address list
...........................
The GraphStorm Docker containers use SSH on port ``2222`` to communicate with each other. Users need to collect all IP addresses of all the instances and put them into a text file, e.g., ``/data/ip_list.txt``, which is like:

.. figure:: ../../../../../../tutorial/distributed_ips.png
    :align: center

.. note:: We recommend to use **private IP addresses** on AWS EC2 cluster to avoid any possible port constraints.

Put the IP list file into container's ``/data/`` folder.

Check port
................
The GraphStorm Docker container uses port ``2222`` to **ssh** to containers running on other machines without password. Please make sure the port is not used by other processes.

Users also need to make sure the port ``2222`` is open for **ssh** commands.

Pick one instance and run the following command to connect to the GraphStorm Docker container.

.. code-block:: bash

    docker container exec -it test /bin/bash

Users need to exchange the ssh key from each of GraphStorm Docker container to
the rest containers in the cluster: copy the keys from the ``/root/.ssh/id_rsa.pub`` from one container to ``/root/.ssh/authorized_keys`` in containers on all other containers.
In the container environment, users can check the connectivity with the command ``ssh <ip-in-the-cluster> -o StrictHostKeyChecking=no -p 2222``. Please replace the ``<ip-in-the-cluster>`` with the real IP address from the ``ip_list.txt`` file above, e.g.,

.. code-block:: bash

    ssh 172.38.12.143 -o StrictHostKeyChecking=no -p 2222

If successful, you should login to the container with ip 172.38.12.143.

If not, please make sure there is no restriction of exposing port 2222.


Launch GSPartition Jobs
-----------------------

Now we can ssh into the **leader node** of the EC2 cluster, and start GSPartition process with the following command:

.. code:: bash

    python3 -m graphstorm.gpartition.dist_partition_graph \
        --input-path ${LOCAL_INPUT_DATAPATH} \
        --metadata-filename ${METADATA_FILE} \
        --output-path ${LOCAL_OUTPUT_DATAPATH} \
        --num-parts ${NUM_PARTITIONS} \
        --partition-algorithm ${ALGORITHM} \
        --ip-config ${IP_CONFIG}

.. warning::
    1. Please make sure the both ``LOCAL_INPUT_DATAPATH`` and ``LOCAL_OUTPUT_DATAPATH`` are located on the shared filesystem.
    2. The number of instances in the cluster should be equal to ``NUM_PARTITIONS``.
    3. For users who only want to generate partition assignments instead of the partitioned DGL graph, please add ``--partition-assignment-only`` flag.

The arguments that ``graphstorm.gpartition.dist_partition_graph`` accepts are the following:

* ``--input-path str``: Path to input DGL chunked data directory. (required). The format for the data
  is available at : https://docs.dgl.ai/guide/distributed-preprocessing.html#chunked-graph-format
* ``--output-path str``: Path to store the partitioned data. (required)
* ``--num-parts int``: Number of partitions to generate. (required)
* ``--metadata-filename str``: Name for the chunked DGL data metadata file. (default: ``metadata.json``)
* ``--ssh-port int``: SSH Port. (default: 22)
* ``--dgl-tool-path str``: The path to `dgl/tools` code. (default: ``/root/dgl/tools``)
* ``--partition-algorithm str``: Partition algorithm to use. (choices: ``random``, ``parmetis``, default: ``random``)
* ``--ip-config str``: A file storing a list of IPs, one line for each instance of the partition cluster.
* ``--partition-assignment-only``: Only generate partition assignments for nodes, the process will not build the partitioned DGL graph.
* ``--logging-level str``: The logging level. The possible values: debug, info, warning, error. The default value is info. (default: info)
* ``--process-group-timeout``: Timeout[seconds] for operations executed against the process group. The default value is 1800.
* ``--use-graphbolt "true"/"false"``: ``New in v0.4``. Whether to convert the partitioned data to the GraphBolt format after creating the DistDGL graph.
  Requires installed DGL version to be at least ``2.1.0``. See :ref:`using-graphbolt-ref` for an example.
  (default: ``"false"``)