GSProcessing Getting Started

Installation

The project needs Python 3.9 and Java 8 or 11 installed. Below we provide brief guides for each requirement.

Install Python 3.9

The project uses Python 3.9. We recommend using PyEnv to have isolated Python installations.

With PyEnv installed you can create and activate a Python 3.9 environment using

pyenv install 3.9
pyenv local 3.9

Install GSProcessing from source

With a recent version of pip installed (we recommend pip>=21.3), you can simply run pip install . from the root directory of the project (graphstorm/graphstorm-processing), which should install the library into your environment and pull in all dependencies:

# Ensure Python is at least 3.9
python -V
cd graphstorm/graphstorm-processing
pip install .

Install GSProcessing using poetry

You can also create a local virtual environment using poetry. With Python 3.9 and poetry installed you can run:

cd graphstorm/graphstorm-processing
# This will create a virtual env under graphstorm-processing/.venv
poetry install
# This will activate the .venv
poetry shell

Install Java 8 or 11

Spark has a runtime dependency on the JVM to run, so you’ll need to ensure Java is installed and available on your system.

On MacOS you can install Java using brew:

brew install openjdk@11

On Linux it will depend on your distribution’s package manager. For Ubuntu you can use:

sudo apt install openjdk-11-jdk

On Amazon Linux 2 you can use:

sudo yum install java-11-amazon-corretto-headless
sudo yum install java-11-amazon-corretto-devel

To check if Java is installed you can use.

java -version

Example

See the provided example for an example of how to start with tabular data and convert them into a graph representation before partitioning and training with GraphStorm.

Running locally

For data that fit into the memory of one machine, you can run jobs locally instead of a cluster.

To use the library to process your data, you will need to have your data in a tabular format, and a corresponding JSON configuration file that describes the data. The input data need to be in CSV (with header(s)) or Parquet format.

The configuration file can be in GraphStorm’s GConstruct format, with the caveat that the file paths need to be relative to the location of the config file. Also note that you’ll need to convert all your input data to CSV or Parquet files.

See Relative file paths required for more details.

After installing the library, executing a processing job locally can be done using:

gs-processing \
    --config-filename gconstruct-config.json \
    --input-prefix /path/to/input/data \
    --output-prefix /path/to/output/data \
    --do-repartition True

Once this script completes, the data are ready to be fed into DGL’s distributed partitioning pipeline. See this guide for more details on how to use GraphStorm distributed partitioning and training on SageMaker.

See example for a detailed walkthrough of using GSProcessing to wrangle data into a format that’s ready to be consumed by the GraphStorm distributed training pipeline.

Running on AWS resources

GSProcessing supports Amazon SageMaker, EMR on EC2, and EMR Serverless as execution environments. To run distributed jobs on AWS resources we will have to build a Docker image and push it to the Amazon Elastic Container Registry, which we cover in distributed processing setup. We can then run either a SageMaker Processing job which we describe in running GSProcessing on SageMaker, an EMR on EC2 job which we describe in running GSProcessing on EMR EC2, or an EMR Serverless job that is covered in running GSProcessing on EMR Serverless.

Input configuration

GSProcessing supports both the GConstruct JSON configuration format, as well as its own GSProcessing config. You can learn about the GSProcessing JSON configuration in GSProcessing Input Configuration.

Re-applying feature transformations to new data

Often you will process your data at training time and run inference at later dates. If your data changes in the meantime. e.g. new values appear in a categorical feature, you’ll need to ensure no new values appear in the transformed data, as the trained model relies on pre-existing values only.

To achieve that, GSProcessing creates an additional file in the output, named precomputed_transformations.json. To ensure the same transformations are applied to new data, you can copy this file to the top-level path of your new input data, and GSProcessing will pick up any transformations there to ensure the produced data match the ones that were used to train your model.

Currently, we only support re-applying transformations for categorical features.

Developer guide

To get started with developing the package refer to developer guide.

Footnotes