GraphStorm Graph Construction

In order to use GraphStorm’s graph construction pipeline on a single machine or a distributed environment, users should prepare their input raw data accroding to GraphStorm’s specifications. Users can find more details of these specifications in the Input Raw Data Explanations section.

Once the raw data is ready, by using GraphStorm single machine graph construction CLIs, users can handle most common academic graphs or small graphs sampled from enterprise data, typically with millions of nodes and up to one billion edges. It’s recommended to use machines with large CPU memory. A general guideline: 1TB of memory for graphs with one billion edges.

Many production-level enterprise graphs contain billions of nodes and edges, with features having hundreds or thousands of dimensions. GraphStorm distributed graph construction CLIs help users manage these complex graphs. This is particularly useful for building automatic graph data processing pipelines in production environments. GraphStorm distributed graph construction CLIs could be applied on multiple Amazon infrastructures, including Amazon SageMaker, EMR Serverless, and EMR on EC2.