Developer Guide
The project is set up using poetry to make easier for developers to
jump into the project.
The steps we recommend are:
Install JDK 8, 11
PySpark requires a compatible Java installation to run, so you will need to ensure your active JDK is using either Java 8 or 11.
On MacOS you can do this using brew:
brew install openjdk@11
On Linux it will depend on your distribution’s package manager. For Ubuntu you can use:
sudo apt install openjdk-11-jdk
On Amazon Linux 2 you can use:
sudo yum install java-11-amazon-corretto-headless
sudo yum install java-11-amazon-corretto-devel
Install pyenv
pyenv is a tool to manage multiple Python version installations. It
can be installed through the installer below on a Linux machine:
curl -L https://github.com/pyenv/pyenv-installer/raw/master/bin/pyenv-installer | bash
or use brew on a Mac:
brew update
brew install pyenv
For more info on pyenv see its documentation.
Create a Python 3.9 env and activate it.
We use Python 3.9 in our images so this most closely resembles the execution environment on our Docker images that will be used for distributed training.
pyenv install 3.9
pyenv global 3.9
Note: We recommend not mixing up
condaandpyenv. When developing for this project, simplyconda deactivateuntil there’s nocondaenv active (evenbase) and just rely onpyenvandpoetryto handle dependencies.
Install poetry
poetry is a dependency and build management system for Python. To install it
use:
curl -sSL https://install.python-poetry.org | python3 -
Install dependencies through poetry
Now we are ready to install our dependencies through poetry.
We have split the project dependencies into the “main” dependencies that
poetry installs by default, and the dev dependency group that
installs that dependencies that are only needed to develop the library.
On a POSIX system (tested on Ubuntu, CentOS, MacOS) run:
# Install all dependencies into local .venv
poetry install --with dev
Once all dependencies are installed you should be able to run the unit tests for the project and continue with development using:
poetry run pytest ./graphstorm-processing/tests
You can also activate and use the virtual environment using:
poetry shell
# We're now using the graphstorm-processing-py3.9 env so we can just run
pytest ./graphstorm-processing/tests
To learn more about poetry see its documentation
Use black to format code [optional]
We use black to
format code in this project. black is an opinionated formatter that
helps speed up development and code reviews. It is included in our
dev dependencies so it will be installed along with the other dev
dependencies.
To use black in the project you can run (from the project’s root,
same level as pyproject.toml)
# From the project's root directory, graphstorm-processing run:
black .
To get a preview of the changes black would make you can use:
black . --diff --color
You can auto-formatting with black to VSCode using the Black
Formatter
Use mypy and pylint to lint code
We include the mypy and pylint linters as a dependency under the dev group
of dependencies. These linters perform static checks on your code and
can be used in a complimentary manner.
We recommend using VSCode and enabling the mypy linter to get in-editor annotations.
You can also lint the project code through:
poetry run mypy ./graphstorm_processing
To learn more about mypy and how it can help development
see its documentation.
Our goal is to minimize mypy errors as much as possible for the
project. New code should be linted and not introduce additional mypy
errors. When necessary it’s OK to use type: ignore to silence
mypy errors inline, but this should be used sparingly.
As a project, GraphStorm requires a 10/10 pylint score, so ensure your code conforms to the expectation by running
pylint --rcfile=/path/to/graphstorm/tests/lint/pylintrc
on your code before commits. To make this easier we include a pre-commit hook below.
Use a pre-commit hook to ensure black and pylint run before commits
To make code formatting and pylint checks easier for graphstorm-processing
developers, we recommend using a pre-commit hook.
We include pre-commit in the project’s dev dependencies, so once
you have activated the project’s venv (poetry shell) you can just
create a file named .pre-commit-config.yaml with the following contents:
# .pre-commit-config.yaml
repos:
- repo: https://github.com/psf/black
rev: 23.7.0
hooks:
- id: black
language_version: python3.9
files: 'graphstorm_processing\/.*\.pyi?$|tests\/.*\.pyi?$|scripts\/.*\.pyi?$'
exclude: 'python\/.*\.pyi'
- repo: local
hooks:
- id: pylint
name: pylint
entry: pylint
language: system
types: [python]
args:
[
"--rcfile=./tests/lint/pylintrc"
]
And then run:
pre-commit install
which will install the black and pylint hooks into your local repository and
ensure it runs before every commit.
Note
The pre-commit hook will also apply to all commits you make to the root
GraphStorm repository. Since that Graphstorm doesn’t use black, you might
want to remove the black hook. You can do so from the root repo
using rm -rf .git/hooks.
Both projects use pylint to check Python files so we’d still recommend using
that hook even if you’re doing development for both GSProcessing and GraphStorm.