A guide on how to use Advanced Research Computing at Virginia Tech

Table of Contents

Virginia Tech provides computing resources for computing with is suitable for simulations, machine learning model training, which called Advanced Research Computing. It provides a user friendly guide on how to use its system, but I think it would be nicer to try to summarize the most essential ones when you want to use it in a adequate manner. I also refer to Ramashish Gaurav‘s blog post about how to submit jobs as well as interactive mode for using the resources.

This link contains all the workshops regarding ARC that is available for the internet.

Advanced Research Computing (ARC) Overview, 22 Spring: Slides, Video
Connect to ARC systems and run your first jobs, 22 Spring: Slides, Video
Get your software/code to run on ARC, 22 Spring: Slides, Video
Monitoring Resource Utilization and Job Efficiency
Getting the Best Data Storage Performance on ARC Filesystems
Launching in Parallel

First you need to create your account here and then you will be asked to agree terms and conditions for requesting access. You will receive an email saying you’ve created the account, for me I could use infer, TinkerCliffs clusters, each name refers to a type of clusters.

Name	specifications
TinkerCliffs	316 Nodes128 Cores(AMD EPYC ROME)+16 Nodes96 Cores (Intel Cascade Lake-AP)
infer	16 Nodes32 cores(Skylake)+ 1 NVIDIA T4(2560 CUDA+320 tensor cores) 40 Nodes28 cores(Broadwell)+ 2 NVIDIA P100(3560 CUDA) 40 Nodes*24 cores(Skylake)+ 2 NVIDIA V100(5120 CUDA+640 tensor cores)

Access

You can use the following line to login to the arc server, please be reminded that you don’t have access to the server off campus unless you connect to the school’s proxy.

ssh user@infer1.arc.vt.edu;
ssh user@tinkercliffs1.arc.vt.edu;
ssh user@tinkercliffs2.arc.vt.edu;

When you input your pid as well as password which is the credential for logging in your vt services, you will be called on your phone for 2 factor authentication, you only need to pick up the call and then press a key, here’s the output after logged in.

+---------------------------------------------------------------------------+
| This computer is the property of Virginia Polytechnic Institute and State |
| University. Use of this equipment implies agreement to the university’s   |
| Acceptable Use Policy (Policy 7000). For more information, please visit:  |
| https://vt.edu/acceptable-use.html                                        |
+---------------------------------------------------------------------------+
+---------------------------------------------------------------------------+
| NOTE: VT Enterprise Directory Password authentication requires a DUO      |
|       second factor challenge.  After your password is provided, you      |
|       will receive a DUO challenge.                                       |
+---------------------------------------------------------------------------+

use “screen” command to detach and attach shell for running the server so that you don’t have to worry about the incidents like sudden loss of network and the time consuming commands are no longer running on the ARC server. After logging in to the server, public key could be added to the authorized_keys file so that it is easier to log in without passwords.

Software Setup

This section is placed on top of the article since a user should know what to do with a specific kind of software. It would thus be more logical to understand submitting customized jobs to the cluster based on the specific version of software based computing task. In addition, this section, the command module is introduced which is really helpful for submitting jobs as well as managing tasks.

module is based on the easybuild framework¹. The man page can also be found here².

https://video.vt.edu/media/Building+Custom+Software+Modules+Manually+on+ARC%27s+Resources/1_ylh24w9q

I use python as an example on how to setup the software.

Python Environment as an example

[user@server ~]$ setup_app --help
setup_app: Setup directory/modulefile for custom software installation
Usage: setup_app package version

Options:
  --base dir,--base=dir    Specify other directory for installation
  --system                 System-wide install in /apps (arc personnel only, overridden by --base)
  --help,-h                Print this usage message

Examples:
  setup_app R 4.0.2-foss-2020b                             #home directory install
  setup_app --base=/projects/myproject R 4.0.2-foss-2020b  #install to /projects/myproject
  setup_app --system R 4.0.2-foss-2020b                    #system-wide install (arc personnel only)
[user@server ~]$ which setup_app
/apps/useful_scripts/bin/setup_app

Bash

Based on the help text, this command will setup the software environment in the user’s home folder is no other options have been passed to it like --base.

[user@server ~]$ setup_app python3 3.12.4_generic
Create directories /home/user/apps/tinkercliffs-rome/python3/3.12.4_generic and /home/user/easybuild/modules/tinkercliffs-rome/all/python3? y
Done. To finish your build:
 1. Install your app in /home/user/apps/tinkercliffs-rome/python3/3.12.4_generic/
 2. Edit the modulefile in /home/user/easybuild/modules/tinkercliffs-rome/all/python3/3.12.4_generic.lua
    - Set or remove modules to load in the load() line
    - Edit description and URL
    - Check the variable names
    - Edit paths (some packages don't have, e.g., an include/)

Note: You may need to refresh the cache, e.g.,
  module --ignore_cache spider python3
to find the module the first time.

Bash

Once the directories are created, further steps need to be done so that the environment is properly deployed. If options like --base is used, there are other steps to be done which is indicated from the output text of the setup_app command, just follow the script and things would be fine.

[user@server Python-3.12.4]$ module load foss/2023b
[user@server Python-3.12.4]$ module list

Currently Loaded Modules:
  1) shared                             12) binutils/2.40-GCCcore-13.2.0      23) PMIx/4.2.6-GCCcore-13.2.0
  2) slurm/slurm/23.02.7                13) GCC/13.2.0                        24) UCC/1.2.0-GCCcore-13.2.0
  3) apps                               14) numactl/2.0.16-GCCcore-13.2.0     25) OpenMPI/4.1.6-GCC-13.2.0
  4) site/tinkercliffs/easybuild/setup  15) XZ/5.4.4-GCCcore-13.2.0           26) OpenBLAS/0.3.24-GCC-13.2.0
  5) cray                               16) libxml2/2.11.5-GCCcore-13.2.0     27) FlexiBLAS/3.3.1-GCC-13.2.0
  6) craype-x86-rome                    17) libpciaccess/0.17-GCCcore-13.2.0  28) FFTW/3.3.10-GCC-13.2.0
  7) craype-network-infiniband          18) hwloc/2.9.2-GCCcore-13.2.0        29) gompi/2023b
  8) useful_scripts                     19) OpenSSL/1.1                       30) FFTW.MPI/3.3.10-gompi-2023b
  9) DefaultModules                     20) libevent/2.1.12-GCCcore-13.2.0    31) ScaLAPACK/2.2.0-gompi-2023b-fb
 10) GCCcore/13.2.0                     21) UCX/1.15.0-GCCcore-13.2.0         32) foss/2023b
 11) zlib/1.2.13-GCCcore-13.2.0         22) libfabric/1.19.0-GCCcore-13.2.0

Bash

The following steps shows how python source code is downloaded, extracted, the configured properly to the folder created using setup_app command and finally use make command to build and install. It should be noted that different softwares need their specific ways of configuring the building and installing process. In this case, I refer to my previous blog³.

[user@server w_python]$ wget https://www.python.org/ftp/python/3.12.4/Python-3.12.4.tgz
[user@server w_python]$ tar -xzvf Python-3.12.4.tgz
[user@server w_python]$ cd Python
[user@server Python-3.12.4]$ ./configure --prefix='/home/user/apps/tinkercliffs-rome/python3/3.12.4_generic' --enable-optimizations
[user@server Python-3.12.4]$ make -j 64
[user@server Python-3.12.4]$ make install

Bash

python installation lua config by default after installation. I presume this piece of text is generated during the setup_app command, which is just a general template with specific paths being pluggined into.

whatis("Name: python3")
whatis("Version: 3.12.4_generic")
whatis("Description: PYTHON3DESCRIPTION")
whatis("URL: https://www.python.org")

help([[
PYTHON3DESCRIPTION

Define Environment Variables:

    $EBROOTPYTHON3 - root
      $PYTHON3_DIR - root
      $PYTHON3_BIN - binaries
      $PYTHON3_INC - includes
      $PYTHON3_LIB - libraries
    $PYTHON3_LIB64 - libraries

Prepend Environment Variables:

               PATH += $PYTHON3_BIN
            MANPATH += $PYTHON3_DIR/share/man
            INCLUDE += $PYTHON3_INC
    LD_LIBRARY_PATH += $PYTHON3_LIB
    LD_LIBRARY_PATH += $PYTHON3_LIB64
]])

setenv("EBROOTPYTHON3", "/home/wxm/apps/tinkercliffs-rome/python3/3.12.4_generic")
setenv("PYTHON3_DIR", "/home/wxm/apps/tinkercliffs-rome/python3/3.12.4_generic")
setenv("PYTHON3_BIN", "/home/wxm/apps/tinkercliffs-rome/python3/3.12.4_generic/bin")
setenv("PYTHON3_INC", "/home/wxm/apps/tinkercliffs-rome/python3/3.12.4_generic/include")
setenv("PYTHON3_LIB", "/home/wxm/apps/tinkercliffs-rome/python3/3.12.4_generic/lib64")

prepend_path("PATH", "/home/wxm/apps/tinkercliffs-rome/python3/3.12.4_generic/bin")
prepend_path("MANPATH", "/home/wxm/apps/tinkercliffs-rome/python3/3.12.4_generic/share/man")
prepend_path("INCLUDE", "/home/wxm/apps/tinkercliffs-rome/python3/3.12.4_generic/include")
prepend_path("LD_LIBRARY_PATH", "/home/wxm/apps/tinkercliffs-rome/python3/3.12.4_generic/lib")
prepend_path("LD_LIBRARY_PATH", "/home/wxm/apps/tinkercliffs-rome/python3/3.12.4_generic/lib64")

load("foss/2023b")

Lua

The module command is able to check the specific version of software that could be utilized within the system. Here’s the history of how to use module command to load the python environment that has been installed.

[user@server bin]$ module reset
Resetting modules to system default. Reseting $MODULEPATH back to system default. All extra directories will be removed from $MODULEPATH.
[user@server bin]$ module load python3/3.12.4_generic
[user@server bin]$ module list
Currently Loaded Modules:
  1) shared                             12) binutils/2.40-GCCcore-13.2.0      23) PMIx/4.2.6-GCCcore-13.2.0
  2) slurm/slurm/23.02.7                13) GCC/13.2.0                        24) UCC/1.2.0-GCCcore-13.2.0
  3) apps                               14) numactl/2.0.16-GCCcore-13.2.0     25) OpenMPI/4.1.6-GCC-13.2.0
  4) site/tinkercliffs/easybuild/setup  15) XZ/5.4.4-GCCcore-13.2.0           26) OpenBLAS/0.3.24-GCC-13.2.0
  5) cray                               16) libxml2/2.11.5-GCCcore-13.2.0     27) FlexiBLAS/3.3.1-GCC-13.2.0
  6) craype-x86-rome                    17) libpciaccess/0.17-GCCcore-13.2.0  28) FFTW/3.3.10-GCC-13.2.0
  7) craype-network-infiniband          18) hwloc/2.9.2-GCCcore-13.2.0        29) gompi/2023b
  8) useful_scripts                     19) OpenSSL/1.1                       30) FFTW.MPI/3.3.10-gompi-2023b
  9) DefaultModules                     20) libevent/2.1.12-GCCcore-13.2.0    31) ScaLAPACK/2.2.0-gompi-2023b-fb
 10) GCCcore/13.2.0                     21) UCX/1.15.0-GCCcore-13.2.0         32) foss/2023b
 11) zlib/1.2.13-GCCcore-13.2.0         22) libfabric/1.19.0-GCCcore-13.2.0   33) python3/3.12.4_generic
 [user@server bin]$ which python3
~/apps/tinkercliffs-rome/python3/3.12.4_generic/bin/python3
[user@server bin]$ which python3.12
~/apps/tinkercliffs-rome/python3/3.12.4_generic/bin/python3.12

Bash

Additional module usages

https://video.vt.edu/media/ARCA+Using+modules+to+access+software+packages+%28EasyBuild+version%29/0_nhj2cdjy

The man page can also be found here².

module list
module load intel/2019b
module show glew #show info of a loaded module
module display glew #same as module show

Bash

Scheduler and Cluster Resource Manager(SLURM)

The clusters uses SLURM which is short for scheduler and cluster resource manager⁴. https://slurm.schedmd.com/pdfs/summary.pdf

https://video.vt.edu/media/ARCA+Interactive+and+Batch+Jobs/1_doz5ylqg

User only logins to the login node to access the computing resource, which means the actual jobs should not be run on the Login Node but the Compute Nodes within the cluster through a scheduler. This principle enable users to understand why scheduler is needed rather than running commands directly.

Ramashish Gaurav’s blog post discusses how to use the SLURM for Advanced Research Computing. In essence, ARC accepts jobs through SLRUM with estimation of hardware resources needed. Two ways of submitting jobs are in the formats of bash script and interactive mode in Jupyter notebook sessions.

interactive mode

Interactive mode is helpful when running Jupyter notebook sessions, debugging code as well as other on-the-go tasks to be done for instant feedbacks. Basically this is helpful for dealing with tasks that are half-baked and you are not completely sure whether the code works flawlessly in a large scale.

The following comand shows how to invoke a interactive session, it takes sometime for slurm to allocate the resources for this interactive session.

[user@login ~]$ interact -N1 --ntasks-per-node=48 -t240:00 -A personal --partition=interactive_q
srun: job 2447900 queued and waiting for resources
srun: job 2447900 has been allocated resources
[user@tc307 ~]$

Bash

[user@tc307 ~]$ wget https://portal.nersc.gov/project/m888/apex/mt-dgemm_160114.tgz
[user@tc307 ~]$ tar -xzvf mt-dgemm_160114.tgz
./mt-dgemm/
./mt-dgemm/Makefile
./mt-dgemm/mt-dgemm.c
./mt-dgemm/README.APEX
[user@tc307 ~]$ make
[user@tc307 mt-dgemm]$ ./mt-dgemm
Matrix size defaulted to 256
Alpha =    1.000000
Beta  =    1.000000
Allocating Matrices...
Allocation complete, populating with values...
Performing multiplication...
Calculating matrix check...

===============================================================
Final Sum is:         256.033333
 -> Solution check PASSED successfully.
Memory for Matrices:  1.500000 MB
Multiply time:        1.409793 seconds
FLOPs computed:       1010565120.000000
GFLOP/s rate:         0.716818 GF/s
===============================================================

Bash

batch job

Unlike interactive mode, batch job is designed to run time consuming tasks and you do have confidence on your code so that the outcome of the job would be quite fruitful. In this way, multiple jobs could be allocated through slurm without human interaction or intervention.

The following is the example bash script for allocating batch jobs.

#!/bin/bash

#SBATCH -N1
#SBATCH --ntasks-per-node=128
#SBATCH -t 30:00
#SBATCH -p normal_q
#SBATCH -A personal

#load module
module reset
module load intel/2019b

#compile
make

#run

export OMP_NUM_THREADS=128
./mt-dgemm 4000

Plaintext

In the above script, the commented out lines are actually parsed into slurm for allocating jobs. I found out it is acceptable to drop the -t option as it is really difficult to estimate how much time it would cost for a program to complete.

The following commands are used to check out the submitted jobs.

[user@login mt-dgemm]$ sbatch mt-dgemm.sh
Submitted batch job 2447928
[user@login mt-dgemm]$ squeue -u $USER
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           2447928  normal_q mt-dgemm      user PD       0:00      1 (AssocGrpBillingMinutes)
           2447933  normal_q mt-dgemm      user  R       0:05      1 tc004
[user@login mt-dgemm]$ scancel 2447928
[user@login mt-dgemm]$ skill 2447933

Bash

Using squeue, all the submitted jobs could be printed out for analysis. It should be noted that all the outputs of a certain job is stored in a file named slurm-JobID.out, for example slurm-2450827.out.

Accounting

quota, scontrol show part, squeue, showusage, tcgetusage <accountname>, showqos

Print 🖨 eBook 📱

Posted

October 11, 2023

Research

Xiaomeng Wang

Tags:

ai, research, vt

A guide on how to use Advanced Research Computing at Virginia Tech

Access

Software Setup

Python Environment as an example

Additional module usages

Scheduler and Cluster Resource Manager(SLURM)

interactive mode

batch job

Accounting

Comments

Leave a Reply Cancel reply