Virginia Tech provides computing resources for computing with is suitable for simulations, machine learning model training, which called Advanced Research Computing. It provides a user friendly guide on how to use its system, but I think it would be nicer to try to summarize the most essential ones when you want to use it in a adequate manner. I also refer to Ramashish Gaurav‘s blog post about how to submit jobs as well as interactive mode for using the resources.
This link contains all the workshops regarding ARC that is available for the internet.
- Advanced Research Computing (ARC) Overview, 22 Spring: Slides, Video
- Connect to ARC systems and run your first jobs, 22 Spring: Slides, Video
- Get your software/code to run on ARC, 22 Spring: Slides, Video
- Monitoring Resource Utilization and Job Efficiency
- Getting the Best Data Storage Performance on ARC Filesystems
- Launching in Parallel
First you need to create your account here and then you will be asked to agree terms and conditions for requesting access. You will receive an email saying you’ve created the account, for me I could use infer, TinkerCliffs clusters, each name refers to a type of clusters.
Name | specifications |
TinkerCliffs | 316 Nodes*128 Cores(AMD EPYC ROME)+16 Nodes*96 Cores (Intel Cascade Lake-AP) |
infer | 16 Nodes*32 cores(Skylake)+ 1 NVIDIA T4(2560 CUDA+320 tensor cores) 40 Nodes*28 cores(Broadwell)+ 2 NVIDIA P100(3560 CUDA) 40 Nodes*24 cores(Skylake)+ 2 NVIDIA V100(5120 CUDA+640 tensor cores) |
Access
You can use the following line to login to the arc server, please be reminded that you don’t have access to the server off campus unless you connect to the school’s proxy.
ssh user@infer1.arc.vt.edu;
ssh user@tinkercliffs1.arc.vt.edu;
ssh user@tinkercliffs2.arc.vt.edu;
When you input your pid as well as password which is the credential for logging in your vt services, you will be called on your phone for 2 factor authentication, you only need to pick up the call and then press a key, here’s the output after logged in.
+---------------------------------------------------------------------------+
| This computer is the property of Virginia Polytechnic Institute and State |
| University. Use of this equipment implies agreement to the university’s |
| Acceptable Use Policy (Policy 7000). For more information, please visit: |
| https://vt.edu/acceptable-use.html |
+---------------------------------------------------------------------------+
+---------------------------------------------------------------------------+
| NOTE: VT Enterprise Directory Password authentication requires a DUO |
| second factor challenge. After your password is provided, you |
| will receive a DUO challenge. |
+---------------------------------------------------------------------------+
use “screen” command to detach and attach shell for running the server so that you don’t have to worry about the incidents like sudden loss of network and the time consuming commands are no longer running on the ARC server. After logging in to the server, public key could be added to the authorized_keys
file so that it is easier to log in without passwords.
Software Setup
This section is placed on top of the article since a user should know what to do with a specific kind of software. It would thus be more logical to understand submitting customized jobs to the cluster based on the specific version of software based computing task. In addition, this section, the command module
is introduced which is really helpful for submitting jobs as well as managing tasks.
module is based on the easybuild framework1. The man page can also be found here2.
https://video.vt.edu/media/Building+Custom+Software+Modules+Manually+on+ARC%27s+Resources/1_ylh24w9q
I use python as an example on how to setup the software.
Python Environment as an example
[user@server ~]$ setup_app --help
setup_app: Setup directory/modulefile for custom software installation
Usage: setup_app package version
Options:
--base dir,--base=dir Specify other directory for installation
--system System-wide install in /apps (arc personnel only, overridden by --base)
--help,-h Print this usage message
Examples:
setup_app R 4.0.2-foss-2020b #home directory install
setup_app --base=/projects/myproject R 4.0.2-foss-2020b #install to /projects/myproject
setup_app --system R 4.0.2-foss-2020b #system-wide install (arc personnel only)
[user@server ~]$ which setup_app
/apps/useful_scripts/bin/setup_app
BashBased on the help text, this command will setup the software environment in the user’s home folder is no other options have been passed to it like --base
.
[user@server ~]$ setup_app python3 3.12.4_generic
Create directories /home/user/apps/tinkercliffs-rome/python3/3.12.4_generic and /home/user/easybuild/modules/tinkercliffs-rome/all/python3? y
Done. To finish your build:
1. Install your app in /home/user/apps/tinkercliffs-rome/python3/3.12.4_generic/
2. Edit the modulefile in /home/user/easybuild/modules/tinkercliffs-rome/all/python3/3.12.4_generic.lua
- Set or remove modules to load in the load() line
- Edit description and URL
- Check the variable names
- Edit paths (some packages don't have, e.g., an include/)
Note: You may need to refresh the cache, e.g.,
module --ignore_cache spider python3
to find the module the first time.
BashOnce the directories are created, further steps need to be done so that the environment is properly deployed. If options like --base
is used, there are other steps to be done which is indicated from the output text of the setup_app
command, just follow the script and things would be fine.
[user@server Python-3.12.4]$ module load foss/2023b
[user@server Python-3.12.4]$ module list
Currently Loaded Modules:
1) shared 12) binutils/2.40-GCCcore-13.2.0 23) PMIx/4.2.6-GCCcore-13.2.0
2) slurm/slurm/23.02.7 13) GCC/13.2.0 24) UCC/1.2.0-GCCcore-13.2.0
3) apps 14) numactl/2.0.16-GCCcore-13.2.0 25) OpenMPI/4.1.6-GCC-13.2.0
4) site/tinkercliffs/easybuild/setup 15) XZ/5.4.4-GCCcore-13.2.0 26) OpenBLAS/0.3.24-GCC-13.2.0
5) cray 16) libxml2/2.11.5-GCCcore-13.2.0 27) FlexiBLAS/3.3.1-GCC-13.2.0
6) craype-x86-rome 17) libpciaccess/0.17-GCCcore-13.2.0 28) FFTW/3.3.10-GCC-13.2.0
7) craype-network-infiniband 18) hwloc/2.9.2-GCCcore-13.2.0 29) gompi/2023b
8) useful_scripts 19) OpenSSL/1.1 30) FFTW.MPI/3.3.10-gompi-2023b
9) DefaultModules 20) libevent/2.1.12-GCCcore-13.2.0 31) ScaLAPACK/2.2.0-gompi-2023b-fb
10) GCCcore/13.2.0 21) UCX/1.15.0-GCCcore-13.2.0 32) foss/2023b
11) zlib/1.2.13-GCCcore-13.2.0 22) libfabric/1.19.0-GCCcore-13.2.0
BashThe following steps shows how python source code is downloaded, extracted, the configured properly to the folder created using setup_app
command and finally use make
command to build and install. It should be noted that different softwares need their specific ways of configuring the building and installing process. In this case, I refer to my previous blog3.
[user@server w_python]$ wget https://www.python.org/ftp/python/3.12.4/Python-3.12.4.tgz
[user@server w_python]$ tar -xzvf Python-3.12.4.tgz
[user@server w_python]$ cd Python
[user@server Python-3.12.4]$ ./configure --prefix='/home/user/apps/tinkercliffs-rome/python3/3.12.4_generic' --enable-optimizations
[user@server Python-3.12.4]$ make -j 64
[user@server Python-3.12.4]$ make install
Bashpython installation lua config by default after installation. I presume this piece of text is generated during the setup_app command, which is just a general template with specific paths being pluggined into.
whatis("Name: python3")
whatis("Version: 3.12.4_generic")
whatis("Description: PYTHON3DESCRIPTION")
whatis("URL: https://www.python.org")
help([[
PYTHON3DESCRIPTION
Define Environment Variables:
$EBROOTPYTHON3 - root
$PYTHON3_DIR - root
$PYTHON3_BIN - binaries
$PYTHON3_INC - includes
$PYTHON3_LIB - libraries
$PYTHON3_LIB64 - libraries
Prepend Environment Variables:
PATH += $PYTHON3_BIN
MANPATH += $PYTHON3_DIR/share/man
INCLUDE += $PYTHON3_INC
LD_LIBRARY_PATH += $PYTHON3_LIB
LD_LIBRARY_PATH += $PYTHON3_LIB64
]])
setenv("EBROOTPYTHON3", "/home/wxm/apps/tinkercliffs-rome/python3/3.12.4_generic")
setenv("PYTHON3_DIR", "/home/wxm/apps/tinkercliffs-rome/python3/3.12.4_generic")
setenv("PYTHON3_BIN", "/home/wxm/apps/tinkercliffs-rome/python3/3.12.4_generic/bin")
setenv("PYTHON3_INC", "/home/wxm/apps/tinkercliffs-rome/python3/3.12.4_generic/include")
setenv("PYTHON3_LIB", "/home/wxm/apps/tinkercliffs-rome/python3/3.12.4_generic/lib64")
prepend_path("PATH", "/home/wxm/apps/tinkercliffs-rome/python3/3.12.4_generic/bin")
prepend_path("MANPATH", "/home/wxm/apps/tinkercliffs-rome/python3/3.12.4_generic/share/man")
prepend_path("INCLUDE", "/home/wxm/apps/tinkercliffs-rome/python3/3.12.4_generic/include")
prepend_path("LD_LIBRARY_PATH", "/home/wxm/apps/tinkercliffs-rome/python3/3.12.4_generic/lib")
prepend_path("LD_LIBRARY_PATH", "/home/wxm/apps/tinkercliffs-rome/python3/3.12.4_generic/lib64")
load("foss/2023b")
LuaThe module
command is able to check the specific version of software that could be utilized within the system. Here’s the history of how to use module
command to load the python environment that has been installed.
[user@server bin]$ module reset
Resetting modules to system default. Reseting $MODULEPATH back to system default. All extra directories will be removed from $MODULEPATH.
[user@server bin]$ module load python3/3.12.4_generic
[user@server bin]$ module list
Currently Loaded Modules:
1) shared 12) binutils/2.40-GCCcore-13.2.0 23) PMIx/4.2.6-GCCcore-13.2.0
2) slurm/slurm/23.02.7 13) GCC/13.2.0 24) UCC/1.2.0-GCCcore-13.2.0
3) apps 14) numactl/2.0.16-GCCcore-13.2.0 25) OpenMPI/4.1.6-GCC-13.2.0
4) site/tinkercliffs/easybuild/setup 15) XZ/5.4.4-GCCcore-13.2.0 26) OpenBLAS/0.3.24-GCC-13.2.0
5) cray 16) libxml2/2.11.5-GCCcore-13.2.0 27) FlexiBLAS/3.3.1-GCC-13.2.0
6) craype-x86-rome 17) libpciaccess/0.17-GCCcore-13.2.0 28) FFTW/3.3.10-GCC-13.2.0
7) craype-network-infiniband 18) hwloc/2.9.2-GCCcore-13.2.0 29) gompi/2023b
8) useful_scripts 19) OpenSSL/1.1 30) FFTW.MPI/3.3.10-gompi-2023b
9) DefaultModules 20) libevent/2.1.12-GCCcore-13.2.0 31) ScaLAPACK/2.2.0-gompi-2023b-fb
10) GCCcore/13.2.0 21) UCX/1.15.0-GCCcore-13.2.0 32) foss/2023b
11) zlib/1.2.13-GCCcore-13.2.0 22) libfabric/1.19.0-GCCcore-13.2.0 33) python3/3.12.4_generic
[user@server bin]$ which python3
~/apps/tinkercliffs-rome/python3/3.12.4_generic/bin/python3
[user@server bin]$ which python3.12
~/apps/tinkercliffs-rome/python3/3.12.4_generic/bin/python3.12
BashAdditional module usages
The man page can also be found here2.
module list
module load intel/2019b
module show glew #show info of a loaded module
module display glew #same as module show
BashScheduler and Cluster Resource Manager(SLURM)
The clusters uses SLURM which is short for scheduler and cluster resource manager4. https://slurm.schedmd.com/pdfs/summary.pdf
https://video.vt.edu/media/ARCA+Interactive+and+Batch+Jobs/1_doz5ylqg
User only logins to the login node to access the computing resource, which means the actual jobs should not be run on the Login Node but the Compute Nodes within the cluster through a scheduler. This principle enable users to understand why scheduler is needed rather than running commands directly.
Ramashish Gaurav’s blog post discusses how to use the SLURM for Advanced Research Computing. In essence, ARC accepts jobs through SLRUM with estimation of hardware resources needed. Two ways of submitting jobs are in the formats of bash script and interactive mode in Jupyter notebook sessions.
interactive mode
Interactive mode is helpful when running Jupyter notebook sessions, debugging code as well as other on-the-go tasks to be done for instant feedbacks. Basically this is helpful for dealing with tasks that are half-baked and you are not completely sure whether the code works flawlessly in a large scale.
The following comand shows how to invoke a interactive session, it takes sometime for slurm to allocate the resources for this interactive session.
[user@login ~]$ interact -N1 --ntasks-per-node=48 -t240:00 -A personal --partition=interactive_q
srun: job 2447900 queued and waiting for resources
srun: job 2447900 has been allocated resources
[user@tc307 ~]$
Bash[user@tc307 ~]$ wget https://portal.nersc.gov/project/m888/apex/mt-dgemm_160114.tgz
[user@tc307 ~]$ tar -xzvf mt-dgemm_160114.tgz
./mt-dgemm/
./mt-dgemm/Makefile
./mt-dgemm/mt-dgemm.c
./mt-dgemm/README.APEX
[user@tc307 ~]$ make
[user@tc307 mt-dgemm]$ ./mt-dgemm
Matrix size defaulted to 256
Alpha = 1.000000
Beta = 1.000000
Allocating Matrices...
Allocation complete, populating with values...
Performing multiplication...
Calculating matrix check...
===============================================================
Final Sum is: 256.033333
-> Solution check PASSED successfully.
Memory for Matrices: 1.500000 MB
Multiply time: 1.409793 seconds
FLOPs computed: 1010565120.000000
GFLOP/s rate: 0.716818 GF/s
===============================================================
Bashbatch job
Unlike interactive mode, batch job is designed to run time consuming tasks and you do have confidence on your code so that the outcome of the job would be quite fruitful. In this way, multiple jobs could be allocated through slurm without human interaction or intervention.
The following is the example bash script for allocating batch jobs.
#!/bin/bash
#SBATCH -N1
#SBATCH --ntasks-per-node=128
#SBATCH -t 30:00
#SBATCH -p normal_q
#SBATCH -A personal
#load module
module reset
module load intel/2019b
#compile
make
#run
export OMP_NUM_THREADS=128
./mt-dgemm 4000
PlaintextIn the above script, the commented out lines are actually parsed into slurm for allocating jobs. I found out it is acceptable to drop the -t option as it is really difficult to estimate how much time it would cost for a program to complete.
The following commands are used to check out the submitted jobs.
[user@login mt-dgemm]$ sbatch mt-dgemm.sh
Submitted batch job 2447928
[user@login mt-dgemm]$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2447928 normal_q mt-dgemm user PD 0:00 1 (AssocGrpBillingMinutes)
2447933 normal_q mt-dgemm user R 0:05 1 tc004
[user@login mt-dgemm]$ scancel 2447928
[user@login mt-dgemm]$ skill 2447933
BashUsing squeue
, all the submitted jobs could be printed out for analysis. It should be noted that all the outputs of a certain job is stored in a file named slurm-JobID.out
, for example slurm-2450827.out
.
Accounting
quota, scontrol show part, squeue, showusage, tcgetusage <accountname>, showqos
Leave a Reply