Unofficial Guide to hpc-cluster at OSU.

This is an unofficial guid on how to request for resource and fairly use them for research.

Available Nodes

sinfo -o "%N %c %G %l %f"

PARTITION	NODELIST	CPUS	GRES	TIMELIMIT	AVAIL_FEATURES
share	cn-d-5	40	(null)	7-00:00:00	broadwell
share	cn-h-8,cn-i-5	24+	(null)	7-00:00:00	skylake
share	cn-0-[1-9],cn-1-[1-9],cn-2-[1-9],cn-3-12,cn-4-[3-4],cn-5-[1-9]	12	(null)	7-00:00:00	x5650
share	cn-7-[1-2,4-9,11-15],cn-8-[1-8,11-13,15]	8	(null)	7-00:00:00	harpertown
share	cn-9-[2-4]	12	(null)	7-00:00:00	sandybridge
share	cn-e-[7-8]	20	(null)	7-00:00:00	haswell
share	cn-g-[01-08]	8	(null)	7-00:00:00	opteron
share	cn-h-4	40	gpu:2(S:0-1)	7-00:00:00	skylake,gtx1080
share	cn-j-4	32	gpu:2(S:0-1)	7-00:00:00	skylake,g6130,gtx980
sharegpu	cn-f-5	20	gpu:2	7-00:00:00	haswell,e5-2660v3,m60
sharegpu	cn-h-4	40	gpu:2(S:0-1)	7-00:00:00	skylake,gtx1080
sharegpu	cn-j-4	32	gpu:2(S:0-1)	7-00:00:00	skylake,g6130,gtx980
sharegpu	compute-gpu	24	gpu:2	7-00:00:00	haswell,m60
dgx2	compute-dgx2-[1-6]	96	gpu:16(S:0-1)	7-00:00:00	skylake,v100
gpu	cn-gpu[1-2]	20	gpu:8	7-00:00:00	haswell,gtx1080
gpu	cn-gpu5	40	gpu:8(S:0-1)	7-00:00:00	skylake,g6248,rtx8000
gpu	compute-gpu[3-4]	20	gpu:8(S:0-1)	7-00:00:00	haswell,rtx2080
dgx	compute-dgx2-[4-6]	96	gpu:16(S:0-1)	7-00:00:00	skylake,v100
dgxs	compute-dgxs-[1-3]	40	gpu:4	2-00:00:00	broadwell,v100
class	compute-dgxs-[1-3]	40	gpu:4	1:00:00	broadwell,v100
eecs	cn-0-[1-9],cn-1-[1-13],cn-2-[1-12]	12+	(null)	7-00:00:00	x5650
eecs3	cn-f-[1-4]	20	(null)	7-00:00:00	haswell,e5-2660v3
eecs3	cn-f-5	20	gpu:2	7-00:00:00	haswell,e5-2660v3,m60
mime1	cn-a-[1-2]	16	(null)	7-00:00:00	sandybridge
mime2	cn-c-[01-06]	16	(null)	7-00:00:00	ivybridge,e5-2650v2
mime2	cn-c-[11-12]	16	(null)	7-00:00:00	haswell,e5-2630v3
mime3	cn-d-[1-5]	28+	(null)	15-00:00:00	broadwell
mime4	cn-l-[1-2]	20	(null)	7-00:00:00	skylake
mime4	cn-e-1	20	gpu:1	7-00:00:00	haswell,k40m
mime4	cn-e-[2-8]	20	(null)	7-00:00:00	haswell
mime5	cn-h-[1-3]	40	gpu:2(S:0-1)	7-00:00:00	skylake,gtx1080
mime5	cn-h-[5-7]	44	(null)	7-00:00:00	skylake
mime6	cn-j-4	32	gpu:2(S:0-1)	7-00:00:00	skylake,g6130,gtx980
mime7	cn-k-[1-2]	20	(null)	7-00:00:00	skylake
cbee	cn-b-[1-6]	16	(null)	7-00:00:00	ivybridge,e5-2650v2
cbee	cn-i-[1-5]	24	(null)	7-00:00:00	skylake
forestry	cn-9-[1-4]	12	(null)	7-00:00:00	sandybridge
cascades	cn-m-1	8	gpu:6(S:0-1)	7-00:00:00	skylake,t4
cascades	cn-m-2	16	gpu:2(S:0-1)	7-00:00:00	skylake,rtx6000
nacse	cn-3-[1-11]	12	(null)	7-00:00:00	x5650
nacse	cn-n-[1-6]	36	(null)	7-00:00:00	skylake
nerhp	cn-4-[3-4]	12	(null)	7-00:00:00	x5650
nerhp	cn-4-[5-8]	28	(null)	7-00:00:00	skylake,gold
chem	cn-g-[01-08]	8	(null)	7-00:00:00	opteron
matsci	cn-5-[2-9]	12	(null)	7-00:00:00	x5650
preempt	cn-gpu[1-2]	20	gpu:8	7-00:00:00	haswell,gtx1080
preempt	cn-m-1	8	gpu:6(S:0-1)	7-00:00:00	skylake,t4
preempt	compute-gpu[3-4]	20	gpu:8(S:0-1)	7-00:00:00	haswell,rtx2080
preempt	cn-1-[10-13],cn-2-[10-12],cn-5-[10-12]	12+	(null)	7-00:00:00	x5650
preempt	cn-4-[5-8]	28	(null)	7-00:00:00	skylake,gold
preempt	cn-9-1	12	(null)	7-00:00:00	sandybridge
preempt	cn-b-[1-5]	16	(null)	7-00:00:00	ivybridge,e5-2650v2
preempt	cn-c-[11-12]	16	(null)	7-00:00:00	haswell,e5-2630v3
preempt	cn-h-[1-3]	40	gpu:2(S:0-1)	7-00:00:00	skylake,gtx1080
preempt	cn-h-[5-7],cn-i-[1-4]	24+	(null)	7-00:00:00	skylake
preempt	cn-m-2	16	gpu:2(S:0-1)	7-00:00:00	skylake,rtx6000

Getting Started

Obtain an COE account

To use the CoE HPC, you must have an engineering account and be sponsored by a primary investigator.

SSH access

You must use an SSH client app to connect to the CoE HPC cluster. If you are using Linux or a Mac computer, then you can just use the ssh command to connect to any one of the three CoE HPC submit hosts (submit-a, submit-b, submit-c), e.g.:

ssh myEngrAcct@submit-a.hpc.engr.oregonstate.edu If you are using Windows, you need to download an SSH client like MobaXterm or Putty.

Reserving cluster resources over screen:

We may either use screen or tmux to keep the session alive for running long jobs like ML model training. Once the resource is allocated one can directly ssh to the allocated server and run the tasks upto the TIMELIMIT in the above table.

Here’s a quick guide on how to do that: Login to CoE HPC submit hosts (submit-a, submit-b, submit-c)

On the command prompt, type screen.
Run the desired program srun -A eecs -p dgx2 --time 7-0 -c 4 --gres=gpu:1 --mail-user=mishrash@oregonstate.edu --mail-type=ALL bash for 7 days, 4 cpus and 1 gpu.
Use the key sequence Ctrl-a + Ctrl-d to detach from the screen session.
Exit submit server using ctrl + c.

At this point you might get a resource allocation immediately or you may have to wait until the requested resource is available. Once the resource has been allocated you should receive an email if configured. You may check the node by:

Login to any submit server.
Type squeue -u mishrash
It would look something like 68985 dgx2 bash mishrash R 2:46:24 1 compute-dgx2-6

To reattach screen(optional):

Login to same submit server.
Type submit -ls and copy the session id
Reattach to the screen session by typing screen -r .

Log into allocated resource

Once the resource has been allocated you may log into the alocated node directly as if you had direct access to the node until TIMELIMIT. You may open as many terminal as required for multiple tasks

For example: ssh mishrash@compute-dgx2-6.hpc.engr.oregonstate.edu

Tips and Tricks:

Type module avail for knowing about different versions of compiles like gcc, cuda etc. Load the required package with module load cuda/10.1 or module load gcc/7.5
VNC on remote server:
- After logging into remote server type vncserver -geometry 1024x720
- From local server open a new terminal and setup port forwarding to ssh -L5901:localhost:5901 mishrash@compute-dgx2-6.hpc.engr.oregonstate.edu
- Donwload and install vnc viewer and a new connection with vnc server as localhost:5901.
Check cluster load with sinfo -o "%n %c %e %G %O" -p dgx2
Allocate a sepeific node with srun -A eecs -p dgx2 -w compute-dgx2-2 --time 7-0 -c 8 --gres=gpu:1 bash

With great power comes great responsibilities!

Currently the Slurm limits are as follows:

32 total jobs, 16 running jobs
16 gpus per user at a time (e.g. either 1 job with 16 gpus or 2 jobs with 8, or 4 jobs with 4, etc.)
32 cpus per user at a time
12 hours default walltime limit, 7 days maximum walltime limit

which gives immense compute capabilities to the faculty and students which if you think about is upto 512 gb of graphic memory which is accessible on dgxs however these resources are shared with hundreds of students and faculty members so its highly recommended to only request for resources which is required and not solely just because you can.

Some ways to ensure fair usage could be:

Checking for utilization of current resources before requesting for additional gpu resources.
For example if someone is trying to tune the hyperparameters consider training different experiments on same gpus with smaller batch size to max out the gres memory before allocating an additional gpu, someone else might need gpus too.

Useful Links:

Author: Shridhar Mishra (mishrash at oregonstate.edu)

Date: 28 Aug 2020

« Conclusion

Shridhar Mishra

Grad student @ OSU, Machine Learning enthusiast.