Slurmジョブについて | GPUSOROBAN 計算クラスター
本記事では、GPUSOROBAN 計算クラスターの基本的なSlurmジョブについて解説しています。
SSH接続にて、ジョブ実行サーバにログインし、Slurmのコマンドにてジョブ実行ができます。
目次[非表示]
基本動作動作確認
sinfo␣-s
#コマンド概要
パーティション(計算ノードの仮想的なグループ)の情報を出力します。
#出力例
※-sオプションによってサマリ表示(凝縮)させています。
asuserXXXXX@kra-loginXX:~$ sinfo -s
PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST
GPU* up 7-00:00:00 0/18/0/18 kra-[XXX-XXX,XXX-XXX]
Grp1 up 7-00:00:00 0/2/0/2 kra-[XXX-XXX]
#出力結果の見方
詳細は Slurm 公式サイトの sinfo ページを参照ください。
sbatch
#コマンド概要
Slurmにバッチジョブ(スクリプト)を投入するコマンドです。
予め、ジョブスクリプトを作成しておく必要があります。
#出力例:
用意したサンプルのジョブスクリプト「8gpu.sh」を実行
asuserXXXXX@kra-loginXX:~/n/shell$ sbatch ./8gpu.sh
Submitted batch job 1115
asuserXXXXX@kra-loginXX:~/n/shell$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1115 GPU test-hpl asuserXX R 0:12 1 kra-XXX
#よく使われるオプション
ジョブ投入に成功すると、Submitted batch job に続いてジョブ ID が出力されます。
よく使われるオプションを以下に示します。
詳細は Slurm 公式サイトの sbatch ページを参照ください。
squeue
#コマンド概要
Slurmのジョブキュー(実行中・待機中など)にあるジョブやジョブステップの状態を一覧表示するコマンドです。
#出力例:
ジョブキューの状態を出力します。
asuserXXXXX@kra-login02:~/n/shell$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
1115 GPU test-hpl asuser10 R 0:12 1 kra-101
asuserXXXXX@kra-login02:~/n/shell$ sstat 1115
JobID MaxVMSize MaxVMSizeNode MaxVMSizeTask AveVMSize MaxRSS
MaxRSSNode MaxRSSTask AveRSS MaxPages MaxPagesNode MaxPagesTask
AvePages MinCPU MinCPUNode MinCPUTask AveCPU NTasks AveCPUFreq
ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ConsumedEnergy MaxDiskRead
MaxDiskReadNode MaxDiskReadTask AveDiskRead MaxDiskWrite MaxDiskWriteNode
MaxDiskWriteTask AveDiskWrite TRESUsageInAve TRESUsageInMax
TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode
TRESUsageInMinTask TRESUsageInTot TRESUsageOutAve TRESUsageOutMax
TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutMin TRESUsageOutMinNode
TRESUsageOutMinTask TRESUsageOutTot
------------ ---------- -------------- -------------- ---------- ----------
---------- ---------- ---------- -------- ------------ --------------
---------- ---------- ---------- ---------- ---------- -------- ----------
------------- ------------- ------------- -------------- ------------
--------------- --------------- ------------ ------------ ----------------
---------------- ------------ -------------- --------------
------------------ ------------------ -------------- ------------------
------------------ -------------- --------------- ---------------
------------------- ------------------- --------------- -------------------
------------------- ---------------
1115.0 0 kra-101 0 0 3529368K
kra-XXX 4 3454863.5+ 1625 kra-101 5 1071
00:00:20 kra-101 2 00:00:20 8 3.98M Unknown
Unknown Unknown 0 78423799 kra-XXX
4 78387822 8659820 kra-101 0 8657683
cpu=00:00:20,+ cpu=00:00:20,+ cpu=kra-101,energ+ cpu=00:00:00,fs/d+
cpu=00:00:20,+ cpu=kra-101,energ+ cpu=00:00:00,fs/d+ cpu=00:02:46,+
energy=0,fs/di+ energy=0,fs/di+ energy=kra-101,fs/+ fs/disk=0
energy=0,fs/di+ energy=kra-101,fs/+#出力結果の見方
srun
#コマンド概要
Slurmが割り当てたリソース(インタラクティブジョブ)上で計算処理を実行するコマンド。
#出力例:
asuserXXXXX@kra-login02:~/n/shell$ srun --nodes=10 hostname
srun: job 1120 queued and waiting for resources
srun: job 1120 has been allocated resources
kra-109
kra-110
kra-106
kra-112
kra-105
kra-108
kra-113
kra-114
kra-111
kra-107scancel
#コマンド概要
実行中ジョブをキャンセルするコマンド
#出力例:
asuserXXXXX@kra-login02:~/n/shell$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1115 GPU test-hpl asuser10 R 2:37 1 kra-101
asuserXXXXX@kra-login02:~/n/shell$ scancel 1115
asuserXXXXX@kra-login02:~/n/shell$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)sstat
#コマンド概要
実行中ジョブの統計情報を表示する(CPU時間、メモリ使用量など)コマンド
#出力例::
asuserXXXXX@kra-login02:~/n/shell$ sstat 1115
JobID MaxVMSize MaxVMSizeNode MaxVMSizeTask AveVMSize MaxRSS
MaxRSSNode MaxRSSTask AveRSS MaxPages MaxPagesNode MaxPagesTask
AvePages MinCPU MinCPUNode MinCPUTask AveCPU NTasks AveCPUFreq
ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ConsumedEnergy MaxDiskRead
MaxDiskReadNode MaxDiskReadTask AveDiskRead MaxDiskWrite MaxDiskWriteNode
MaxDiskWriteTask AveDiskWrite TRESUsageInAve TRESUsageInMax
TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode
TRESUsageInMinTask TRESUsageInTot TRESUsageOutAve TRESUsageOutMax
TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutMin TRESUsageOutMinNode
TRESUsageOutMinTask TRESUsageOutTot
------------ ---------- -------------- -------------- ---------- ----------
---------- ---------- ---------- -------- ------------ --------------
---------- ---------- ---------- ---------- ---------- -------- ----------
------------- ------------- ------------- -------------- ------------
--------------- --------------- ------------ ------------ ----------------
---------------- ------------ -------------- --------------
------------------ ------------------ -------------- ------------------
------------------ -------------- --------------- ---------------
------------------- ------------------- --------------- -------------------
------------------- ---------------
1115.0 0 kra-101 0 0 3529368K
kra-101 4 3454863.5+ 1625 kra-101 5 1071
00:00:20 kra-101 2 00:00:20 8 3.98M Unknown
Unknown Unknown 0 78423799 kra-101
4 78387822 8659820 kra-101 0 8657683
cpu=00:00:20,+ cpu=00:00:20,+ cpu=kra-101,energ+ cpu=00:00:00,fs/d+
cpu=00:00:20,+ cpu=kra-101,energ+ cpu=00:00:00,fs/d+ cpu=00:02:46,+
energy=0,fs/di+ energy=0,fs/di+ energy=kra-101,fs/+ fs/disk=0
energy=0,fs/di+ energy=kra-101,fs/+ fs/disk=2 energy=0,fs/di+sacct
#コマンド概要
ジョブの履歴を表示するコマンド
#出力例:
asuserXXXXX@kra-login02:~/n/shell$ sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1053 test-hpl.+ GPU asuser100+ 28 FAILED 1:0
1053.batch batch asuser100+ 28 FAILED 1:0
1053.extern extern asuser100+ 28 COMPLETED 0:0
1053.0 hpl.sh asuser100+ 28 FAILED 1:0
1054 test-hpl.+ GPU asuser100+ 28 FAILED 1:0
1054.batch batch asuser100+ 28 FAILED 1:0
1054.extern extern asuser100+ 28 COMPLETED 0:0
1054.0 hpl.sh asuser100+ 28 FAILED 1:0
1055 test-hpl.+ GPU asuser100+ 28 FAILED 1:0
1055.batch batch asuser100+ 28 FAILED 1:0
1055.extern extern asuser100+ 28 COMPLETED 0:0
1055.0 hpl.sh asuser100+ 28 FAILED 1:0
1057 test-hpl.+ GPU asuser100+ 28 COMPLETED 0:0
1057.batch batch asuser100+ 28 COMPLETED 0:0
1057.extern extern asuser100+ 28 COMPLETED 0:0
1057.0 nvidia-smi asuser100+ 28 COMPLETED 0:0
1059 test-hpl.+ GPU asuser100+ 224 COMPLETED 0:0
1059.batch batch asuser100+ 224 COMPLETED 0:0
1059.extern extern asuser100+ 224 COMPLETED 0:0
1059.0 nvidia-smi asuser100+ 224 COMPLETED 0:0
1060 test-hpl.+ GPU asuser100+ 224 COMPLETED 0:0
1060.batch batch asuser100+ 224 COMPLETED 0:0
1060.extern extern asuser100+ 224 COMPLETED 0:0
1060.0 hpl.sh asuser100+ 224 COMPLETED 0:0
1062 test-hpl.+ GPU asuser100+ 224 COMPLETED 0:0
1062.batch batch asuser100+ 224 COMPLETED 0:0
1062.extern extern asuser100+ 224 COMPLETED 0:0
1062.0 hpl.sh asuser100+ 224 COMPLETED 0:0
1064 test-hpl.+ GPU asuser100+ 224 FAILED 1:0
1064.batch batch asuser100+ 224 FAILED 1:0
1064.extern extern asuser100+ 224 COMPLETED 0:0
1064.0 hpl.sh asuser100+ 224 FAILED 1:0
1065 test-hpl.+ GPU asuser100+ 224 COMPLETED 0:0
1065.batch batch asuser100+ 224 COMPLETED 0:0
1065.extern extern asuser100+ 224 COMPLETED 0:0
1065.0 hpl.sh asuser100+ 224 COMPLETED 0:0
1066 test-hpl.+ GPU asuser100+ 28 FAILED 1:0
1066.batch batch asuser100+ 28 FAILED 1:0
1066.extern extern asuser100+ 28 COMPLETED 0:0
1066.0 hpl.sh asuser100+ 28 FAILED 1:0
1068 test-hpl.+ GPU asuser100+ 224 COMPLETED 0:0
1068.batch batch asuser100+ 224 COMPLETED 0:0
1068.extern extern asuser100+ 224 COMPLETED 0:0
1068.0 hpl.sh asuser100+ 224 COMPLETED 0:0
1083 test-hpl.+ GPU asuser100+ 224 COMPLETED 0:0
1083.batch batch asuser100+ 224 COMPLETED 0:0
1083.extern extern asuser100+ 224 COMPLETED 0:0
1083.0 hpl.sh asuser100+ 224 COMPLETED 0:0
1090 test-hpl.+ GPU asuser100+ 224 COMPLETED 0:0
1090.batch batch asuser100+ 224 COMPLETED 0:0
1090.extern extern asuser100+ 224 COMPLETED 0:0
1090.0 hpl.sh asuser100+ 224 COMPLETED 0:0
1096 test-hpl.+ GPU asuser100+ 224 COMPLETED 0:0
1096.batch batch asuser100+ 224 COMPLETED 0:0
1096.extern extern asuser100+ 224 COMPLETED 0:0
1096.0 hpl.sh asuser100+ 224 COMPLETED 0:0
1104 test-hpl.+ GPU asuser100+ 224 COMPLETED 0:0
1104.batch batch asuser100+ 224 COMPLETED 0:0
1104.extern extern asuser100+ 224 COMPLETED 0:0
1104.0 hpl.sh asuser100+ 224 COMPLETED 0:0
1108 test-hpl.+ GPU asuser100+ 224 COMPLETED 0:0
1108.batch batch asuser100+ 224 COMPLETED 0:0
1108.extern extern asuser100+ 224 COMPLETED 0:0
1108.0 hpl.sh asuser100+ 224 COMPLETED 0:0
1115 test-hpl.+ GPU asuser100+ 224 RUNNING 0:0
1115.batch batch asuser100+ 224 RUNNING 0:0
1115.extern extern asuser100+ 224 RUNNING 0:0
1115.0 hpl.sh asuser100+ 224 RUNNING 0:0
salloc
#コマンド概要
リソースを割り当てるコマンド
#出力例:
asuserXXXXX@kra-login02:~/n/shell$ salloc --nodes=1 --ntasks=1
salloc: Granted job allocation 1122
salloc: Waiting for resource configuration
salloc: Nodes kra-XXX are ready for job
asuserXXXXX@kra-103:~/n/shell$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
GPU* up 7-00:00:00 1 mix kra-103
GPU* up 7-00:00:00 1 alloc kra-104
GPU* up 7-00:00:00 16 idle kra-[101-102,105-114,117-120]
Grp1 up 7-00:00:00 2 idle kra-[115-116]
asuser10034@kra-103:~/n/shell$ srun hostname
kra-103
asuserXXXXX@kra-103:~/n/shell$
asuserXXXXX@kra-103:~/n/shell$ exit
exit
salloc: Relinquishing job allocation 1122
リソース超過時挙動
#実行概要
ジョブが要求以上のリソースを使用した場合のエラー表示
#出力例:
ノード数、GPU数
asuserXXXXX@kra-login02:~/n/shell$ srun --nodes=19 hostname
srun: error: Unable to allocate resources: Requested node configuration is not availableメモリ
asuserXXXXX@kra-login02:~/n/shell$ srun --nodes=1 --mem=3938G hostname
srun: error: Memory specification can not be satisfied
srun: error: Unable to allocate resources: Requested node configuration is not availableコア数
asuserXXXXX@kra-login02:~/n/shell$ srun --nodes=1 --cpus-per-task=225 hostname
srun: error: CPU count per node can not be satisfied
srun: error: Unable to allocate resources: Requested node configuration is not available

