.. _slurm-usage-guide: ================= Slurm usage guide ================= .. contents:: Introduction ------------ Slurm is an open-source job scheduling system for Linux clusters, most frequently used for high-performance computing (HPC) applications. This guide will cover some of the basics to get started using slurm as a user. For more information, the `Slurm Docs `_ are a good place to start. Slurm is deployed on Kosmos, this means that a slurm daemon should be running on each compute system. Users do not log directly into each compute system to do their work. Instead, they execute slurm commands (ex: ``srun``\ , ``sinfo``\ , ``scancel``\ , ``scontrol``\ , etc) from a slurm login node. These commands communicate with the slurmd daemons on each host to perform work. The control node is ``kosmos``. You should login to this node to run any job. Only when a job is running, you can login to the compute nodes, but this shouldn’t be necessary. Node Partition and GPU Configuration ------------------------------------ The following table provides a summary of the node partitions and their corresponding GPU/CPU configurations within the cluster: +--------------+-------------+---------------------------------+------+------+---------+ | PARTITION | NODE | GPU TYPE | CPUS | GPUS | MEM (G) | +==============+=============+=================================+======+======+=========+ | a100 | euctemon | NVIDIA A100 80GB | 128 | 8 | 980 | +--------------+-------------+---------------------------------+------+------+---------+ | a100 | eudoxus | NVIDIA A100 80GB | 128 | 8 | 980 | +--------------+-------------+---------------------------------+------+------+---------+ | a6000 | aristarchus | NVIDIA RTX A6000 48GB | 128 | 8 | 980 | +--------------+-------------+---------------------------------+------+------+---------+ | a6000 | galileo | NVIDIA RTX A6000 48GB | 128 | 8 | 490 | +--------------+-------------+---------------------------------+------+------+---------+ | a6000 | ptolemaeus | NVIDIA RTX A6000 48GB | 128 | 8 | 980 | +--------------+-------------+---------------------------------+------+------+---------+ | cpu | gaia | \- | 256 | \- | 490 | +--------------+-------------+---------------------------------+------+------+---------+ | p6000 | mariecurie | NVIDIA Quadro P6000 24GB | 40 | 4 | 122 | +--------------+-------------+---------------------------------+------+------+---------+ | rtx2080ti | alanturing | NVIDIA GeForce RTX 2080 Ti 11GB | 40 | 8 | 489 | +--------------+-------------+---------------------------------+------+------+---------+ | rtx2080ti | hamilton | NVIDIA GeForce RTX 2080 Ti 11GB | 40 | 8 | 244 | +--------------+-------------+---------------------------------+------+------+---------+ | rtx2080ti_sm | carlos | NVIDIA GeForce RTX 2080 Ti 11GB | 24 | 2 | 91 | +--------------+-------------+---------------------------------+------+------+---------+ | rtx2080ti_sm | plato | NVIDIA GeForce RTX 2080 Ti 11GB | 24 | 2 | 122 | +--------------+-------------+---------------------------------+------+------+---------+ | rtx2080ti_sm | schrodinger | NVIDIA GeForce RTX 2080 Ti 11GB | 24 | 2 | 91 | +--------------+-------------+---------------------------------+------+------+---------+ | rtx8000 | roentgen | NVIDIA Quadro RTX 8000 48GB | 40 | 4 | 244 | +--------------+-------------+---------------------------------+------+------+---------+ SLURM QoS Specifications for Our Cluster ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. _slurm-qos-options: If you encounter errors upon launching new jobs with an invalid QoS specification, add the following options to your `srun` command: +---------------------+---------------------+-----------------+ | Partition | QoS Specification | Maximum GPUs | +=====================+=====================+=================+ | A6000 and rtx8000 | --qos=a6000_qos | 2 | +---------------------+---------------------+-----------------+ | A100 | --qos=a100_qos | 2 | +---------------------+---------------------+-----------------+ | All Other Partitions| --qos=rtx_qos | 4 | +---------------------+---------------------+-----------------+ For the A6000 and rtx8000 partitions (both are 48GB GPUs), use the `--qos=a6000_qos` option, which allows a maximum of 2 GPUs. Similarly, for the A100 partition, use `--qos=a100_qos` with a maximum of 2 GPUs. For all other partitions, including RTX, use `--qos=rtx_qos` and a maximum of 4 GPUs. These QoS specifications ensure that your jobs are properly assigned to the appropriate partitions with the correct number of GPUs. Note: Please be aware that there may exist other QoS specifications specific to certain users or groups. If you belong to such a user group, please consult the guidelines provided by your system administrators to determine the appropriate QoS specification to use for your jobs. SLURM Job Submission Options ---------------------------- .. _slurm-options: The following is a list of commonly used SLURM arguments for job submissions: - ``--job-name=``: Specifies the name for the job. - ``--partition=``: Specifies the partition to run the job on. - ``--nodes=``: Specifies the number of nodes to allocate for the job. - ``--nodelist=``: Specifies a comma-separated list of specific nodes to run the job on. - ``--ntasks=``: Specifies the total number of tasks to run for the job. - ``--ntasks-per-node=``: Specifies the number of tasks to run per node. - ``--cpus-per-task=``: Specifies the number of CPUs to allocate per task. - ``--gpus=``: Specifies GPUs number. This cannot be higher than the specific `--qos` configuration. - ``--mem=``: Specifies the memory requirement for the job. - ``--time=