# Getting Started Before you begin, make sure you have the following: - Access to an HPC cluster with Slurm installed. - Basic knowledge of Linux command-line operations. ## Slurm scheduler Simple Linux Utility for Resource Management (SLURM) is commonly used for job scheduling and resource management on high-performance computing (HPC) clusters. Slurm operates on the concept of jobs, nodes, and partitions. Familiarize yourself with these key terms: - **Job**: A computational task submitted to the cluster. - **Node**: A computing resource that performs tasks as part of a job. - **Partition**: A logical division of the cluster resources. ## Connecting to the HPC Cluster Use SSH to connect to the HPC cluster: ``` ssh @hpc-cluster.example.com ``` Replace `username` with your username and hpc-cluster.example.com with the actual address of your HPC cluster. ## Submitting a Job To submit a job using the `sbatch` command followed by the script file, run the command as- ``` sbatch script.sh ``` where `script.sh` is the name of your shell script. The script should include the commands necessary for running your analysis or simulation. The output will provide a unique job ID (e.g., 12345678). You can view more information about your job using the `squeue` command:: ## Example of slurm script for CPUQ partition ``` #!/bin/bash #SBATCH --job-name=MyJob # Job name #SBATCH --partition=CPUQ # CPU partition #SBATCH --output=output.%j # Standard output (%j = job ID) #SBATCH --error=error.%j # Standard error (%j = job ID) #SBATCH --ntasks=4 # Total number of tasks (e.g. MPI processes) #SBATCH --nodes=1 # Number of nodes #SBATCH --ntasks-per-node=4 # Tasks per node #SBATCH --time=00-00:10:00 # Walltime (DD-HH:MM:SS) # Optional: Load required modules # module load # Run the CPU job python3 test.py ``` ## Example of slurm script for HGXQ partition that utilizes GPU ``` #!/bin/bash #SBATCH --job-name=patchLM # Job name #SBATCH --partition=HGXQ # GPU partition #SBATCH --gres=gpu:1 # Request 1 GPU #SBATCH --nodes=1 # Number of nodes #SBATCH --ntasks-per-node=1 # Tasks per node (1 for single-GPU job) #SBATCH --cpus-per-task=4 # CPUs allocated per task #SBATCH --mem=100GB # Memory per node #SBATCH --time=00-01:00:00 # Walltime (DD-HH:MM:SS) #SBATCH --output=output/slurm_%j.out # Output file (%j = job ID) in output dir #SBATCH --error=output/slurm_%j.err # Error file # Optional: Load modules # module purge # module load # Example: load Python module # Activate environment (conda or venv) source venv/bin/activate # Adjust path as needed # Run the GPU-enabled Python script python3 classifier.py # Replace with your script ``` This is a basic example, and you may need to customize it according to your specific needs, such as adjusting resource requirements, module loading, and paths. You can specify the following parameters in your slurm script. **--job-name=**: Specifies a name for the job. This name will be used in identifying the job in the queue and in the output and error files. **--output=**: Specifies the name of the file where the standard output of the job will be written. You can use `%j` in the filename, and SLURM will replace it with the job ID. **--error=**: Specifies the name of the file where the standard error of the job will be written. Like `--output`, `%j` can be used in the filename. **--partition=**: Specifies the name of the partition or queue to which the job should be submitted. Partitions are used to group nodes with similar characteristics. **--ntasks=**: Specifies the total number of tasks (or processes) to be run. This is often used in parallel computing with MPI. **--nodes=**: Specifies the number of nodes requested for the job. If not specified, SLURM may allocate the tasks across nodes as needed. **--ntasks-per-node=**: Specifies the number of tasks (or processes) to be run per node. **--cpus-per-task=**: Specifies the number of CPU cores to allocate per task. This can be used to control the number of threads per task. **--gres=gpu:**: Specifies the number of GPUs to allocate to the job. This is required when running GPU-accelerated applications. SLURM uses this to reserve GPUs as generic resources (gres). For example, --gres=gpu:1 allocates one GPU for the job. **--time=