PBS Batch Job Scheduler
What is PBS? The Portable Batch System, is software that performs job scheduling. Its main task is to allocate computational resources to the available computing resources within a cluster and offers ways to monitor and control the workload of available resources in a fair way. With PBS, jobs can be scheduled for execution on according to scheduling policies that attempt to fully utilize system resources without over committing those resources, while being fair to all users. For more information about PBS, see the online manual page, which can be viewed by executing the command:
Job details are typically contained in a submit file. Unlink interactive jobs, batch jobs are controlled via scripts. These scripts tell the system which resources a job will require and how long the resources will be needed. You can control the resources used for your jobs. Examples of user controlled resources are number of nodes, number of cores and memory requirements.
PBS Job Queues
There are several PBS Queues on the Cluster. All users should be aware of the various queues and should avoid using queues setup for specific labs without first asking that lab's PI.
$ qstat -Q
Queue Max Tot Ena Str Que Run Hld Wat Trn Ext T
---------------- --- --- --- --- --- --- --- --- --- --- -
plaut 0 0 yes yes 0 0 0 0 0 0 E
dsi 0 0 yes yes 0 0 0 0 0 0 E
default 0 71 yes yes -111 156 0 0 0 0 E
loprio 0 0 yes yes 0 0 0 0 0 0 E
- plaut - This queue maps to the nodes that were the plaut nodes on the hawk cluster. This queue is reserved for David Plaut's Lab. Plaut lab members should consider using this queue instead of the default queue to keep the default queue open to other users.
- dsi - This queue maps to the nodes that were the dsi nodes on the hawk cluster. This queue is reserved for Timothy Verstynen's Lab. Verstynen lab members should consider using this queue instead of the default queue to keep the default queue open to other users.
- default - The is the default queue which maps to all nodes that were the CNBC nodes on the hawk cluster and the psycho nodes before the two clusters were merged. It is available to all cluster users. As can be assumed by it's name, when no queue is specified, your job will be submitted to this queue by default.
- loprio - This queue is the low priority queue. The queue encompasses all nodes on the cluster. This queue is available to all users. It is low priority in that jobs in the various higher priority queues (plaut, dsi and default) will take priority over the jobs in this queue. (jobs in this queue will have to wait for those other jobs to be scheduled first). There is also a limit to the wall time of this queue. This is to ensure that users submitting jobs to the higher priority queues don't have to wait too long to have their jobs scheduled.
Format for submitting a job (1 core on 1 node):
qsub <script name>
Format for requesting a particular queue:
qsub -q <queuename> <script name>
Example command for submitting a job on the low priority queue:
qsub -q loprio <script name>
Example command for a request for an interactive session, xsession, on the low priority queue:
qsub -I -X -q loprio
Reasonable Usage Policies for Running Jobs
There are configuration setting preventing users from submitting jobs to any of the queues on the system. It is expected that users will get permission from the PI to submit to queues that ordinarily they wouldn't be privileged to submit jobs to. Once they user receives permission, please email me and copy the PI so you can be added to the queue.
PBS Job Scheduler can do a very good job at fairly distributing the work load of jobs of multiple users. But, it has some constraints. If one user submits a large number of long (more than 4 hrs) jobs that will fill most or all of the nodes available, other users will have to wait at least that long before their job is scheduled. We need your cooperation so that you are not responsible for "taking over the cluster" and preventing others from fairly having access to nodes to do their work.
Here are a couple recommendations:
- Break your jobs up into small jobs. Figure out a way to slip one long job into 3 or 4 smaller jobs. If is better to have 400x4hrs jobs than it is to have 40x16hr jobs
- If you cannot break your job into small pieces, only submit 4 or 5 jobs instead of 20 or 40 jobs. After the your jobs are finished, submit more of them. It really isn't fair to grab all of the nodes for a long period of time, unless you monitor the queue and kill some of your jobs if others submit jobs that aren't being scheduled.
Jobs are submitted to be run under PBS via the qsub command. For complete details on using qsub, see the online manual page, which can be viewed by executing the command:
- qsub [options] <PBS script>: will submit a job to the PBS batch scheduler. The script contains the information needed by PBS to allocate resources for your job. It includes directions for handling standard I/O streams, and instructions to run the job.
It is mandatory to use a scriptfile since you cannot submit binary files directly to Torque/PBS. The basic structure of the script is shown below:
#$PBS Script options # Optional PBS directives
shellscripts # Optional shell commands
application # Application itself
As seen above, options ('directives') can be specified in the Torque/PBS script using "#$PBS". Please note the difference in the jobscript file:
The following table shows a summary of some commonly used directives in Torque/PBS. For further information, please refer to the Torque manual.
|| #PBS -N testjob
|| jobname to be used in Torque/PBS
|| #PBS -M user@domain
|| sends email notification to user@domain
|| #PBS -m e
|| mail is to be sent at the end of the job
|| #PBS -e error-file
|| redirects error to error-file
|| #PBS -o ~/out
|| redirects output file to $HOME/out
|| #PBS -q all.q
|| specify queue, i.e. all.q
|| #PBS -d /home/testuser
|| set the working directory
|| #PBS -l walltime=00:30:00
|| resource request, i.e. 30 minutes run time
|| #PBS -l nodes=2:ppn=2
|| request two CPUs on two nodes each
The jobscript that is used to submit your job to PBS using the "qsub" command has directives specified in it. You can also add additional options (i.e. specific a job queue) on the command line. The queues available on the cluster can be displayed by executing 'qstat -q':
# qstat -q
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
default -- -- -- -- 13 -18 -- E R
plaut -- -- -- -- 0 0 -- E R
dsi -- -- -- -- 0 0 -- E R
loprio -- -- -- -- 0 0 -- E R
If you don't specify the queue, your job will be submitted to the default queue. The queues are defined as:
- default: queue will schedule jobs on all nodes but the plaut and dsi queue nodes. (YOU DO NOT HAVE TO SPECIFY THIS QUEUE WHEN SUBMITTING A JOB)
- plaut: queue will only schedule jobs on the plaut nodes.
- dsi : queue will only submit jobs on the dsi queue
- loprio will submit job jobs to all of the nodes on the cluster but the other queues will have higher priority. This option should be used if you would like to take full advantage of all of the nodes on the cluster, but don't mind others jumping ahead of you if they submit to the highter priority queue.
In order to submit your job to a specified queue other than the default queue, you must specify the queue name. As an example, if you would like to submit a job to the queue "plaut". Upon successful submission, Torque/PBS responds the job-ID assigned to your job.
qsub -q plaut testjob.sh
In order to display the status of your running job(s), use the command 'qstat'. Note, while 'qstat' shows all
jobs currently submitted to the cluster, it is convenient to specify the job-ID of the job or to restrict the output to your username (i.e. 'testuser'):
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
95.psych-o testjob.sh testuser 0 R special
# qstat -u testuser
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
-------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - -----
95.psych-o.hpc1 testuser default testuser.sh 10000 -- -- -- 08:00 R 00:36
During execution (either successful or not), output will be put in your current directory, appended with the job number which is assigned by Torque/PBS. By default, there is an error and an output file for every job.
The output file most likely contains your applications output, whereas the error file can be useful for analysing jobs that failed.
Jobs, either running or being queued can easily be deleted using
. Note, you can only delete your own jobs.
Here are a few options for monitoring your jobs. More details and options can be found on the man pages.
- qstat -a: Provides the status of all jobs on the system
- qstat -f: Lists full job detail by displaing more information about a particular job.
- qstat -n: Lists nodes per job.
- qstat -q: Lists queues and their load
- showq: This command will display all jobs from all of the queues and users.
Cleanup failed jobs
How to STOP the job
- qdel <job_id>: Removes job in orderly manner.
- qdel -p <job_id>:Purges job from queue, but does not do any cleanup).
- qsig -s 0 <job_id>: Sends job process a signal to exit.
How to Cleanup the processes
- rocks run host compute 'killall -9 -u <userid>: Command to kill all <userid> processes on the compute nodes.
- rocks run host compute 'killall -9 -u <userid>: Verifies whether any orphaned processes are running.
- diagnose -j <jobid>: This will provide information about why a job won't run.
- tracejob: Provides historical data about a job.
The CNBC users are used to interactive jobs on backend compute nodes. The thought of running batch jobs through a job scheduler is highly encouraged so that the resources are utilized fully and fairly. PBS, does have the capability for users to request interactive jobs.
qsub -I : This command executes an interactive command. After you enter the command, you will have to wait for PBS to start the job. As with any job, your interactive job will wait in the queue until the specified number of nodes is available. If you specify a small number of nodes, the wait will be shorter. Once the job starts, you will see something like this (note that we are requesting 2 cpus in this example):
[<username@headnode> ~]$ qsub -I -l ncpus=2
qsub: waiting for job <some job name> to start qsub: job <some job name> ready
[<username@compute node> ~]$
When you are done with your interactive commands, you can use the exit command to end the job:
[username@<compute node> ~]$ exit
qsub: job <some job name> completed
PBS will keep track of your the cpu time used by all requested nodes until you end the job. PBS will use this in its calculations to determine how to distribute resources to users in a fairly.
Standard error and standard output are printed to the terminal. This can be redirected to a file or piped to another command using Unix shell syntax.
Some use cases for interactive sessions
- Interactive Matlab sessions (for debugging or analysis) Use the -X option to request an X session.
- Anything requiring a Graphical User Interface (GUI). Use the -X option to request an X session.
- Compiling code of large projects that are resource intensive.
- Any other resource intensive interactive task (to avoid using CPU cycles on the headnode)
PBS has an attribute "walltime", which the scheduler can use to assign jobs to nodes. The ideas is that if you can accurately communicate your jobs' time requirements to PBS, PBS can allocate resources most effectively. You should estimate how long your job will run. If you do not estimate the wall clock time required by your run, (e.g. walltime=45:00), PBS will terminate your job after the default wall time. However, if you specify an excessively long runtime, your job may be delayed in the queue longer than it should be. Therefore, please attempt to accurately estimate your wall clock runtime.
However, grossly overstating your wallclock can cause PBS to penalize you. Your jobs may be delayed in the queue longer than it should be. (A modest amount of overestimation (10-20%) is probably ideal). Underestimating walltime will cause your job to prematurely terminated - a huge penalty!
The default wall time on the CNBC Cluster is: TBD
All jobs will be submitted with the default wall time unless the user specifies a wall time in their pbs job.
PBS scripts are rather simple. Comments are denoted by "##" in PBS scripts. PBS will parse the first #.
PBS collects STDOUT and STDERR from your program and it is recommended that you allow PBS to take care of your output by avoiding redirecting the output to a particular directory (e.g. your directory on /data2 ). PBS manages the output by writing the files to the local disk while it is working and after the job is completed it copied the files to the directory you designate. Writing to the local disk improves the performance of your program, but it does mean that you cannot see the output from your program while it is running. Another advantage to this is if there is a problem with the remote file system while your program is running, it will be unaffected and will continue to run. With parallel programs, keep in mind that the local disk is local to each compute node, so if each task needs to read and write files, you need to distribute them to or gather them from all of the compute nodes as necessary. Finally, if your program uses scratch files, it may be worthwhile to set the scratch directory to a local disk (with parallel programs make sure that the data isnít shared, or this wonít work).
An important consideration when creating your PBS script is file input and output. If you reference your input files a lot, it is worth your time to add commands to your PBS script to create a local directory in /tmp (mkdir /tmp/$PBS_JOBID) and copy your files over to that directory before starting your program.
You can request specific attributes, such as number of nodes, memory or job runtime. Memory requests should be set per process using the pmem. Set the walltime to a number close but slightly longer than you expect the job to run. To request 2 nodes of 2 processors each, each process usinga max of 2gb of memory for 1 hour:
#PBS -l nodes=2:ppn=2,pmem=2000mb,walltime=1:00:00
To request memory, you should request memory in mb rather than gb, as well as making 1000mb=1gb. As an example for our 2gb nodes, they actually only have available after system usage maybe 2025mb. A request for 2gb==2048mb will never be honored. In a nutshell, round down a little for your memory approximations.
Example PBS Job Script:
##------------------------------------------ Start of Example #1 --------------------------------------------------
## Specify the shell for PBS ? required.
## Notice the double # for comments.
## You can also use /bin/csh.
#PBS -S /bin/sh
## Merge stderr to stdout (optional, otherwise they're in separate files)
#PBS -j oe
##Specify the output filename explicitly (optional; the default is named
## from the job ID, in the directory where qsub was run.)
#PBS -o /path/to/output/directory/testjob.out
##Requests 1 node to run 1 process in the queue.
#PBS -l nodes=1:ppn=1
## Request mail when job ends, or is aborted (optional, default is "a" only)
#PBS -m a
# To start in a certain directory; you probably need to change.
## Below are the actual commands to be executed (i.e. to do the work).
echo "Test job starting at `date`"
echo "Test job finished at `date`"
#PBS -N /path/to/application/or/script/
##------------------------------------------ End of Example #1--------------------------------------------------
-- David Pane - 2015-04-22