Sun Grid Engine Quick Start

[_NAME_]
Back to Main Documentation Page

Using Sun Grid Engine 6

If you share a cluster with other users, a batch schedule allows for optimal sharing among users. Grid Engine is a robust batch scheduler that can handle large workloads across entire organizations. You can find more detailed information in the Administrators Guide (pdf), the Users Guide, and Release Notes.

Adminstation Help

Qmon

Node/Queue errors To see the error, look in the node log file: /opt/sge6/default/spool/sge_execd/node##/messages where ## is the node number. To see errors you can do a qstat -f -u "*" this will show all the queues and the state. If a queue@node is in an error state it will not accept jobs. There is also a way to get errors emailed to you as well, I'll set that up. To clear the state, qmod -c *@node02 The * clears all queues that have node02 in them. --------------rewrite clean up Seeing "E" in the state column of qstat? E state errors usually mean that an attempt to start a job failed in a spectacular manner and the Grid Engine qmaster decided to close off the queue instance to new jobs. This is an important Grid Engine protective measure designed to keep your remaining pending jobs from a "black hole" draining effect in which they all successively get dispatched to the "bad" node die instantly with errors. There are different causes to state E -- in most cases the root cause is is some large, systemic hardware or OS level error or misconfiguration. Typical examples include: * The username of the job submitter does not exist on the execution host (extremely common) * Shared filesystem failure * Parallel jobs: syntax errors or bad commands in "start_proc_args" or "stop_proc_args" as defined within the parallel environment (PE) * Serial jobs: syntax errors or a "prolog" or "epilog" script that does not exit with status code 0 * Serious path or path_alias problems (paths that exist on the submit host are different on remote execution host or have been improperly aliased * Network, routing or DNS errors that are interfering with LDAP, NIS or DNS I have seen a few cases of actual jobs crashing and causing queue instance state "E". Usually this seems to occur when the job itself has crashed and taken out its parent process (the 'sge_shepard' deamon). If your job is bombing bad enough to wipe out the parent sge_shepard process then SGE will usually toggle the queue instance into "E" state. This is still a fairly rare occurance so if you are trying to debug this situation I'd recommend first looking at Hardware and OS level issues before looking too closely at the job as a root-cause. State "E" does not go away automatically One big message to impart is that E states are persistent and never go away on their own (unlike many SGE queue and job states which clear automatically). State "E" will persist through hardware reboots and Grid Engine restart efforts. The state has to be manually be cleared by a Grid Engine administrator. Again, the reason for this is that SGE wants a human to investigate the root cause first in case there is potential for the "black hole" effect mentioned above. If you think this was a transient problem you can clear the queues and see what happens with your pending jobs --- the command is "qmod -c (queue instance)". To globally clear all E states in your SGE cluster: qmod -c '*' Troubleshooting and Diagnosing * qstat -explain E * Examine the node itself and OS logs with an eye towards entries relating to permissions, failures or access errors * Try to login to the node in question using a username associated with a failed job. This will help diagnose any username, authentication or access issues * Look in the job output directory if it is available. Output from failed jobs can be extremely useful, especially if there is a path, ENV or permission problem * Examine the SGE logs with particular focus on the messages file created by the sge_exced on the execution host in question * If all else fails, SGE daemons will write log files to /tmp when they can't write to their normal spool location. Seeing recent SGE event data in /tmp instead of your normal spool location is a good indication of filesystem or permission errors I'll try to keep this page updated in the future with new information and troubleshooting hints * Posted in Administration, --------------rewrite clean up------------- Priority A job's priority determines its position in the queue. A user can enqueue tasks with a lower than normal priority. One could use this to prioritise ones own tasks or as an altruistic act to improve other users access to the system. The SGE system, administrative users and operators are allowed to increase a jobs priority. Urgency or deadlines A jobs urgency can be specified by setting the deadline option. As a deadline approaches the increasing urgency is reflected in a dynamic increase in the jobs priority. Set the deadline to help ensure that your task is completed before a given Share factor The share factor is set according to how much use any one user has made of the system during a predefined period. Any given job will get a higher priority than those of other users who have submitted more jobs during the integration period. This period is a configure option on the SGE master. Load-balancing Machines will have tasks evenly distributed among the available compute nodes. No one node should become too heavily loaded or go unused. Execution slots Instead of allocating processors from a machine to the compute farm SGE allocates slots. Each job needs a slot before it will run. First-fit A job can specify the resources need to complete itself, e.g, memory requirements, specific CPU family, shared data needs, etc. ------------other file------------

Basic Commands

At a basic level, Sun Grid Engine (SGE) is very easy to use. The following sections will describe the commands you need to submit simple jobs to the Grid Engine. The command that will be most useful to you are as follows A more convenient queue status package called userstat combines qstat, qhost, and qdel into a simple easy to use "top" like interface. Each will be described below. Additional information on these commands is available by using man command-name

Important: before you use SGE you must load the SGE module:

$ module load sge6

You can include this command in your .bashrc or .cshrc file to avoid entering it each time you login.

Submitting a job to the queue: qsub
Qsub is used to submit a job to SGE. The qsub command has the following syntax:

qsub [ options ] [ scriptfile | -- [ script args ]] 

Binary files may not be submitted directly to SGE. For example, if we wanted to submit the "date" command to SGE we would need a script that looks like:

#!/bin/bash 
bin/date 

If the script were called sge-date, then we could simply run the following:

$ qsub sge-date 

SGE will then run the program, and place two files in your current directory:

sge-date.e# 
sge-date.o# 

where # is the job number assigned by SGE. The sge-date.e# file contains the output from standard error and the sge-date.o# file contains the output form standard out. The following basic options may be used to submit the job.

-A [account name] -- Specify the account under which to run the job 
-N [name] -- The name of the job 
-l h rt=hr:min:sec -- Maximum walltime for this job 
-r [y,n] -- Should this job be re-runnable (default y) 
-pe [type] [num] -- Request [num] amount of [type] nodes. 
-cwd -- Place the output files (.e,.o) in the current working directory. 
     The default is to place them in the users home directory. 
-S [shell path] -- Specify the shell to use when running the job script 

Although it is possible to use command line options and script wrappers to submit jobs, it is usually more convenient to use just a single script to include all options for the job. The next section describes how this is done.

Job Scripts
The most convenient method to submit a job to SGE is to use a "job script". The job script allows all options and the program file to placed in a single file. The following script will report the node on which it is running, sleep for 60 seconds, then exit. It also reports the start/end date and time as well as sending an email to user when the jobs starts and when the job finishes. Other SGE options are set as well. The example script can be found
here as well.

#!/bin/sh
#
# Usage: sleeper.sh [time]]
#        default for time is 60 seconds

# -- our name ---
#$ -N Sleeper1
#$ -S /bin/sh
# Make sure that the .e and .o file arrive in the
# working directory
#$ -cwd
#Merge the standard out and standard error to one file
#$ -j y
/bin/echo Here I am: `hostname`. Sleeping now at: `date`
/bin/echo Running on host: `hostname`.
/bin/echo In directory: `pwd`
/bin/echo Starting on: `date`
# Send mail at submission and completion of script
#$ -m be
#$ -M deadline@kronos
time=60
if [ $# -ge 1 ]; then
   time=$1
fi
sleep $time

echo Now it is: `date`

The "#$" is used in the script to indicate an SGE option. If we name the script sleeper1.sh and then submit it to SGE as follows:

qsub sleeper1.sh

The output will be in the file Sleeper1.o#, where # is the job number assigned by SGE. Here is an example output file for the sleeper1.sh script. When submitting MPI or PVM jobs, we will need additional information in the job script. See below.

Preserving Your Environment
If you want to make sure yrou current environment variables are used on you SGE jobs, include the following in your submit script:

#$ -V

Queue Status: qstat
Queue status for your jobs can be found by issuing a "qstat" command. Example output is shown below. qstat -u $user to show all jobs: qstat -u "*" You can use the alias command to make this the default behavior. add this to your .bashrc file. alias qstat='qstat -u "*"'

job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID
---------------------------------------------------------------------------------------------
    135     0 Sleeper2   deadline-loc r     08/08/2006 16:27:15 node000.q  MASTER
            0 Sleeper2   deadline-loc r     08/08/2006 16:27:15 node000.q  SLAVE
    138     0 Sleeper1   deadline-loc r     08/08/2006 16:26:59 node000.q  MASTER
    132     0 Sleeper4   deadline-loc t     08/08/2006 16:27:15 node002.q  MASTER
            0 Sleeper4   deadline-loc t     08/08/2006 16:27:15 node002.q  SLAVE
    135     0 Sleeper2   deadline-loc r     08/08/2006 16:27:15 node002.q  SLAVE
    132     0 Sleeper4   deadline-loc t     08/08/2006 16:27:15 node003.q  SLAVE
    136     0 Sleeper2   deadline-loc t     08/08/2006 16:27:15 node003.q  SLAVE
    132     0 Sleeper4   deadline-loc t     08/08/2006 16:27:15 node004.q  SLAVE
    136     0 Sleeper2   deadline-loc t     08/08/2006 16:27:15 node004.q  MASTER
            0 Sleeper2   deadline-loc t     08/08/2006 16:27:15 node004.q  SLAVE
    132     0 Sleeper4   deadline-loc t     08/08/2006 16:27:15 node006.q  SLAVE
    139     0 Sleeper1   deadline-loc t     08/08/2006 16:27:15 node006.q  MASTER
    137     0 Sleeper4   deadline-loc qw    08/08/2006 16:23:24
    140     0 Sleeper2   deadline-loc qw    08/08/2006 16:23:25
    141     0 Sleeper4   deadline-loc qw    08/08/2006 16:23:25
    142     0 Sleeper2   deadline-loc qw    08/08/2006 16:23:25
    143     0 Sleeper1   deadline-loc qw    08/08/2006 16:23:25
    144     0 Sleeper1   deadline-loc qw    08/08/2006 16:23:25

More data information can be obtained by using the -f option. For parallel jobs the output is not very easy to understand. See userstat for a better display of the data. In the above listing, the stat is either qw (queue waiting), t (transferring), and r (running).

Deleting a Job: qdel
Jobs may be deleted by using the qdel command as follows:

$ qdel job-id

The job-id job number is the number assigned by SGE when you submit the job using qsub. You can only delete you jobs.

Host/Node Status: qhost
Node or host status can be obtained by using the qhost command. An example listing is shown below.

HOSTNAME             ARCH       NPROC  LOAD   MEMTOT   MEMUSE   SWAPTO   SWAPUS
-------------------------------------------------------------------------------
global               -              -     -        -        -        -        -
node000              lx24-amd64     2  0.00     3.8G    35.8M      0.0      0.0
node001              lx24-amd64     2  0.00     3.8G    35.2M      0.0      0.0
node002              lx24-amd64     2  0.00     3.8G    35.7M      0.0      0.0
node003              lx24-amd64     2  0.00     3.8G    35.6M      0.0      0.0
node004              lx24-amd64     2  0.00     3.8G    35.7M      0.0      0.0

Summary Status: userstat
Userstat displays node statistics and the batch queue for a cluster. It uses the output from the qhost and qstat commands. The display has two main windows. The top window is the batch queue and the bottom window is the nodes window. An example display is shown below.

[Userstat display]

The general commands are as follows:

q - to quit userstat 
h - to get a help screen 
b - to make the batch queue window active (default) 
n - to make the nodes window active 
spacebar - update windows (windows also update automatically every 20 seconds) 
up_arrow - to move though the jobs or nodes window 
down_arrow - to move though the jobs or nodes window 
Pg Up/Down - move a whole page in the jobs or nodes window 

Queue Window Commands:

j - sort on job-ID 
u - sort on user name 
p - sort on program name 
a - redisplay all jobs 
return - on a job will display only the nodes for that job 
d - delete a job from the queue (you must own the job or be root) 

Window Commands:

s - sort on system hostname 
a - redisplay all hosts 

Both windows can scroll. A "+"indicates that the window will scroll further. You can use the Up and Down Arrow keys or Page Up and Page Down keys to move through the list. Both windows are updated by pressing the space bar. The windows will update automatically after 20 seconds if the space bar has not been pressed.

The Batch Queue Window has the following features. The top line shows the total jobs in the queue and the number of active jobs. Active jobs may be running or transferring into or out of the cluster. An entry for each job is displayed in the batch window. The JobID, Priority, Name, User, State, Submission/Start time and the number of CPUs is provided. Only active jobs have CPUs assigned to them.

The nodes for a specific job can be viewed by placing the cursor on the job name and then pressing the Return (Enter) key. The following figure illustrates this command.

[Userstat display]

A job may be deleted by placing the cursor on the job and entering d. The user must confirm the deletion by entering y. In some cases, a delete will not work. In these cases, the user may be required to enter a qdel -f JOB-ID by hand or request that root delete stubborn jobs.

Simple sorting can be done on the Batch Queue window. See the j,u, and p commands above. The a command will restore all jobs to the batch queue window.

The Hosts (Node) Window has the following features. The top line shows the total nodes and number of down nodes (i.e. nodes that are not part of the batch system). If the nodes for a specific job are being displayed, the JOB-ID will be displayed on this line. The next line is a summary line for the cluster. Total CPUs, Load (5 minute), Memory, Memory Use, Swap, and Swap Use are shown. Below this line are the numbers for each individual node. If a node is down, it has "*" around the node name. (i.e. *nodename*)

Nodes can be sorted by nodename or by job (entering the return key on a job from from the batch queue window.) All nodes can be redisplay by entering a a in the Hosts window.

Parallel Submit Scripts

Submitting parallel jobs is very similar to submitting single node jobs (as shown above). A parallel job needs a pe parallel environment assigned to the script. The following is an annotated script for submitting an MPICH job to SGE.

#!/bin/sh
#
# EXAMPLE MPICH SCRIPT FOR SGE
# Modified by Basement Supercomputing 1/2/2006 DJE
# To use, change "MPICH_JOB", "NUMBER_OF_CPUS"
# and "MPICH_PROGRAM_NAME" to real values.
#
# Your job name
#$ -N MPICH_JOB
#
# Use current working directory
#$ -cwd
#
# Join stdout and stderr
#$ -j y
#
# pe request for MPICH. Set your number of processors here.
# Make sure you use the "mpich" parallel environment.
#$ -pe mpich NUMBER_OF_CPUS
#
# Run job through bash shell
#$ -S /bin/bash
#
# The following is for reporting only. It is not really needed
# to run the job. It will show up in your output file.
echo "Got $NSLOTS processors."
echo "Machines:"
cat $TMPDIR/machines
# Adjust MPICH procgroup to ensure smooth shutdown
export MPICH_PROCESS_GROUP=no
#
# Use full pathname to make sure we are using the right mpirun
/opt/mpi/tcp/mpich-gnu4/bin/mpirun -np $NSLOTS -machinefile $TMPDIR/machines MPICH_PROGRAM_NAME

The important option is the -pe line in the submit script. This variable must be set for the MPI environment for which you compiled your program. The following example submit scripts are available:

To use SGE with MPI simply copy the appropriate scripts to your working directory, edit the script to fill in the appropriate variables, rename it to reflect your program and use qsub to submit it to SGE.


This page, and all contents, are Copyright © 2007,2008 by Basement Supercomputing, Bethlehem, PA, USA, All Rights Reserved. This notice must appear on all copies (electronic, paper, or otherwise) of this document.