Lens Software

A sample PBS submit script can be found below. There are a few parts.

Firstly, you will need to set the following environment variables in your .bashrc file located in your home directory.

export LENSDIR=/data2/plautlab/Lens
export HOSTTYPE=x86_64
export PATH=$PATH:$LENSDIR/Bin/$HOSTTYPE
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$LENSDIR/Bin/${HOSTTYPE}
export TCL_LIBRARY=/data2/plautlab/Lens/Bin/x86_64/

The next time you start a shell session, your environmental variable declaration will be read and passed on to the shell environment. You can force your current session to read the file now by typing in the shell that you are using:

source ~/.bashrc

Or you can log out of the cluster and back in so they are used.

The basic script is test.qsub, and the two important lines in that script are the line that sets the number of processors to use and the line calling the Lens software.

Currently, the script reads: #PBS -l nodes=4:ppn=4 which tells PBS to use 4 of the compute nodes (out of the current 7 on psych-o) and use 4 processors on each node (out of the 8 processors per node). Lens will use the first processor as the lens server, and the rest (15 in this case) as the clients for computation. You can change these numbers in the script before using it to set the number of nodes or processors you want. (nodes=2:ppn=8 would also give you 16 processors, using all the processors on 2 of the nodes, for example).


The other important line is: lens -b server.qsub.tcl > server.out

You can alter this line to use different input or output files. server.qsub.tcl has been changed from the default server.tcl file so you no longer need to edit it to list which machines to use for the job. That's now handled through PBS, by setting the number of processors in test.qsub. It makes use of an additional little script, 'tailnodelist', which needs to be in the same location as the server.qsub.tcl file (see below) along with the rest of the scripts.


So to run Lens via PBS:

  1. Edit test.qsub to set the requested number of nodes/processors, as needed.
  2. Edit test.qsub to change the input or output files, as needed.
  3. Run the command 'qsub test.qsub'
One can always make different versions of test.qsub (and give them different filenames), and then just run them with the 'qsub' command.

Begin test.qsub


#!/bin/sh
# specify a jobname
#PBS -N test

# specify number of nodes (ppn should be 8 to reserve all cores on the node)
#PBS -l nodes=4:ppn=4

# misc other PBS settings:
#PBS -j eo

# echo "Moving to plautlab directory."
cd /data2/plautlab/Lens
# echo "Beginning lens job."
lens -b server.qsub.tcl > server.out
# echo "Lens job completed."

End test.qsub


Begin server.qsub.tcl


# This script is run on the server with the following (assuming "excecutable" below is set to "lens")
# ./lens -b server.tcl > server.out &

# LD_LIBRARY_PATH on both server and client machines must include current directory
# (containing libtcl8.3.so and libtk8.3.so)

# The base of the network file name (and its directory)
set filename rand100
set workingDirectory /data2/plautlab/Lens
set networkScript $workingDirectory/$filename.in
set clientScript client.tcl
set fixedPort 2001
set executable $workingDirectory/lens

# Starting epoch, total number to run, and learning algorithm
set epoch 0
set nepochs 100
set algorithm dougsMomentum

# number of epochs to run only steepestDescent before switching to the specified algorithm
set nsteepest 0

# Checkpointing of weight files
set checkpointInterval 1000
set minSaveInterval 100
set maxSaveInterval 100

# Here is where you list the client machines.
# To run two or more client processes on the same machine, just list it multiple times.
#set clientMachines {
#compute-0-10 compute-0-10
#}
set clientMachines [exec $workingDirectory/tailnodelist $::env(PBS_NODEFILE)]
#echo "List of client machines:"
#echo $clientMachines

#############################################################################
# shouldn't need to change anything below this
#############################################################################

proc sourceIfExists {file} {
if { [file exists $file] } {
 puts "Reading parameter file $file"
 puts [source $file]
}
}

proc checkpoint { filename } {
 global checkpointInterval
 global minSaveInterval
 global maxSaveInterval
 set epoch [getObj totalUpdates]
 set saveInterval [expr int(pow(10,floor(log10($epoch))))]
 if { $saveInterval < $minSaveInterval } {
 set saveInterval $minSaveInterval
 } elseif { $saveInterval > $maxSaveInterval } {
 set saveInterval $maxSaveInterval
 }
 if { [expr $epoch % $saveInterval] = 0 } {
 puts "Saving weights to $filename.$epoch.wt.bz2"
 saveWeights $filename.$epoch.wt.bz2 -values 3
 } elseif { [expr $epoch % $checkpointInterval] = 0 } {

 puts "Checkpointing to $filename.ckp.wt.bz2"
 saveWeights $filename.ckp.wt.bz2 -values 3
 }
 sourceIfExists $filename.prm
 sourceIfExists $filename.$epoch.prm
}

# Start the server
set port [startServer $fixedPort]
set hostname [exec hostname -f]

# Write the customized client script
set customClientScript $workingDirectory/client[getSeed].tcl
regsub -all / $networkScript \\/ netScript
sed "s/SCRIPT/$netScript/; s/SERVER/$hostname/; s/PORT/$port/" \
 $clientScript > $customClientScript
echo "Copying script to clients"
foreach client $clientMachines {
echo "$client"
scp $customClientScript $client:$workingDirectory
}

# Here we define a command for launching clients using ssh
proc launchClients {executable customClientScript machines} {
 global env
 set i 0
 foreach client $machines {
 puts " launching on $client..."
 exec /usr/bin/ssh $client -n \
 "$executable -batch $customClientScript > /dev/null" &
 incr i
 }
 return $i
}

# Now we use the command
puts "Launching clients..."
set numClients [launchClients $executable $customClientScript $clientMachines]

# Load the network and training set
source $networkScript
setObj postUpdateProc { checkpoint $filename }

# maybe load previously saved weights (to restart)
if { [file exists wts/$filename.$epoch.wt.bz2] } {
 loadWeights wts/$filename.$epoch.wt
 echo loadWeights wts/$filename.$epoch.wt
} elseif { [file exists $filename.$epoch.wt.bz2] } {
 loadWeights $filename.$epoch.wt
 echo loadWeights $filename.$epoch.wt
}
set epoch [getObj totalUpdates]

# Now wait for the clients to connect.
puts "Waiting for $numClients clients..."
waitForClients $numClients

# Start training and wait for it to finish,
# but don't wait if it didn't start correctly.
puts "Training..."

if { $epoch < $nsteepest } {
 puts [trainParallel [expr $nsteepest - $epoch] -algorithm steepest ]
 set epoch $nsteepest
}
puts [trainParallel [expr $nepochs - $epoch] -algorithm $algorithm ]

# Now break the barrier holding the clients so they can exit.
puts "Releasing clients..."
waitForClients
exec rm [glob $customClientScript]
puts "Ba-bye"
exit

End server.qsub.tcl


-- David Pane - 2015-06-01

Comments

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r2 - 2015-06-02 - dpane
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback