Bad WN
What is a "bad workernode" ?
Imagine the following situation : a workernode suffers from a serious problem (such as a full /scratch partition for example) that will make every new job crashing quickly. Since on this node, each job will end rapidly, it will be quickly ready for a new job, which will again crash, and so on. The consequence may be that the workernode may empty the queue of the PBS server, crashing all the jobs. This node will act like a kind of "black hole"...
The bad_wn.pl script
As a remedy against "black hole", a bad_wn.pl script was created and put in the /etc/profile.d directory of each workernode. Since a new session is opened by pbs_mom for each job, the script bad_wn.pl (like every script in /etc/profile.d) is launched before the job really start. The role of bad_wn.pl is to detect serious problems on the workernode. If such a problem is detected, the script sends a email to the grid administrators describing the problem, and then it does a sleep of 14 days. The sleep will prevent the job from starting.
What to do with a workernode signaled as "bad"
Here is a procedure to apply to cure a "bad workernode" :
- Log on to the PBS server and put the workernode offline (pbsnodes -o wn1.iihe.ac.be)
- Log on to the workernode and fix the problem described in the email
- Kill the the sleeping bad_wn.pl script (ps aux | grep bad_wn | xargs kill -SIGTERM). The jobs will start.
- Watch carefully at the evolution of critical parameters of the wn such as free space on partitions (df -ah) and CPU/memory usage (top). If everything evolves in a gentle way, then you can go on to the last step :
- On the PBS server, put the workernode back online (pbsnodes -c wn1.iihe.ac.be).