Maintenance script: Created page with " === The great steps of kernel update in brief === Kernel critical update is normally necessary only on UIs and WNs, since these are the only machines where users can get dir..."

2015-08-26T12:28:44Z

Created page with " === The great steps of kernel update in brief === Kernel critical update is normally necessary only on UIs and WNs, since these are the only machines where users can get dir..."

New page

=== The great steps of kernel update in brief ===
Kernel critical update is normally necessary only on UIs and WNs, since these are the only machines where users can get direct access and/or execute programs.

The first step is to upload the kernel update RPM to BEgrid repository and to update the corresponding Quattor template. This step is fully described [http://mon.iihe.ac.be/trac/t2b/wiki/GridAdminSurvivalGuide#AddinganewkernelupdatetoBEgridrepo here].

The next step is to modify the nlist OS_KERNEL_VERSION_DEFAULT in Quattor, in the template os/kernel_version_arch. This nlist associates to each OS a default kernel version.

The last step is to reboot all the WNs and UIs.

=== A script to automate the reboot of all the workernodes ===
Manual reboot of all the workernodes without putting the site in downtime is a long and dull task ! As you cannot reboot a workernode on which jobs are running, you must first put if offline, then wait until it is drained (it can take a few days in case of long jobs) before you can reboot it, and then wait again until it is up to put it back online, provided that the machine has the good kernel version. Applying this procedure to tens of workernodes requires a good organization and a lot of patience, and it also involves a lot of repetitive tasks. That's why we decided to write a script to automate the reboot process of all the workernodes.

To write the script, we started from the following idea : during the reboot process, each node will pass through different states :
<pre>
initial state (offline or online) --> offline and draining --> offline and drained --> rebooting --> rebooted --> back to initial state (offline or online)
</pre>
If a machine fails during the transition between two states, then it will fall in a special state "error". Each of these states will be materialized by a file, and during the process, the machine names will migrate through the different files as the corresponding machines will evolve from one state to the next one. The human operator, whose job was to check periodically the actual state of each machine and to take action accordingly, will be replaced by a cron job.

=== How to use the reboot_wns script ===
The reboot_wns script is written in Perl. You can get some help by typing :
<pre>
perldoc reboot_wns.pl
</pre>
==== Configuration ====
The general parameters of the script are to be changed in the reboot.conf file. Here is an example :
<pre>
# Name of the reboot script
reboot_script_name=reboot.pl
# Define the periodicity in minutes for executing the reboot script
reboot_script_period=2
# Define the maximum number of WNs that can be offline simultaneously
max_offline_wns=5
# Define the name of the CE
ce_name=gridce
# Define the wanted kernel version
good_kernel_version=2.6.9-89.29.1
</pre>
You must also create a list of the workernodes you want to reboot :
<pre>
[root@ccq stephane]# cat all_wns
node11-1.wn.iihe.ac.be
node12-7.wn.iihe.ac.be
node12-8.wn.iihe.ac.be
node14-1.wn.iihe.ac.be
node14-10.wn.iihe.ac.be
</pre>
==== Initialisation ====

==== Start ====
==== Stop ====

==== Known problems ====
*Nodes for which there was a problem during the process are put in the error list, and they are accounted as offline nodes by the reboot cron job. This might create the following situation. Imagine you set the maximum number of offline nodes to 5, and that you have 10 nodes that are found to be down during initialisation. As the maximum number of offline nodes is exceeded the reboot cron script will never put any nodes offline, and it will run forever doing nothing. To avoid this, just put in comment in your initial list the nodes that are down. That way, they won't be consider.
*It may happen that a node fails to come back to life after a reboot. Such a node will stay forever in the list of rebooting nodes and will never move to the rebooted status. And so, the reboot process will never end, and it will never suggest you to do a --stop.

=== Code in CVS ===
*The scripts are available in the cern CVS repository. A link to the what/how of cms CVS repo is found [https://twiki.cern.ch/twiki/bin/view/CMSPublic/WorkBookSoftware here]
*The code can be found [http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/UserCode/T2B_IIHE/ here]

{{TracNotice|{{PAGENAME}}}}

KernelUpdate - Revision history

Maintenance script: Created page with " === The great steps of kernel update in brief === Kernel critical update is normally necessary only on UIs and WNs, since these are the only machines where users can get dir..."