GridAdminSurvivalGuide

From T2B Wiki
Jump to navigation Jump to search

PageOutline


How to access webpages whose access is restricted to a network ?

Imagine for example that you want to access to http://vocms21.cern.ch:8888/logginfo/ whose access is only possible from CERN's network (eg lxplus.cern.ch). The solution is to create a proxy socket with SSH. Here is the procedure :

  1. Create the proxy socket to a machine in the network authorized to access the Web pages. (In our example : lxplus.cern.ch) :
ssh -D 12345 lxplus.cern.ch
The port number 12345 is an arbitrary choice. Simply chose a port well above 1024 to avoid privileged ports.
2. Configure your browser to use the proxy socket. In the case of Firefox, it goes like that :
Menu Edit > Preferences > Advanced > Network > button "Settings..." > select "Manual proxy configuration" :
--> SOCKS Host : localhost
--> Port : 12345
3. You can now access the page with your browser !

KVM management console

  1. Connect as root to CCQ with the X forwarding option :
ssh -X root@ccq
2. Launch the IPMIview binary :
./IPMIView20.bin


Special case: behar07X series

Unfortunatelly these steps need to be repeated every time as the session given out by the IPMI is only valid for limited amount of time.

  • Get the file sess_<Some Random string>.jnlp
go to the IPMI dedicated web page
-> Remote Control
-> Console redirection
-> Launch Console
-> Save
  • upload it to ccq
  • do
ssh -X ccq
  • run
/usr/java/jdk1.6.0_22/bin/javaws sess_<Some Random string>.jnlp


Add some RPMs to the BEgrid repository with swrep

Adding new RPMs to the BEgrid repository is a work in two steps :

l. Uploading the RPMs to BEgrid Web server (quattor.begrid.be);
l. Updating in the central Quattor SVN templates describing the content of the BEgrid repository.

Now, let's have a look at the BEgrid repository : http://quattor.begrid.be/begrid/Central_BEGrid_Repository/

You see from the name of the directories that it is organized the following way : each directory corresponds to one type of architecture combined to a general group of RPMs. The possible architecture are : i386, x86_64 and noarch. The possible groups of RPMs are : cos3_3X, cos4_4X, cos5_5X, dag_el3, dag_el4, etc. A combination <architecture>_<group_of_software> is called a "platform".

Each of these directories in BEgrid repository has its content described by a template in Quattor. These templates are located under : cfg/sites/begrid/repository/. The name of these templates is just the name of the corresponding directory in BEgrid repository, prefixed by "cb-".

To interact with BEgrid repository, you must use the tool swrep. It is advised to work from CCQ3. The following use-cases will show in deep details how to use swrep command.

Adding a new kernel update to BEgrid repo

After you are logged as root to ccq3, go to the following directory :

cd /opt/CB5/tmp/src/begrid/cb-client/cb-client-swrep

Then, create a new directory where you will download the kernel update :

mkdir rpm-kernel-update
cd rpm-kernel-update
wget http://linuxsoft.cern.ch/cern/slc5X/x86_64/yum/testing/kernel-2.6.18-194.11.3.el5.cve20103081.x86_64.rpm

Note that there are generally several kernel "flavours" (largesmp, smp, xenU,...) for each release : don't forget to download them all !

Now, going back to the parent directory, just issue the following command :

cd ..
./swrep.py --debug --mode=up --plat x86_64_sl5_5X,/sl5/sl53/updates --dir=/opt/CB5/tmp/src/begrid/cb-client/cb-client-swrep/rpm-kernel-update/

The last command can be explained like this :

  • "--mode=up" means that we want to upload the RPMs located in the directory given by "--dir=...";
  • "--plat" : the generic syntax is "--plat <platform>,<area>", where "platform" is in fact <architecture>_<group_of_software> (as described above), and <area> is simply a kind of tag to organize things in a hierarchical way inside each plaform. (The area plays no role in Quattor.)

Going to http://quattor.begrid.be/begrid/Central_BEGrid_Repository/x86_64_sl5_5X/, you can check the presence the new RPM.

Now, you still have to update Quattor template for the plaform x86_64_sl5_5X. There is two different ways to do this : You can edit "manually" the template repository/cb-x86_64_sl5_5X, adding the following line in the nlist "contents" :

escape("kernel-2.6.18-194.11.3.el5.cve20103081-x86_64"),nlist("name","kernel","version","2.6.18-194.11.3.el5.cve20103081","arch","x86_64"),

If you want to avoid typos, it is advised to generate the updated template with swrep command :

./swrep.py --mode=get --reg x86_64_sl5_5X

This will generate the updated template :

./repository/cb-x86_64_sl5_5X.tpl

that you still have to copy in your local checkout...

Adding RPM updates to BEgrid repo

We will take as an example the update of systemtap-runtime we had to do after critical vulnerability CVE-2010-4170 was discovered. Having simulated a "yum update systemtap-runtime", we have learnt that we must in fact update the following RPMs :

systemtap (to version 1.1-3.el5_5.3, arch. x86_64)
systemtap-runtime (to version 1.1-3.el5_5.3, arch. x86_64)
kernel-devel (to version 2.6.18-194.26.1.el5, arch. x86_64)

So, we log on as root to CCQ3, and we do the following (see previous section for detailed explanations) :

cd /opt/CB5/tmp/src/begrid/cb-client/cb-client-swrep
mkdir rpm-systemtap-update
cd rpm-systemtap-update
wget http://linuxsoft.cern.ch/cern/slc5X/updates/x86_64/RPMS/systemtap-1.1-3.el5_5.3.x86_64.rpm
wget http://linuxsoft.cern.ch/cern/slc5X/updates/x86_64/RPMS/systemtap-runtime-1.1-3.el5_5.3.x86_64.rpm
wget http://linuxsoft.cern.ch/cern/slc5X/updates/x86_64/RPMS/kernel-devel-2.6.18-194.26.1.el5.x86_64.rpm
cd ..
./swrep.py --debug --mode=up --plat x86_64_sl5_5X,/sl5/sl53/updates --dir=/opt/CB5/tmp/src/begrid/cb-client/cb-client-swrep/rpm-systemtap-update/
./swrep.py --mode=get --reg x86_64_sl5_5X

The last command above has generated the updated OS template :

./repository/cb-x86_64_sl5_5X.tpl

Its content has to replace the content of :

cfg/sites/begrid/repository/cb-x86_64_sl5_5X.tpl

in your local copy...

Useful command to debug after a runcheck

[TO BE DONE]

How to find the hardware, OS and IP of a machine in SCDB

[TO BE DONE]

Check information published in the BDII

If you want to check what's available on our CE (number of slots, applications tags, supported VOs, name of the queues, etc.) :

ldapsearch -x -H ldap://cream02.iihe.ac.be:2170 -b mds-vo-name=resource,o=grid

Note : in fact, cream01 is now also our site-BDII, but to be queried as such, you must adapt the mds-vo-name :

ldapsearch -x -H ldap://cream02.iihe.ac.be:2170 -b mds-vo-name=BEgrid-ULB-VUB,o=grid

If you want to check what is published for our CEs, querying the BDII defined by the environment variable LCG_GFAL_INFOSYS :

lcg-infosites --vo cms ce | grep iihe

Build XML profiles of IIHE site without Eclipse

In case of big troubles with your Eclipse, you might not be able to build XML profiles of the site. But you can still make it out with ant commands :

export ANT_HOME=/home/stgerard/workspace/central-begrid-v6/external/ant
export JAVA_HOME=/usr/java/jdk1.6.0_14
cd <location_of_build.xml_in_your_local_copy>
export ANT_OPTS="-Xmx512M" && <path_of_your_local_copy>/external/ant/bin/ant -f build.xml compile.profiles.iihe-glite

How to see debug messages during Quattor build ?

Note : what follows is only valid with panc 8.2.4 and above. You need to modify the quattor.build.xml file in your local copy. Let's say you want to see the messages coming from the debug functions in the template profile_se01.begrid.be. Here is how to set your variables in quattor.build.xml :

<property name="pan.debug.include" value="profile_se01.begrid.be" />
<property name="pan.debug.exclude" value="" />

As another example, you want to show debug messages from all templates except spma and pan ones :

<property name="pan.debug.include" value=".*" />
<property name="pan.debug.exclude" value=".*/spma/.*|pan/.*" />

Get the version of pan

The pan compiler for Eclipse is in a jar file :

cd <path_of_your_local_copy>/external/panc/lib
java -jar panc.jar

How to update Pan compiler ?

You first need the Pan RPM. You can download it here. Extract the content of the RPM, inside the directory ./usr/lib you will find the file panc.jar. In your local copy of Quattor templates, replace the jar file in SVN inside external/panc/lib, and then commit.

How to get info about a job submitted locally ?

For jobs submitted locally, you won't find anything in /var/log/glite/glite-ce-cream.log* logs. The relevant PBS log files are located in the following directories :

/var/spool/pbs/server_priv/accounting/
/var/spool/pbs/server_logs/

You will find there one log file per day, the name of the file being a date. However, it is easier to use the tracejob command. Simply give it the job id, and it will parse these log files for you. Here is an example :

tracejob 1002455 -n 3

The option "-n 3" means that you want to look for back in the last 3 days.

How to get info about a job submitted through the Grid ?

If you are lazy and you don't want to spend too much time in archeological digs in cream log files (/var/log/glite/glite-ce-cream.log*), then you can simply use the command glite-ce-job-status. However, to get information about jobs not belonging to you, you need the special "cream administrator" status. For this, you have to add your DN, surrounded by quotes ("") in the file /etc/grid-security/admin-list. Here is an example showing how to use the command from a UI :

voms-proxy-init --voms beapps
glite-ce-job-status -L 2 https://cream02.iihe.ac.be:8443/CREAM475582955

The option "-L 2" is for high verbosity. The main big downside with this command is the fact that you have to know the CREAM job id. (This can be infered from the PBS job id by grepping the cream log files.)


How to add a new machine in the DNS

  • In case of a public IP open a ticket at support-iihe team at support-iihe@vub[NOSPAm].ac.be
  • In case of a private IP:
    • go on ccq
    • edit /var/named/wn.zone (make sure the IP address is not already used)
    • edit /var/named/wn.rev
    • restart named deamon
[root@ccq ~]# service named restart
Stopping named:                                            [  OK  ]
Starting named:                                            [  OK  ]

How to stop Quattor daemons

If you are busy testing manual changes on a Quattor-managed machine and you don't want these changes to be reverted by Quattor, then you have to kill two daemons :

service cdp-listend stop
service ncm-cdispd stop

Don't forget that if you reboot the machine, these daemons will be restarted. If you want to avoid this :

chkconfig --del cdp-listend
chkconfig --del ncm-cdispd

How to (re)create /dev/null

For a reason that is still not clear, it can happen that the /dev/null is corrupted on a node. Instead of being a device, it is a simple file :

[root@node19-4 ~]# ls -al /dev/null
-rw-r--r-- 1 root root 33 Aug 11 11:45 /dev/null

As a consequence, the node becomes a black hole. After having put the node offline, you can correct the problem like this :

rm /dev/null
mknod /dev/null c 1 3
chmod 0666 /dev/null

You should then see it again as a device :

[root@node19-4 ~]# ls -al /dev/null 
crw-rw-rw- 1 root root 1, 3 Aug 11 11:53 /dev/null

Note : The problem could be due to this bug of Torque.

Check the SE with lcg-* commands

Here is a useful sequence that can be run on a UI :

voms-proxy-init -- voms beapps
export LFC_CATALOG_TYPE=lfc
export LFC_HOST=lfc01.begrid.be
lcg-cr -v --srm-timeout 180 --connect-timeout 10 --sendreceive-timeout 120 --bdii-timeout 20 --vo beapps -d maite.iihe.ac.be -l lfn:/grid/beapps/stgerard/test_lcg-cr /bin/bash
lcg-del -v --vo beapps -a lfn:/grid/beapps/stgerard/test_lcg-cr

How to ban a user

Open the template config/banned_users located in sites/iihe-production/config/, and add the DN to ban in the following comma-seperated list :

variable LCAS_BANNED_USERS = list('/O=GetRID/O=abusers/CN=Endless Job','/C=TW/O=AS/OU=GRID/CN=Tz Ke Wu 164236');

When we started with the new CREAM CE, the banishment was not working on it because the following line was missing from /opt/glite/etc/lcas/lcas-glexec.db :

pluginname=/opt/glite/lib64/modules/lcas_userban.mod,pluginargs=ban_users.db

This has been corrected in Quattor.

How to find local account on cream CE of a grid user

[root@cream02 ~]# less /var/cream_sandbox/* | grep scia
drwx------   3 cms173     cms       15 Nov  2 11:52 _DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_asciaba_CN_430796_CN_Andrea_Sciaba_cms173/
drwx------   4 cmsprod034 cms       24 Sep  2 02:09 _DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_asciaba_CN_430796_CN_Andrea_Sciaba_cmsprod034/
drwx------ 103 cmss       cms     4096 Apr 15  2011 _DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_asciaba_CN_430796_CN_Andrea_Sciaba_cms_Role_lcgadmin_Capability_NULL_cmss/
drwx------ 103 cms173     cms     4096 May 18 11:16 _DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_asciaba_CN_430796_CN_Andrea_Sciaba_cms_Role_NULL_Capability_NULL_cms173/
drwx------ 103 cmspilo    cms     4096 Jun 17 15:28 _DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_asciaba_CN_430796_CN_Andrea_Sciaba_cms_Role_pilot_Capability_NULL_cmspilo/
drwx------ 103 cmsprod034 cms     4096 Apr  4  2011 _DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_asciaba_CN_430796_CN_Andrea_Sciaba_cms_Role_production_Capability_NULL_cmsprod034/
drwx------ 103 cmsprod006 cms     4096 May 20 02:40 _DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_sciaba_CN_430796_CN_Andrea_Sciaba_cms_Role_production_Capability_NULL_cmsprod006/

Howto update a software repository at BELNET ?

Let's say we want to update the x86_64 EPEL5 repo. Here are the steps :

  • Log on in CCQ3
  • Go to the following directory :
cd /opt/CB5/tmp/src/begrid/cb-client/cb-client-swrep
  • Have a look at the content of the fill_swrep_server.conf file, to get the correct tag for the repo you want to update. For example, if you want to update the x86_64 EPEL5 repo, then you'll find the following line :
x86_64_epel_el5 /epel http://ftp-stud.hs-esslingen.de/pub/epel/5/x86_64/ ep5x
From this line, you see that the correct tag is "ep5x".
  • Type the following command :
./swrep.py --mode=fill --reg=ep5x --debug

Howto add a new TracWiki on mon

Let's say you want to create a new wiki called "mywiki". Then you must first type the following :

trac-admin /var/www/trac/mywiki initenv
chown -R apache:apache /var/www/trac/mywiki

Then you have to give the TRAC_ADMIN rights to a user (tracadminmywiki) :

trac-admin /var/www/trac/kik permission add tracadminmywiki TRAC_ADMIN

Create the file for all the passwords and set the password for the tracadmin user :

htpasswd -cm /etc/httpd/conf/mywiki.users tracadminmywiki

As you want to allow authentication by login/password, you will also have to edit /etc/httpd/conf.d/ssl.conf to add the following :

<Location /trac/mywiki>
  SetHandler mod_python
  PythonHandler trac.web.modpython_frontend
  PythonOption TracEnvParentDir /var/www/trac
  PythonOption TracUriRoot /trac

    AuthType Basic
    AuthName "MyWiki"
    AuthUserFile /etc/httpd/conf/mywiki.users
    Require valid-user
</Location>

and then also modify the LocationMatch markup in such a way that authentication by certificate is no more mandatory. In our case, the following will work :

<LocationMatch "/trac/[^fvim]">

Note : if you don't want to allow login/password authentication, then you don't have to modify /etc/httpd/conf.d/ssl.conf, but you have to replace tracadminmywiki by the DN of the tracadmin user in the previous commands. To create a new user "mdupont" :

htpasswd -m /etc/httpd/conf/mywiki.users mdupont

Problem with Sindes during ks-post-install

Most common problems are :

  • wrong SINDES-ca-certificate for the Quattor client (maybe the one you are using is expired);
  • Sindes window is expired.

The expired Sindes window problem can be easily detected by having a look at the Sindes logfiles :

[root@q3 ~]# tail -f -n 400 /var/log/sindes/SINDES_Server_Everything.log
...
2012/08/13 14:32:12 INFO> GetCertificate.pm:95 SINDES::GetCertificate::info - [client ui01.iihe.ac.be] [FORBIDDEN] Access denied for ui01 (outside time-window): 1344861132> 1344853810

or you can also do the following :

[root@q3 ~]# su - sindes
SINDESsh  > acl -print
�----------------------------------------------�
|       hostname	  TTL	  Request Right|
�----------------------------------------------�
|             tt	  EXP	            YES|
|           ui01	  EXP	            YES|
�----------------------------------------------�

It this case, you see that TTL value for ui01 is EXP, that means that the Sindes window for ui01 is expired. You can extend it this way :

SINDESsh  > acl -set -grant -length 5000 -target ui01      
Setting acl for 1 host(s)
SINDESsh  > acl -print  
�----------------------------------------------�
|       hostname	  TTL	  Request Right|
�----------------------------------------------�
|             tt	  EXP	            YES|
|           ui01	01:23	            YES|
�----------------------------------------------�

You'll still find more troubleshooting tips here.

Howto to use the new Quattor Git repository

We will start from an example. Let's say that you want to update the Quattor aii RPMs on a BEgrid client using the latest releases from the Quattor SF Git repo. First, you download a copy of the code :

git clone git://quattor.git.sourceforge.net/gitroot/quattor/aii

Then, you move to the new aii directory, and you do the following :

cd aii/
mvn install

After the build, you'll find the RPMs in the target sub-directories. For example :

aii/aii-core/target/rpm/aii-server/RPMS/noarch/aii-server-3.0.2-1.noarch.rpm

How to test job submission with glite-wms-* commands (EMI)

In this example, we will test job submission in VO beapps. First, create a proxy :

voms-proxy-init --voms beapps

You need a job script, a JDL, and a file configuring the WMS endpoint. Here is the job script (job.sh) :

#!/bin/bash
hostname
sleep 60

(Yes, I know, that's a stupid job !) Here is the config file specifying the WMS endpoint (filename : wms.conf) :

WMSclient = [
         #requirements = other.GlueCEStateStatus == "Production";
         #MyProxyServer = "myproxy.cern.ch";
         WMProxyEndpoints = {
             "https://wms1.grid.sara.nl:7443/glite_wms_wmproxy_server"
         };
         ListenerStorage = "/tmp";
         ErrorStorage = "/tmp";
         ShallowRetryCount = 3;
         PerusalFileEnable = false;
         rank =- other.GlueCEStateEstimatedResponseTime;
         OutputStorage = "/tmp";
         RetryCount = 0;
     ];
]

You may want to adapt this file by changing the WMS endpoint. To find a valid WMS server for your VO (beapps in our case) :

lcg-infosites --vo beapps wms

And the content of JDL (filename : job.jdl) :

[
    Type = "job";
    JobType = "normal";
    Executable = "job.sh";
    Arguments = "";
    StdOutput = "output";
    StdError = "error";
    OutputSandbox = {
              "error",
              "output"
    };
    InputSandbox = {"/user/sgerard/jobs_tests/grid/beapps/test_cream02_glite-wms/job.sh"};
    requirements = (other.GlueCEUniqueID == "cream02.iihe.ac.be:8443/cream-pbs-beapps");
]

In our case, we want to test our CREAM-CE, that's why we've added the requirements.

To submit the job, just type :

glite-wms-job-submit -a -c wms.conf job.jdl

Check the status with the command :

glite-wms-job-status <jobid>

How to test job submission with glite-ce-* commands (EMI)

First, create a proxy :

voms-proxy-init --voms beapps

You need a JDL. Here is an example :

[
VirtualOrganisation = "beapps";
Executable = "job.sh";
Arguments = "";
StdOutput = "output";
StdError = "error";
OutputSandbox = {
"output",
"error"
};
OutputSandboxBaseDestURI = "gsiftp://localhost";
JobType = "normal";
Type = "Job";
InputSandbox = {"/user/sgerard/jobs_tests/grid/beapps/test_cream02_glite-ce/job.sh"};
]

and here is the test job :

#!/bin/bash

hostname
sleep 60

To submit the job :

glite-ce-job-submit -a -r cream02.iihe.ac.be:8443/cream-pbs-beapps job.jdl

To get the status of the submitted job :

glite-ce-job-status <job_id>

Once the job is finished, you may want to get the output :

glite-ce-job-output <job_id>

Download non standard CRAB versions

Most versions of crab can be found via their web page: https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrab
But patches are never shown on this page. To install these, One needs to go to lxplus and get it from /afs/cern.ch/cms/ccs/wm/scripts/Crab/

Re-Initialize torque on cream02

Once upon a time, we changed the name of a torque server, from cream02g to cream02, and then it was impossible for ncm-pbsserver to change the configuration of the pbs server (each time you try do something on the server involving pbs, you get a "Invalid credential" message). The explanation is the following : since you change the machine name, the access permissions for root in serverdb are not valid anymore. The solution is to re-initialize torque :

  1. Edit /var/spool/torque/server_name to name the head node. It is recommended to match the hostname in /etc/hostname for simplicity's sake.

2. Create and configure the torque server:

# pbs_server -t create
PBS_Server localhost.localdomain: Create mode and server database exists,
do you wish to continue y/(n)?y

3. Launch the script /var/torque/myninit.sh. 4. Launch the component ncm-pbsserver.

How to build your own modified version of a ncm-component

The best place to do it is on quattorrepository, where maven has been installed in /root directory.

Let's say you want to patch ncm-autofs and deploy the patched version on the cluster. First of all, you need to fetch the master version from Github :

mkdir autofs_patched
cd autofs_patched
git clone https://github.com/quattor/configuration-modules-core.git

Then you need to go to the good place (where the pom.xml is located) :

cd configuration-modules-core/ncm-autofs

Once you have modified the code of the component (usually located in src/main/perl), you can run maven :

export M2_HOME=/root/maven/apache-maven-3.1.1
export M2=$M2_HOME/bin
export PATH=$M2:$PATH
maven install

After maven has finished, you will find the rpm somewhere in ./target.

Problem with Firefox 31 when trying to access self-signed pages

This problem is described here. On this page, you will also find solutions to workaround the problem.

Check if a machine certificate is grid compatible

This amounts to check if the issuer exists in /etc/grid-security/certificates. A rapid method to check it is to generate the hash of the issuer's name and then to check is the hash exists in /etc/grid-security/certficates :

openssl x509 -in <path_of_certificate> -noout -issuer_hash
locate <hash_obtained_with_previous_command>

If the locate command finds something in /etc/grid-security/certificates, it is grid compatible.

Renewing your certificate in Eclipse

Your Eclipse is making use of your personal certificate, whose path is given in ~/.subversion/servers. If you use directly the cert.p12 that you get doing a backup with Firefox, it will not work because the certificate doesn't have the correct format. To create a p12 with the correct format :

openssl pkcs12 -inkey userkey.pem -in usercert.pem -export -out cert.12

(This command supposes that your are in a directory containing your usercert.pem and userkey.pem.)


Template:TracNotice