T2B Wiki - User contributions [en-gb]

Faq t2b

2026-03-30T11:34:53Z

Admin: /* Debugging SSH connection to mX machines: */

=== List of the UIs / mX machines: ===
- mshort: m10 , m11 => 20 minutes of CPU time per process 
- mlong: m4 to m9 => no limit of CPU time per process

=== Keep ssh connection to UI open: ===
Add option ' '''-o ServerAliveInterval=100''' ' to your ssh command

=== Debugging SSH connection to mX machines: ===
# Check permissions on ssh keys on your laptop:
<pre>
> ll $HOME/.ssh
-rw------- 1 rougny rougny 411 avr 29 2019 id_ed25519
-rw-r--r-- 1 rougny rougny 102 avr 29 2019 id_ed25519.pub
</pre>

: To have the correct permissions:
<pre>
chmod 600 $HOME/.ssh/id_ed25519
chmod 644 $HOME/.ssh/id_ed25519.pub
</pre>

: 2a. If that does not fix it, send us the output of those commands via chat/email, as well as the content of your public key to crosscheck with what is in our system:
<pre>
> ll $HOME/.ssh
> date && ssh -vvv MYUSERNAME@m10.iihe.ac.be <-- it needs to be on a specific machines (no mshort/mlong) so that we can read the logs!
</pre>

: 2b. Also add your public IPv4 so that we can track your connection in the logs, via visiting for instance https://www.whatismyip.com/
: 2c. Just in case something went wrong: send us your public ssh key (the one ending in .pub!à

=== MadGraph taking all the cores of a workernode ===
The default settings for MadGraph is to take all the available cores. This kills the site.

that is why you need to uncomment and set 2 variables in the '''md5_configuration.txt''' file (not the '''dat''' file), '''run_mode''' & '''nb_core'''. 
The run mode should be set to 0, single machine, via:
run_mode = 0

If the number of cores used by MadGraph is higher than 1, this needs to be asked to the job scheduler with the following directive added to your HTCondor submit file:
<pre> request_cpus = "2" </pre>

To tell MadGraph the number of cores he can take per job, use the following recipe:
<pre>
./bin/mg5_aMC
set nb_core 1 #or 2 or whatever you want
save options
</pre>
or in the '''md5_configuration.txt''':
nb_core = 1
Note 'nb_core' and 'request_cpus' must alway be the same value! 
Note also that if you ask for more than one core your time in the queue will probably be longer as the scheduler needs to find the correct amount of free slots on one single machine. 
We advise against putting this number higher than one unless you really need it for parallel jobs.

Faq t2b

2026-03-30T11:34:28Z

Admin: /* List of the UIs / mX machines: */

=== List of the UIs / mX machines: ===
- mshort: m10 , m11 => 20 minutes of CPU time per process 
- mlong: m4 to m9 => no limit of CPU time per process

=== Keep ssh connection to UI open: ===
Add option ' '''-o ServerAliveInterval=100''' ' to your ssh command

=== Debugging SSH connection to mX machines: ===
# Check permissions on ssh keys on your laptop:
<pre>
> ll $HOME/.ssh
-rw------- 1 rougny rougny 411 avr 29 2019 id_ed25519
-rw-r--r-- 1 rougny rougny 102 avr 29 2019 id_ed25519.pub
</pre>

: To have the correct permissions:
<pre>
chmod 600 $HOME/.ssh/id_ed25519
chmod 644 $HOME/.ssh/id_ed25519.pub
</pre>

: 2a. If that does not fix it, send us the output of those commands via chat/email, as well as the content of your public key to crosscheck with what is in our system:
<pre>
> ll $HOME/.ssh
> date && ssh -vvv MYUSERNAME@m3.iihe.ac.be <-- it needs to be on a specific machines (no mshort/mlong) so that we can read the logs!
</pre>

: 2b. Also add your public IPv4 so that we can track your connection in the logs, via visiting for instance https://www.whatismyip.com/
: 2c. Just in case something went wrong: send us your public ssh key (the one ending in .pub!à

=== MadGraph taking all the cores of a workernode ===
The default settings for MadGraph is to take all the available cores. This kills the site.

that is why you need to uncomment and set 2 variables in the '''md5_configuration.txt''' file (not the '''dat''' file), '''run_mode''' & '''nb_core'''. 
The run mode should be set to 0, single machine, via:
run_mode = 0

If the number of cores used by MadGraph is higher than 1, this needs to be asked to the job scheduler with the following directive added to your HTCondor submit file:
<pre> request_cpus = "2" </pre>

To tell MadGraph the number of cores he can take per job, use the following recipe:
<pre>
./bin/mg5_aMC
set nb_core 1 #or 2 or whatever you want
save options
</pre>
or in the '''md5_configuration.txt''':
nb_core = 1
Note 'nb_core' and 'request_cpus' must alway be the same value! 
Note also that if you ask for more than one core your time in the queue will probably be longer as the scheduler needs to find the correct amount of free slots on one single machine. 
We advise against putting this number higher than one unless you really need it for parallel jobs.

Getting a certificate for the T2

2025-09-29T08:39:39Z

Admin:

If you need grid access on the T2, please follow all the steps below: 

# [[Obtaining_a_certificate | Get a Grid certificate (new)]].
#:*NB: for now, this does not work for users from UGent. Get a CERN certificate for that.
#:*If you already have a CERN grid certificate and are associated to a Belgian university, you can temporarily use the CERN certificate. But, for accounting reasons, we need you to get a Belgian certificate.
#:*If you are not from a Belgian institute, request a certificate through your national institute.
# [[certificate_to_UI | Put your certificate on the UIs]]
# Register to the VO
#::'''!!! In case you already have an old certificate registered to the VO, read [[What if your DN has changed?|this page]] on how to deal with it !!!'''
#:* '''CMS'''
#:*# [[Register_to_the_CMS_VO|Register to the CMS VO]]
#:*# [[SiteDB | Check if your certificate is ok on SiteDB]]. Note the DN.
#:*# [[CERN_certificate_management | Check that your certificate is the only one registered on the CERN website.]]
#:* '''IceCube'''
#:** [[Register_to_the_IceCube_VO|Register to the IceCube VO]]
#:* '''Solid'''
#:** [[Register_to_the_Solid_VO|Register to the Solid VO ]]
#:* '''Others'''
#:** [[Register_to_the_Beapps_VO|Register to the Belgian VO (beapps) ]]
# Send a mail to the T2B support (grid_adminATlistserv.vub.be) with your DN in order to have write access on the T2.
#:* You can find it via the following commands on the cluster, once your certificate has been installed there:
<pre>
voms-proxy-init
voms-proxy-info --identity
</pre>
::Send us the result of the last command
 
:5. [[Check_Certificate_UIs | Check if everything works fine on the mX machines]]

Obtaining a certificate

2025-09-26T14:06:10Z

Admin:

== Quick documentation ==

If you are familiar with the certificates request procedure and you are confident, follow these steps, otherwise go to section '''More extensive documentation''':

1. Go to https://cm.harica.gr/;
2. Choose "Academic Login";
3. Choose your institution, and login;
4. Request an "IGTF Client Auth" and choose "GEANT Personal Authentication";
5. Enroll your certificate using '''RSA - 4096''' to generate it and give a password to secure the P12 file that contains the certificate. Agree on the policy and "Enroll Certificate";
6. This last step will show a pop-up that allows the download of your certificate in a P12 file;
7. You can now upload the certificate on you browser.

== More extensive documentation ==

'''Go to https://cm.harica.gr/'''

[[File:Cert_Manager_Harica.png|600px|thumb|left|Choose "Academic Login" and you will redirected to the IdP federation login page]]
[[File:Choose_IdP_Harica.png|600px|thumb|left|Choose your institution, e.g., by typing 'ULB' or 'VUB']]

[[File:Institution_IdP_Harica.png|600px|thumb|left|Login with your institution credentials. This page may look different depending on the institution.]]

[[File:Dashboard_Harica.png|1200px|thumb|left|When logged in you will arrive at the Dashboard page. At this point, you need to go to "IGTF Client Auth".]]

[[File:IGTF_Client_Auth_1_Harica.png|1200px|thumb|left|Here choose "GEANT Personal Authentication"]]
[[File:IGTF_Client_Auth_2_Harica.png|1200px|thumb|left|Here just press "Next"]]
[[File:IGTF_Client_Auth_3_Harica.png|1200px|thumb|left|Check the box to agree with the policy. And press "Submit Request"]]

[[File:Enroll_Certificate_Harica.png|1400px|thumb|left|From the Dashboard page you can now "Enroll your Certificate"]]
[[File:Enroll_Certificate_request_form_Harica.png|800px|thumb|left|Select '''RSA (Default)''' and '''4096'''. Give a password to secure the P12 file that contains the certificate. Agree on the policy and "Enroll Certificate"]]
[[File:Download_Certificate_Harica.png|800px|thumb|left|'''Download''' your certificate in a P12 file]]

[[File:Firefox_Certificate_page.png|600px|thumb|left|On your browser, go to '''[Firefox]''' Preferences -> Certificates -> View Certificates '''[chrome]''' Settings > Privacy and Security > Security > Manage Certificates > Your Certificates ]]
[[File:Firefox_Import.png|600px|thumb|left|Press the Import button and select the cert.p12 files you just downloaded]]
[[File:Firefox_Import_done.png|600px|thumb|left|The certificate is now listed in the Certificate Manager]]

Obtaining a certificate

2025-09-26T13:54:58Z

Admin:

== Quick documentation ==

If you are familiar with the certificates request procedure and you are confident, follow these steps, otherwise go to section '''More extensive documentation''':

1. Go to https://cm.harica.gr/;
2. Choose "Academic Login";
3. Choose your institution, and login;
4. Request an "IGTF Client Auth" and choose "GEANT Personal Authentication";
5. Enroll your certificate using '''RSA - 4096''' to generate it and give a password to secure the P12 file that contains the certificate. Agree on the policy and "Enroll Certificate";
6. This last step will show a pop-up that allows the download of your certificate in a P12 file;
7. You can now upload the certificate on you browser.

== More extensive documentation ==

'''Go to https://cm.harica.gr/'''

[[File:Cert_Manager_Harica.png|600px|thumb|left|Choose "Academic Login" and you will redirected to the IdP federation login page]]
[[File:Choose_IdP_Harica.png|600px|thumb|left|Choose your institution, e.g., by typing 'ULB' or 'VUB']]

[[File:Institution_IdP_Harica.png|600px|thumb|left|Login with your institution credentials. This page may look different depending on the institution.]]

[[File:Dashboard_Harica.png|1200px|thumb|left|When logged in you will arrive at the Dashboard page. At this point, you need to go to "IGTF Client Auth".]]

[[File:IGTF_Client_Auth_1_Harica.png|1200px|thumb|left|Here choose "GEANT Personal Authentication"]]
[[File:IGTF_Client_Auth_2_Harica.png|1200px|thumb|left|Here just press "Next"]]
[[File:IGTF_Client_Auth_3_Harica.png|1200px|thumb|left|Check the box to agree with the policy. And press "Submit Request"]]

[[File:Enroll_Certificate_Harica.png|1400px|thumb|left|From the Dashboard page you can now "Enroll your Certificate"]]
[[File:Enroll_Certificate_request_form_Harica.png|800px|thumb|left|Select '''RSA (Default)''' and '''4096'''. Give a password to secure the P12 file that contains the certificate. Agree on the policy and "Enroll Certificate"]]
[[File:Download_Certificate_Harica.png|800px|thumb|left|'''Download''' of your certificate in a P12 file]]

[[File:Firefox_Certificate_page.png|600px|thumb|left|On your browser, go to '''[Firefox]''' Preferences -> Certificates -> View Certificates '''[chrome]''' Settings > Privacy and Security > Security > Manage Certificates > Your Certificates ]]
[[File:Firefox_Import.png|600px|thumb|left|Press the Import button and select the cert.p12 files you just downloaded]]
[[File:Firefox_Import_done.png|600px|thumb|left|The certificate is now listed in the Certificate Manager]]

File:Download Certificate Harica.png

2025-09-26T13:40:00Z

Admin:

File:Enroll Certificate request form Harica.png

2025-09-26T13:39:38Z

Admin:

File:Enroll Certificate Harica.png

2025-09-26T13:38:51Z

Admin:

Obtaining a certificate

2025-09-26T13:32:34Z

Admin:

== Quick documentation ==

If you are familiar with the certificates request procedure and you are confident, follow these steps, otherwise go to section '''More extensive documentation''':

1. Go to https://cm.harica.gr/;
2. Choose "Academic Login";
3. Choose your institution, and login;
4. Request an "IGTF Client Auth" and choose "GEANT Personal Authentication";
5. Enroll your certificate using '''RSA - 4096''' to generate it and give a password to secure the P12 file that contains the certificate. Agree on the policy and "Enroll Certificate";
6. This last step will show a pop-up that allows the download of your certificate in a P12 file;
7. You can now upload the certificate on you browser.

== More extensive documentation ==

'''Go to https://cm.harica.gr/'''

[[File:Cert_Manager_Harica.png|600px|thumb|left|Choose "Academic Login" and you will redirected to the IdP federation login page]]
[[File:Choose_IdP_Harica.png|600px|thumb|left|Choose your institution, e.g., by typing 'ULB' or 'VUB']]

[[File:Institution_IdP_Harica.png|600px|thumb|left|Login with your institution credentials. This page may look different depending on the institution.]]

[[File:Dashboard_Harica.png|1200px|thumb|left|When logged in you will arrive at the Dashboard page. At this point, you need to go to "IGTF Client Auth".]]

[[File:IGTF_Client_Auth_1_Harica.png|1200px|thumb|left|Here choose "GEANT Personal Authentication"]]
[[File:IGTF_Client_Auth_2_Harica.png|1200px|thumb|left|Here just press "Next"]]
[[File:IGTF_Client_Auth_3_Harica.png|1200px|thumb|left|Check the box to agree with the policy. And press "Submit Request"]]

[[File:New_certificate_type.png|600px|thumb|left|Request a '''GEANT Personal Authentication''' certificate and use '''RSA - 8192''' as Key type for Key Generation.]]
[[File:Download_Certificate.png|600px|thumb|left|A this point the certificate is generated and you can just save it in your computer]]
[[File:Firefox_Certificate_page.png|600px|thumb|left|On your browser, go to '''[Firefox]''' Preferences -> Certificates -> View Certificates '''[chrome]''' Settings > Privacy and Security > Security > Manage Certificates > Your Certificates ]]
[[File:Firefox_Import.png|600px|thumb|left|Press the Import button and select the cert.p12 files you just downloaded]]
[[File:Firefox_Import_done.png|600px|thumb|left|The certificate is now listed in the Certificate Manager]]

Obtaining a certificate

2025-09-26T12:49:15Z

Admin:

== Quick documentation ==

If you are familiar with the certificates request procedure and you are confident, follow these steps, otherwise go to section '''More extensive documentation''':

1. Go to https://cert-manager.com/customer/belnet/idp/clientgeant;
2. Choose your institution, and login;
3. Request a '''GEANT Personal Authentication''' certificate and use '''RSA - 8192''' as Key type for Key Generation;
4. Give a Password to secure the P12 file that contains the certificate and '''Submit''' and '''Agree''' with the EULA;
5. This last step will trigger automatically the download of your certificate in a P12 file;
6. You now have to upload the certificate on you browser.

== More extensive documentation ==

'''Go to https://cm.harica.gr/'''

[[File:Cert_Manager_Harica.png|600px|thumb|left|Choose "Academic Login" and you will redirected to the IdP federation login page]]
[[File:Choose_IdP_Harica.png|600px|thumb|left|Choose your institution, e.g., by typing 'ULB' or 'VUB']]

[[File:Institution_IdP_Harica.png|600px|thumb|left|Login with your institution credentials. This page may look different depending on the institution.]]

[[File:Dashboard_Harica.png|1200px|thumb|left|When logged in you will arrive at the Dashboard page. At this point, you need to go ro "IGTF Client Auth".]]

[[File:IGTF_Client_Auth_1_Harica.png|1200px|thumb|left|Here choose "GEANT Personal Authentication"]]
[[File:IGTF_Client_Auth_2_Harica.png|1200px|thumb|left|Here just press "Next"]]
[[File:IGTF_Client_Auth_3_Harica.png|1200px|thumb|left|Check the box to agree with the policy. And press "Submit Request"]]

[[File:New_certificate_type.png|600px|thumb|left|Request a '''GEANT Personal Authentication''' certificate and use '''RSA - 8192''' as Key type for Key Generation.]]
[[File:Download_Certificate.png|600px|thumb|left|A this point the certificate is generated and you can just save it in your computer]]
[[File:Firefox_Certificate_page.png|600px|thumb|left|On your browser, go to '''[Firefox]''' Preferences -> Certificates -> View Certificates '''[chrome]''' Settings > Privacy and Security > Security > Manage Certificates > Your Certificates ]]
[[File:Firefox_Import.png|600px|thumb|left|Press the Import button and select the cert.p12 files you just downloaded]]
[[File:Firefox_Import_done.png|600px|thumb|left|The certificate is now listed in the Certificate Manager]]

File:IGTF Client Auth 3 Harica.png

2025-09-26T12:23:24Z

Admin:

File:IGTF Client Auth 2 Harica.png

2025-09-26T12:23:04Z

Admin:

File:IGTF Client Auth 1 Harica.png

2025-09-26T12:22:42Z

Admin:

File:Institution IdP Harica.png

2025-09-26T12:22:09Z

Admin:

File:Choose IdP Harica.png

2025-09-26T12:21:40Z

Admin:

File:Dashboard Harica.png

2025-09-26T12:20:56Z

Admin:

File:Cert Manager Harica.png

2025-09-26T12:20:13Z

Admin:

SingularityContainers

2025-09-04T13:07:59Z

Admin:

To make SL7/CC8 flavours available for everyone while the cluster is using another version, we make use of containers.

 
=== Prerequisites ===
Helper scripts are in '''/group/userscripts/''', so make sure either it is in your $PATH variable with:
export PATH=$PATH:/group/userscripts/
or just call the full path:
/group/userscripts/sl7

 
=== To test code on the mX UIs ===
Simply go into an EL7 environment:
sl7 bash
This should give you a prompt, where your SL7 code should work.

Just to convince yourself, you can cross-check the OS release:
<pre>cat /etc/redhat-release</pre>

 
=== To use SL7 inside a cluster job ===
You can simply ask your script to be run inside the sl7 container:
sl7 /user/$USER/MYSUPERSCRIPT.sh

To send it to the cluster, the job.sh file you now will send to using your .sub file to condor_submit should contain:
/group/userscripts/sl7 /user/$USER/MYSUPERSCRIPT.sh

Please note that you then need, in your script, to go to TMPDIR.
 It is '''HIGHLY''' recommended to play with this first, by printing path and env variables, to check everything inside the container.

To be more exhaustive, what the script '''sl7''' does is first to add your PATH & LD_LIBRARY_PATH to singularity, and then launch your script using the following line:
singularity exec -B /cvmfs -B /pnfs -B /user -B /scratch /cvmfs/singularity.opensciencegrid.org/opensciencegrid/osgvo-el7:latest MYSCRIPT.sh

where:
* '''exec''': action to do for singularity, here will just execute your command in the container specified
* '''-B /mountpoint''': is used to have the path present in the container if needed.
* '''/cvmfs/.../cms:rhel7''': the path on /cvmfs of the container used.
* '''MYSCRIPT.sh''': script or command to execute

 '''N.B.''' You can look at more containers inside '''/cvmfs/singularity.opensciencegrid.org/opensciencegrid/'''.
 Or you can also create your own container, with for instance specific software versions, it is quite simple.
 You can use the guide [[SingularityContainerCreation|here]], or google ''singularity container'' in your favorite search engine :)

=== If you need grid commands ===
The above containers work perfectly on our cluster. However, they have the drawback of not being updated often. Unfortunately, to be able to work in a grid environment, the CA (certificate authority) files need to be kept up-to-date. 
For this reason, CMS also provides some containers. Unfortunately, not all software is inside (python in el9 is lacking atm). But the grid commands do work. For them, run the scrip:
<pre>
/cvmfs/cms.cern.ch/common/cmssw-el7
or
/cvmfs/cms.cern.ch/common/cmssw-el8
</pre>
Some more info on these containers can be found [http://cms-sw.github.io/singularity.html here]

OtherSoftware

2025-09-01T06:18:15Z

Admin:

== Software available from the T2 UI machines ==
The T2 has many more software than what is available directly from the M machines. All are provided by cern's centrally managed files system called cvmfs.
They are divided in a CMS specific area and a more general area aimed at high energy in general. 
The CMS specific area comes with a home brew script that makes it easy for searching specific versions of the software you need. However, the set-up is a lot more complicated than the general purpose area. Therefore we recommend the general purpose area unless you do not find the specific version you need in it.

=== General purpose area ===

!!! N.B.: For now, the recommended release is [https://lcginfo.cern.ch/release/102/ '''LCG_107'''] with platform '''x86_64-el9-gcc11-opt''', with a list of all packages included [https://lcginfo.cern.ch/release_packages/x86_64-el9-gcc11-opt/107/ here] !!! Just source the software area like this:
source /cvmfs/sft.cern.ch/lcg/views/setupViews.sh LCG_107 x86_64-el9-gcc11-opt

====The LCG environments, a group of packages neatly built into a coherent environment ====
Instead of trying to explain in details how the /cvmfs/sft.cern.ch sub-directories are organized, we will limit ourselves in this wiki to give some relevant links and commands about the general purpose area.

A high-level description of the software packages and how they are grouped in LCG Configurations can be found here: [http://lcginfo.cern.ch lcginfo.cern.ch]. 
You can search by LCG release, platforms, but more importantly by package name. 
Then from the package, you can select the specific version you want, and find which LCG release contains it. 
Finally just find a arch-OS-compiler combo that suits you (see below), and voila ! 

==== Load a specific software repository ====

The software packages are grouped in LCG Configurations. In each of these configurations, you will normally find a set of software versions that are compatible with each others. To get the list of LCG Configurations :
<pre>/cvmfs/sft.cern.ch/lcg/views/checkSetupViews.sh</pre>

Now, let's say you want to know more about the '''"LCG_107"''' configuration, and get what architecture/OS/compiler are available for it :
<pre>/cvmfs/sft.cern.ch/lcg/views/checkSetupViews.sh LCG_107
Available <arch-os-complier> for LCG_107 :
aarch64-el9-gcc13-dbg
aarch64-el9-gcc13-opt
aarch64-el9-gcc14-dbg
aarch64-el9-gcc14-opt
arm64-mac13-clang150-opt
arm64-mac14-clang160-opt
arm64-mac15-clang160-opt
x86_64-el8-gcc11-opt
x86_64-el9-clang16-dbg
x86_64-el9-clang16-opt
x86_64-el9-clang19-dbg
x86_64-el9-clang19-opt
x86_64-el9-gcc11-opt
x86_64-el9-gcc13-dbg
x86_64-el9-gcc13-opt
x86_64-el9-gcc14-dbg
x86_64-el9-gcc14-opt
x86_64-el9-gcc14fp-opt
x86_64-ubuntu2004-gcc9-opt
x86_64-ubuntu2204-gcc11-opt
x86_64-ubuntu2404-gcc13-opt
</pre>

If you are on an Alma 9 UI (like the T2B cluster) and want gcc11 compiler (prefer gcc compiler to others on the cluster unless you know what you are doing), issue the following command :
<pre>source /cvmfs/sft.cern.ch/lcg/views/setupViews.sh LCG_107 x86_64-el9-gcc11-opt</pre>

Check the version of gcc :
<pre>gcc -v
gcc version 11.2.0 (GCC)</pre>

At this moment your environment has been changed to use all the versions in this specific software area, just check by issuing '''env'''.

=== GFAL issue when loading a different version of python ===
When the software repository you want to use contains a version of python that is higher than the one installed on the cluster, you might encounter a problem with the gfal libraries:
<pre>
'import site' failed; use -v for traceback
Traceback (most recent call last):
File "/usr/bin/gfal-mkdir", line 24, in <module>
from gfal2_util.shell import Gfal2Shell
ImportError: No module named gfal2_util.shell
</pre>
This happens because of a mismatch between the software used and the libraries installed. The easiest solution is to use python 3 and a specific version of the UI (user interface) software like in the following example:
<pre>
source /cvmfs/sft.cern.ch/lcg/views/setupViews.sh LCG_93python3 x86_64-centos7-gcc7-opt
source /cvmfs/grid.cern.ch/emi3ui-latest/etc/profile.d/setup-ui-example.sh
</pre>

 
=== CMS specific area ===
!! Prefer loading an LCG complete environment (see above), this way you are sure your environment and packages are coherent whith each other !!

The T2 hosts many software that are packaged with specific versions of CMSSW but that can be used in a stand alone mode. A tool was designed to easily find what is available and which versions can be used. 

To list all the available software:
<pre>
/swmgrs/cmss/soft.pl --list
</pre>

Most used software are specific version of gcc that are needed, or a root or geant4 version. To find all the available gcc versions:
<pre>
/swmgrs/cmss/soft.pl --versions gcc
</pre>

Now, to get a given version of gcc (4.6.2 in this example) :
<pre>
/swmgrs/cmss/soft.pl --load gcc/4.6.2
</pre>

This prints the full path of the init.sh that you have to source to get the desired version of gcc :
<pre>
source /cvmfs/cms.cern.ch/slc6_amd64_gcc462/external/gcc/4.6.2/etc/profile.d/init.sh
</pre>

Let's now check the version of gfortran :
<pre>
gfortran --version
GNU Fortran (GCC) 4.6.2
</pre>

{{TracNotice|{{PAGENAME}}}}

Rucio

2025-08-12T11:36:12Z

Admin: /* Make a rucio request on T2B as a user */

== Rucio instructions at T2B ==

=== Introduction ===

Rucio web requests are handled via https://cms-rucio-webui.cern.ch/r2d2/request

Some Rucio vocabulary:
<pre>
RSE: Rucio Storage Element. In our case this is T2_BE_IIHE (and not the name of the se)
DID: Data identyfier; is used to represent any set of file, dataset or container identifier. Data identifiers are unique over time. So they are never reused.
rule: a data transfer. This includes DID, RSE and possible end time when the files are deleted again.
lock: a tranfereed file
scope: in our case the scope is cms.
</pre>

Find the full info on: https://twiki.cern.ch/twiki/bin/view/CMSPublic/Rucio

=== Make a rucio request on T2B as a user ===

- initialise the env:
<pre>
source /cvmfs/cms.cern.ch/cmsset_default.sh
source /cvmfs/cms.cern.ch/rucio/setup-py3.sh
voms-proxy-init -voms cms -rfc -valid 192:00
export RUCIO_ACCOUNT=`whoami`
</pre>

- in rucio, a transfer is called a rule. In order to create a rule:
<pre>
rucio rule add cms:/CMS/DATA/SET/NAME 1 T2_MY_SITE
rucio rule add cms:/CMS/DATA/SET/NAME#BLOCK-NAME 1 T2_MY_SITE
</pre>

- if there is a lot of data, add <pre>--asynchronous</pre>

- as a user, rules need to be approved by the admins. This is done with 2 extra modifiers:
<pre>--ask-approval</pre> and <pre>--lifetime (in seconds; for reference, 30 days is 2592000 seconds)</pre>

==== Summary to request a rule as a user ====
<pre>
rucio add-rule cms:/CMS/DATA/SET/NAME 1 T2_BE_IIHE --asynchronous --ask-approval --lifetime 7776000
(3 months)
</pre>

- now wait for the admins to approve your rule (usually within the day)

==== Getting more info about your request ====
- you can check your rules:
<pre>
rucio list-rules --account $RUCIO_ACCOUNT List all the rules you have and their state
rucio rule-info [RULE_HASH] monitor the progress of your rule and any transfers it may have initiated
</pre>
Example:
<pre>
rucio list-rules --account $RUCIO_ACCOUNT
ID ACCOUNT SCOPE:NAME STATE[OK/REPL/STUCK] RSE_EXPRESSION COPIES EXPIRES (UTC) CREATED (UTC)
-------------------------------- --------- ------------------------------------------------------------------------------------------------------------------------------------------------ ---------------------- ---------------- -------- ------------------- -------------------
eafda759f4ae4128b49fede980db5622 odevroed cms:/Neutrino_E-10_gun/RunIISummer17PrePremix-PUAutumn18_102X_upgrade2018_realistic_v15-v1/GEN-SIM-DIGI-RAW#0149acf0-6b06-43c9-b99f-dfc531b6eecb OK[235/0/0] T2_BE_IIHE 1 2021-02-19 15:11:47 2021-01-20 15:11:47

rucio rule-info eafda759f4ae4128b49fede980db5622
Id: eafda759f4ae4128b49fede980db5622
Account: odevroed
Scope: cms
Name: /Neutrino_E-10_gun/RunIISummer17PrePremix-PUAutumn18_102X_upgrade2018_realistic_v15-v1/GEN-SIM-DIGI-RAW#0149acf0-6b06-43c9-b99f-dfc531b6eecb
RSE Expression: T2_BE_IIHE
Copies: 1
State: OK
Locks OK/REPLICATING/STUCK: 235/0/0
Grouping: DATASET
Expires at: 2021-02-19 15:11:47
Locked: False
Weight: None
Created at: 2021-01-20 15:11:47
Updated at: 2021-01-20 15:36:07
Error: None
Subscription Id: None
Source replica expression: None
Activity: User Subscriptions
Comment: None
Ignore Quota: True
Ignore Availability: False
Purge replicas: False
Notification: NO
End of life: None
Child Rule Id: None
</pre>
More info [https://twiki.cern.ch/twiki/bin/view/CMSPublic/RucioUserDocsRules here]

=== Information for the T2B admins ===

- requests can also be done for users, but I have not tested this.

- users can also be given [https://twiki.cern.ch/twiki/bin/view/CMSPublic/RucioSiteDocsQuotas quota’s]

- datasets can be grouped into containers. This can be handy if several datasets are needed for a specific analysis. All of then can then also be removed together if the analysis is finished. This is done via rucio containers. More info [https://twiki.cern.ch/twiki/bin/view/CMSPublic/RucioUserDocsContainers here].

At the moment of writing this document, these were the settings for our site:
<pre>
rucio list-rse-attributes T2_BE_IIHE
T2_BE_IIHE: True
cms_type: real
country: BE
ddm_quota: 3355000000000000
fts: https://fts3-cms.cern.ch:8446
lfn2pfn_algorithm: cmstfc
pnn: T2_BE_IIHE
quota_approvers: odevroed,rougny,srugovac
reaper: True
region: C
rule_approvers: odevroed,rougny,srugovac
source_for_total_space: static
source_for_used_space: rucio
tier: 2
</pre>

You can see the data stored on a T2 by issuing the following command:
<pre>
rucio list-rse-usage T2_BE_IIHE --show-accounts
</pre>

Main Page

2025-06-23T09:05:46Z

Admin: /* Welcome to the CMS Belgian T2 Wiki */

== Welcome to the CMS Belgian T2 Wiki ==
<center> [[first_access_to_t2b|=> FIRST ACCESS TO T2B <=]] </center>

=== General information for users ===

*[[First_access_to_t2b|Getting access to T2B]]

==== Information for new users ====

*[[ Cluster_Overview | Overview of the cluster with all relevant information ]]
*[[Faq_t2b | FAQ]]
*[[Getting_a_certificate_for_the_T2|Certificates and VOs]]
*[[Introduction_to_Linux|Introduction to Linux]]
*[[CorrectWorkflow|Correct workflow, or how to use T2B resources efficiently]]

==== Using the Tier2 computing resources ====
*[[HTCondor|Using the new HTCondor cluster]]
*[[GridStorageAccess| How to handle data on the '''/pnfs''' Grid Storage]]
*[[SingularityContainers|How to use EL7/EL8 on the cluster with containers]]
*[[Rucio|How to use rucio at T2B]]
*[[OtherSoftware| Other software available at the T2]]
*[[PublicWebpages|Having Public Webpages]]
*[[Crontab|Executing scripts regularly on the UIs using crontab]]
*[[Backup| Backups of /user , /group , /data , /ice3]]
*[[GPUs| About GPUs]]

==== Other topics ====
*[[Basic_computing_skills| Basic computing skills]]
*[[Using_Git| Using Git]]
*[[CernLxplus| Useful info on use of lxplus.cern.ch and CERN facilities]]
*[[Using_htmap| Using htmap (HTCondor python library)]]

----

=== Dedicated experiments pages ===
!! Beware some pages might be obsolete !!

==== Jupyter Notebook ====
*[[jupyterlabt2b|Using Jupyter Lab on T2B cluster]]

==== CMS ====
*[[gridSubmission_withCrab| Submitting jobs with CRAB to the worldwide grid]]
*[[Getting_started_with_the_CMSSW_software| Getting started with the CMSSW software]]
*[[FAQ_CMSSW_on_the_Grid| FAQ CMSSW on the Grid on proxy and more!]]
*[[Getting_started_with_the_MadGraph_software| Getting started with the MadGraph software]]
*[[TtBar_Analysis_Framework| TtBar Analysis Framework (old)]]
*[[TopQuarkGroup| Top Quark Group wiki]]
*[[HEEP_Analysis_Framework| HEEP Analysis Framework]]
*[[V0_Analysis_wiki| V0 Analysis wiki]]
*[[Info_exchange| Higgs analysis]]
==== IceCube ====
* [[Transfer_from_Madison|Transfer files from Madison]]
* [[Register_to_the_IceCube_VO|Register to the IceCube VO]]
* [[IceCube_software|IceCube software]]
* [[Build_IceCube_software|Build IceCube software]]
* [[ipython_notebook|iPython notebook]]

----

*[[ObsoletePages|Obsolete twiki pages]]

== Admin section ==
*[[AdminPage| Pages for administrators]]

HTCFirstSubmissionGuide

2025-04-10T12:39:43Z

Admin: /* T2B Specifics */

== First time submitting a job ==
For new users, we recommend following [https://indico.cern.ch/event/936993/contributions/4022073/attachments/2105538/3540926/2020-Koch-User-Tutorial.pdf this presentation], that should give you an idea of how to submit jobs. 
Then to practice the basics of job submission on a HTCondor cluster, some exercises are proposed on this [https://en.wikitolearn.org/Course:HTCondor Wiki].

=== ''T2B Specifics'' ===

- '''File Transfers:'''
: Please note that contrary to what is usually shown in documentation and examples, we recommend not using HTCondor file transfer mechanisms ('''should_transfer_files = NO''') and copy files yourself within your script.

- '''Sending many jobs:'''
:If you plan on sending > 1k jobs at the same time (not a problem), please add '''max_idle = 200''' to your .sub file. 
:This will make sure only 200 jobs materialize in the queue at one time, not stressing the scheduler machine (even if you send >50k jobs in total).

- '''Always adapt requested resources to your job'''
: You need to adapt the resources you request to what you estimate your job will need. Requesting more than what you really need is wasteful and deprive your fellow users from ressources they might require.
: To do so, just add to your submit file the following lines:
<pre>request_cpus = 1
request_memory = 200MB
request_disk = 1GB</pre>

: Note that for a job, if you need more than 1 cpu / 4GB of memory / 10GB of disk, please be careful and sure of what you are doing.

- '''Where am I when I start a job ?'''
:You should always prefer using $HOME, which in a job equates to a local unique directory in the /scratch of the local disk,eg: /scratch/condor/dir_275928.
: $TMPDIR = $HOME/tmp, so it is also on the local disk and unique

- '''Efficient use of local disks on worker nodes'''
:Note that local disks are now exclusively NVMEs which are much -much- faster than network protocols used in writing to /pnfs or /user. 
:So for repeated reads (like cycling through events with files O(1GB)) it is more efficient to copy file locally first then open it. 
:Same for write, prefer writing locally then copying the file to /pnfs.

- '''Can I have a shell/interactive job on the batch system ?'''
:Yes! If you want to make tests, or run things interactively, with dedicated core/memory for you, just run:
condor_submit -i
:Note that if you want to reserve more than the standard 1 core / 600MB of memory, simply add your request_* [specified just above] like this:
condor_submit -interactive request_cpus=2

- '''Sending DAG jobs'''
:For now, sending DAG jobs only works when you are directly on the scheduler, so first add your ssh key to local keyring agent and then connect to any mX machine with the -A (forwarding agent) option:
ssh-add
ssh -A mshort.iihe.ac.be
:then connect to the scheduler:
ssh schedd02.wn.iihe.ac.be
:there you can do the condor_submit_dag commands and it will work. Note than condor_q commands to see the advancements of your DAG jobs can still be performed on the mX machines.

:'''NB:''' as we are now having 2 OSes, some requirements - ''to put in your .sub file'' - are necessary to ensure your jobs arrive on the right cluster:
<pre>For EL7 workflows:
requirements = TARGET.OpSysAndVer == "CentOS7"

For EL9 workflows:
requirements = TARGET.OpSysAndVer == "AlmaLinux9"</pre>

=== What is my job doing ? Is something wrong ? ===
For debugging jobs in queue or what jobs running are doing, please follow our [[HTCondorDebug|HTCondor Debug]] section

 
=== HTCondor Official Documentation ===
Have a look at the [https://htcondor.readthedocs.io/en/latest/users-manual/index.html official User Manual] on the HTCondor website. 
It is very well done and explains all available features.

 
=== HTCondor Workshop Presentation ===
Every 6 months, there is an HTCondor workshop. Presentations are usually very helpful, especially if you want to go into details of HTCondor (API, DAGMan, ...). You can find the agenda of the latest one [https://indico.cern.ch/event/936993/timetable/#20200921.detailed here]

Main Page

2025-04-08T11:50:00Z

Admin: /* Using the Tier2 computing resources */

== Welcome to the CMS Belgian T2 Wiki ==
<center> [[first_access_to_t2b|=> FIRST ACCESS TO T2B <=]] </center>

=== General information for users ===

*[[First_access_to_t2b|Getting access to T2B]]

==== Information for new users ====

*[[ Cluster_Overview | Overview of the cluster with all relevant information ]]
*[[Faq_t2b | FAQ]]
*[[Getting_a_certificate_for_the_T2|Certificates and VOs]]
*[[Introduction_to_Linux|Introduction to Linux]]
*[[CorrectWorkflow|Correct workflow, or how to use T2B resources efficiently]]

==== Using the Tier2 computing resources ====
*[[HTCondor|Using the new HTCondor cluster]]
*[[GridStorageAccess| How to handle data on the '''/pnfs''' Grid Storage]]
*[[SingularityContainers|How to use EL7/EL8 on the cluster]]
*[[Rucio|How to use rucio at T2B]]
*[[OtherSoftware| Other software available at the T2]]
*[[PublicWebpages|Having Public Webpages]]
*[[Crontab|Executing scripts regularly on the UIs using crontab]]
*[[Backup| Backups of /user , /group , /data , /ice3]]
*[[GPUs| About GPUs]]

==== Other topics ====
*[[Basic_computing_skills| Basic computing skills]]
*[[Using_Git| Using Git]]
*[[CernLxplus| Useful info on use of lxplus.cern.ch and CERN facilities]]
*[[Using_htmap| Using htmap (HTCondor python library)]]

----

=== Dedicated experiments pages ===
!! Beware some pages might be obsolete !!

==== Jupyter Notebook ====
*[[jupyterlabt2b|Using Jupyter Lab on T2B cluster]]

==== CMS ====
*[[gridSubmission_withCrab| Submitting jobs with CRAB to the worldwide grid]]
*[[Getting_started_with_the_CMSSW_software| Getting started with the CMSSW software]]
*[[FAQ_CMSSW_on_the_Grid| FAQ CMSSW on the Grid on proxy and more!]]
*[[Getting_started_with_the_MadGraph_software| Getting started with the MadGraph software]]
*[[TtBar_Analysis_Framework| TtBar Analysis Framework (old)]]
*[[TopQuarkGroup| Top Quark Group wiki]]
*[[HEEP_Analysis_Framework| HEEP Analysis Framework]]
*[[V0_Analysis_wiki| V0 Analysis wiki]]
*[[Info_exchange| Higgs analysis]]
==== IceCube ====
* [[Transfer_from_Madison|Transfer files from Madison]]
* [[Register_to_the_IceCube_VO|Register to the IceCube VO]]
* [[IceCube_software|IceCube software]]
* [[Build_IceCube_software|Build IceCube software]]
* [[ipython_notebook|iPython notebook]]

----

*[[ObsoletePages|Obsolete twiki pages]]

== Admin section ==
*[[AdminPage| Pages for administrators]]

OpenID

2025-03-31T08:47:45Z

Admin: /* Register your OpenID identity at T2B */

Grid facilities in general and CMS in particular are slowly moving from a certificate/proxy based authentication towards an OpenID/token based authentication.

This page explains how to use them yourself at T2B [BR]
IMPORTANT: these instructions only work on M19 for the time being!!!!!!

== Getting an OpenID identity ==
* Go the CMS IAM service at https://cms-auth.cern.ch/ and log in
* At the left, click 'Manage Active Tokens'. At the moment, you have none.
* Go to the M machines and issue the following commands:
<pre>
eval `oidc-agent-service use`
oidc-gen --iss https://cms-auth.cern.ch/ --scope openid -w device cms-id

Options:
iss: IAM site of your virtual organisation
scope: Set of pre-defined rules that limit what you can do with your tokens. The most encompassing is --scope max
w: Way to connect the local 'id' to the IAM one.
In this case 'device' means you will need to go a specific webpage and insert a code that oidc-gen will specify.
The other options do not work well at T2B
cms-id: Your local id name.
</pre>

* Follow the onscreen instructions.
* 'cms-id' now contains the reference to your online identity and will be used in subsequent commands. Feel free to use your imagination here.
* You can now go back to the IAM site and see that you have just created this identity with the 'openid' scope.
* It has also created 2 tokens for your. More on this in a later section.
* For more detailed information about the available options, you can see [https://indigo-dc.gitbook.io/oidc-agent/user/oidc-gen this page]
* You can make as many IDs a you wish. Each can have a different scope and thus different use cases.
* If you want basic information about your identity, issue the following command:
<pre>
oidc-gen -p cms-id | jq .
</pre>

== Register your OpenID identity at T2B ==

Go to your [https://cms-auth.cern.ch/manage/user/profile user page] on the IAM page and send us the value of the field labelled 'sub'.

It should be of this form:
<pre>
xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
</pre>

== Creating a token ==
On the CMS IAM page, you saw that you have a short and long lived token. One is for direct usage, the other to easily renew without the need for a password. This is especially useful when running longer jobs.

The renewal token can also be recreated, but this does require you to give in a password.

A token is created via the following command:
<pre>
oidc-token cms-id
</pre>
Notice the use of 'cms-id' that must be the same name as the ID you created in the previous step.

If you forgot the name of your ID, you can always find it back via this command:
<pre>oidc-add --list</pre>
Details about your configuration can be retrieved via:
<pre>oidc-add --print cms-id</pre>

You can also limit your token to specific tasks. If you want to give read access to your files to a colleague, you can send her/him a token created like this:
<pre>
oidc-token -s storage.read:/ cms-id
</pre>
When your long lived renewal token expires, it can be recreated via:
<pre>
oidc-gen --reauthenticate --flow=device cms-id
</pre>

More information about the use of oidc-token can be found [https://indigo-dc.gitbook.io/oidc-agent/user/oidc-token here].

== Using your token at T2B ==

If your ID is registered at T2B and you made a new token, you can now use it easily via the usual 'gfal' commands.

The gfal commands can take either a proxy or a token, depending on an environment variable. In the case of tokens, the variable can be set in the following way:
<pre>
export BEARER_TOKEN=$(oidc-token cms-id)
</pre>
Now all the gfal commnads will use your token to identify. However, only the webdav protocol is supported as of now. This command will get you started:
<pre>
gfal-ls https://dcache6-shadow.iihe.ac.be:2880/pnfs/iihe/cms/ph/sc4/
</pre>

OpenID

2025-03-31T08:32:12Z

Admin:

Grid facilities in general and CMS in particular are slowly moving from a certificate/proxy based authentication towards an OpenID/token based authentication.

This page explains how to use them yourself at T2B [BR]
IMPORTANT: these instructions only work on M19 for the time being!!!!!!

== Getting an OpenID identity ==
* Go the CMS IAM service at https://cms-auth.cern.ch/ and log in
* At the left, click 'Manage Active Tokens'. At the moment, you have none.
* Go to the M machines and issue the following commands:
<pre>
eval `oidc-agent-service use`
oidc-gen --iss https://cms-auth.cern.ch/ --scope openid -w device cms-id

Options:
iss: IAM site of your virtual organisation
scope: Set of pre-defined rules that limit what you can do with your tokens. The most encompassing is --scope max
w: Way to connect the local 'id' to the IAM one.
In this case 'device' means you will need to go a specific webpage and insert a code that oidc-gen will specify.
The other options do not work well at T2B
cms-id: Your local id name.
</pre>

* Follow the onscreen instructions.
* 'cms-id' now contains the reference to your online identity and will be used in subsequent commands. Feel free to use your imagination here.
* You can now go back to the IAM site and see that you have just created this identity with the 'openid' scope.
* It has also created 2 tokens for your. More on this in a later section.
* For more detailed information about the available options, you can see [https://indigo-dc.gitbook.io/oidc-agent/user/oidc-gen this page]
* You can make as many IDs a you wish. Each can have a different scope and thus different use cases.
* If you want basic information about your identity, issue the following command:
<pre>
oidc-gen -p cms-id | jq .
</pre>

== Register your OpenID identity at T2B ==

Go to your [https://cms-auth.web.cern.ch/manage/user/profile user page] on the IAM page and send us the value of the field labelled 'sub'.

It should be of this form:
<pre>
xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
</pre>

== Creating a token ==
On the CMS IAM page, you saw that you have a short and long lived token. One is for direct usage, the other to easily renew without the need for a password. This is especially useful when running longer jobs.

The renewal token can also be recreated, but this does require you to give in a password.

A token is created via the following command:
<pre>
oidc-token cms-id
</pre>
Notice the use of 'cms-id' that must be the same name as the ID you created in the previous step.

If you forgot the name of your ID, you can always find it back via this command:
<pre>oidc-add --list</pre>
Details about your configuration can be retrieved via:
<pre>oidc-add --print cms-id</pre>

You can also limit your token to specific tasks. If you want to give read access to your files to a colleague, you can send her/him a token created like this:
<pre>
oidc-token -s storage.read:/ cms-id
</pre>
When your long lived renewal token expires, it can be recreated via:
<pre>
oidc-gen --reauthenticate --flow=device cms-id
</pre>

More information about the use of oidc-token can be found [https://indigo-dc.gitbook.io/oidc-agent/user/oidc-token here].

== Using your token at T2B ==

If your ID is registered at T2B and you made a new token, you can now use it easily via the usual 'gfal' commands.

The gfal commands can take either a proxy or a token, depending on an environment variable. In the case of tokens, the variable can be set in the following way:
<pre>
export BEARER_TOKEN=$(oidc-token cms-id)
</pre>
Now all the gfal commnads will use your token to identify. However, only the webdav protocol is supported as of now. This command will get you started:
<pre>
gfal-ls https://dcache6-shadow.iihe.ac.be:2880/pnfs/iihe/cms/ph/sc4/
</pre>

OpenID

2025-03-31T08:31:46Z

Admin:

Grid facilities in general and CMS in particular are slowly moving from a certificate/proxy based authentication towards an OpenID/token based authentication.

This page explains how to use them yourself at T2B [BR]
IMPORTANT: these instructions only work on M19 for the time being!!!!!!

== Getting an OpenID identity ==
* Go the CMS IAM service at https://cms-auth.web.cern.ch/ and log in
* At the left, click 'Manage Active Tokens'. At the moment, you have none.
* Go to the M machines and issue the following commands:
<pre>
eval `oidc-agent-service use`
oidc-gen --iss https://cms-auth.cern.ch/ --scope openid -w device cms-id

Options:
iss: IAM site of your virtual organisation
scope: Set of pre-defined rules that limit what you can do with your tokens. The most encompassing is --scope max
w: Way to connect the local 'id' to the IAM one.
In this case 'device' means you will need to go a specific webpage and insert a code that oidc-gen will specify.
The other options do not work well at T2B
cms-id: Your local id name.
</pre>

* Follow the onscreen instructions.
* 'cms-id' now contains the reference to your online identity and will be used in subsequent commands. Feel free to use your imagination here.
* You can now go back to the IAM site and see that you have just created this identity with the 'openid' scope.
* It has also created 2 tokens for your. More on this in a later section.
* For more detailed information about the available options, you can see [https://indigo-dc.gitbook.io/oidc-agent/user/oidc-gen this page]
* You can make as many IDs a you wish. Each can have a different scope and thus different use cases.
* If you want basic information about your identity, issue the following command:
<pre>
oidc-gen -p cms-id | jq .
</pre>

== Register your OpenID identity at T2B ==

Go to your [https://cms-auth.web.cern.ch/manage/user/profile user page] on the IAM page and send us the value of the field labelled 'sub'.

It should be of this form:
<pre>
xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
</pre>

== Creating a token ==
On the CMS IAM page, you saw that you have a short and long lived token. One is for direct usage, the other to easily renew without the need for a password. This is especially useful when running longer jobs.

The renewal token can also be recreated, but this does require you to give in a password.

A token is created via the following command:
<pre>
oidc-token cms-id
</pre>
Notice the use of 'cms-id' that must be the same name as the ID you created in the previous step.

If you forgot the name of your ID, you can always find it back via this command:
<pre>oidc-add --list</pre>
Details about your configuration can be retrieved via:
<pre>oidc-add --print cms-id</pre>

You can also limit your token to specific tasks. If you want to give read access to your files to a colleague, you can send her/him a token created like this:
<pre>
oidc-token -s storage.read:/ cms-id
</pre>
When your long lived renewal token expires, it can be recreated via:
<pre>
oidc-gen --reauthenticate --flow=device cms-id
</pre>

More information about the use of oidc-token can be found [https://indigo-dc.gitbook.io/oidc-agent/user/oidc-token here].

== Using your token at T2B ==

If your ID is registered at T2B and you made a new token, you can now use it easily via the usual 'gfal' commands.

The gfal commands can take either a proxy or a token, depending on an environment variable. In the case of tokens, the variable can be set in the following way:
<pre>
export BEARER_TOKEN=$(oidc-token cms-id)
</pre>
Now all the gfal commnads will use your token to identify. However, only the webdav protocol is supported as of now. This command will get you started:
<pre>
gfal-ls https://dcache6-shadow.iihe.ac.be:2880/pnfs/iihe/cms/ph/sc4/
</pre>

GridStorageAccess

2025-03-17T15:18:10Z

Admin: /* Before starting */

This page describes how to handle data stored on our mass storage system.
== Introduction ==
T2B has an ever increasing amount of mass storage available. This system hosts both the centrally produced datasets as well as the user produced data. 
The mass storage is managed by a software called [https://www.dcache.org/ dCache]. As this software is in full development, new features are added continuously. This page contains an overview of the most used features of the software. 

== General info ==
As dCache was designed for precious data, the files are immutable. This means that once they are written, they cannot be changed any more. So, if you want to make changes to a file on dCache, you need to first erase it and then write it anew.
Our dCache instance is mounted on all the M machines and can be browsed via the /pnfs directory. If you want to find your personal directory, the structure is the following:
<pre>/pnfs/iihe/<Experiment>/store/user/<Username> <-- Replace <Username> and <Experiment> accordingly.</pre>
On the M machines as well as the whole cluster, /pnfs is mounted read-write via a protocol called 'nfs'. Please be aware that you can now inadvertently remove a large portion of your files. As it is a mass storage system, you can easily delete several TBs of data.

Writing and deleting files can still be done via via grid enable commands (see next section). These should still provide the best read/write speeds, compared to nfs access. These commands are mostly done via scripts, so the probability of an error is lessened. 

== Before starting ==
In what follows, most of the commands will require some type of authentication to access /pnfs. This is because these commands can be executed over WAN and your location is irrelevant. 
The way authentication is done on our mass storage instance is via an x509 proxy. This proxy is made through your grid certificate. If you do not have a grid certificate, see [https://t2bwiki.iihe.ac.be/Getting_a_certificate_for_the_T2 this page] on how to get one. 
The command to make a grid proxy is:
<pre>voms-proxy-init --voms <MYEXPERIMENT></pre>

Where ''<MYEXPERIMENT>'' is one of 'cms, icecube, beapps'

NOTE: the proxy is created in /tmp, so it is local to the machine. For batch jobs to execute grid commands too, set in your '''.bashrc''' file:
<pre>
export X509_USER_PROXY=/user/$USER/x509up_u$UID
</pre>

== Browser access ==
dCache now exposes files using the WebDav protocol. This means that the files are accessible to browse over https. 
For this, you need to have your certificate in your browser (to import your .p12 certificate, google is your friend). 
Then just point your browser to:
https://maite.iihe.ac.be:2880/pnfs/iihe/cms/store/user/

To be able to see and download your files.

dCache has an even more powerful web interface. It is called dCache View and can be accessed via:
https://maite.iihe.ac.be:3880/
Work is still in progress to make all actions work (12/2021).

== Access via GFAL ==
GFAL is a wrapper around the latest grid commands. Learning to use it means that whatever middleware requires to be used in the future, you don't need to learn new commands (like srm, lcg, etc)
 
=== gfal-commands ===
If you want more information on the options that can be used, please use the man pages of gfal !

Here are all the commands that can be used:
*''gfal-ls'': get information on a file
*''gfal-mkdir'': remove a directory
*''gfal-rm'': removes a file. To remove an entire directory, use -r
*''gfal-copy'': copy files.

 

=== Usage ===
There are 2 types of file url:
* '''Distant files''': their url is of the type <protocol>://<name_of_server>:<port>/some/path, eg for IIHE:
:<pre>davs://maite.iihe.ac.be:2880/pnfs/iihe/</pre>
* '''Local files''': their url is of the type file://path_of_the_file, eg for IIHE:
:<pre>file:///user/$USER/MyFile.root</pre>

[[File:Exclamation-mark.jpg|left|40x30px|line=1|]] Be careful, the number of '''/''' is very -very- important [[File:Exclamation-mark.jpg|40x30px|line=1|]]

*To get a list of all distant urls for all the Storage Elements, one can do:
:<pre>lcg-infosites --is grid-bdii.desy.de --vo cms se </pre>
:or read more in the [[CERNGridUrls | CERN EOS urls page]].

=== Protocols ===

* '''https/WebDavs''' [preferred] [try this one first]
::<pre>gfal-ls davs://maite.iihe.ac.be:2880/pnfs/iihe/cms/store/user/</pre>

* '''xrootd''' [preferred]
::<pre>gfal-ls root://maite.iihe.ac.be:1094/pnfs/iihe/cms/store/user/</pre>

* '''srm''' [deprecated]
::<pre>gfal-ls srm://maite.iihe.ac.be:8443/pnfs/iihe/cms/store/user/</pre>

* '''nfs''' ''{local-only | no-cert}''
::<pre>gfal-ls /pnfs/iihe/cms/store/user/</pre>

* '''dcap''' ''{local-only | no-cert}''
::<pre>gfal-ls dcap://maite.iihe.ac.be/pnfs/iihe/cms/store/user</pre>

=== Examples ===
*To list the contents of a directory ''/pnfs/iihe/cms'' :
::<pre> gfal-ls davs://maite.iihe.ac.be:2880/pnfs/iihe/cms </pre>

* To create a directory:
::<pre> gfal-mkdir davs://maite.iihe.ac.be:2880/pnfs/iihe/cms/store/user/$USER/NewDir </pre>

*copy file from local disk to remote server
::<pre> gfal-copy file:///user/$USER/MyFile.root davs://maite.iihe.ac.be:2880/pnfs/iihe/cms/store/user/$USER/ </pre>

* To copy a file from remote server to our Storage Element:
::<pre> gfal-copy gsiftp://eosuserftp.cern.ch/eos/user/r/rougny/Documents/desktop.ini davs://maite.iihe.ac.be:2880/pnfs/iihe/cms/store/user/rougny/ </pre>

* To delete a file on remote server
::<pre> gfal-rm srm://maite.iihe.ac.be:8443/pnfs/iihe/cms/store/user/$USER/MyFile.root </pre>

* To remove a directory and its entire content on remote server ?!? not working for now ?):
::<pre> gfal-rm -r srm://maite.iihe.ac.be:8443/pnfs/iihe/cms/store/user/$USER/NewDir </pre>

 

== Copy more than 1 file ==

==== Copy Directories ====

You can easily copy whole directories to/from our site using gfal commands. 
It is usually must faster than using scp or rsync commands.
gfal-copy -r [--dry-run] gsiftp://eosuserftp.cern.ch/eos/user/r/rougny/Documents/ davs://maite.iihe.ac.be:2880/pnfs/iihe/cms/store/user/rougny/

The magic option is '''-r''' for recursive copying. 
When you are sure you get what you want, remove the --dry-run option. 

Note that by default, gfal-copy will not overwrite files already present at the destination. This means it is usually safe to run the command several times. 
If you want to force the copy over files already there, add the '''-f''' option to your gfal-copy command.

 
==== Bulk copy from a list of files ====

There is an elegant way to run gfal-copy through several files. This is done using the '''--from-file''' option.
<pre> gfal-copy -f [--dry-run] --from-file files.txt file://location/to/store/ </pre>
where files.txt is a file where every line is a source like:
<pre>davs://maite.iihe.ac.be:2880/pnfs/iihe/cms/store/user/odevroed/eosTransfer-1.root
davs://maite.iihe.ac.be:2880/pnfs/iihe/cms/store/user/odevroed/eosTransfer-2.root
... </pre>

Make some tests with one line in datafile and make sure the url is OK for both source and destination before running over several files. 
When you are sure you get what you want, remove the --dry-run option.

 
==== WebFTS web interface ====
If you prefer using a web interface to copy files to/from our /pnfs, you can use CERN's [https://webfts.cern.ch/ WebFTS feature]. 
Not all experiments are allowed to use it, but you can always make a request to have it included.

== Other ways to access the mass storage system ==
=== Read and copy access ===

As stated in the introduction, dCache is an immutable file system, therefore files cannot be changed once they are written.
Files can be accessed from pnfs in several ways without the requirement of a grid certificate and grid tools.

* Via the regular 'cp' command (prefer the rsync command below):
::<pre>cp /pnfs/iihe/cms/store/user/odevroed/DQMfile_83_1_hF2.root /user/odevroed </pre>

* Via the regular 'rsync' command:
::<pre>rsync -aP /pnfs/iihe/cms/store/user/odevroed/*.root /user/odevroed/</pre>

* Via the dcache copy command (dccp):
::<pre>dccp dcap://maite.iihe.ac.be/pnfs/iihe/cms/store/user/odevroed/DQMfile_83_1_hF2.root ./ </pre>

* To open files directly using root, use eg
::<pre>root dcap://maite.iihe.ac.be/pnfs/iihe/some/file.root </pre>

::When reading out the root files, if is rather slow or it doesn't work at all, and nothing is wrong with the root file (e.g. in an interactive analysis mX machines) you can increase your dCache readahead buffer. Don't make the buffer larger than 50MB!
::To enlarge the buffer set this in you environment: 
::'''For csh'''
:::<pre>setenv DCACHE_RAHEAD 1</pre>
:::<pre>setenv DCACHE_RA_BUFFER 50000000</pre>

::'''For bash'''
:::<pre>export DCACHE_RAHEAD=true</pre>
:::<pre>export DCACHE_RA_BUFFER=50000000</pre>

* Via the 'curl' command over https
:'''Copy from /pnfs:'''
::<pre>curl -L --cert $X509_USER_PROXY --key $X509_USER_PROXY --cacert $X509_USER_PROXY --capath $X509_CERT_DIR -O https://maite.iihe.ac.be:2880/pnfs/iihe/cms/store/user/odevroed/testing_transfer</pre>

:'''Copy to /pnfs:'''
::<pre>curl -L --cert $X509_USER_PROXY --key $X509_USER_PROXY --cacert $X509_USER_PROXY --capath $X509_CERT_DIR -T testing_transfer https://maite.iihe.ac.be:2880/pnfs/iihe/cms/store/user/odevroed/testing_transfer_2</pre>

::This is equivalent to issuing the gfal-cp command via the https protocol:
::<pre>gfal-copy testing_transfer https://maite.iihe.ac.be:2880/pnfs/iihe/cms/store/user/odevroed/testing_transfer2</pre>

Cluster Overview

2025-03-03T08:17:39Z

Admin: /* Data Storage & Directory Structure */

== Overview ==

The cluster is composed 3 groups of machines :
 
* The '''User Interfaces (UI)'''
::This is the cluster front-end, to use the cluster, you need to log into those machines
::::Servers : mshort [ m2 , m3 ] , mlong [ m0, m1 ]
:: The '''File Server''' provides the user home on the UIs. It is a highly efficient & redundant storage node of ~120 TB capacity with regular backups.

 
* The '''Computing Machines'''
** The '''Computing Element (CE):''' This is the gateway between the World and the T2B cluster: it receives all Grid jobs and submit them to the local batch system.
::::Servers : testumd-htcondorce (temporary)

:* The '''HTCondor Schedulers:''' This is the brain of the batch system: they manage all the submitted jobs, and send them to the worker nodes.
::::Servers : scheddXX

:* The '''Worker Nodes (WN): ''' This is the power of the cluster : they run multiple jobs in parallel and send the results & status back to the CE.
::::Servers : nodeXX-YY
 
* The '''Mass Storage'''
** The '''Storage Element''': it is the brain of the cluster storage. Grid accessible, it knows where all the files are, and manages all the storage nodes.
::::Server : maite
:* The '''Storage Nodes''': This is the memory of the cluster : they contain big data files. In total, they provide ~8400 TB of grid-accessible storage.
::::Servers : beharXXX

 

== How to Connect ==

To connect to the cluster, you need to have sent us your public ssh key.
In a terminal, type the following (adapt <MYLOGIN> accordingly WITHOUT the brackets <>):
ssh -X -o ServerAliveInterval=100 <MYLOGIN>@mshort.iihe.ac.be
:''Tip: the ''-o ServerAliveInterval=100'' option is used to keep your session alive for a long period of time ! You should not be disconnected during a whole day of work.''
:''Tip: use aliases to connect easily! eg add to your ''~/.bashrc'' file the following: ''alias mshort='ssh -X -o ServerAliveInterval=100 <MYLOGIN>@mshort.iihe.ac.be'

If connecting does not work, please follow the help [[Faq_t2b#Debugging_SSH_connection_to_mX_machines:|here]]. After a successful login, you'll see this message :

 (: Welcome to the T2B Cluster :) 
 ________________________________ 
 The cluster is working properly 
___________________________________________________________________________ 
Mail: grid_admin@listserv.vub.be | Chat: https://chat.iihe.ac.be
Wiki: https://t2bwiki.iihe.ac.be | Status: https://status.iihe.ac.be
___________________________________________________________________________ 
[/user] => 224 / 500 GB [44%] --|-- [/pnfs] => 101 GB [01/12/2023]
___________________________________________________________________________ 
Welcome on [m7] ! You have 3600s (1 hours) of cpu time per processes.
There are 2 users here | Load: 7.56 /4 CPUs (189%) | Mem: 16% used 

Please observe all the information in this message:
* The header, telling you the health of the cluster. When there is an issue, the header of the welcome message will transform to:

 :( Welcome to the T2B Cluster ): 
 ________________________________ 
 THERE ARE ISSUES ON THE CLUSTER
 More details at status.iihe.ac.be
 (Register to receive updates)
 

* The email used for the cluster support (please use this one rather than personal mail, this way everyone on the support team can answer and track the progress.)
* The wiki link, where you should go first to find the information
* The chat link, where you can easily contact us for fast exchanges. IIHE users can use their intranet account, others can just create an account.
* The status link, where you can see if the cluster has any problems reported. Please make sure you are registered to receive updates.
 
* The space used on the mass storage /pnfs, where storing a few TB is no problem. No hard limits are applied, but please contact us if you plan to go over 20 TB!
* The quota used on /user (and /group). Here a hard limit is applied, so if you are at 100%, you will have many problems. Clean your space, and if you really need more contact us. 
* The cpu time limit imposed per process, as we divided our UIs into 2 groups. Please note '''processes will be killed''' if they go over their CPU-time limit!
:: '''The light task''' UIs (max '''CPU''' time = 20 minutes) : they are used for crab/local job submission, writing code, debugging ...
::<pre>mshort.iihe.ac.be : m2.iihe.ac.be, m3.iihe.ac.be </pre>
:: '''The CPU-intensive''' UIs (max '''CPU''' time = 5 hour) : they are available for CPU-intensive and testing tasks/workflows, although you should prefer using local job submission ...
::<pre>mlong.iihe.ac.be : m0.iihe.ac.be, m1.iihe.ac.be</pre>
* Information about how heavily this UI is used. If any of them is red (ie above optimal usage), please consider using another UI. Please be mindful of other users and don't start too many processes, especially if the UI is already under charge.
 
* Sometimes announcements are printed at the end. Please make sure you read those.
 

== Data Storage & Directory Structure ==

There are 2 main directories to store your work and data:
* '''/user [/$USER]''' : this is your home directory. You have an enforced quota there, as it is an expensive storage with redundancy and daily backups (see below).
* '''/pnfs [/iihe/MYEXP/store/user/$USER]''' : this is where you can store a large amount of data, and is also [[GridStorageAccess|grid-accessible]]. If you need more than a few TB, please contact us. There is no backups there, so be careful of what you do !
** '''IMPORTANT:''' /pnfs is an immutable file system. This means that once data is written, it cannot be changed anymore. Therefore you should '''not''' put your scripts in this area.
 
There are other directories than you might want to take notice of:
* '''/group''' : same as /user , but if you need to share/produce in a group.
* '''/scratch''' : a temporary scratch space for your job. Use $TMPDIR on the WNs, it is cleanned after each job :)
* '''/cvmfs''' : Centralised CVMFS software repository. It should contain most of the software you will need for your experiment. Find [[OtherSoftware|here]] how to get a coherent environment for most tools you will need.
* '''/software''' : local area for shared software not in /cvmfs . You can use a [[OtherSoftware|nice tool]] to find the software and versions available.

 

== Batch System ==

The cluster is based on HTCondor (also used at CERN or Wisconsin for instance).
Please follow [[HTCondor|this page]] for details on how to use it.

{| width="1064" cellspacing="1" cellpadding="5" border="1" align="center"
|-
! scope="row" | Description
| nowrap="nowrap" align="center" | HTCondor batch ressources 
|-
! scope="row" | # CPU's (Jobs)
| nowrap="nowrap" align="center" | 10700 
|-
! scope="row" | Walltime limit
| nowrap="nowrap" align="center" | 168 hours = 1 week
|-
! scope="row" | Preferred Memory per job
| nowrap="nowrap" align="center" | 4 Gb 
|-
! scope="row" | $TMPDIR/scratch max usable space
| nowrap="nowrap" align="center" | 10-20 Gb 
|-
! scope="row" | Max # jobs sent to the batch system / User
| nowrap="nowrap" align="center" | theoretically none (contact us if you plan on sending more than 10 000) 
|}

 

== Backup ==
There are several areas that we regularly back up: '''/user''' , '''/group''' , '''/ice3'''. 
You can find more information on the backup frequency and how to access them [[Backup|here]].

== Useful links ==
[http://ganglia.iihe.ac.be/ganglia/ Ganglia Monitoring] : stats on all our servers. 
[http://status.iihe.ac.be Cluster Status] : current status of all T2B services. Check here before sending us an email. Please also consider registering to receive T2B issues and be informed when things are resolved.

HTCFirstSubmissionGuide

2025-02-18T09:58:24Z

Admin:

== First time submitting a job ==
For new users, we recommend following [https://indico.cern.ch/event/936993/contributions/4022073/attachments/2105538/3540926/2020-Koch-User-Tutorial.pdf this presentation], that should give you an idea of how to submit jobs. 
Then to practice the basics of job submission on a HTCondor cluster, some exercises are proposed on this [https://en.wikitolearn.org/Course:HTCondor Wiki].

=== ''T2B Specifics'' ===

- '''File Transfers:'''
: Please note that contrary to what is usually shown in documentation and examples, we recommend not using HTCondor file transfer mechanisms ('''should_transfer_files = NO''') and copy files yourself within your script.

- '''Always adapt requested resources to your job'''
: You need to adapt the resources you request to what you estimate your job will need. Requesting more than what you really need is wasteful and deprive your fellow users from ressources they might require.
: To do so, just add to your submit file the following lines:
<pre>request_cpus = 1
request_memory = 200MB
request_disk = 1GB</pre>

: Note that for a job, if you need more than 1 cpu / 4GB of memory / 10GB of disk, please be careful and sure of what you are doing.

- '''Where am I when I start a job ?'''
:You should always prefer using $HOME, which in a job equates to a local unique directory in the /scratch of the local disk,eg: /scratch/condor/dir_275928.
: $TMPDIR = $HOME/tmp, so it is also on the local disk and unique

- '''Efficient use of local disks on worker nodes'''
:Note that local disks are now exclusively NVMEs which are much -much- faster than network protocols used in writing to /pnfs or /user. 
:So for repeated reads (like cycling through events with files O(1GB)) it is more efficient to copy file locally first then open it. 
:Same for write, prefer writing locally then copying the file to /pnfs.

- '''Can I have a shell/interactive job on the batch system ?'''
:Yes! If you want to make tests, or run things interactively, with dedicated core/memory for you, just run:
condor_submit -i
:Note that if you want to reserve more than the standard 1 core / 600MB of memory, simply add your request_* [specified just above] like this:
condor_submit -interactive request_cpus=2

- '''Sending DAG jobs'''
:For now, sending DAG jobs only works when you are directly on the scheduler, so first add your ssh key to local keyring agent and then connect to any mX machine with the -A (forwarding agent) option:
ssh-add
ssh -A mshort.iihe.ac.be
:then connect to the scheduler:
ssh schedd02.wn.iihe.ac.be
:there you can do the condor_submit_dag commands and it will work. Note than condor_q commands to see the advancements of your DAG jobs can still be performed on the mX machines.

:'''NB:''' as we are now having 2 OSes, some requirements - ''to put in your .sub file'' - are necessary to ensure your jobs arrive on the right cluster:
<pre>For EL7 workflows:
requirements = TARGET.OpSysAndVer == "CentOS7"

For EL9 workflows:
requirements = TARGET.OpSysAndVer == "AlmaLinux9"</pre>

=== What is my job doing ? Is something wrong ? ===
For debugging jobs in queue or what jobs running are doing, please follow our [[HTCondorDebug|HTCondor Debug]] section

 
=== HTCondor Official Documentation ===
Have a look at the [https://htcondor.readthedocs.io/en/latest/users-manual/index.html official User Manual] on the HTCondor website. 
It is very well done and explains all available features.

 
=== HTCondor Workshop Presentation ===
Every 6 months, there is an HTCondor workshop. Presentations are usually very helpful, especially if you want to go into details of HTCondor (API, DAGMan, ...). You can find the agenda of the latest one [https://indico.cern.ch/event/936993/timetable/#20200921.detailed here]

Faq t2b

2024-11-19T14:19:37Z

Admin: /* Debugging SSH connection to mX machines: */

=== List of the UIs / mX machines: ===
- mshort: m2 , m3 => 20 minutes of CPU time per process 
- mlong: m0, m1 => 5 hour of CPU time per process

=== Keep ssh connection to UI open: ===
Add option ' '''-o ServerAliveInterval=100''' ' to your ssh command

=== Debugging SSH connection to mX machines: ===
# Check permissions on ssh keys on your laptop:
<pre>
> ll $HOME/.ssh
-rw------- 1 rougny rougny 411 avr 29 2019 id_ed25519
-rw-r--r-- 1 rougny rougny 102 avr 29 2019 id_ed25519.pub
</pre>

: To have the correct permissions:
<pre>
chmod 600 $HOME/.ssh/id_ed25519
chmod 644 $HOME/.ssh/id_ed25519.pub
</pre>

: 2a. If that does not fix it, send us the output of those commands via chat/email, as well as the content of your public key to crosscheck with what is in our system:
<pre>
> ll $HOME/.ssh
> date && ssh -vvv MYUSERNAME@m3.iihe.ac.be <-- it needs to be on a specific machines (no mshort/mlong) so that we can read the logs!
</pre>

: 2b. Also add your public IPv4 so that we can track your connection in the logs, via visiting for instance https://www.whatismyip.com/
: 2c. Just in case something went wrong: send us your public ssh key (the one ending in .pub!à

=== MadGraph taking all the cores of a workernode ===
The default settings for MadGraph is to take all the available cores. This kills the site.

that is why you need to uncomment and set 2 variables in the '''md5_configuration.txt''' file (not the '''dat''' file), '''run_mode''' & '''nb_core'''. 
The run mode should be set to 0, single machine, via:
run_mode = 0

If the number of cores used by MadGraph is higher than 1, this needs to be asked to the job scheduler with the following directive added to your HTCondor submit file:
<pre> request_cpus = "2" </pre>

To tell MadGraph the number of cores he can take per job, use the following recipe:
<pre>
./bin/mg5_aMC
set nb_core 1 #or 2 or whatever you want
save options
</pre>
or in the '''md5_configuration.txt''':
nb_core = 1
Note 'nb_core' and 'request_cpus' must alway be the same value! 
Note also that if you ask for more than one core your time in the queue will probably be longer as the scheduler needs to find the correct amount of free slots on one single machine. 
We advise against putting this number higher than one unless you really need it for parallel jobs.

T2B News

2024-11-18T12:50:51Z

Admin:

== [14/10/2024] New OS supported: AlmaLinux 9 (EL9) ==

As of June 2024, the current OS on the cluster, Centos 7 (EL7), is not supported anymore, ie does not receive any new security patch. 
We have therefore started the migration to an EL9 variant, AlmaLinux 9.

'''!! We ask you that you start migrating your workflows to EL9 as at some point we will decommission all EL7 related services !!'''

We will have a transition period where you can access both versions of the OS:
* 4 mX machines in '''EL7''':
** m0 & m1 (mlong), big hardware machines with 5h of CPU time and 56 cores
** m2 & m3 (mshort), virtual machines with 20min of CPU time, only meant for coding

* 3 mX machines in '''EL9''':
** m9, big hardware machines with 5h of CPU time and 56 cores
** m10 & m11, virtual machines with 20min of CPU time, only meant for coding

''Note that this is emphasized in the "message of the day" when connecting via ssh.''

The batch system also has now ~3000 slots in EL9, and we will continue migrating compute nodes. 
Please be patient as your jobs might spend more time in queue until transition is over. 
We will monitor and slowly give EL9 more resources with time.

Note that to submit jobs to EL9 compute nodes, all you have to do is do so from the EL9 mX machines. 
All mX machines automatically append their OS version to your job as requirement for both EL7 & EL9, eg:
TARGET.OpSysMajorVer == 9
and talk to a different scheduler, schedd02 for EL7 mX and schedd03 for EL9.

You just have to be careful and adapt what you source for your environment. 
Note that containers will still available if you need to use workflows in EL7, even on EL9 machines.

Do not hesitate to contact us if you have any questions or encounter problems !

== [23/10/2024] new policies on mX machines: daily reboots and VSCode usage ==

In view of the recurrent slowness of some mX machines, along with their heavy usage, we will set some new policies.

1/ ALL mX machines will be rebooted '''every day at 5AM Brussels Time'''.
If you need to have processes that last longer, use the batch system.

2/ VSCode has seen an increase in usage. Unfortunately, it seems to be heavy on usage resources, making some mX machines (eg m2 / m3) nearly unusable when several users open VSCode sessions.
Therefore any VSCode connection now has to go through only m0, m1 or m9. Starting next week, we will just kill any instance found on m2, m3, m10, m11.

3/ mX machines are being reshuffled, with more coming to provide the new EL9 OS.
For now, m10 & m11 can be used to test EL9 code deployment and edit code.
No EL9 resources are deployed yet in the batch system, so jobs sent from m10 or m11 will just stay idle.
More news on this will be coming in the following week(s)

4/ Those policies are on a trial phase, and might be adapted in the future.
We welcome any comments or ideas on the situation !

== [22/05/2017] Retirement of the mon.iihe.ac.be service ==

We are retiring the server mon.iihe.ac.be . To get access to the services previously hosted on mon, please use the following new links:

* '''Jobview''' : http://mon.iihe.ac.be/jobview/ ==> http://jobview.iihe.ac.be

* '''Your user dir''' : http://mon.iihe.ac.be/~USERNAME ==> http://homepage.iihe.ac.be/~USERNAME , (http://mon.iihe.ac.be/~USERNAME will continue working for a while and redirect to the new server)

If something is missing or you have issues, please contact us !

T2B News

2024-11-18T12:49:09Z

Admin:

== [14/10/2024] New OS supported: AlmaLinux 9 (EL9) ==

As of June 2024, the current OS on the cluster, Centos 7 (EL7), is not supported anymore, ie does not receive any new security patch. 
We have therefore started the migration to an EL9 variant, AlmaLinux 9.

'''!! We ask you that you start migrating your workflows to EL9 as at some point we will decommission all EL7 related services !!'''

We will have a transition period where you can access both versions of the OS:
* 4 mX machines in '''EL7''':
** m0 & m1 (mlong), big hardware machines with 5h of CPU time and 56 cores
** m2 & m3 (mshort), virtual machines with 20min of CPU time, only meant for coding

* 3 mX machines in '''EL9''':
** m9, big hardware machines with 5h of CPU time and 56 cores
** m10 & m11, virtual machines with 20min of CPU time, only meant for coding

The batch system also has now ~3000 slots in EL9, and we will continue migrating compute nodes. 
Please be patient as your jobs might spend more time in queue until transition is over. 
We will monitor and slowly give EL9 more resources with time.

Note that to submit jobs to EL9 compute nodes, all you have to do is do so from the EL9 mX machines. 
All mX machines automatically append their OS version to your job as requirement for both EL7 & EL9, eg:
TARGET.OpSysMajorVer == 9
and talk to a different scheduler, schedd02 for EL7 mX and schedd03 for EL9.

You just have to be careful and adapt what you source for your environment. 
Note that containers will still available if you need to use workflows in EL7, even on EL9 machines.

Do not hesitate to contact us if you have any questions or encounter problems !

== [23/10/2024] new policies on mX machines: daily reboots and VSCode usage ==

In view of the recurrent slowness of some mX machines, along with their heavy usage, we will set some new policies.

1/ ALL mX machines will be rebooted '''every day at 5AM Brussels Time'''.
If you need to have processes that last longer, use the batch system.

2/ VSCode has seen an increase in usage. Unfortunately, it seems to be heavy on usage resources, making some mX machines (eg m2 / m3) nearly unusable when several users open VSCode sessions.
Therefore any VSCode connection now has to go through only m0, m1 or m9. Starting next week, we will just kill any instance found on m2, m3, m10, m11.

3/ mX machines are being reshuffled, with more coming to provide the new EL9 OS.
For now, m10 & m11 can be used to test EL9 code deployment and edit code.
No EL9 resources are deployed yet in the batch system, so jobs sent from m10 or m11 will just stay idle.
More news on this will be coming in the following week(s)

4/ Those policies are on a trial phase, and might be adapted in the future.
We welcome any comments or ideas on the situation !

== [22/05/2017] Retirement of the mon.iihe.ac.be service ==

We are retiring the server mon.iihe.ac.be . To get access to the services previously hosted on mon, please use the following new links:

* '''Jobview''' : http://mon.iihe.ac.be/jobview/ ==> http://jobview.iihe.ac.be

* '''Your user dir''' : http://mon.iihe.ac.be/~USERNAME ==> http://homepage.iihe.ac.be/~USERNAME , (http://mon.iihe.ac.be/~USERNAME will continue working for a while and redirect to the new server)

If something is missing or you have issues, please contact us !

T2B News

2024-11-14T12:37:17Z

Admin:

== [14/10/2024] New OS supported: AlmaLinux 9 (EL9) ==

As of June 2024, the current OS on the cluster, Centos 7 (EL7), is not supported anymore, ie does not receive any new security patch. 
We have therefore started the migration to an EL9 variant, AlmaLinux 9.

'''!! We ask you that you start migrating your workflows to EL9 as at some point we will decommission all EL7 related services !!'''

We will have a transition period where you can access both versions of the OS:
* 4 mX machines in '''EL7''':
** m0 & m1 (mlong), big hardware machines with 5h of CPU time and 56 cores
** m2 & m3 (mshort), virtual machines with 20min of CPU time, only meant for coding

* 3 mX machines in '''EL9''':
** m9, big hardware machines with 5h of CPU time and 56 cores
** m10 & m11, virtual machines with 20min of CPU time, only meant for coding

The batch system also has now > 1000 slots in EL9, and we will continue migrating compute nodes. 
Please be patient as your jobs might spend more time in queue until transition is over. 
We will monitor and slowly give EL9 more resources with time.

Note that to submit jobs to EL9 compute nodes, all you have to do is do so from the EL9 mX machines. 
All mX machines automatically append their OS version to your job as requirement for both EL7 & EL9, eg:
TARGET.OpSysMajorVer == 9
and talk to a different scheduler, schedd02 for EL7 mX and schedd03 for EL9.

You just have to be careful and adapt what you source for your environment. 
Note that containers will still available if you need to use workflows in EL7, even on EL9 machines.

Do not hesitate to contact us if you have any questions or encounter problems !

== [23/10/2024] new policies on mX machines: daily reboots and VSCode usage ==

In view of the recurrent slowness of some mX machines, along with their heavy usage, we will set some new policies.

1/ ALL mX machines will be rebooted '''every day at 5AM Brussels Time'''.
If you need to have processes that last longer, use the batch system.

2/ VSCode has seen an increase in usage. Unfortunately, it seems to be heavy on usage resources, making some mX machines (eg m2 / m3) nearly unusable when several users open VSCode sessions.
Therefore any VSCode connection now has to go through only m0, m1 or m9. Starting next week, we will just kill any instance found on m2, m3, m10, m11.

3/ mX machines are being reshuffled, with more coming to provide the new EL9 OS.
For now, m10 & m11 can be used to test EL9 code deployment and edit code.
No EL9 resources are deployed yet in the batch system, so jobs sent from m10 or m11 will just stay idle.
More news on this will be coming in the following week(s)

4/ Those policies are on a trial phase, and might be adapted in the future.
We welcome any comments or ideas on the situation !

== [22/05/2017] Retirement of the mon.iihe.ac.be service ==

We are retiring the server mon.iihe.ac.be . To get access to the services previously hosted on mon, please use the following new links:

* '''Jobview''' : http://mon.iihe.ac.be/jobview/ ==> http://jobview.iihe.ac.be

* '''Your user dir''' : http://mon.iihe.ac.be/~USERNAME ==> http://homepage.iihe.ac.be/~USERNAME , (http://mon.iihe.ac.be/~USERNAME will continue working for a while and redirect to the new server)

If something is missing or you have issues, please contact us !

SingularityContainers

2024-11-05T14:44:07Z

Admin:

To make SL6/SL7/CC8/EL9 flavours available for everyone while the cluster is using another version, we make use of containers. This also works for SL8 and EL9 on the new HTCondor CC7 Cluster, just replace SL7 with the version you want.

 
=== Prerequisites ===
Helper scripts are in '''/group/userscripts/''', so make sure either it is in your $PATH variable with:
export PATH=$PATH:/group/userscripts/
or just call the full path:
/group/userscripts/sl6

 
=== To test code on the mX UIs ===
Simply go into an EL7 environment:
sl7 bash
This should give you a prompt, where your SL7 code should work.

Just to convince yourself, you can cross-check the OS release:
<pre>cat /etc/redhat-release</pre>

 
=== To use SL7 inside a cluster job ===
You can simply ask your script to be run inside the sl7 container:
sl7 /user/$USER/MYSUPERSCRIPT.sh

To send it to the cluster, the job.sh file you now will send to qsub should contain:
/group/userscripts/sl7 /user/$USER/MYSUPERSCRIPT.sh

Please note that you then need, in your script, to go to TMPDIR.
 It is '''HIGHLY''' recommended to play with this first, by printing path and env variables, to check everything inside the container.

To be more exhaustive, what the script '''sl7''' does is first to add your PATH & LD_LIBRARY_PATH to singularity, and then launch your script using the following line:
singularity exec -B /cvmfs -B /pnfs -B /user -B /scratch /cvmfs/singularity.opensciencegrid.org/opensciencegrid/osgvo-el7:latest MYSCRIPT.sh

where:
* '''exec''': action to do for singularity, here will just execute your command in the container specified
* '''-B /mountpoint''': is used to have the path present in the container if needed.
* '''/cvmfs/.../cms:rhel7''': the path on /cvmfs of the container used.
* '''MYSCRIPT.sh''': script or command to execute

 '''N.B.''' You can look at more containers inside '''/cvmfs/singularity.opensciencegrid.org/opensciencegrid/'''.
 Or you can also create your own container, with for instance specific software versions, it is quite simple.
 You can use the guide [[SingularityContainerCreation|here]], or google ''singularity container'' in your favorite search engine :)

=== To use EL9 inside a cluster job ===

As Centos 7 is soon to be EOL (30/06/2024), we plan on changing the OS of the cluster to an EL9 flavor. Until then, you can use the following to get access to a containter in EL9:
singularity exec -B /cvmfs -B /pnfs -B /user -B /scratch /cvmfs/singularity.opensciencegrid.org/opensciencegrid/osgvo-el9:latest MYSCRIPT.sh

=== If you need grid commands ===
The above containers work perfectly on our cluster. However, they have the drawback of not being updated often. Unfortunately, to be able to work in a grid environment, the CA (certificate authority) files need to be kept up-to-date. 
For this reason, CMS also provides some containers. Unfortunately, not all software is inside (python in el9 is lacking atm). But the grid commands do work. For them, run the scrip:
<pre>
/cvmfs/cms.cern.ch/common/cmssw-el8
or
/cvmfs/cms.cern.ch/common/cmssw-el9
</pre>
Some more info on these containers can be found [http://cms-sw.github.io/singularity.html here]

SingularityContainers

2024-11-05T14:43:50Z

Admin:

To make SL6/SL7/CC8/EL9 flavours available for everyone while the cluster is using another version, we make use of containers. This also works for SL8 and EL9 on the new HTCondor CC7 Cluster, just replace SL7 with the version you want.

 
=== Prerequisites ===
Helper scripts are in '''/group/userscripts/''', so make sure either it is in your $PATH variable with:
export PATH=$PATH:/group/userscripts/
or just call the full path:
/group/userscripts/sl6

 
=== To test code on the mX UIs ===
Simply go into an EL7 environment:
sl7 bash
This should give you a prompt, where your SL7 code should work.

Just to convince yourself, you can cross-check the OS release:
<pre>cat /etc/redhat-release</pre>

 
=== To use SL7 inside a cluster job ===
You can simply ask your script to be run inside the sl7 container:
sl7 /user/$USER/MYSUPERSCRIPT.sh

To send it to the cluster, the job.sh file you now will send to qsub should contain:
/group/userscripts/sl7 /user/$USER/MYSUPERSCRIPT.sh

Please note that you then need, in your script, to go to TMPDIR.
 It is '''HIGHLY''' recommended to play with this first, by printing path and env variables, to check everything inside the container.

To be more exhaustive, what the script '''sl7''' does is first to add your PATH & LD_LIBRARY_PATH to singularity, and then launch your script using the following line:
singularity exec -B /cvmfs -B /pnfs -B /user -B /scratch /cvmfs/singularity.opensciencegrid.org/opensciencegrid/osgvo-el7:latest MYSCRIPT.sh

where:
* '''exec''': action to do for singularity, here will just execute your command in the container specified
* '''-B /mountpoint''': is used to have the path present in the container if needed.
* '''/cvmfs/.../cms:rhel7''': the path on /cvmfs of the container used.
* '''MYSCRIPT.sh''': script or command to execute

 '''N.B.''' You can look at more containers inside '''/cvmfs/singularity.opensciencegrid.org/opensciencegrid/'''.
 Or you can also create your own container, with for instance specific software versions, it is quite simple.
 You can use the guide [[SingularityContainerCreation|here]], or google ''singularity container'' in your favorite search engine :)

=== To use EL9 inside a cluster job ===

As Centos 7 is soon to be EOL (30/06/2024), we plan on changing the OS of the cluster to an EL9 flavor. Until then, you can use the following to get access to a containter in EL9:
singularity exec -B /cvmfs -B /pnfs -B /user -B /scratch /cvmfs/singularity.opensciencegrid.org/opensciencegrid/osgvo-el9:latest MYSCRIPT.sh

=== If you need grid commands ===
The above containers work perfectly on our cluster. However, they have the drawback of not being updated often. Unfortunately, to be able to work in a grid environment, the CA (certificate authority) files need to be kept up-to-date. 
For this reason, CMS also provides some containers. Unfortunately, not all software is inside (python in el9 is lacking atm). But the grid commands do work. For them, run the scrip:
<pre>
/cvmfs/cms.cern.ch/common/cmssw-el9
or
/cvmfs/cms.cern.ch/common/cmssw-el9
</pre>
Some more info on these containers can be found [http://cms-sw.github.io/singularity.html here]

SingularityContainers

2024-11-05T14:41:15Z

Admin:

HTCFirstSubmissionGuide

2024-11-05T13:14:29Z

Admin: /* T2B Specifics */

== First time submitting a job ==
For new users, we recommend following [https://indico.cern.ch/event/936993/contributions/4022073/attachments/2105538/3540926/2020-Koch-User-Tutorial.pdf this presentation], that should give you an idea of how to submit jobs. 
Then to practice the basics of job submission on a HTCondor cluster, some exercises are proposed on this [https://en.wikitolearn.org/Course:HTCondor Wiki].

=== ''T2B Specifics'' ===

- '''File Transfers:'''
: Please note that contrary to what is usually shown in documentation and examples, we recommend not using HTCondor file transfer mechanisms ('''should_transfer_files = NO''') and copy files yourself within your script.

- '''Always adapt requested resources to your job'''
: You need to adapt the resources you request to what you estimate your job will need. Requesting more than what you really need is wasteful and deprive your fellow users from ressources they might require.
: To do so, just add to your submit file the following lines:
<pre>request_cpus = 1
request_memory = 200MB
request_disk = 1GB</pre>

: Note that for a job, if you need more than 1 cpu / 4GB of memory / 10GB of disk, please be careful and sure of what you are doing.

- '''Where am I when I start a job ?'''
:You should always prefer using $HOME, which in a job equates to a local unique directory in the /scratch of the local disk,eg: /scratch/condor/dir_275928.
: $TMPDIR = $HOME/tmp, so it is also on the local disk and unique

- '''Efficient use of local disks on worker nodes'''
:Note that local disks are now exclusively NVMEs which are much -much- faster than network protocols used in writing to /pnfs or /user. 
:So for repeated reads (like cycling through events with files O(1GB)) it is more efficient to copy file locally first then open it. 
:Same for write, prefer writing locally then copying the file to /pnfs.

- '''Can I have a shell/interactive job on the batch system ?'''
:Yes! If you want to make tests, or run things interactively, with dedicated core/memory for you, just run:
condor_submit -i
:Note that if you want to reserve more than the standard 1 core / 600MB of memory, simply add your request_* [specified just above] like this:
condor_submit -interactive request_cpus=2

- '''Sending DAG jobs'''
:For now, sending DAG jobs only works when you are directly on the scheduler, so first add your ssh key to local keyring agent and then connect to any mX machine with the -A (forwarding agent) option:
ssh-add
ssh -A mshort.iihe.ac.be
:then connect to the scheduler:
ssh schedd02.wn.iihe.ac.be
:there you can do the condor_submit_dag commands and it will work. Note than condor_q commands to see the advancements of your DAG jobs can still be performed on the mX machines.

=== What is my job doing ? Is something wrong ? ===
For debugging jobs in queue or what jobs running are doing, please follow our [[HTCondorDebug|HTCondor Debug]] section

 
=== HTCondor Official Documentation ===
Have a look at the [https://htcondor.readthedocs.io/en/latest/users-manual/index.html official User Manual] on the HTCondor website. 
It is very well done and explains all available features.

 
=== HTCondor Workshop Presentation ===
Every 6 months, there is an HTCondor workshop. Presentations are usually very helpful, especially if you want to go into details of HTCondor (API, DAGMan, ...). You can find the agenda of the latest one [https://indico.cern.ch/event/936993/timetable/#20200921.detailed here]

HTCFirstSubmissionGuide

2024-11-05T10:19:06Z

Admin: /* First time submitting a job = */

== First time submitting a job ==
For new users, we recommend following [https://indico.cern.ch/event/936993/contributions/4022073/attachments/2105538/3540926/2020-Koch-User-Tutorial.pdf this presentation], that should give you an idea of how to submit jobs. 
Then to practice the basics of job submission on a HTCondor cluster, some exercises are proposed on this [https://en.wikitolearn.org/Course:HTCondor Wiki].

=== ''T2B Specifics'' ===

- '''File Transfers:'''
: Please note that contrary to what is usually shown in documentation and examples, we recommend not using HTCondor file transfer mechanisms ('''should_transfer_files = NO''') and copy files yourself within your script.

- '''Always adapt requested resources to your job'''
: You need to adapt the resources you request to what you estimate your job will need. Requesting more than what you really need is wasteful and deprive your fellow users from ressources they might require.
: To do so, just add to your submit file the following lines:
<pre>request_cpus = 1
request_memory = 200MB
request_disk = 1GB</pre>

: Note that for a job, if you need more than 1 cpu / 4GB of memory / 10GB of disk, please be careful and sure of what you are doing.

- '''Where am I when I start a job ?'''
:You should always prefer using $HOME, which in a job equates to a local unique directory in the /scratch of the local disk,eg: /scratch/condor/dir_275928.
: $TMPDIR = $HOME/tmp, so it is also on the local disk and unique

- '''Efficient use of local disks on worker nodes'''
:Note that local disks are now exclusively NVMEs which are much -much- faster than network protocols used in writing to /pnfs or /user. 
:So for repeated reads (like cycling through events with files O(1GB)) it is more efficient to copy file locally first then open it. 
:Same for write, prefer writing locally then copying the file to /pnfs.

- '''Can I have a shell/interactive job on the batch system ?'''
:Yes! If you want to make tests, or run things interactively, with dedicated core/memory for you, just run:
condor_submit -i
:Note that if you want to reserve more than the standard 1 core / 600MB of memory, simply add your request_* [specified just above] like this:
condor_submit -interactive request_cpus=2

- '''Sending DAG jobs'''
:For now, sending DAG jobs only works when you are directly on the scheduler, so first connect to any mX machine with the -A (forwarding agent) option:
ssh -A mshort.iihe.ac.be
:then connect to the scheduler:
ssh schedd02.wn.iihe.ac.be
:there you can do the condor_submit_dag commands and it will work. Note than condor_q commands to see the advancements of your DAG jobs can still be performed on the mX machines.

=== What is my job doing ? Is something wrong ? ===
For debugging jobs in queue or what jobs running are doing, please follow our [[HTCondorDebug|HTCondor Debug]] section

 
=== HTCondor Official Documentation ===
Have a look at the [https://htcondor.readthedocs.io/en/latest/users-manual/index.html official User Manual] on the HTCondor website. 
It is very well done and explains all available features.

 
=== HTCondor Workshop Presentation ===
Every 6 months, there is an HTCondor workshop. Presentations are usually very helpful, especially if you want to go into details of HTCondor (API, DAGMan, ...). You can find the agenda of the latest one [https://indico.cern.ch/event/936993/timetable/#20200921.detailed here]

Backup

2024-10-24T12:02:12Z

Admin:

== Backups ==

=== User space on T2B ===
The '''/user, /group, /software & /ice3''' are backed up every day to a secondary ceph storage cluster, in case our production cluster goes down. 
Backups can be found in '''/backup/$DATE/{user,group,software,ice3}''':
* There is one done every day at 8.30am, we keep the last seven of those (so a complete week), eg:
:: <pre>scheduled-2024-02-04-08_30_00_UTC</pre>

* We also keep the last 4 Sunday snapshots, equivalent to a month.
* All backups are READ-ONLY, so extract the files you want from them into your personal directory

=== Mass Storage ===

'''/pnfs''' is '''''NOT''''' backed-up, as it is a massive storage.

Register to the CMS VO

2024-10-24T11:55:48Z

Admin:

Since July 2024, there is no need to register to the VO CMS anymore. Anyone who is a member of CMS is automatically added. 
You still need to add your DN to the DB. The instructions can be found [https://twiki.cern.ch/twiki/bin/view/CMSPublic/UsernameForCRAB#Adding_your_DN_to_your_profile here]. 
Notice that there are a few entries already present on your cern profile. You can safely ignore those.
 
 
 
'''OSOLETED'''
* Go to the [https://voms2.cern.ch:8443/voms/cms/register/start.action VOMS page]. On the possible certificate prompt, select the one you just created. 
** [ If you don't arrive in the page below, then you might already be registered to the CMS VO. Make the [[SiteDB|SiteDB check]] to be sure. ]
** Enter the email address registered at cern, then click submit. 
** You should appear just below. If it's you, well click on the correct button ! 

:: [[ File:vocms1.png|center]] 
:* If it doesn't find you with the email you entered after clicking on submit, then look in the [https://phonebook.cern.ch/phonebook/ CERN phonebook] for your email. If you cannot find yourself, then make sure you are registered to CERN or CMS at least.
:: [[File:cern_phonebook.png|center]]
:* Fill in all fields, accept the policy, then submit.
:: [[File:vocms_form.png|center]]
:* The procedure is nearly finished, look at your inbox corresponding to the CERN email.
[[File:vocms_email.png|center]]
:* Just click on the confirmation link in the email received.
[[File:vocms_end.png|center]]

* Now you only need to wait '''a few hours''' for your membership to be approved !

* You can [[SiteDB | follow the wiki]] to check SiteDB if your certificate as well as membership are fine and got approved.

=== IMPORTANT!!! ===
Once you are a member of the cms VO, you should also request to be a part of the group becms. This needs to be done on the [https://voms2.cern.ch:8443/voms/cms/register/start.action VOMS page] also. 
This attribute will grant you higher priority when you crab job lands at T2B. So it is best to always make your proxy in this way:
<pre>
voms-proxy-init --voms cms:/cms/becms
</pre>

Register to the CMS VO

2024-10-24T11:53:30Z

Admin:

Since July 2024, there is no need to register to the VO CMS anymore. 
Anyone you is a member of CMS is already added. You still need to add your DN to the DB. The instructions can be found [https://twiki.cern.ch/twiki/bin/view/CMSPublic/UsernameForCRAB#Adding_your_DN_to_your_profile here]. 
Notice that you there are a few entries already present on your cern profile. You can safely ignore those.
 
 
OSOLETED
* Go to the [https://voms2.cern.ch:8443/voms/cms/register/start.action VOMS page]. On the possible certificate prompt, select the one you just created. 
** [ If you don't arrive in the page below, then you might already be registered to the CMS VO. Make the [[SiteDB|SiteDB check]] to be sure. ]
** Enter the email address registered at cern, then click submit. 
** You should appear just below. If it's you, well click on the correct button ! 

:: [[ File:vocms1.png|center]] 
:* If it doesn't find you with the email you entered after clicking on submit, then look in the [https://phonebook.cern.ch/phonebook/ CERN phonebook] for your email. If you cannot find yourself, then make sure you are registered to CERN or CMS at least.
:: [[File:cern_phonebook.png|center]]
:* Fill in all fields, accept the policy, then submit.
:: [[File:vocms_form.png|center]]
:* The procedure is nearly finished, look at your inbox corresponding to the CERN email.
[[File:vocms_email.png|center]]
:* Just click on the confirmation link in the email received.
[[File:vocms_end.png|center]]

* Now you only need to wait '''a few hours''' for your membership to be approved !

* You can [[SiteDB | follow the wiki]] to check SiteDB if your certificate as well as membership are fine and got approved.

=== IMPORTANT!!! ===
Once you are a member of the cms VO, you should also request to be a part of the group becms. This needs to be done on the [https://voms2.cern.ch:8443/voms/cms/register/start.action VOMS page] also. 
This attribute will grant you higher priority when you crab job lands at T2B. So it is best to always make your proxy in this way:
<pre>
voms-proxy-init --voms cms:/cms/becms
</pre>

Register to the CMS VO

2024-10-24T11:52:13Z

Admin:

Since July 2024, there is no need to register to the VO CMS anymore. 
Anyone you is a member of CMS is already added. You still need to add your DN to the DB. The instructions can be found [https://twiki.cern.ch/twiki/bin/view/CMSPublic/UsernameForCRAB#Adding_your_DN_to_your_profile | here].

 
 
OSOLETED
* Go to the [https://voms2.cern.ch:8443/voms/cms/register/start.action VOMS page]. On the possible certificate prompt, select the one you just created. 
** [ If you don't arrive in the page below, then you might already be registered to the CMS VO. Make the [[SiteDB|SiteDB check]] to be sure. ]
** Enter the email address registered at cern, then click submit. 
** You should appear just below. If it's you, well click on the correct button ! 

:: [[ File:vocms1.png|center]] 
:* If it doesn't find you with the email you entered after clicking on submit, then look in the [https://phonebook.cern.ch/phonebook/ CERN phonebook] for your email. If you cannot find yourself, then make sure you are registered to CERN or CMS at least.
:: [[File:cern_phonebook.png|center]]
:* Fill in all fields, accept the policy, then submit.
:: [[File:vocms_form.png|center]]
:* The procedure is nearly finished, look at your inbox corresponding to the CERN email.
[[File:vocms_email.png|center]]
:* Just click on the confirmation link in the email received.
[[File:vocms_end.png|center]]

* Now you only need to wait '''a few hours''' for your membership to be approved !

* You can [[SiteDB | follow the wiki]] to check SiteDB if your certificate as well as membership are fine and got approved.

=== IMPORTANT!!! ===
Once you are a member of the cms VO, you should also request to be a part of the group becms. This needs to be done on the [https://voms2.cern.ch:8443/voms/cms/register/start.action VOMS page] also. 
This attribute will grant you higher priority when you crab job lands at T2B. So it is best to always make your proxy in this way:
<pre>
voms-proxy-init --voms cms:/cms/becms
</pre>

Backup

2024-10-21T08:27:41Z

Admin: /* Backups */

== Backups ==

All backups can be found in the '''/backup''' directory on the UIs (M Machines). 
The /user, /group, /software & /ice3 are backed up every day to a secondary ceph storage cluster, in case our production cluster goes down.

!! NOTE: '''/pnfs''' is '''''NOT''''' backed-up, as it is a massive storage !!

!! NOTE2: Those backups are READ-ONLY, so extract the files you want from them into your normal directory !!

=== /user, /group, /software & /ice3 ===

Our ceph cluster allows us to do regular snapshots, they can be found in '''/backup/$DATE/{user,group,software,ice3}''':
* There is one done every day at 8.30am, we keep the last seven of those (so a complete week), eg:
:: <pre>scheduled-2024-02-04-08_30_00_UTC</pre>

* We also keep the last 4 Sunday snapshots, equivalent to a month.

Cluster Overview

2024-10-16T11:07:48Z

Admin: /* How to Connect */

== Overview ==

The cluster is composed 3 groups of machines :
 
* The '''User Interfaces (UI)'''
::This is the cluster front-end, to use the cluster, you need to log into those machines
::::Servers : mshort [ m2 , m3 ] , mlong [ m0, m1 ]
:: The '''File Server''' provides the user home on the UIs. It is a highly efficient & redundant storage node of ~120 TB capacity with regular backups.

 
* The '''Computing Machines'''
** The '''Computing Element (CE):''' This is the gateway between the World and the T2B cluster: it receives all Grid jobs and submit them to the local batch system.
::::Servers : testumd-htcondorce (temporary)

:* The '''HTCondor Schedulers:''' This is the brain of the batch system: they manage all the submitted jobs, and send them to the worker nodes.
::::Servers : scheddXX

:* The '''Worker Nodes (WN): ''' This is the power of the cluster : they run multiple jobs in parallel and send the results & status back to the CE.
::::Servers : nodeXX-YY
 
* The '''Mass Storage'''
** The '''Storage Element''': it is the brain of the cluster storage. Grid accessible, it knows where all the files are, and manages all the storage nodes.
::::Server : maite
:* The '''Storage Nodes''': This is the memory of the cluster : they contain big data files. In total, they provide ~8400 TB of grid-accessible storage.
::::Servers : beharXXX

 

== How to Connect ==

To connect to the cluster, you need to have sent us your public ssh key.
In a terminal, type the following (adapt <MYLOGIN> accordingly WITHOUT the brackets <>):
ssh -X -o ServerAliveInterval=100 <MYLOGIN>@mshort.iihe.ac.be
:''Tip: the ''-o ServerAliveInterval=100'' option is used to keep your session alive for a long period of time ! You should not be disconnected during a whole day of work.''
:''Tip: use aliases to connect easily! eg add to your ''~/.bashrc'' file the following: ''alias mshort='ssh -X -o ServerAliveInterval=100 <MYLOGIN>@mshort.iihe.ac.be'

If connecting does not work, please follow the help [[Faq_t2b#Debugging_SSH_connection_to_mX_machines:|here]]. After a successful login, you'll see this message :

 (: Welcome to the T2B Cluster :) 
 ________________________________ 
 The cluster is working properly 
___________________________________________________________________________ 
Mail: grid_admin@listserv.vub.be | Chat: https://chat.iihe.ac.be
Wiki: https://t2bwiki.iihe.ac.be | Status: https://status.iihe.ac.be
___________________________________________________________________________ 
[/user] => 224 / 500 GB [44%] --|-- [/pnfs] => 101 GB [01/12/2023]
___________________________________________________________________________ 
Welcome on [m7] ! You have 3600s (1 hours) of cpu time per processes.
There are 2 users here | Load: 7.56 /4 CPUs (189%) | Mem: 16% used 

Please observe all the information in this message:
* The header, telling you the health of the cluster. When there is an issue, the header of the welcome message will transform to:

 :( Welcome to the T2B Cluster ): 
 ________________________________ 
 THERE ARE ISSUES ON THE CLUSTER
 More details at status.iihe.ac.be
 (Register to receive updates)
 

* The email used for the cluster support (please use this one rather than personal mail, this way everyone on the support team can answer and track the progress.)
* The wiki link, where you should go first to find the information
* The chat link, where you can easily contact us for fast exchanges. IIHE users can use their intranet account, others can just create an account.
* The status link, where you can see if the cluster has any problems reported. Please make sure you are registered to receive updates.
 
* The space used on the mass storage /pnfs, where storing a few TB is no problem. No hard limits are applied, but please contact us if you plan to go over 20 TB!
* The quota used on /user (and /group). Here a hard limit is applied, so if you are at 100%, you will have many problems. Clean your space, and if you really need more contact us. 
* The cpu time limit imposed per process, as we divided our UIs into 2 groups. Please note '''processes will be killed''' if they go over their CPU-time limit!
:: '''The light task''' UIs (max '''CPU''' time = 20 minutes) : they are used for crab/local job submission, writing code, debugging ...
::<pre>mshort.iihe.ac.be : m2.iihe.ac.be, m3.iihe.ac.be </pre>
:: '''The CPU-intensive''' UIs (max '''CPU''' time = 5 hour) : they are available for CPU-intensive and testing tasks/workflows, although you should prefer using local job submission ...
::<pre>mlong.iihe.ac.be : m0.iihe.ac.be, m1.iihe.ac.be</pre>
* Information about how heavily this UI is used. If any of them is red (ie above optimal usage), please consider using another UI. Please be mindful of other users and don't start too many processes, especially if the UI is already under charge.
 
* Sometimes announcements are printed at the end. Please make sure you read those.
 

== Data Storage & Directory Structure ==

There are 2 main directories to store your work and data:
* '''/user [/$USER]''' : this is your home directory. You have an enforced quota there, as it is an expensive storage with redundancy and daily backups (see below).
* '''/pnfs [/iihe/MYEXP/store/user/$USER]''' : this is where you can store a large amount of data, and is also [[GridStorageAccess|grid-accessible]]. If you need more than a few TB, please contact us. There is no backups there, so be careful of what you do !
 
There are other directories than you might want to take notice of:
* '''/group''' : same as /user , but if you need to share/produce in a group.
* '''/scratch''' : a temporary scratch space for your job. Use $TMPDIR on the WNs, it is cleanned after each job :)
* '''/cvmfs''' : Centralised CVMFS software repository. It should contain most of the software you will need for your experiment. Find [[OtherSoftware|here]] how to get a coherent environment for most tools you will need.
* '''/software''' : local area for shared software not in /cvmfs . You can use a [[OtherSoftware|nice tool]] to find the software and versions available.

 

== Batch System ==

The cluster is based on HTCondor (also used at CERN or Wisconsin for instance).
Please follow [[HTCondor|this page]] for details on how to use it.

{| width="1064" cellspacing="1" cellpadding="5" border="1" align="center"
|-
! scope="row" | Description
| nowrap="nowrap" align="center" | HTCondor batch ressources 
|-
! scope="row" | # CPU's (Jobs)
| nowrap="nowrap" align="center" | 10700 
|-
! scope="row" | Walltime limit
| nowrap="nowrap" align="center" | 168 hours = 1 week
|-
! scope="row" | Preferred Memory per job
| nowrap="nowrap" align="center" | 4 Gb 
|-
! scope="row" | $TMPDIR/scratch max usable space
| nowrap="nowrap" align="center" | 10-20 Gb 
|-
! scope="row" | Max # jobs sent to the batch system / User
| nowrap="nowrap" align="center" | theoretically none (contact us if you plan on sending more than 10 000) 
|}

 

== Backup ==
There are several areas that we regularly back up: '''/user''' , '''/group''' , '''/ice3'''. 
You can find more information on the backup frequency and how to access them [[Backup|here]].

== Useful links ==
[http://ganglia.iihe.ac.be/ganglia/ Ganglia Monitoring] : stats on all our servers. 
[http://status.iihe.ac.be Cluster Status] : current status of all T2B services. Check here before sending us an email. Please also consider registering to receive T2B issues and be informed when things are resolved.

RestoringCloudFrontendFromBackup

2024-07-25T13:24:41Z

Admin:

= Context =
In July 2024, we have lost our OpenNebula frontend VM (cloud2) after an attempt to reboot it. It was hosted on domz02, a basic QEMU/KVM standalone hypervizor managed with libvirt. The problem seemed to be that the system was not able to find the partition table in the qcow2 image. As there was no backup of the OpenNebula data and config files, the only solution to recover these things was to attach the image to a new VM and to recreate the partition table in the mounted image using the tool gpart. It eventually worked, but things would have far more easier if we had a simple backup of the directories containing the important ONE data and configuration files of our cloud system. And above all, the procedure we followed to restore the machine was sketched in a situation of emergency with no warranty that it would succeed.

= Data and configuration items to backup =
Here is a list of the important files/directories to backup:
* /var/lib/one
* /var/lib/mysql
* /etc/one
* /etc/my.cnf
* /etc/my.cnf.d

You will find all of them in '''/backup_mnt/backup/BACKUPS/cloud2/'''

= Procedure to restore the frontend =
On the standalone hypervizor, create a new VM with same hardware characteristics as the previous (mac address, disk size, memory, cpu,...). The easiest way is to copy-paste the xml of the previous. Be careful that libvirt will complain about the fact that you reuse the mac address. The solution is simple: remove the nic of the previous VM to free the mac address. Here are some commands that might be useful for this step:
<pre>
virsh list
virsh edit <machine_name>
virsh create <xml_description_of_vm>
</pre>
Of course, you can also use the GUI '''virt-manager''' for most tasks. Be especially careful with the drivers (it should be '''virtio''' for the NIC and the drive). Also double-check that the drive and the memory have the same sizes as on the previous VM. And of course, don't reuse the disk of the previous VM, you have to create a new one (it's easy to do from the '''virt-manager''' GUI).

Once the VM is running, you'll have to reinstall the frontend on it with Quattor and Puppet. However, the VM must initially be reinstalled with machine-type 'puppet_node' and Puppet app set to 'servers' and role to 'none'. Why? Because if you directly reinstall the VM as frontend, the initization scripts that come with the ONE packages and also with Puppet will generate some new settings that might be tricky to overwrite with the backup. So to say, in the beginning of the restore process, the machine must be a vanilla one. Here is the vanilla profile used to reinstall the VM:
<pre>
object template cloud2.wn.iihe.ac.be;
include 'machine-types/puppet_node';

# Mounting backup
include 'config/nfs/common';
include 'config/ceph/cephfs.backup';

# software repositories (should be last)
include PKG_REPOSITORY_CONFIG;
</pre>
And in the Quattor file '''site/puppet/database''', here is the setting for the hiera '''app''' and '''role''':
<pre>
'cloud2.wn.iihe.ac.be', dict(
'environment', 'prod',
'app', 'servers',
'role', 'none',
'cloud', 'cloud2',
</pre>

Once these changes have been pushed in Quattor repo, and before doing the aii-shellfe configure and install on the aii server, there are two things to do to avoid problems (note those 2 steps are done when using '''quat -ri'''):
* revoke the SinDES certificate of the machine on the aii server (if you don't do that, no SinDES ACL will be created since there is already a valid certificate for the machine);
* revoke the Puppet certificate on the Puppet master machine.

When the Quattor installation is finished, you have to mount the '''backup''' share on the VM, and then you restore the files and directories with the command of your taste (in my case, I just used '''cp -ap'''). Of course, it is really important to preserve the permissions and ownerships of the source files.

With the data and configuration files being restored, it is now time to switch the VM back as a good old OpenNebula frontend. Revert the profile to something like this:
<pre>
object template cloud2.wn.iihe.ac.be;

include 'machine-types/puppet_node';

variable ONE_RELEASE = '6.6';
include 'features/one_frontend/light_config';
include 'features/one_frontend/one6.X/sunstone_apache_ssl';

# Mounting backup
include 'config/nfs/common';
include 'config/ceph/cephfs.backup';

# Making backup of everything that is needed
include 'components/cron/config';
'/software/components/cron/entries' = push(
dict(
'name', 'backup_config_db_one',
'user', 'root',
'frequency', '30 */2 * * *',
'log', dict('disabled', false),
'command', '/usr/bin/rsync -avR /var/lib/one /var/lib/mysql /etc/one /etc/my.cnf /etc/my.cnf.d /backup_mnt/backup/BACKUPS/cloud2/.'
)
);

# software repositories (should be last)
include PKG_REPOSITORY_CONFIG;
</pre>
and don't forget the Puppet database to generate the correct hiera description in the file '''/etc/puppetlabs/facter/facts.d/provisionning.yaml''':
<pre>
'cloud2.wn.iihe.ac.be', dict(
'environment', 'prod',
'app', 'opennebula',
'role', 'frontend',
'cloud', 'cloud2',
</pre>

RestoringCloudFrontendFromBackup

2024-07-25T13:23:41Z

Admin:

= Context =
In July 2024, we have lost our OpenNebula frontend VM (cloud2) after an attempt to reboot it. It was hosted on domz02, a basic QEMU/KVM standalone hypervizor managed with libvirt. The problem seemed to be that the system was not able to find the partition table in the qcow2 image. As there was no backup of the OpenNebula data and config files, the only solution to recover these things was to attach the image to a new VM and to recreate the partition table in the mounted image using the tool gpart. It eventually worked, but things would have far more easier if we had a simple backup of the directories containing the important ONE data and configuration files of our cloud system. And above all, the procedure we followed to restore the machine was sketched in a situation of emergency with no warranty that it would succeed.

= Data and configuration items to backup =
Here is a list of the important files/directories to backup:
* /var/lib/one
* /var/lib/mysql
* /etc/one
* /etc/my.cnf
* /etc/my.cnf.d

= Procedure to restore the frontend =
On the standalone hypervizor, create a new VM with same hardware characteristics as the previous (mac address, disk size, memory, cpu,...). The easiest way is to copy-paste the xml of the previous. Be careful that libvirt will complain about the fact that you reuse the mac address. The solution is simple: remove the nic of the previous VM to free the mac address. Here are some commands that might be useful for this step:
<pre>
virsh list
virsh edit <machine_name>
virsh create <xml_description_of_vm>
</pre>
Of course, you can also use the GUI '''virt-manager''' for most tasks. Be especially careful with the drivers (it should be '''virtio''' for the NIC and the drive). Also double-check that the drive and the memory have the same sizes as on the previous VM. And of course, don't reuse the disk of the previous VM, you have to create a new one (it's easy to do from the '''virt-manager''' GUI).

Once the VM is running, you'll have to reinstall the frontend on it with Quattor and Puppet. However, the VM must initially be reinstalled with machine-type 'puppet_node' and Puppet app set to 'servers' and role to 'none'. Why? Because if you directly reinstall the VM as frontend, the initization scripts that come with the ONE packages and also with Puppet will generate some new settings that might be tricky to overwrite with the backup. So to say, in the beginning of the restore process, the machine must be a vanilla one. Here is the vanilla profile used to reinstall the VM:
<pre>
object template cloud2.wn.iihe.ac.be;
include 'machine-types/puppet_node';

# Mounting backup
include 'config/nfs/common';
include 'config/ceph/cephfs.backup';

# software repositories (should be last)
include PKG_REPOSITORY_CONFIG;
</pre>
And in the Quattor file '''site/puppet/database''', here is the setting for the hiera '''app''' and '''role''':
<pre>
'cloud2.wn.iihe.ac.be', dict(
'environment', 'prod',
'app', 'servers',
'role', 'none',
'cloud', 'cloud2',
</pre>

Once these changes have been pushed in Quattor repo, and before doing the aii-shellfe configure and install on the aii server, there are two things to do to avoid problems (note those 2 steps are done when using '''quat -ri'''):
* revoke the SinDES certificate of the machine on the aii server (if you don't do that, no SinDES ACL will be created since there is already a valid certificate for the machine);
* revoke the Puppet certificate on the Puppet master machine.

When the Quattor installation is finished, you have to mount the '''backup''' share on the VM, and then you restore the files and directories with the command of your taste (in my case, I just used '''cp -ap'''). Of course, it is really important to preserve the permissions and ownerships of the source files.

With the data and configuration files being restored, it is now time to switch the VM back as a good old OpenNebula frontend. Revert the profile to something like this:
<pre>
object template cloud2.wn.iihe.ac.be;

include 'machine-types/puppet_node';

variable ONE_RELEASE = '6.6';
include 'features/one_frontend/light_config';
include 'features/one_frontend/one6.X/sunstone_apache_ssl';

# Mounting backup
include 'config/nfs/common';
include 'config/ceph/cephfs.backup';

# Making backup of everything that is needed
include 'components/cron/config';
'/software/components/cron/entries' = push(
dict(
'name', 'backup_config_db_one',
'user', 'root',
'frequency', '30 */2 * * *',
'log', dict('disabled', false),
'command', '/usr/bin/rsync -avR /var/lib/one /var/lib/mysql /etc/one /etc/my.cnf /etc/my.cnf.d /backup_mnt/backup/BACKUPS/cloud2/.'
)
);

# software repositories (should be last)
include PKG_REPOSITORY_CONFIG;
</pre>
and don't forget the Puppet database to generate the correct hiera description in the file '''/etc/puppetlabs/facter/facts.d/provisionning.yaml''':
<pre>
'cloud2.wn.iihe.ac.be', dict(
'environment', 'prod',
'app', 'opennebula',
'role', 'frontend',
'cloud', 'cloud2',
</pre>

RestoringCloudFrontendFromBackup

2024-07-25T13:22:50Z

Admin: /* Procedure to restore the frontend */

= Context =
In July 2024, we have lost our OpenNebula frontend VM (cloud2) after an attempt to reboot it. It was hosted on domz02, a basic QEMU/KVM standalone hypervizor managed with libvirt. The problem seemed to be that the system was not able to find the partition table in the qcow2 image. As there was no backup of the OpenNebula data and config files, the only solution to recover these things was to attach the image to a new VM and to recreate the partition table in the mounted image using the tool gpart. It eventually worked, but things would have far more easier if we had a simple backup of the directories containing the important ONE data and configuration files of our cloud system. And above all, the procedure we followed to restore the machine was sketched in a situation of emergency with no warranty that it would succeed.

= Data and configuration items to backup =
Here is a list of the important files/directories to backup:
* /var/lib/one
* /var/lib/mysql
* /etc/one
* /etc/my.cnf
* /etc/my.cnf.d

= Procedure to restore the frontend =
On the standalone hypervizor, create a new VM with same hardware characteristics as the previous (mac address, disk size, memory, cpu,...). The easiest way is to copy-paste the xml of the previous. Be careful that libvirt will complain about the fact that you reuse the mac address. The solution is simple: remove the nic of the previous VM to free the mac address. Here are some commands that might be useful for this step:
<pre>
virsh list
virsh edit <machine_name>
virsh create <xml_description_of_vm>
</pre>
Of course, you can also use the GUI '''virt-manager''' for most tasks. Be especially careful with the drivers (it should be '''virtio''' for the NIC and the drive). Also double-check that the drive and the memory have the same sizes as on the previous VM. And of course, don't reuse the disk of the previous VM, you have to create a new one (it's easy to do from the '''virt-manager''' GUI).

Once the VM is running, you'll have to reinstall the frontend on it with Quattor and Puppet. However, the VM must initially be reinstalled with machine-type 'puppet_node' and Puppet app set to 'servers' and role to 'none'. Why? Because if you directly reinstall the VM as frontend, the initization scripts that come with the ONE packages and also with Puppet will generate some new settings that might be tricky to overwrite with the backup. So to say, in the beginning of the restore process, the machine must be a vanilla one. Here is the vanilla profile used to reinstall the VM:
<pre>
object template cloud2.wn.iihe.ac.be;
include 'machine-types/puppet_node';

# Mounting backup
include 'config/nfs/common';
include 'config/ceph/cephfs.backup';

# software repositories (should be last)
include PKG_REPOSITORY_CONFIG;
</pre>
And in the Quattor file '''site/puppet/database''', here is the setting for the hiera '''app''' and '''role''':
<pre>
'cloud2.wn.iihe.ac.be', dict(
'environment', 'prod',
'app', 'servers',
'role', 'none',
'cloud', 'cloud2',
</pre>

Once these changes have been pushed in Quattor repo, and before doing the aii-shellfe configure and install on the aii server, there are two things to do to avoid problems (note those 2 steps are done when using quat -ri):
* revoke the SinDES certificate of the machine on the aii server (if you don't do that, no SinDES ACL will be created since there is already a valid certificate for the machine);
* revoke the Puppet certificate on the Puppet master machine.

When the Quattor installation is finished, you have to mount the '''backup''' share on the VM, and then you restore the files and directories with the command of your taste (in my case, I just used '''cp -ap'''). Of course, it is really important to preserve the permissions and ownerships of the source files.

With the data and configuration files being restored, it is now time to switch the VM back as a good old OpenNebula frontend. Revert the profile to something like this:
<pre>
object template cloud2.wn.iihe.ac.be;

include 'machine-types/puppet_node';

variable ONE_RELEASE = '6.6';
include 'features/one_frontend/light_config';
include 'features/one_frontend/one6.X/sunstone_apache_ssl';

# Mounting backup
include 'config/nfs/common';
include 'config/ceph/cephfs.backup';

# Making backup of everything that is needed
include 'components/cron/config';
'/software/components/cron/entries' = push(
dict(
'name', 'backup_config_db_one',
'user', 'root',
'frequency', '30 */2 * * *',
'log', dict('disabled', false),
'command', '/usr/bin/rsync -avR /var/lib/one /var/lib/mysql /etc/one /etc/my.cnf /etc/my.cnf.d /backup_mnt/backup/BACKUPS/cloud2/.'
)
);

# software repositories (should be last)
include PKG_REPOSITORY_CONFIG;
</pre>
and don't forget the Puppet database to generate the correct hiera description in the file '''/etc/puppetlabs/facter/facts.d/provisionning.yaml''':
<pre>
'cloud2.wn.iihe.ac.be', dict(
'environment', 'prod',
'app', 'opennebula',
'role', 'frontend',
'cloud', 'cloud2',
</pre>

RestoringCloudFrontendFromBackup

2024-07-25T13:18:07Z

Admin: /* Procedure to restore the frontend */

= Context =
In July 2024, we have lost our OpenNebula frontend VM (cloud2) after an attempt to reboot it. It was hosted on domz02, a basic QEMU/KVM standalone hypervizor managed with libvirt. The problem seemed to be that the system was not able to find the partition table in the qcow2 image. As there was no backup of the OpenNebula data and config files, the only solution to recover these things was to attach the image to a new VM and to recreate the partition table in the mounted image using the tool gpart. It eventually worked, but things would have far more easier if we had a simple backup of the directories containing the important ONE data and configuration files of our cloud system. And above all, the procedure we followed to restore the machine was sketched in a situation of emergency with no warranty that it would succeed.

= Data and configuration items to backup =
Here is a list of the important files/directories to backup:
* /var/lib/one
* /var/lib/mysql
* /etc/one
* /etc/my.cnf
* /etc/my.cnf.d

= Procedure to restore the frontend =
On the standalone hypervizor, create a new VM with same hardware characteristics as the previous (mac address, disk size, memory, cpu,...). The easiest way is to copy-paste the xml of the previous. Be careful that libvirt will complain about the fact that you reuse the mac address. The solution is simple: remove the nic of the previous VM to free the mac address. Here are some commands that might be useful for this step:
<pre>
virsh list
virsh edit <machine_name>
virsh create <xml_description_of_vm>
</pre>
Of course, you can also use the GUI '''virt-manager''' for most tasks. Be especially careful with the drivers (it should be '''virtio''' for the NIC and the drive). Also double-check that the drive and the memory have the same sizes as on the previous VM. And of course, don't reuse the disk of the previous VM, you have to create a new one (it's easy to do from the '''virt-manager''' GUI).

Once the VM is running, you'll have to reinstall the frontend on it with Quattor and Puppet. However, the VM must initially be reinstalled with machine-type 'puppet_node' and Puppet app set to 'servers' and role to 'none'. Why? Because if you directly reinstall the VM as frontend, the initization scripts that come with the ONE packages and also with Puppet will generate some new settings that might be tricky to overwrite with the backup. So to say, in the beginning of the restore process, the machine must be a vanilla one. Here is the vanilla profile used to reinstall the VM:
<pre>
object template cloud2.wn.iihe.ac.be;
include 'machine-types/puppet_node';
# software repositories (should be last)
include PKG_REPOSITORY_CONFIG;
</pre>
And in the Quattor file '''site/puppet/database''', here is the setting for the hiera '''app''' and '''role''':
<pre>
'cloud2.wn.iihe.ac.be', dict(
'environment', 'prod',
'app', 'servers',
'role', 'none',
'cloud', 'cloud2',
</pre>

Once these changes have been pushed in Quattor repo, and before doing the aii-shellfe configure and install on the aii server, there are two things to do to avoid problems :
* revoke the SinDES certificate of the machine on the aii server (if you don't do that, no SinDES ACL will be created since there is already a valid certificate for the machine);
* revoke the Puppet certificate on the Puppet master machine.

When the Quattor installation is finished, you have to mount the '''backup''' share on the VM, and then you restore the files and directories with the command of your taste (in my case, I just used '''cp -ap'''). Of course, it is really important to preserve the permissions and ownerships of the source files.

With the data and configuration files being restored, it is now time to switch the VM back as a good old OpenNebula frontend. Revert the profile to something like this:
<pre>
object template cloud2.wn.iihe.ac.be;
include 'machine-types/puppet_node';
variable ONE_RELEASE = '6.6';
include 'features/one_frontend/light_config';
include 'features/one_frontend/one6.X/sunstone_apache_ssl';
# software repositories (should be last)
include PKG_REPOSITORY_CONFIG;
</pre>
and don't forget the Puppet database to generate the correct hiera description in the file '''/etc/puppetlabs/facter/facts.d/provisionning.yaml''':
<pre>
'cloud2.wn.iihe.ac.be', dict(
'environment', 'prod',
'app', 'opennebula',
'role', 'frontend',
'cloud', 'cloud2',
</pre>

RestoringCloudFrontendFromBackup

2024-07-25T13:11:19Z

Admin: /* Procedure to restore the backup */

= Context =
In July 2024, we have lost our OpenNebula frontend VM (cloud2) after an attempt to reboot it. It was hosted on domz02, a basic QEMU/KVM standalone hypervizor managed with libvirt. The problem seemed to be that the system was not able to find the partition table in the qcow2 image. As there was no backup of the OpenNebula data and config files, the only solution to recover these things was to attach the image to a new VM and to recreate the partition table in the mounted image using the tool gpart. It eventually worked, but things would have far more easier if we had a simple backup of the directories containing the important ONE data and configuration files of our cloud system. And above all, the procedure we followed to restore the machine was sketched in a situation of emergency with no warranty that it would succeed.

= Data and configuration items to backup =
Here is a list of the important files/directories to backup:
* /var/lib/one
* /var/lib/mysql
* /etc/one
* /etc/my.cnf
* /etc/my.cnf.d

= Procedure to restore the frontend =
On the standalone hypervizor, create a new VM with same hardware characteristics as the previous (mac address, disk size, memory, cpu,...). The easiest way is to copy-paste the xml of the previous. Be careful that libvirt will complain about the fact that you reuse the mac address. The solution is simple: remove the nic of the previous VM to free the mac address. Here are some commands that might be useful for this step:
<pre>
virsh list
virsh edit <machine_name>
virsh create <xml_description_of_vm>
</pre>
Of course, you can also use the GUI '''virt-manager''' for most tasks. Be especially careful with the drivers (it should be '''virtio''' for the NIC and the drive). Also double-check that the drive and the memory have the same sizes as on the previous VM. And of course, don't reuse the disk of the previous VM, you have to create a new one (it's easy to do from the '''virt-manager''' GUI).

Once the VM is running, you'll have to reinstall the frontend on it with Quattor and Puppet. However, the VM must initially be reinstalled with machine-type 'puppet_node' and Puppet app set to 'servers' and role to 'none'. Why? Because if you directly reinstall the VM as frontend, the initization scripts that come with the ONE packages and also with Puppet will generate some new settings that might be tricky to overwrite with the backup. So to say, in the beginning of the restore process, the machine must be a vanilla one. Here is the vanilla profile used to reinstall the VM:
<pre>
object template cloud2.wn.iihe.ac.be;
include 'machine-types/puppet_node';
# software repositories (should be last)
include PKG_REPOSITORY_CONFIG;
</pre>
And in the Quattor file '''site/puppet/database''', here is the setting for the hiera '''app''' and '''role''':
<pre>
'cloud2.wn.iihe.ac.be', dict(
'environment', 'prod',
'app', 'servers',
'role', 'none',
'cloud', 'cloud2',
</pre>

Once these changes have been pushed in Quattor repo, and before doing the aii-shellfe configure and install on the aii server, there are two things to do to avoid problems :
* revoke the SinDES certificate of the machine on the aii server (if you don't do that, no SinDES ACL will be created since there is already a valid certificate for the machine);
* revoke the Puppet certificate on the Puppet master machine.

When the Quattor installation is finished, you have to mount the '''backup''' share on the VM, and then you restore the files and directories with the command of your taste (in my case, I just used '''cp -ap'''). Of course, it is really important to preserve the permissions and ownerships of the source files.

With the data and configuration files being restored, it is now time to switch the VM as a good old OpenNebula frontend. Revert the profile to something like this:
<pre>
object template cloud2.wn.iihe.ac.be;
include 'machine-types/puppet_node';
variable ONE_RELEASE = '6.6';
include 'features/one_frontend/light_config';
include 'features/one_frontend/one6.X/sunstone_apache_ssl';
# software repositories (should be last)
include PKG_REPOSITORY_CONFIG;
</pre>
and don't forget the Puppet database to generate the correct hiera description in the file '''/etc/puppetlabs/facter/facts.d/provisionning.yaml''':
<pre>
'cloud2.wn.iihe.ac.be', dict(
'environment', 'prod',
'app', 'opennebula',
'role', 'frontend',
'cloud', 'cloud2',
</pre>

RestoringCloudFrontendFromBackup

2024-07-25T13:00:40Z

Admin: /* Procedure to restore the backup */