Cluster Overview
Overview
The cluster is composed 3 groups of machines :
- The User Interfaces (UI)
- This is the cluster front-end, to use the cluster, you need to log into those machines
- Servers : mshort [ m2 , m3 ] , mlong [ m6 , m7 ]
- The File Server provides the user home on the UIs. It is a highly efficient & redundant storage node of ~120 TB capacity with regular backups.
- This is the cluster front-end, to use the cluster, you need to log into those machines
- The Computing Machines :
- The Computing Element (CE): This server is the brain of the batch system : it manages all the submitted jobs, and send them to the worker nodes.
- Servers : testumd-htcondorce (temporary)
- The Worker Nodes (WN): This is the power of the cluster : they run multiple jobs in parallel and send the results & status back to the CE.
- Servers : nodeXX-YY
- The Mass Storage
- The Storage Element: it is the brain of the cluster storage. Grid accessible, it knows where all the files are, and manages all the storage nodes.
- Server : maite
- The Storage Nodes: This is the memory of the cluster : they contain big data files. In total, they provide ~5100 TB of grid-accessible storage.
- Servers : beharXXX
How to Connect
To connect to the cluster, you need to have sent us your public ssh key. In a terminal, type the following:
ssh -X -o ServerAliveInterval=100 username@mshort.iihe.ac.be
- Tip: the -o ServerAliveInterval=100 option is used to keep your session alive for a long period of time ! You should not be disconnected during a whole day of work.
After a successful login, you'll see this message :
! Welcome to the T2B Cluster ! ___________________________________________________________________________
Wiki: https://t2bwiki.iihe.ac.be Chat: https://chat.iihe.ac.be Mail: grid_admin@listserv.vub.ac.be ___________________________________________________________________________
[/pnfs ] => 101 GB [08/12/2021] [/user + /group ] => 45% used (271G left) ___________________________________________________________________________
Welcome on [m7] ! You have 3600s (1 hours) of cpu time per processes. There are 2 users here | Load: 7.56 /4 CPUs (189%) | Mem: 16% used
! BEWARE, YOU ARE ON A NEW UI, USING EL7 & HTCONDOR !
Please observe all the information in this message:
- The wiki link, where you should go first to find the information
- The chat link, where you can easily contact us for fast exchanges. IIHE users can use their intranet account, others can just create an account.
- The email used for the cluster support (please use this one rather than personal mail, this way everyone on the support team can answer and track the progress.)
- The space used on the mass storage /pnfs, where storing a few TB is no problem. No hard limits are applied, but please contact us if you plan to go over 20 TB!
- The quota used on /user (and /group). Here a hard limit is applied, so if you are at 100%, you will have many problems. Clean your space, and if you really need more contact us.
- The cpu time limit imposed per process, as we divided our UIs into 2 groups.
- The light task UIs (max CPU time = 20 minutes) : they are used for crab/local job submission, writing code, building debugging ...
mshort.iihe.ac.be : m2.iihe.ac.be, m3.iihe.ac.be
- The CPU-intensive UIs (max CPU time = 1 hour) : they are available for CPU-intensive and testing tasks/workflows, although you should prefer using local job submission ...
mlong.iihe.ac.be : m6.iihe.ac.be, m7.iihe.ac.be
- Information about how heavily this UI is used. If any of them is red (ie above optimal usage), please consider using another UI. Please be mindful of other users and don't start too many processes, especially if the UI is already under charge.
Data Storage & Directory Structure
There are 2 main directories to store your work and data:
- /user [/$USER] : this is your home directory. You have an enforced quota there, as it is an expensive storage with redundancy and daily backups (see below).
- /pnfs [/iihe/MYEXP/store/user/$USER] : this is where you can store a large amount of data, and is also grid-accessible. If you need more than a few TB, please contact us. There is no backups there, so be careful of what you do !
There are other directories than you might want to take notice of:
- /group : same as /user , but if you need to share/produce in a group.
- /scratch : a temporary scratch space for your job. Use $TMPDIR on the WNs, it is cleanned after each job :)
- /cvmfs : Centralised CVMFS software repository. It should contain most of the software you will need for your experiment. Find here how to get a coherent environment for most tools you will need.
- /software : local area for shared software not in /cvmfs . You can use a nice tool to find the software and versions available.
Batch System
The cluster is based on HTCondor (also used at CERN or Wisconsin for instance). Please follow this page for details on how to use it.
Queues
Description | HTCondor batch ressources |
---|---|
# CPU's (Jobs) | 10700 |
Walltime limit | 168 hours = 1 week |
Preferred Memory per job | 2 Gb |
$TMPDIR/scratch max usable space | 10-20 Gb |
Max # jobs sent to the batch system / User | theoretically none (contact us if you plan on sending more than 10 000) |
Backup
There are several areas that we regularly back up: /user , /group , /data , /ice3.
You can find more information on the backup frequency and how to access them here.
Usefull links
Ganglia Monitoring : status of all our servers.
Cluster Status : current status of all T2B services. Check here before sending us an email. Please also consider registering to receive T2B issues and be informed when things are resolved.