Cluster Overview: Difference between revisions

Revision as of 15:09, 16 November 2017

Overview

The cluster is composed 3 groups of machines :

The User Interfaces (UI)

This is the cluster front-end, to use the cluster, you need to log into those machines

Servers : mshort [ m0 , m1 , m2 , m3 ] , mlong [ m5 , m6 , m7 , m8 , m9 ]

The File Server provides the user home on the UIs. It is a highly efficient & redundant storage node of ~70 TB capacity with regular backups.

The Computing Machines :
- The Computing Element (CE): This server is the brain of the batch system : it manages all the submitted jobs, and send them to the worker nodes.

Servers : cream02

The Worker Nodes (WN): This is the power of the cluster : they run multiple jobs in parallel and send the results & status back to the CE.

Servers : nodeXX-YY

The Mass Storage
- The Storage Element: it is the brain of the cluster storage. Grid accessible, it knows where all the files are, and manages all the storage nodes.

Server : maite

The Storage Nodes: This is the memory of the cluster : they contain big data files. In total, they provide ~2300 TB of grid-accessible storage.

Servers : beharXXX

How to Connect

To connect to the cluster, you need to have sent us your public ssh key. In a terminal, type the following:

ssh -X -o ServerAliveInterval=100 username@mshort.iihe.ac.be

Tip: the -o ServerAliveInterval=100 option is used to keep your session alive for a long period of time ! You should not be disconnected during a whole day of work.

After a successful login, you'll see this message :


         @@@@@@@@     @@@@             @@@@@     @@@@@@@
            @@       @    @            @@   @    @@
            @@            @    @@@@    @@@@@     @@@@
            @@         @@              @@    @   @@@@
            @@       @                 @@    @   @@
            @@       @@@@@@            @@@@@@    @@@@@@@
                              @ IIHE   

  Welcome to the t2b cluster ! You are on the following UI: m2 

  You can find more info on our wiki page: http://t2bwiki.iihe.ac.be
           To contact us: grid_admin@listserv.vub.ac.be

  Please remember this machine will allow you only 600s (10 minutes)
     of cpu time per processes.
 ________________________________________________________________________
                  Your Quota on /user: 43% used (282G left) 
There are 2 users here   |   Load: 7.51 /8 CPUs (2%)  |   Mem: 80% used

Please observe all the information in this message:

The wiki link, where you should go first to find the information
The email used for the cluster support (please use this one rather than personal mail, this way everyone on the support team can answer and track the progress.)
The cpu time limit imposed per process, as we divided our UIs into 2 groups.

The light task UIs (max CPU time = 10 minutes) : they are used for crab/local job submission, writing code, building debugging ...

mshort.iihe.ac.be :  m0.iihe.ac.be, m1.iihe.ac.be, m2.iihe.ac.be, m3.iihe.ac.be

The CPU-intensive UIs (max CPU time = 5 hours) : they are available for CPU-intensive and long tasks, although you should prefer using local job submission ...

mlong.iihe.ac.be : m5.iihe.ac.be, m6.iihe.ac.be, m7.iihe.ac.be, m8.iihe.ac.be and m9.iihe.ac.be

The quota you have left on /user
Information about how heavily this UI is used. If any of them is red (ie above optimal usage), please consider using another UI. Please be mindful of other users and don't start too many processes, epsecially if the UI is already under charge.

Data Storage & Directory Structure

There are 2 main directories to store your work and data:

/user [/$USER] : this is your home directory. You have an enforced quota there, as it is an expensive storage with redundancy and daily backups.
/pnfs [/iihe/cms/store/user/$USER] : this is where you can store a large amount of data, and is also grid-accessible. If you need more than 2 TB, please contact us. THere is no backups there, so be careful of what you do !

There are other directories than you might want to take notice of:

/group : same as /user , but if you need to share/produce in a group.
/scratch : a temporary scratch space for your job. Use $TMPDIR on the WNs, it is cleanned after each job :)
/cvmfs : Centralised CVMFS software repository from CERN. It should contain most of the software you will need.
/swmgrs : local area for shared software not in /cvmfs . You can use a nice tool to find the software and versions available.

Batch System

Queues

The cluster is decomposed in queues

	localgrid	highmem	highbw	express
Description	default queue, all available nodes except express	Subgroup of localgrid with WNs having 4GB Mem / Slot	Subgroup of localgrid, subgroup of highmem, with WNs having 10Gb/s bandwidth access to storage.	Limited walltime
# CPU's (Jobs)	6566	4576	2464	16
Walltime limit	168 hours			30 minutes
Memory limit	2 Gb	4 Gb	4 Gb	2 Gb
$TMPDIR/scratch max usable space	10 Gb		16 Gb	10 Gb
Max # jobs running / User	1000 jobs			8 jobs
Max # jobs sent to the batch system / User	2500 jobs (see here if you want to send more)			100 jobs

Job submission

To submit a job, you just have to use the qsub command :

qsub myjob.sh

OPTIONS

-q queueName : choose the queue you want [mandatory]

-N jobName : name of the job

-I : (capital i) pass in interactive mode

-m mailaddress : set mail address (use in conjonction with -m) : MUST be @ulb.ac.be or @vub.ac.be

-m [a|b|e] : send mail on job status change (a = aborted , b = begin, e = end)

-l : resources options

For instance, if you want to use 2 cores: -lnodes=1:ppn=2

If you want to send more than 2500 jobs to the cluster, write all qsub commands in a text file, and use the script big-submission (more info here).

If you use MadGraph, read this section first or you risk crashing the cluster.

Job management

To see all jobs (running / queued), you can use the qstat command, or go to the JobView page to have a summary of what's running.

qstat

OPTIONS

-u username : list only jobs submitted by username

-n : show nodes where jobs are running

-q : show the job repartition on queues

Job Statistics

All the log files from the batch system are synced every 30 minutes in:

/group/log/torque/

A simple script to analyze the logs and provide some statistics for the user is provided:

/group/log/torque/torque-user-info.py

Just execute it as is. It will print information like the following:

ID: 6077555  ExCode:   0 Mem:    0M cpuT:      0s wallT:      3s eff:  0.0%   STDIN
ID: 6077602  ExCode:   0 Mem:   50M cpuT:      0s wallT:      2s eff:  0.0%   STDIN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
   user[G]	# Jobs	 <MEM> +- RMS        #HiMem  MAX Mem      <CPU time>    <walltime>    <Eff>   % WT/WT_TOT    # Jobs with Error code (% of user job)
----------------------------------------------------------------------------------------------------------------------------------------------------------------
    rougny[l]	    12	    13 +- 22    MB      0      52 MB  |  00:00:00:00  00:00:00:24  ( 0.0%) (-1.00% of tot) | # EC:

If you want to test the batch system, you can follow the workbook here

Job Deletion

Use the following command:

qdel <JOBID>

To delete all your jobs, be patient while using the following line:

for j in $(qselect -u $USER);do timeout 3 qdel -a $j;done

Backup

There are several areas that we regularly back up: /user , /group , /data , /ice3.
You can find more information on the backup frequency and how to access them here.

Usefull links

Ganglia Monitoring : status of all our servers.
JobView Monitoring : summary of the cluster usage.

@@ Line 120: / Line 120: @@
 |-
 ! scope="row" | Max # jobs running / User
-| nowrap="nowrap" align="center" colspan="3"| 400 jobs<br>
+| nowrap="nowrap" align="center" colspan="3"| 1000 jobs<br>
 | nowrap="nowrap" align="center" | 8 jobs<br>
 |-

Cluster Overview: Difference between revisions

Revision as of 15:09, 16 November 2017

Contents

Overview

How to Connect

Data Storage & Directory Structure

Batch System

Queues

Job submission

Job management

Job Statistics

Job Deletion

Backup

Usefull links

Navigation menu

Cluster Overview: Difference between revisions

Revision as of 15:09, 16 November 2017

Overview

How to Connect

Data Storage & Directory Structure

Batch System

Queues

Job submission

Job management

Job Statistics

Job Deletion

Backup

Usefull links

Navigation menu

Search