HTCFirstSubmissionGuide: Difference between revisions

From T2B Wiki
Jump to navigation Jump to search
 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
This is only a quick first draft for you to learn more on how to use HTCondor.
== First time submitting a job ==
For new users, we recommend following [https://indico.cern.ch/event/936993/contributions/4022073/attachments/2105538/3540926/2020-Koch-User-Tutorial.pdf this presentation], that should give you an idea of how to submit jobs.<br>
Then to practice the basics of job submission on a HTCondor cluster, some exercises are proposed on this [https://en.wikitolearn.org/Course:HTCondor Wiki].
 
=== ''T2B Specifics'' ===
 
- '''File Transfers:'''
: Please note that contrary to what is usually shown in documentation and examples, we recommend not using HTCondor file transfer mechanisms ('''should_transfer_files = NO''') and copy files yourself within your script.
 
 
- '''Always adapt requested resources to your job'''
: You need to adapt the resources you request to what you estimate your job will need. Requesting more than what you really need is wasteful and deprive your fellow users from ressources they might require.
: To do so, just add to your submit file the following lines:
<pre>request_cpus = 1
request_memory = 200MB
request_disk = 1GB</pre>
 
: Note that for a job, if you need more than 1 cpu / 4GB of memory / 10GB of disk, please be careful and sure of what you are doing.
 
 
- '''Where am I when I start a job ?'''
:You should always prefer using $HOME, which in a job equates to a local unique directory in the /scratch of the local disk,eg: /scratch/condor/dir_275928.
: $TMPDIR = $HOME/tmp, so it is also on the local disk and unique
 
 
- '''Efficient use of local disks on worker nodes'''
:Note that local disks are now exclusively NVMEs which are much -much- faster than network protocols used in writing to /pnfs or /user.<br>
:So for repeated reads (like cycling through events with files O(1GB)) it is more efficient to copy file locally first then open it.<br>
:Same for write, prefer writing locally then copying the file to /pnfs.
 
 
- '''Can I have a shell/interactive job on the batch system ?'''
:Yes! If you want to make tests, or run things interactively, with dedicated core/memory for you, just run:
condor_submit -i
:Note that if you want to reserve more than the standard 1 core / 600MB of memory, simply add your request_* [specified just above] like this:
condor_submit -interactive request_cpus=2
 


<br>
=== Status of Machine Migration to the New Cluster ===
* '''User Interfaces (UI)''': m2, m3, m6, m7
* '''Worker nodes''': ~7000 slots


<br>
- '''Sending DAG jobs'''
:For now, sending DAG jobs only works when you are directly on the scheduler, so first add your ssh key to local keyring agent and then connect to any mX machine with the -A (forwarding agent) option:
ssh-add
ssh -A mshort.iihe.ac.be
:then connect to the scheduler:
ssh schedd02.wn.iihe.ac.be
:there you can do the condor_submit_dag commands and it will work. Note than condor_q commands to see the advancements of your DAG jobs can still be performed on the mX machines.


=== File Transfers ===
=== What is my job doing ? Is something wrong ? ===
Please note that contrary to what is usually shown in documentation and examples, we recommend not using HTCondor file transfer mechanisms ('''should_transfer_files = NO''') and copy files yourself within your script.
For debugging jobs in queue or what jobs running are doing, please follow our [[HTCondorDebug|HTCondor Debug]] section


<br>
<br>
Line 18: Line 56:
<br>
<br>
=== HTCondor Workshop Presentation ===
=== HTCondor Workshop Presentation ===
Every 6 months, there is an HTCondor workshop. Presentations are usually very helpfull, especially if you want to go into details of HTCondor (API, DAGMan, ...). You can find the agenda of the latest one [https://indico.cern.ch/event/936993/timetable/#20200921.detailed here]
Every 6 months, there is an HTCondor workshop. Presentations are usually very helpful, especially if you want to go into details of HTCondor (API, DAGMan, ...). You can find the agenda of the latest one [https://indico.cern.ch/event/936993/timetable/#20200921.detailed here]
 
For new users, we recommend following [https://indico.cern.ch/event/936993/contributions/4022073/attachments/2105538/3540926/2020-Koch-User-Tutorial.pdf this presentation], that should give you an idea of how to submit jobs.
 
<br>
=== Self-learning material ===
For real beginners that want to learn the basics of job submission on a HTCondor cluster, some exercises are proposed on this [https://en.wikitolearn.org/Course:HTCondor Wiki].

Latest revision as of 13:14, 5 November 2024

First time submitting a job

For new users, we recommend following this presentation, that should give you an idea of how to submit jobs.
Then to practice the basics of job submission on a HTCondor cluster, some exercises are proposed on this Wiki.

T2B Specifics

- File Transfers:

Please note that contrary to what is usually shown in documentation and examples, we recommend not using HTCondor file transfer mechanisms (should_transfer_files = NO) and copy files yourself within your script.


- Always adapt requested resources to your job

You need to adapt the resources you request to what you estimate your job will need. Requesting more than what you really need is wasteful and deprive your fellow users from ressources they might require.
To do so, just add to your submit file the following lines:
request_cpus = 1
request_memory = 200MB
request_disk = 1GB
Note that for a job, if you need more than 1 cpu / 4GB of memory / 10GB of disk, please be careful and sure of what you are doing.


- Where am I when I start a job ?

You should always prefer using $HOME, which in a job equates to a local unique directory in the /scratch of the local disk,eg: /scratch/condor/dir_275928.
$TMPDIR = $HOME/tmp, so it is also on the local disk and unique


- Efficient use of local disks on worker nodes

Note that local disks are now exclusively NVMEs which are much -much- faster than network protocols used in writing to /pnfs or /user.
So for repeated reads (like cycling through events with files O(1GB)) it is more efficient to copy file locally first then open it.
Same for write, prefer writing locally then copying the file to /pnfs.


- Can I have a shell/interactive job on the batch system ?

Yes! If you want to make tests, or run things interactively, with dedicated core/memory for you, just run:
condor_submit -i
Note that if you want to reserve more than the standard 1 core / 600MB of memory, simply add your request_* [specified just above] like this:
condor_submit -interactive request_cpus=2


- Sending DAG jobs

For now, sending DAG jobs only works when you are directly on the scheduler, so first add your ssh key to local keyring agent and then connect to any mX machine with the -A (forwarding agent) option:
ssh-add
ssh -A mshort.iihe.ac.be
then connect to the scheduler:
ssh schedd02.wn.iihe.ac.be
there you can do the condor_submit_dag commands and it will work. Note than condor_q commands to see the advancements of your DAG jobs can still be performed on the mX machines.

What is my job doing ? Is something wrong ?

For debugging jobs in queue or what jobs running are doing, please follow our HTCondor Debug section


HTCondor Official Documentation

Have a look at the official User Manual on the HTCondor website.
It is very well done and explains all available features.


HTCondor Workshop Presentation

Every 6 months, there is an HTCondor workshop. Presentations are usually very helpful, especially if you want to go into details of HTCondor (API, DAGMan, ...). You can find the agenda of the latest one here