HTCondorDebug

From T2B Wiki
Revision as of 14:09, 3 January 2024 by Admin (talk | contribs) (Created page with "== Why is my job still in queue ? == To see what is happening and why your jobs are not starting, use: <pre>> condor_q -better-analyze 2993001.0 -- Schedd: schedd03.wn.iihe.ac.be : <192.168.10.147:9618?... The Requirements expression for job 2993001.000 is (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.HasFileTransfer) Job 2993001.000 defines the following attributes: Disk...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Why is my job still in queue ?

To see what is happening and why your jobs are not starting, use:

> condor_q -better-analyze 2993001.0

-- Schedd: schedd03.wn.iihe.ac.be : <192.168.10.147:9618?...
The Requirements expression for job 2993001.000 is

    (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.HasFileTransfer)

Job 2993001.000 defines the following attributes:

    DiskUsage = 1
    ImageSize = 1
    RequestDisk = DiskUsage
    RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)

The Requirements expression for job 2993001.000 reduces to these conditions:

     Slots
Step    Matched  Condition
 ----  --------  ---------
[0]        4547  TARGET.Arch == "X86_64"
[1]        4547  TARGET.OpSys == "LINUX"
[3]        4547  TARGET.Disk >= RequestDisk
[5]        3017  TARGET.Memory >= RequestMemory


2993001.000:  Job is running.

Last successful match: Wed Jan  3 14:46:54 2024


2993001.000:  Run analysis summary ignoring user priority.  Of 152 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
    152 are able to run your job


There are 2 important sections here:

  • the reduced requirements expression:
You can see each requirements, and how many job slots are ok with your requirement. If for one requirement you have 0 matched slots then this one is causing the issue and preventing your jobs to run. It probably does not exist. Contact us if you are not sure.
  • the run analysis summary:
you have a summary of how many machines could accept your job. Again, if it states no machine is able to run your job, that should be confirmed with how many slots match each requirement.


What is happening to my running Jobs ?

To help answer this, try connecting to where your job is with:

condor_ssh_to_job 2993001.0

Then first look at what your jobs are doing using htop.
To see only your jobs, press u (for users), then select your username from the left, finally enter Tree Mode with F5