ProdAgentFailures

From T2B Wiki
Jump to navigation Jump to search

List with some details on failures on how to try to solve them

  • VO_CMS_SW_DIR is not set: this environment variable is needed for CMS, so the jobs know where to look for software. This has to be set on the workernodes.
    • send an email to the admins, if possible also telling on which node this happened.
    • error in stdout
  ERROR: VO_CMS_SW_DIR is not set
  prodAgentFailure Invoked with code 10030 
  • no site match
    • the resource broker could not find a site matching the job requirements
    • there will be no stdout file from the jobs, but only a file like JobTracking/Failed/Submission_1/log/edgLoggingInfo.log with grid details (like crab -postMortem)
   Event: Abort
   - host                    =    laranja.iihe.ac.be
   - level                   =    SYSTEM
   - priority                =    asynchronous
   - reason                  =    Cannot plan: BrokerHelper: no compatible resources
 Look for the job requirement part and try to poll the grid info system to see what is available: eg
    • the edgLoggingInfo.log has a part with the requirements
  - job            =

       [
        requirements = ( Member("VO-cms-CMSSW_1_2_0",other.GlueHostApplicationSoftwareRunTimeEnvironment) && anyMatch(other.storage.CloseSEs,( target.GlueSEUniqueID == "polgrid2.in2p3.fr.in2p3.fr" )) ) && ( other.GlueCEStateStatus == "Production" );
        RetryCount = 3;

    • the job tries to find a site matching this:
      • is there a site with a closeSE named polgrid2.in2p3.fr.in2p3.fr?
  lcg-infosites --vo cms closeSE|grep -C 3 polgrid2.in2p3.fr.in2p3.fr
   the result of this query is empty, because there's very probably a typo in the SE name ;)
 Other example:
 
  requirements = ( Member("VO-cms-CMSSW_1_2_0",other.GlueHostApplicationSoftwareRunTimeEnvironment) && anyMatch(other.storage.CloseSEs,( target.GlueSEUniqueID == "grid11.lal.in2p3.fr" )) ) && ( other.GlueCEStateStatus == "Production"
);
    • check for closeSE, and it returns a matching CE.
  -sh-2.05b$ lcg-infosites --vo cms closeSE|grep -C 3 grid11.lal.in2p3.fr
        node12.datagrid.cea.fr

  Name of the CE: grid10.lal.in2p3.fr:2119/jobmanager-pbs-cms
        grid11.lal.in2p3.fr
        grid05.lal.in2p3.fr
        grid03.lal.in2p3.fr
    • is the software available on that CE: the following command shows all available tags at all available sites for CMS. look there for the corresponding CE and see if the software tag is available.
  lcg-infosites --vo cms tag|less
  ...
  Name of the CE: grid10.lal.in2p3.fr
        VO-cms-CMSSW_0_6_0
        VO-cms-CMSSW_0_6_1
        VO-cms-CMSSW_0_7_0
        VO-cms-CMSSW_0_8_1
        VO-cms-CMSSW_0_8_3
        VO-cms-CMSSW_1_0_1
        VO-cms-CMSSW_1_0_4
        VO-cms-CMSSW_1_2_0_install-failed-with-25600-on-2007/01/06_15:08:10
        VO-cms-slc3_ia32_gcc323
  ...
  Apparently the software installation of 120 failed and this is the required version of the job. Contact the admins to ask to take a look at it and/or fix it.


Template:TracNotice