ProdAgentFailures
Jump to navigation
Jump to search
List with some details on failures on how to try to solve them
- VO_CMS_SW_DIR is not set: this environment variable is needed for CMS, so the jobs know where to look for software. This has to be set on the workernodes.
- send an email to the admins, if possible also telling on which node this happened.
- error in stdout
ERROR: VO_CMS_SW_DIR is not set prodAgentFailure Invoked with code 10030
- no site match
- the resource broker could not find a site matching the job requirements
- there will be no stdout file from the jobs, but only a file like JobTracking/Failed/Submission_1/log/edgLoggingInfo.log with grid details (like crab -postMortem)
Event: Abort - host = laranja.iihe.ac.be - level = SYSTEM - priority = asynchronous - reason = Cannot plan: BrokerHelper: no compatible resources
Look for the job requirement part and try to poll the grid info system to see what is available: eg
- the edgLoggingInfo.log has a part with the requirements
- job = [ requirements = ( Member("VO-cms-CMSSW_1_2_0",other.GlueHostApplicationSoftwareRunTimeEnvironment) && anyMatch(other.storage.CloseSEs,( target.GlueSEUniqueID == "polgrid2.in2p3.fr.in2p3.fr" )) ) && ( other.GlueCEStateStatus == "Production" ); RetryCount = 3;
- the job tries to find a site matching this:
- is there a site with a closeSE named polgrid2.in2p3.fr.in2p3.fr?
- the job tries to find a site matching this:
lcg-infosites --vo cms closeSE|grep -C 3 polgrid2.in2p3.fr.in2p3.fr
the result of this query is empty, because there's very probably a typo in the SE name ;) Other example:
requirements = ( Member("VO-cms-CMSSW_1_2_0",other.GlueHostApplicationSoftwareRunTimeEnvironment) && anyMatch(other.storage.CloseSEs,( target.GlueSEUniqueID == "grid11.lal.in2p3.fr" )) ) && ( other.GlueCEStateStatus == "Production" );
- check for closeSE, and it returns a matching CE.
-sh-2.05b$ lcg-infosites --vo cms closeSE|grep -C 3 grid11.lal.in2p3.fr node12.datagrid.cea.fr Name of the CE: grid10.lal.in2p3.fr:2119/jobmanager-pbs-cms grid11.lal.in2p3.fr grid05.lal.in2p3.fr grid03.lal.in2p3.fr
- is the software available on that CE: the following command shows all available tags at all available sites for CMS. look there for the corresponding CE and see if the software tag is available.
lcg-infosites --vo cms tag|less ... Name of the CE: grid10.lal.in2p3.fr VO-cms-CMSSW_0_6_0 VO-cms-CMSSW_0_6_1 VO-cms-CMSSW_0_7_0 VO-cms-CMSSW_0_8_1 VO-cms-CMSSW_0_8_3 VO-cms-CMSSW_1_0_1 VO-cms-CMSSW_1_0_4 VO-cms-CMSSW_1_2_0_install-failed-with-25600-on-2007/01/06_15:08:10 VO-cms-slc3_ia32_gcc323 ...
Apparently the software installation of 120 failed and this is the required version of the job. Contact the admins to ask to take a look at it and/or fix it.