ProdAgentFailures
Jump to navigation
Jump to search
List with some details on failures on how to try to solve them
- VO_CMS_SW_DIR is not set: this environment variable is needed for CMS, so the jobs know where to look for software. This has to be set on the workernodes.
- send an email to the admins, if possible also telling on which node this happened.
- error in stdout
ERROR: VO_CMS_SW_DIR is not set prodAgentFailure Invoked with code 10030
- no site match
- the resource broker could not find a site matching the job requirements
- there will be no stdout file from the jobs, but only a file like JobTracking/Failed/Submission_1/log/edgLoggingInfo.log with grid details (like crab -postMortem)
Event: Abort - host = laranja.iihe.ac.be - level = SYSTEM - priority = asynchronous - reason = Cannot plan: BrokerHelper: no compatible resources
Look for the job requirement part and try to poll the grid info system to see what is available: eg
- the edgLoggingInfo.log has a part with the requirements
- job =
[
requirements = ( Member("VO-cms-CMSSW_1_2_0",other.GlueHostApplicationSoftwareRunTimeEnvironment) && anyMatch(other.storage.CloseSEs,( target.GlueSEUniqueID == "polgrid2.in2p3.fr.in2p3.fr" )) ) && ( other.GlueCEStateStatus == "Production" );
RetryCount = 3;
- the job tries to find a site matching this:
- is there a site with a closeSE named polgrid2.in2p3.fr.in2p3.fr?
- the job tries to find a site matching this:
lcg-infosites --vo cms closeSE|grep -C 3 polgrid2.in2p3.fr.in2p3.fr
the result of this query is empty, because there's very probably a typo in the SE name ;) Other example:
requirements = ( Member("VO-cms-CMSSW_1_2_0",other.GlueHostApplicationSoftwareRunTimeEnvironment) && anyMatch(other.storage.CloseSEs,( target.GlueSEUniqueID == "grid11.lal.in2p3.fr" )) ) && ( other.GlueCEStateStatus == "Production"
);
- check for closeSE, and it returns a matching CE.
-sh-2.05b$ lcg-infosites --vo cms closeSE|grep -C 3 grid11.lal.in2p3.fr
node12.datagrid.cea.fr
Name of the CE: grid10.lal.in2p3.fr:2119/jobmanager-pbs-cms
grid11.lal.in2p3.fr
grid05.lal.in2p3.fr
grid03.lal.in2p3.fr
- is the software available on that CE: the following command shows all available tags at all available sites for CMS. look there for the corresponding CE and see if the software tag is available.
lcg-infosites --vo cms tag|less
...
Name of the CE: grid10.lal.in2p3.fr
VO-cms-CMSSW_0_6_0
VO-cms-CMSSW_0_6_1
VO-cms-CMSSW_0_7_0
VO-cms-CMSSW_0_8_1
VO-cms-CMSSW_0_8_3
VO-cms-CMSSW_1_0_1
VO-cms-CMSSW_1_0_4
VO-cms-CMSSW_1_2_0_install-failed-with-25600-on-2007/01/06_15:08:10
VO-cms-slc3_ia32_gcc323
...
Apparently the software installation of 120 failed and this is the required version of the job. Contact the admins to ask to take a look at it and/or fix it.