ExplainingApel: Difference between revisions
(5 intermediate revisions by the same user not shown) | |||
Line 22: | Line 22: | ||
* fetch benchmark information from LDAP database; | * fetch benchmark information from LDAP database; | ||
* join EventRecords and BlahdRecords into JobRecords; | * join EventRecords and BlahdRecords into JobRecords; | ||
* summarise jobs | * summarise jobs; | ||
* unload JobRecords or SummaryRecords into filesystem | * unload JobRecords or SummaryRecords into filesystem. | ||
= Sending the job and summary records to the remote accounting server = | = Sending the job and summary records to the remote accounting server = | ||
Line 39: | Line 39: | ||
= Problems encountered = | = Problems encountered = | ||
* Missing userFQAN field in blah records | * Missing userFQAN field in blah records -> use the script '''fix_blah_file.py''' in '''iihe-scripts/ce/correct_blah_files''' and in the table '''ProcessedFiles''', remove the records of the blah files that have their '''Parsed''' field = 0 otherwise you will have doublons (if there is no record for a batch file in the '''ProcessedFiles''' table, the parser will reparse the file). To avoid the problem with missing '''userFQAN''' field in blah records in the future, the script that generates the blah files was altered in the following way: | ||
* Some records malformed in batch files (missing fields: 8 instead of 11) | |||
<pre> | |||
[root@ce03 ~]# diff /usr/share/condor-ce/condor_batch_blah.py /usr/share/condor-ce/condor_batch_blah.py.orig | |||
167c167 | |||
< '"userFQAN=/vo/Role=NULL/Capability=NULL" ', "EMPTY", | |||
--- | |||
> '"userFQAN=%s" ', "x509UserProxyFirstFQAN", | |||
</pre> | |||
This change will be wiped out by an update of the package htcondor-ce-apel. | |||
* Some records malformed in batch files (missing fields: 8 instead of 11) -> the following fields are sometimes missing in the history file: RemoteWallClockTime, JobStartDate, ResidentSetSize_RAW |
Latest revision as of 09:46, 22 July 2024
Context
Our CE is an HTCondor CE, and the underlying batch system is htcondor.
These CEs are equipped with an APEL software stack. The role of this machinery is to extract information about jobs from logfiles, and feed a local database with them. Everyday, the database is read, records are extracted from it and sent to a remote Apel accounting server.
From HTCondor job history files to batch and blah files
Each time a job is finished, a job record is created in the directory /var/lib/condor/history.
Thanks to a systemd timer (see /usr/lib/systemd/system/condor-ce-apel.timer), every hour, a script (/usr/share/condor-ce/condor_ce_apel.sh) is run to parse these history records and to generate the blah and batch files from them via the script /usr/share/condor-ce/condor_batch_blah.py. The blah and batch files are created in the directory /var/lib/condor-ce/apel. If for some reason, it fails to parse an history file, this file is moved to the subdirectory quarantine. Otherwise, it is removed.
The blah files contain information provided by the CE layer (like the user DN for example), while the batch files contain low level pieces of information coming form the underlying batch system (that can be SLURM, or LSF, or PBS or HTCondor or ...).
After the blah and batch files have been generated, the script condor_ce_apel.sh will call /usr/bin/apelparser, a Python script whose role is to update the local Apel Mysql database with the content of the blah and batch files.
The APEL database
In our case, it's a MySQL database whose name is apelclient. It's created and intialized automatically when the CE is deployed thanks to Puppet. It contains table like BlahdRecords to store the records from blah files, EventRecords to store records from batch files, and JobRecords to store the records that are generated by the join performed by the apelclient script (see next section). In this database, you'll also find views (their names begins with a 'V') that give a fast access to data. Many operations on these tables are actually performed through MySQL procedures.
A very important table for sysadmins is the ProcessedFiles. It is used by the apelparser script to check if a file has already been processed or not. When the processing of a file has failed, it will be recorded with its Parsed field set to 0. Otherwise, this field indicates the number of records processed in the file.
Generation of job and summary records
Once every day, job records will be generated by the apelclient Python script. To be more precise, this script accomplish the following tasks:
- fetch benchmark information from LDAP database;
- join EventRecords and BlahdRecords into JobRecords;
- summarise jobs;
- unload JobRecords or SummaryRecords into filesystem.
Sending the job and summary records to the remote accounting server
During this step, the records that have been previously unloaded to the filesystem will be sent by SSM to a remove accounting server. This step can be performed by the apelclient script if the variable enabled from section SSM of the config file /etc/apel/client.cfg is set to true.
The script that is run to send these records is /usr/bin/ssmsend. It is configured by file /etc/apel/sender.cfg. The protocol we use to send the records is called AMS (instead of STOMP).
Please note that for this step to work, you need to declare the apel service of the CE in GOCDB.
Configuration files
They are located in /etc/apel. Here are the config files relevant to us:
- /etc/apel/client.cfg
- /etc/apel/parser.cfg
- /etc/apel/sender.cfg.
Problems encountered
- Missing userFQAN field in blah records -> use the script fix_blah_file.py in iihe-scripts/ce/correct_blah_files and in the table ProcessedFiles, remove the records of the blah files that have their Parsed field = 0 otherwise you will have doublons (if there is no record for a batch file in the ProcessedFiles table, the parser will reparse the file). To avoid the problem with missing userFQAN field in blah records in the future, the script that generates the blah files was altered in the following way:
[root@ce03 ~]# diff /usr/share/condor-ce/condor_batch_blah.py /usr/share/condor-ce/condor_batch_blah.py.orig 167c167 < '"userFQAN=/vo/Role=NULL/Capability=NULL" ', "EMPTY", --- > '"userFQAN=%s" ', "x509UserProxyFirstFQAN",
This change will be wiped out by an update of the package htcondor-ce-apel.
- Some records malformed in batch files (missing fields: 8 instead of 11) -> the following fields are sometimes missing in the history file: RemoteWallClockTime, JobStartDate, ResidentSetSize_RAW