ExplainingApel: Difference between revisions
No edit summary |
|||
(11 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
Our CE is an HTCondor CE, and the underlying batch system is htcondor. | Our CE is an HTCondor CE, and the underlying batch system is htcondor. | ||
These CEs are equipped with an | These CEs are equipped with an APEL software stack. The role of this machinery is to extract information about jobs from logfiles, and feed a local database with them. Everyday, the database is read, records are extracted from it and sent to a remote Apel accounting server. | ||
= From HTCondor job history files to batch and blah files = | = From HTCondor job history files to batch and blah files = | ||
Line 8: | Line 8: | ||
Thanks to a systemd timer (see '''/usr/lib/systemd/system/condor-ce-apel.timer'''), every hour, a script ('''/usr/share/condor-ce/condor_ce_apel.sh''') is run to parse these history records and to generate the blah and batch files from them via the script '''/usr/share/condor-ce/condor_batch_blah.py'''. The blah and batch files are created in the directory '''/var/lib/condor-ce/apel'''. If for some reason, it fails to parse an history file, this file is moved to the subdirectory '''quarantine'''. Otherwise, it is removed. | Thanks to a systemd timer (see '''/usr/lib/systemd/system/condor-ce-apel.timer'''), every hour, a script ('''/usr/share/condor-ce/condor_ce_apel.sh''') is run to parse these history records and to generate the blah and batch files from them via the script '''/usr/share/condor-ce/condor_batch_blah.py'''. The blah and batch files are created in the directory '''/var/lib/condor-ce/apel'''. If for some reason, it fails to parse an history file, this file is moved to the subdirectory '''quarantine'''. Otherwise, it is removed. | ||
The blah files contain information provided by the CE layer (like the user DN for example), while the batch files contain low level pieces of information coming form the underlying batch system (that can be SLURM, or LSF, or PBS or HTCondor or ...). | |||
After the blah and batch files have been generated, the script condor_ce_apel.sh will call '''/usr/bin/apelparser''', a Python script whose role is to update the local Apel Mysql database with the content of the blah and batch files. | After the blah and batch files have been generated, the script condor_ce_apel.sh will call '''/usr/bin/apelparser''', a Python script whose role is to update the local Apel Mysql database with the content of the blah and batch files. | ||
= The APEL database = | |||
In our case, it's a MySQL database whose name is '''apelclient'''. It's created and intialized automatically when the CE is deployed thanks to Puppet. It contains table like '''BlahdRecords''' to store the records from blah files, '''EventRecords''' to store records from batch files, and '''JobRecords''' to store the records that are generated by the join performed by the '''apelclient''' script (see next section). In this database, you'll also find views (their names begins with a 'V') that give a fast access to data. Many operations on these tables are actually performed through MySQL procedures. | |||
A very important table for sysadmins is the '''ProcessedFiles'''. It is used by the apelparser script to check if a file has already been processed or not. When the processing of a file has failed, it will be recorded with its '''Parsed''' field set to 0. Otherwise, this field indicates the number of records processed in the file. | |||
= Generation of job and summary records = | |||
Once every day, job records will be generated by the '''apelclient''' Python script. To be more precise, this script accomplish the following tasks: | |||
* fetch benchmark information from LDAP database; | |||
* join EventRecords and BlahdRecords into JobRecords; | |||
* summarise jobs; | |||
* unload JobRecords or SummaryRecords into filesystem. | |||
= Sending the job and summary records to the remote accounting server = | |||
During this step, the records that have been previously unloaded to the filesystem will be sent by '''SSM''' to a remove accounting server. This step can be performed by the '''apelclient''' script if the variable '''enabled''' from section '''SSM''' of the config file '''/etc/apel/client.cfg''' is set to '''true'''. | |||
The script that is run to send these records is '''/usr/bin/ssmsend'''. It is configured by file '''/etc/apel/sender.cfg'''. The protocol we use to send the records is called '''AMS''' (instead of '''STOMP'''). | |||
Please note that for this step to work, you need to declare the apel service of the CE in '''GOCDB'''. | |||
= Configuration files = | |||
They are located in '''/etc/apel'''. Here are the config files relevant to us: | |||
* '''/etc/apel/client.cfg''' | |||
* '''/etc/apel/parser.cfg''' | |||
* '''/etc/apel/sender.cfg'''. | |||
= Problems encountered = | |||
* Missing userFQAN field in blah records -> use the script '''fix_blah_file.py''' in '''iihe-scripts/ce/correct_blah_files''' and in the table '''ProcessedFiles''', remove the records of the blah files that have their '''Parsed''' field = 0 otherwise you will have doublons (if there is no record for a batch file in the '''ProcessedFiles''' table, the parser will reparse the file). To avoid the problem with missing '''userFQAN''' field in blah records in the future, the script that generates the blah files was altered in the following way: | |||
<pre> | |||
[root@ce03 ~]# diff /usr/share/condor-ce/condor_batch_blah.py /usr/share/condor-ce/condor_batch_blah.py.orig | |||
167c167 | |||
< '"userFQAN=/vo/Role=NULL/Capability=NULL" ', "EMPTY", | |||
--- | |||
> '"userFQAN=%s" ', "x509UserProxyFirstFQAN", | |||
</pre> | |||
This change will be wiped out by an update of the package htcondor-ce-apel. | |||
* Some records malformed in batch files (missing fields: 8 instead of 11) -> the following fields are sometimes missing in the history file: RemoteWallClockTime, JobStartDate, ResidentSetSize_RAW |
Latest revision as of 09:46, 22 July 2024
Context
Our CE is an HTCondor CE, and the underlying batch system is htcondor.
These CEs are equipped with an APEL software stack. The role of this machinery is to extract information about jobs from logfiles, and feed a local database with them. Everyday, the database is read, records are extracted from it and sent to a remote Apel accounting server.
From HTCondor job history files to batch and blah files
Each time a job is finished, a job record is created in the directory /var/lib/condor/history.
Thanks to a systemd timer (see /usr/lib/systemd/system/condor-ce-apel.timer), every hour, a script (/usr/share/condor-ce/condor_ce_apel.sh) is run to parse these history records and to generate the blah and batch files from them via the script /usr/share/condor-ce/condor_batch_blah.py. The blah and batch files are created in the directory /var/lib/condor-ce/apel. If for some reason, it fails to parse an history file, this file is moved to the subdirectory quarantine. Otherwise, it is removed.
The blah files contain information provided by the CE layer (like the user DN for example), while the batch files contain low level pieces of information coming form the underlying batch system (that can be SLURM, or LSF, or PBS or HTCondor or ...).
After the blah and batch files have been generated, the script condor_ce_apel.sh will call /usr/bin/apelparser, a Python script whose role is to update the local Apel Mysql database with the content of the blah and batch files.
The APEL database
In our case, it's a MySQL database whose name is apelclient. It's created and intialized automatically when the CE is deployed thanks to Puppet. It contains table like BlahdRecords to store the records from blah files, EventRecords to store records from batch files, and JobRecords to store the records that are generated by the join performed by the apelclient script (see next section). In this database, you'll also find views (their names begins with a 'V') that give a fast access to data. Many operations on these tables are actually performed through MySQL procedures.
A very important table for sysadmins is the ProcessedFiles. It is used by the apelparser script to check if a file has already been processed or not. When the processing of a file has failed, it will be recorded with its Parsed field set to 0. Otherwise, this field indicates the number of records processed in the file.
Generation of job and summary records
Once every day, job records will be generated by the apelclient Python script. To be more precise, this script accomplish the following tasks:
- fetch benchmark information from LDAP database;
- join EventRecords and BlahdRecords into JobRecords;
- summarise jobs;
- unload JobRecords or SummaryRecords into filesystem.
Sending the job and summary records to the remote accounting server
During this step, the records that have been previously unloaded to the filesystem will be sent by SSM to a remove accounting server. This step can be performed by the apelclient script if the variable enabled from section SSM of the config file /etc/apel/client.cfg is set to true.
The script that is run to send these records is /usr/bin/ssmsend. It is configured by file /etc/apel/sender.cfg. The protocol we use to send the records is called AMS (instead of STOMP).
Please note that for this step to work, you need to declare the apel service of the CE in GOCDB.
Configuration files
They are located in /etc/apel. Here are the config files relevant to us:
- /etc/apel/client.cfg
- /etc/apel/parser.cfg
- /etc/apel/sender.cfg.
Problems encountered
- Missing userFQAN field in blah records -> use the script fix_blah_file.py in iihe-scripts/ce/correct_blah_files and in the table ProcessedFiles, remove the records of the blah files that have their Parsed field = 0 otherwise you will have doublons (if there is no record for a batch file in the ProcessedFiles table, the parser will reparse the file). To avoid the problem with missing userFQAN field in blah records in the future, the script that generates the blah files was altered in the following way:
[root@ce03 ~]# diff /usr/share/condor-ce/condor_batch_blah.py /usr/share/condor-ce/condor_batch_blah.py.orig 167c167 < '"userFQAN=/vo/Role=NULL/Capability=NULL" ', "EMPTY", --- > '"userFQAN=%s" ', "x509UserProxyFirstFQAN",
This change will be wiped out by an update of the package htcondor-ce-apel.
- Some records malformed in batch files (missing fields: 8 instead of 11) -> the following fields are sometimes missing in the history file: RemoteWallClockTime, JobStartDate, ResidentSetSize_RAW