CleanCreamdb
Initial problem
The problem happened with the following version of CREAM :
glite-ce-cream-1.12.4-2 glite-CREAM-3.2.8-2.sl5
In March 2012, while we were investigating on cream01 instability, we saw that the creamdb was filled with thousands of old jobs from previous year. To see this, we used the following kind of MySQL request :
[root@cream01 ~]# mysql -u root -p creamdb -e "SELECT COUNT(*) FROM job_status WHERE (time_stamp < '2012-01-01' AND type = 0);"
The table "job_status" records the different states that the jobs have gone through. The field "type" in the table "job_status" is a number giving the status (0 --> REGISTERED, 1 --> PENDING, 2 --> IDLE, 3 --> RUNNING, 4 --> REALLY RUNNING, 5 --> CANCELLED, 7 --> DONE-OK, 8 --> DONE-FAILED, 9 --> ABORTED).
There is purge mechanism that is triggered each time tomcat is restarted (that is everyday at 2AM), as can be seen from the logs (/var/log/glite/glite-ce-cream.log*) :
13 Mar 2012 02:25:44,756 org.glite.ce.creamapi.jobmanagement.cmdexecutor.JobPurger - purging 0 jobs with status ABORTED <= Sat Mar 03 02:25:44 CET 2012 13 Mar 2012 02:25:44,766 org.glite.ce.creamapi.jobmanagement.cmdexecutor.JobPurger - purging 0 jobs with status CANCELLED <= Sat Mar 03 02:25:44 CET 2012 13 Mar 2012 02:25:44,847 org.glite.ce.creamapi.jobmanagement.cmdexecutor.JobPurger - purging 0 jobs with status DONE-OK <= Mon Feb 27 02:25:44 CET 2012 13 Mar 2012 02:25:44,853 org.glite.ce.creamapi.jobmanagement.cmdexecutor.JobPurger - purging 0 jobs with status DONE-FAILED <= Sat Mar 03 02:25:44 CET 2012 13 Mar 2012 02:25:47,761 org.glite.ce.creamapi.jobmanagement.cmdexecutor.JobPurger - purging 28 jobs with status REGISTERED <= Sun Mar 11 02:25:44 CET 2012
So, how can it be that we still have old jobs in creamdb ? The answer lies in the fact that only jobs in terminal status can be purged, and the old jobs we found were all stuck in REALLY-RUNNING status.
How to purge jobs in a non-terminal status ?
Following the instructions given here, we first tried the command JobDBAdminPurger.sh, but it didn't work due to bugs (https://savannah.cern.ch/bugs/?81561). We then had the idea to enforce the aborted status for the old jobs by inserting the suitable records in the table job_status. For this, we first need to get the list of all old jobs :
[root@cream01 ~]# mysql -u root -p creamdb -e "SELECT DISTINCT(jobId) FROM job_status WHERE (time_stamp < '2012-01-01' AND type = 0);" > test_old_jobs_search
We then create a small Perl script to generate the MySQL insert commands from the list of jobId's :
[root@cream01 purge_old_jobs]# cat generate_sql_command_from_creamid.pl #!/usr/bin/perl my $filename = shift; open(FILE,$filename); while(<FILE>) { chomp; print "INSERT INTO job_status (type,time_stamp,jobId) VALUES(10,'2012-03-09 16:24:36','$_');\n"; } close(FILE); [root@cream01 ~]# ./generate_sql_command_from_creamid.pl test_old_jobs_search > abort_old_jobs.sql
We then execute the list of commands generated above :
[root@cream01 ~]# mysql -u root -p creamdb < abort_old_jobs.sql
Now that the old jobs are all in aborted status, we can just wait for the automatic purge mechanism to remove them. As we are not that patient, we decide to remove them using the glite-ce-job-purge command from an UI with the cream-admin privilege (to get this privilege, add your DN in /etc/grid-security/admin-list). Here is an example showing how to purge a single job with this command :
glite-ce-job-purge https://cream01.iihe.ac.be:8443/CREAM979034579
To automate the operations, we have created a Perl script to apply this command to the whole list of jobId's :
[sgerard@m7 ~]$ cat automatic_purge.pl #!/usr/bin/perl my $file = shift; open(FILE,$file); my $compteur = 1; while(<FILE>) { chomp; my $command = "glite-ce-job-purge --noint https://cream01.iihe.ac.be:8443/$_"; print "N°$compteur\n"; print "$command\n"; my $res = system($command); $compteur += 1; } close(FILE);
To use it, don't forget to create a proxy, and then call the script with the list of jobId's generated above as argument :
./automatic_purge.pl list_jobs_to_purge