AutoTopTreeProdDiscussion

From T2B Wiki
Revision as of 12:28, 26 August 2015 by Maintenance script (talk | contribs) (Created page with " == Follow-up on the 29-11-2010 (Top@BXL) meeting == PageOutline === Location of the slides === *The slides of the meeting can be found here: [http://w3.iihe.ac.b...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Follow-up on the 29-11-2010 (Top@BXL) meeting

PageOutline

Location of the slides

Aim of the automatic production of the toptrees

  • Since beginning of 2009 the Brussels top quark group (O(10) analysis/people) are using the TopTrees data format on top of the PAT data format. This reduces significantly the disk space needed to store samples and largely increases the turn-around speed of analyses while keeping the flexibility as large as possible.
  • Since the toptrees are common for most people doing analysis it is convenient to join the efforts to create these toptrees. This prevents the duplication of work and resources and keeps the samples uniform for all the whole group
  • Michael and Stijn have developed a tool to automatize the production of the toptrees which takes care of several aspects of the production and maintenance of samples. This can all be done via a rather simple, straightforward webinterface.

Discussions during the meeting

Point 1: The question was raised what would be the effort if there was no automatic TopTree production tool. Shortly the different steps of the production and maintenance of samples is described here together with the advantages/disadvantages of the automatic TopTree producer (ATTP):

  • Setup a CMSSW release. This needs to be done as well for the ATTP.
  • Generate config file. The ATTP automaticaly checks the DBS if the sample exists, would have to be done by hand.
  • Check config files.
  • Sending jobs on the grid. An important gain with the ATTP is the serial production of PAT and TopTree, it includes a script to perform 2 cmsRuns while a regular crab job is only capable of running 1 cmsRun.
  • Monitoring: The ATTP takes care of the monitoring and checks the status of the jobs after a fixed time-interval and can bases its actions on the outcome of the crab -status. This heavily reduces the human resources for this monley-task
  • Resubmit failed jobs. Done automagically by the ATTP
  • Retrieve jobs. Done automagically by the ATTP
  • Bookkeeping of produced sample. This is an important feature of the ATTP. The ATTP is interfaced with the topDB. In general bookkeeping all the relevant information of produced samples is tedious, the ATTP makes this more easy by filling the TopDB with information on the produced sample.

Point 2: Over the last weeks (months) the ATTP has shown not to be fully reliable. The question was raised how long it took to TopTree-ify the Fall10 samples.

  • The short answer is 1.5 months
  • The longer answer is 2 to 3 weeks. If no technical issues would have appeared about 3 weeks would have been needed, given the current setup, to produce all samples
  • Technical issues with the ATTP:
    • The GRID: The GRID is not 100% efficient, which is (partially) inevitable. It can however be optimized by not using the CRABserver. It was found that the CRABserver is often unreliable. CRAB is designed to not allow stand-alone submission if a user wants to submit more than 500 jobs. For large datasets this is difficult to avoid except by splitting the task in several sub-datasets. However this last trick cannot be fully exploited since the performance of the current machine is limited.
    • The software area of mTop: During the Fall10 production round 80% of the time was lost due to problems with the installed software. Intensive debugging was done but no conclusive answer was found. Possible solutions: mount software area of Tier2 or reinstal software from scracth. The latter means a downtime for mTop and has to wait untill all production is done
    • Limitations of the mTop machine: The mTop machine is a rather old server.
    • Simultaneous skimming and TopTree production is hard and can crash the machine. More power would have the advantage the skimming can be performed on mTop for all samples so the the interface for the skims with the topDB is guaranteed
    • During the the Fall10 production only a reduced number of workers (on average 2) were used. These workers submit the ATTP tasks and thus make sure samples can be run in parallel (Michael/Stijn, is this right?). The current machine can have at maximum 7-8 workers if no skimming is performed at the same time.
  • The conclussion is that if no software problem would have occured and the machine would have been used at a rate of 8 workers about 3 weeks should have been sufficient to have all samples ready. With a new machine this time will be reduced more eg. 2 weeks and the functionality can be extended towards skimming.

Point 3: An important point of discussion was the potential purchase of a new (additional) server for running the ATTP/skimming. Here a short overview of the advantages, needs and disadvantages are listed. The goal of the server is to extend the current server rather than to replace it.

  • Disadvantages:
    • price: The minimal needed server costs between 2000-3000 euro
    • place in the Tier2: The server would need to be added in the racks of the Tier2. A short discussion with Shkelzen pointed out that space is not an issue. It could be even added to the back-up power system of the Tier2
    • Human resources: A new/additional machine needs humen resources to setup/maintain/monitor and perform interventions
  • Advantages:
    • A more performant machine can run more workers. This results in more simultaneous crab jobs and allows the skimming to be performed on mTop
    • Redundancy. Currently all data on mTop (topDB) is backed up daily but no other machine can take over. If the current machine fails the setup has to start from 0 while for an additional machine the current setup can be used as example. Buying replace parts for the current machine would be more expensive then for a newer machine.

Task sharing

During the meeting it was clearly stated that the current work sharing cannot be maintained. Most (all) work/development was done by Michael and Stijn, while the production of samples via CRAB/GRID is the task of everybody needing these samples. The share of the work can be factorized as following (from Michael's slides):

  • PAT and TopTree: Test/Update the PAT/TopTree con�guration �with new CMSSW/PAT/TopTreeProducer versions. 1 person 10% FTE
  • TopDB: Maintain TopDB: Update LumiCalc, update TopSkim interface, implement new features in TopDB, test the production workflow with new versions of CRAB. 1 person at 5% FTE
  • Documentation: Writing documentation on the scripts and how to operate them. 2 persons at 5% FTE. (Michael and Stijn)
  • Monitoring: Monitor production workflow and add production requests. 1 person at 10% FTE in a shift-like system
  • Server maintenance: Maintenance of mTop.iihe.ac.be and the CMS Software installation. 2 persons at 5% FTE. (Michael and Stijn)

Feel free (or feel obliged) to nominate yourself for one or more tasks.


Task FTE Candidate
PAT and TopTree 10% Thierry
TopDB 5% Gerrit
Documentation 5% Michael and Stijn
Monitoring 10% nn
Server maintenance 5% Michael and Stijn


Proposal for mTop2

When building a hardware configuration proposal for a potential new mtop, we took into account the potential of central skimming of toptrees and being able to run at least 10-15 workers for the production scripts. This is why we chose for a machine with many cores and enough RAM.

The main specifications of the proposal are:

Chassis Dell PowerEdge R415
CPU 2x AMD Opteron 4180 2,6Ghz 6Cores = 12 Cores in total!
RAM 8x2GB DDR3 1333Mhz = 16GB Memory = 1,3GB/Core
Storage Non-Hotswappable 2x500GB Sata in RAID1 Configuration = 500GB of disk space (*)
Extra Dell iRDCA6 Enterprise card (**)
Support 3 years Dell ProSupport and 4-hour mission critical (onsite service)
Normal Price 3232,12 EURO
VUB Price ?


(*) We will revert to the software area of the T2. So we don't need more space.
(**) This card will allow us to connect to the machine at hardware level: reboot/shutdown remotely the machine when the linux has crashed. It also provides detailed hardware monitorring and protection.

The idea is that the two machines would co-exist and the old mtop would help in the skimming.

If you have any questions/doubts or feedback, please provide it to us before the end of the week.


Template:TracNotice