CERN/ATLAS data processing on grids and GridFactory

In this post I’ll report on running the application “mc09_7TeV.107691….” from the GridPilot app store. In the case of NorduGrid and WLCG, the ATLAS software is preinstalled on the resources. In the case of GridFactory, the jobs run inside a CernVM appliance with ATLAS software loaded through the AFS network file system. The input dataset consisted of 26 files totaling 36 GB, i.e. each input file was rather large. The files were physically located at atlassrm-fzk.gridka.de.

Timing results are summarized below. There’s no timing for general NorduGrid because I did not manage run on other clusters than our own tier-3.

Summary of runs

NorduGrid tier-3 cluster (A) WLCG

(B)

GridFactory / virt. – 4 nodes – run 1 (C) GridFactory / virt. – 4 nodes – run 2 (D) GridFactory – 4 nodes – run 1 (E) GridFactory – 4 nodes – run 2 (F) GridFactory – 1 node
(G)
Average submission time per job (s) 1.34 3.69 0.538 0.731 0.731 0.731 0.692
Summed CPU time (s) 3259 7691 4445 4390 2244 1926 1887
Summed download time (s) - 72041 8755 11478 8998 7170 6295
User real waiting time (submission, processing and data transfer time) (s) 6356 37528 1595 1702 1354 962 2515
Number of available cores ~100 - 16 16 16 16 4


Processing time

Contrary to the simulation runs and more in line with the boildown runs, this time our tier-3 cluster performed substantially better than the average WLCG resource, but still substantially worse than a new desktop PC. In fact, a user is better off running 26 such jobs on such a desktop PC (with 4 Intel i7 cores) – they finish 2.5 times faster than on the tier-3 cluster with 160 cores (of which a few were busy with other jobs). This is partly because each job runs 1.7 times faster on the desktop PC. The remaining gap must be due to the desktop PC having a faster internet connection and/or overhead incurred by the grid system.

On WLCG, 16 out of 26 jobs failed and had to be resubmitted – with various reasons reported: “user timeout” (.es, .it, .tw), “server responded with and error – transfer aborted” (.tw), atlas misconfiguration (.ru, .za).

On the GridFactory cluster, I ran the same production twice with VirtualBox virtualization and twice without. In both cases, the ATLAS software was read from AFS. In the latter case, a noticeable speedup was observed in the second run – this is presumably due to AFS having cached the ATLAS software. In the former case, the speedup was much smaller – probably the cache in the virtual machine is too small to make a difference. Download times varied quite a lot – presumably for reasons outside of our control (the file server, network). Interestingly, the performance penalty incurred by virtualization appears to be almost a factor 2 – much higher than e.g. in the case of the simulation runs. We ascribe this to the I/O penalty from running in a VirtualBox shared folder on a rather large input file.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>