In this post I’ll report on running the application “mc09_7TeV.107691….” from the GridPilot app store. In the case of NorduGrid and WLCG, the ATLAS software is preinstalled on the resources. In the case of GridFactory, the jobs run inside a CernVM appliance with ATLAS software loaded through the AFS network file system. The input dataset consisted of 26 files totaling 36 GB, i.e. each input file was rather large. The files were physically located at atlassrm-fzk.gridka.de.
Timing results are summarized below. There’s no timing for general NorduGrid because I did not manage run on other clusters than our own tier-3.
|NorduGrid tier-3 cluster (A)||WLCG
|GridFactory / virt. – 4 nodes – run 1 (C)||GridFactory / virt. – 4 nodes – run 2 (D)||GridFactory – 4 nodes – run 1 (E)||GridFactory – 4 nodes – run 2 (F)||GridFactory – 1 node
|Average submission time per job (s)||1.34||3.69||0.538||0.731||0.731||0.731||0.692|
|Summed CPU time (s)||3259||7691||4445||4390||2244||1926||1887|
|Summed download time (s)||-||72041||8755||11478||8998||7170||6295|
|User real waiting time (submission, processing and data transfer time) (s)||6356||37528||1595||1702||1354||962||2515|
|Number of available cores||~100||-||16||16||16||16||4|
On WLCG, 16 out of 26 jobs failed and had to be resubmitted – with various reasons reported: “user timeout” (.es, .it, .tw), “server responded with and error – transfer aborted” (.tw), atlas misconfiguration (.ru, .za).
On the GridFactory cluster, I ran the same production twice with VirtualBox virtualization and twice without. In both cases, the ATLAS software was read from AFS. In the latter case, a noticeable speedup was observed in the second run – this is presumably due to AFS having cached the ATLAS software. In the former case, the speedup was much smaller – probably the cache in the virtual machine is too small to make a difference. Download times varied quite a lot – presumably for reasons outside of our control (the file server, network). Interestingly, the performance penalty incurred by virtualization appears to be almost a factor 2 – much higher than e.g. in the case of the simulation runs. We ascribe this to the I/O penalty from running in a VirtualBox shared folder on a rather large input file.