CERN/ATLAS Monte Carlo simulation on grids and clouds

In previous posts we saw that I/O bound jobs ran ~3 faster on standard SATA disks than on network file systems and block devices (GPFS, NFS, EBS). This post reports on CPU bound jobs. I ran standard ATLAS Monte Carlo simulation on both grid and cloud resources: imported the ATLAS simulation app and ran the default 100 small jobs, each generating 100 events and writing a file of about 0.8 MB in size. On the two clouds (C and D – in so far a virtualized batch system qualify as a cloud), the jobs ran inside a CernVM appliance with ATLAS software loaded through the CVMFS network file system. The timing results are summarized in the table and chart below.

Summary of runs

NorduGrid tier-3 cluster (A) WLCG (B) GridFactory / Irigo (C) GridFactory / virt. (D) GridFactory (E)
Average submission time per job (s) 0.59 6.43 0.59 0.56 0.59
Average running time (s) 393 130 83.9 85.6 75.9
User real waiting time (submission, processing and data transfer time) (s) 2689 18000 1773 2730 1700
Number of available cores ~100 - 8 8 8


Simulation time

Notes

  • On the GridFactory cluster, switching on virtualizatilon incurred an 11.3% performance penalty and a penalty of 37.7% in “User real waiting time”. This rather substantial last penalty is primarily due to the short running time of each job, i.e. to running jobs via SSH and staging files in and out of virtual machines.
  • On WLCG (gLite) the “User real waiting time” was exceedingly long because of the “tail” problem mentioned in previous posts: some jobs took a very long time to start.
  • On WLCG (gLite) 28 out of 100 jobs failed for a variety of reasons – most prominently because the ATLAS setup script was not found in the standard location “$VO_ATLAS_SW_DIR/software/[release]/setup.sh”. I don’t know if I should look for the setup script somewhere else – if someone out there does, please post a comment or drop me a line.
  • Despite the many cores, the tier-3 cluster did not do too well. Presumably one reason for this is simply that its processors are rather old Xeons, at least compared to the Core i7′s of the Irigo and GridFactory clusters. Another reason could be that other jobs were running on the cluster – using other ATLAS software releases. If some jobs running different ATLAS releases happened to run on the same (8-core) node, GPFS may have had trouble serving the software. This last hypothesis is supported by the large spread in the CPU times of the jobs: 59 s – 774 s, compared to e.g. 42 s – 132 s on the GridFactory cluster (without virtualization).

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>