CERN/ATLAS boildown on clouds

In this post, I’ll take a look at some more runs of the “atlas_d3pd_boildown” application available in the GridPilot app store. The difference w.r.t. the runs described in a previous post is that this time I ran on cloud as opposed to grid resources. On dedicated hardware and on two public clouds, Amazon’s EC2 and Cabo’s Irigo cloud, I fired up a GridFactory cluster, and changed my preferences to use each one in turn. On the dedicated hardware I made sure the jobs would run in virtual machines (VirtualBox with shared folder) – 2 on each of the 4 worker nodes, each running one job, and on both clouds I ran on 8 worker nodes, each running one job. Notice that in all cases, the jobs run inside a CernVM appliance, but whereas on the two clouds, ATLAS software is accessed through the CVMFS network file system, on GridFactory, the software is accessed over the AFS network file system. Notice also that to avoid the notorious “tail” problem (see previous posts, e.g. this one), this time I ran a “private” cluster on EC2, allowing only my own GridWorkers. On EC2 I chose instances of type “small” (1.7 GB of RAM, I/O performance moderate, 1 virtual core). On Irigo and the dedicated hardware with virtualization I chose a matching setup with instances with the same amount of RAM and 1 virtual core. On the dedicated hardware without virtualization I ran 2 jobs at a time on each 4-core physical machine. The results are summarized below.

Summary of runs

EC2 Irigo Dedicated hardware with virtualization Dedicated hardware without virtualization
Average submission time per job (s) 1.24 1.09 0.531 0.541
Summed running time (s) 55788 13715 9166 7755
Summed CPU time (s) 9501 8644 3220 2517
User real waiting time (submission, processing and data transfer time) (s) 9940 6926 4980 4853


Processing time

Notes:

  • This was primarily an I/O exercise: with I/O referring to both network (download of input file) and disk (reading the file) I/O.
  • Irigo and EC2 have comparable disk I/O and CPU performance with Irigo having a ~10% edge.
  • The virtual machines on both Irigo and EC2 are booted from a shared file system: Irigo from its image store and EC2 from Amazon’s EBS. Apparently the technology underlying EBS is very similar in performance to that that used by Irigo for its image store – NFS (on top of ZFS).
  • The “moderate” network I/O of the “small” EC2 instances chosen, apparently is really moderate. Download of input files takes ~9 times longer than on Irigo and ~8 times longer than on the dedicated hardware (at CERN). These fast downloads are likely bounded mostly by the performance of the file server (in Germany) where the input files reside.
  • The disk images of the virtual machines on both EC2 and Irigo are hosted on a shared file system. While this was expected to limit performance, it is still a bit surprising that the raw disks of the dedicated hardware with (VirtualBox) virtualization give an almost 3 times better performance.
  • From the two last columns it is seen that the virtualization layer itself apparently incurs ~27% overhead. This is presumably mainly due to the fact that the input data file is read from a VirtualBox shared folder.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>