This example demonstrates the use of GridPilot in data processing in high energy physics (HEP). It makes extensive use of some HEP-specific technologies, that are incapsulated in GridPilot in the form of plugins: the ATLAS DB plugin and the NG and GLite computing system plugins. The jobs chosen are so-called “boildown” jobs, each with and input file of ~120 MB, running ~30 seconds on a modern CPU and an output file of ~16 MB. In the case of NorduGrid and WLCG, the ATLAS software is preinstalled on the resources. In the case of GridFactory, the software is loaded over the AFS network file system.
The data processing was run on two of the major European academic production grids, WLCG and NorduGrid and – for comparison, also on a small GridFactory cluster. Here’s a writeup of the steps I followed:
First of all, in the preferences, under “Computing systems” I set “time between submissions” to “500″ and “lookup pfns” to “no”.
The latter is important when working with many input files – otherwise job creation takes very long. Also, under each of the used computing systems, I set “max simultaneous running” to “1000″ and “max simultaneous submissions” to “5″.
Next, I did the following:
- imported the app “atlas_d3pd_boildown” (“File” ? “Import application(s)”)
- on the tab “datasets”, searched for a suited input dataset – found one with 303 files
- back on the tab “applications/datasets”, double-clicked on the imported app and changed the input dataset to the above and the name accordingly
- also changed OUTPUTLOCATION to a gsiftp directory
- right-clicked on the app, chose “Create job definition(s)” and created the 303 job definitions
- right-clicked on the app, chose “Show job definition(s)” and ran a few test jobs
- finally, I ran over all 303 input files, generating 303 output files, first on a single NorduGrid (NG) cluster only, then on NorduGrid at large, on WLCG (GLite) at large and on a small GridFactory cluster. Each run is summarized below.
|NorduGrid cluster||NorduGrid||WLCG cluster||WLCG||GridFactory mini-cluster|
|Average submission time per job (s)||0.535||1.273||2.960||3.878||0.528|
|Summed running time (s)||10480||9176||14966*||33594*||7482*|
|User real waiting time (submission, processing and data transfer time) (s)||1908||7200||4622||15929||5268|
|Max number of simultaneous jobs seen||12||22||145||68||5|
|Number of available cores||160||-||-||-||16|
*) These times include data staging.
In general, there were many job failures on WLCG, mostly due to site misconfigurations. This is probably difficult to avoid, given the large number of sites – and since automatic resubmitting is very easy with GridPilot, it’s not a big issue. Below is an overview of the various errors reported. Another issue is that querying sites for installed runtime environments is very slow and sometimes times out – causing long startup times for GridPilot. Despite the number of involved sites, the total time before all jobs were completed was very large. This is due to the “tail” of a few last jobs that were unlucky enough to land on an old piece of hardware somewhere with a very poor Internet connection. And some of these I even killed by hand and resubmitted somewhere else. Notice that although brokering on WLCG is done centrally, one can direct jobs to specific sites and I could in principle have directed all my jobs to a few large sites.
When running on a single NorduGrid server, job submission, checking and retrieval was all quite fast and responsive. When running on NorduGrid at large, some issues appeared. First of all, the information system was quite unresponsive – as can often be observed on the monitoring page. This causes a long start-up time for GridPilot, but brokering seemed relatively unaffected and submission speed was acceptable – about 100% slower than when submitting to a single resource. Since NorduGrid in the ATLAS context is dominated by a few large sites that announce far more CPUs than are actually available to a user (at least as reported by the ARC Java library), brokering is problematic. In this case, I had disabled the NBI T3 in order to not have all jobs land there again, but instead they all landed on the large Norwegian ATLAS site – queued behind hundreds of production jobs. They all failed with no error message. Next, I disabled the three largest T1 sites plus the NBI T3 and submitted 100 jobs. They all went to the Slovenian site, “Arnes”, where they started nicely. ~15% failed on output file upload with no details in the error log. I then disabled Arnes as well and tried some more smaller sites. On most of these sites jobs either were not accepted because of the job requirements (software, CPU time) or failed because of site misconfigurations. Finally, I submitted some batches to specific clusters and eventually managed to run all jobs.
1) For the kind of jobs used for this test, if you wants to get work done in a reasonable amount of time – user time, the best way is to run on your own a small cluster. The technical reasons are not so obvious:
- On WLCG in principle you have a lot of resources at your disposal; in practise, I managed to run 68 jobs in parallel and the quality of the resources was so varying that a tail of a last few jobs that never finished made the total execution time blow up.
- On NorduGrid, it turned out that beside the 3 large tier-1 sites plus our own tier-3, only one site was able (or willing) to run my jobs and with a rather high failure rate. Thus, the “tail” problem also appeared here.
- On NorduGrid I could just have run my jobs on one or several of the 3 large tier-1 sites, but then they would have been queued behind hundreds of production jobs.
- On the WLCG cluster hosting the data I actually managed to run 145 jobs in parallel, but only for some time. Eventually I was left with a tail of a few jobs that queued for a long time – presumably because a quota system kicked in, either in the gLite stack or at the batch level.
2) More generally, this is not entirely surprising, since the production grids were designed by and for the production teams of the large CERN experiments. They were designed to optimise the total use of resources, i.e. reduce CPU idle time and thereby optimise the total time spent on very large data productions. It is, however, slightly surprising, at least to me, that a small cluster of 4 desktop machines outperforms both international production grids from a user perspective.
3) As already mentioned, the jobs used for this exercise were short-running with large input files, i.e. this was predominantly an I/O exercise. For other kinds of jobs, notably long-running jobs with little input data, the production grids will presumably shine more.
4) It’s positive to see that the storage systems (at least the one serving the data used) apparently have no problems serving users along with ongoing large-scale production.
5) From the site-configuration perspective, it’s also somewhat surprising that the fastest machines are those of a desktop cluster running ATLAS off AFS. Each CPU-core of the fastest of the grid clusters (our own tier-3 cluster) apparently spends 40% more time crunching a data file than that of one of the desktop machines. The processors of the desktop machines are 1-2 years newer than those of the tier-3, but as already noted this is predominantly an I/O exercise and actually the measured crunching time of the desktop machines includes file download while that of the tier-3 cluster machines don’t.
6) The user interface of GridPilot, which allows for rather low-level inspection of running jobs etc. seems entirely appropriate, given the state of today’s grid infrastructure. At a later point, one may consider hiding all these details and simply displaying a progress bar for each dataset the user is processing.
|Error getting signing policy file/globus_sysconfig: File does not exist: /etc/grid-security/certificates/20ce830e.signing_policy is not a valid file|
|Server failed to parse job description|
|Failed extracting LRMS ID due to some internal error|
|LRMS error: (271) =>> PBS: job killed: walltime 3 exceeded limit 1|
|Number of jobs||Failure reason|
|1||Status: ABORTED, Reason: failed (LB query failed)|
|4||Status: ABORTED, Reason: hit job retry count (0)|
|12||Submission timed out by GridPilot|
|22||lcg-cp: error while loading shared libraries: libcgsi_plugin_gsoap_2.7.so: cannot open shared object file: No such file or directory|
|1||[SE][Ls] golias100.farm.particle.cz:8446/srm/managerv2: CGSI-gSOAP running on wn11.grid.lebedev.ru reports could not open connection to golias100.farm.particle.czt|
|8||[BDII] srvslngrd001.uct.ac.za: No entries for host: golias100.farm.particle.czt|
|26||Using AtlasSetup default options 16.2.1 AtlasOffline opt gcc43 slc5 32 runtime AtlasSetup(ERROR): CMTCONFIG (i686-slc5-gcc43-opt) not supported on this platform / root-config: command not found|
|1||./….job: line 13: /software/16.2.1/setup.sh: No such file or directory [GFA, NULLL][bdii_query_send][EINVAL] Invalid BDII parameters|
|1||golias100.farm.particle.cz/dpm/farm.particle.cz/home/atlas/atlaslocalgroupdisk/group10/phys-sm/data10_7TeV/group10.phys-sm.data10_7TeV.00158269.physics_L1Calo.merge.ESD.r1647_p306.WZphys.101222.01.101224004527_D3PD/group10.phys-sm.08079_003119.D3PD._00098.root: Invalid argument
lcg_cp: Invalid argument