Given the popularity of the iPhone, an interesting use of a batch system is conversion of movie files from the AVI to the MPG4 format. In this post I’ll explain 3 ways doing this with GridPilot. Which way you prefer will likely depend on the number and size of files you want to convert and the power of your local desktop. Either way, the application “avi_to_mp4” is used. After importing this application in GridPilot, simply click “Run” and then select one of the back-ends you’ve enabled. You’ll then be asked to populate the input dataset with AVI files. If you just keep the defaults, you’ll then be converting 4 movies plus a test clip, all in the public domain from AVI to MP4 format for viewing on your iPhone. You can of course also choose other movies of your own. We’ll look at doing this conversion on 3 different back-end systems: virtual machine(s) on your own desktop, virtual machines on EC2 and a GridFactory cluster running on Amazon’s EC2 cloud.
VMFork: This is a somewhat experimental back-end meant to allow quick testing of a few jobs before running larger-scale productions on remote resources. To use this back-end you must have either VirtualBox or Qemu installed and configured. You also have to enable VMFork in your GridPilot preferences. After that, if you open your monitor window (ctrl or command m), you’ll notice a new tab, “Local virtual machines”. Click the top “Refresh” button and you’ll see a few images that you can boot by clicking “Launch”. You don’t need this though, as a virtual machine will automatically be booted if you submit your jobs on this back-end. In the preferences you can choose how many virtual machines you allow to have running simultaneously and how much memory to assign to each. For this example, we’ll just stay with the defaults and run on a single virtual machine. When you’re all set, restart GridPilot and do the following:
- choose “Import application(s)” from the “File menu”
- navigate to “media/avi_to_mp4_conversion” and click “OK”
- select the application/dataset “mp4_files” and click “Run”
That’s it! You’ll be asked to select some input files; if you just click “OK” you’ll default to a collection of 5 public domain movies. You’ll notice that this will take a very long time: first each movie has to be downloaded to your hard disk, for the first job, a virtual machine has to be fired up, each movie has to be copied into the virtual machine, converted and the resulting file copied out again. All in all, you’ll probably loose patience, in which case you can click the progress bar to cancel submission and right-click on any running jobs and kill them. Or you can simply restart GridPilot.
Still, you may want to test on just the small input file “spacemission-byrjt2005.avi”. This you can do by right-clicking on “mp4_files”, choosing “Show job definition(s)”, select the relevant job definition (#5) and click “Submit”.
EC2: Once you’re happy things work on your local machine, you may want to try out the EC2 back-end. Notice that first you must enable EC2 and enter you AWS credentials in your preferences. After that, all is analogous to above. Below you have a screencast of transcoding the 5 default AVI files on EC2.
Before running large batch productions, you may want to edit various settings, notably:
- “max machines” (under “EC2″): For long-running jobs you would typically set this to the number of jobs. In this case GridPilot will fire up one for each job – assuming you’ve set “Max simultaneous running” (under “Computing systems”) to the same number.
- “submit retries” (under “Computing systems”): This setting is important if you have more jobs than slots (see above). In this case, GridPilot runs its own batch queue, retrying regularly to see if a slot is available and “submit retries” ☓ “time between submissions” must be higher than the expected total processing time of all jobs.
- “terminate hosts on job end” (under “EC2″): This is only relevant if your number of slots is larger than or equal to your number of jobs. In that case, each job will run in a fresh machine and you may as well shut it down when the job ends. Notice that in this case you should set OUTPUTLOCATION of the app to an S3 diretory (sss://) – otherwise the output files will be gone with the terminated machines.
One thing to notice is that local input and output files will be copied via SSH from/to your hard disk to/from the EC2 instances. This is quite slow. Therefore, for larger productions, you should set the OUTPUTLOCATION of the “mp4_files” dataset to an S3 directory (sss://) and, if possible, use input files that are served by S3 or a web server (like the default files). That is, the OUTPUTLOCATION of the “avi_input_files” should be set to an HTTP (http:// or ) or S3 (sss://) directory. Notice that you should be careful not to set the OUTPUTLOCATION of “mp4_files” to an HTTP directory because the worker machine will likely not have the credentials to write there.
Another thing to notice is that both the “required runtime environments” and “UTIL/MpegUtils-1.0″ will be installed on the fly on each freshly booted EC2 instance. This takes time – up to a few minutes, so depending on the running time of each job, it can make sense to limit the number of slots and have each machine reused for a number of jobs. The two screen shots below show suggested preferences for converting ~100 DVD-sized movied. Since the number of slots is set to 20, each machine will run ~5 jobs. Since, with the EC2 backend, the queue of jobs is kept by GridPilot, GridPilot must run during the whole execution time of all 100 jobs. If you don’t want this, you should change the number of slots from 20 to 100 – then all jobs will be submitted and you can shut down GridPilot and start it again to check the progress of the production. Alternatively, you may want to have the queue of jobs kept by an external service instead. For this, read on.
GridFactory on EC2: You can run the “avi_to_mp4? app on any GridFactory server, as long as it is subscribing to a software catalog containing “UTIL/MpegUtils-1.0″ and the attached worker nodes are able to install the software – this means that they must either run with administrator right or be capable of launching virtual machines.
One possibility is gridfactory.dyndns.org. This server is running on EC2 and may not have any running worker nodes attached, but you can easily add some yourself, by simply firing up instances of the pre-configured GridWorker image available on EC2. Notice that since the attached worker nodes are contributed completely voluntarily and not centrally controlled, your jobs may land on very slow worker nodes (that are either of the micro type or doing lots of other things).
You can of course also set up your own GridFactory server on EC2 and fire up your own worker nodes with a much more predictable quality of service.
Notes: Two of the example AVI files have resolutions close to the 320 ☓ 180 of the output MP4 files and therefore don’t get reduced in size – actually for one of them the size increases. To reduce the size of the output file you may try altering the settings in the executable script “transcode.sh”; for instance you could reduce the number of frames per second (fps).