Fast EC2 Deployment with Ruby Threads

Our distribution application is written in Ruby and runs as a daemonized process in EC2. Each ec2 instance is started to handle deliveries to one of our store partners. We ca configure multiple instances to handle distribution to a single store if needed.

Our code deployment process for this application involves simply starting an EC2 instance. Our custom AMI uses upstart to configure an instance (using tags), and installs the latest code upon launch. For us to ‘deploy’ new code, we simply need to restart or launch new EC2 instances after pushing to changes to our git repo.

I recently made some changes to a script we’ve been running manually to start/stop/terminate the AWS instances. The goal being to make it run faster and to be able to use it as a deployment script with Jenkins.

Deploying EC2 instances with Ruby AWS SDK

We use the AWS Ruby SDK to allow communication with our AWS infrastructure. When I started working on this change we were launching servers like so:

@ec2     = Aws::EC2::Resource.new(region: region)
instance = @ec2.create_instances({
 # passed in config options
})

@ec2.client.wait_until(:instance_running, {instance_ids: instance.map(&:id)})

instance.batch_create_tags({hash: :of_tags})

This code is slow because we are processing the list of servers serially.

After some reading and discussion I found that I do not have to perform the wait_until on the instance to apply tags to it. The AWS API will respond right away with an instance id when you invoke create_instances, so one way to speed this up would be to just remove the wait_until.

That wasn’t going to work for me, though. The end goal is to use this script in Jenkins as a way to easily “deploy” instances. I want to actually wait until they are running to ensure my Jenkins job properly reflects the actual state of the “deploy”.

Using Thread to “parallelize” the processing

In order to speed up the execution of this code I decided to try using the Ruby Thread class to parallelize the launching of instances.

As an aside, I want to point out that we cannot truly parallelize (run multiple things at the same time on multiple cores) with the Ruby MRI interpreter. This is due to good old GIL. The ‘Global Interpreter Lock’ will lock execution of Ruby code to only one thread at a time, as a way to ensure atomic processing. So even if you launch 20 threads in your Ruby app, only one of them is being executed by the hardware at any given moment.

That being said, the interpreter will sleep execution on a Thread if it has an interrupt flag set. The Ruby timer in the MRI checks this interrupt flag every 100ms and will pause execution on a thread and move to the next one if it is set. When the interpreter needs to do some IO (like hit the AWS API) it calls out to the kernel, which will release the lock in the current running thread and move onto the next.

So we cannot achieve true parallelism, but we can at least have Ruby work on multiple things at a time.

To achieve this, it is quite trivial. For each one of the instance launch invocations I just wrapped it in a Thread:

threads = []
servers_to_launch.each { |q| threads << Thread.new { launch_server(q, queue_type.to_s) } }

This will spin up a new thread for each server I am launching. I then need to wait on each thread to finish execution:

threads.each { |t| t.join }

The .join method on a thread will ensure that the parent process will not exit before its execution has completed. For more detailed information on this and other methods you can call on a thread, check out the docs here.

Retry logic due to API rate limits

Once I had this code working properly, I started to notice the AWS SDK raised a Rate Limit Exceeded exception periodically. To work around this, I implemented some simple retry logic in my launch_server method:

 def initialize
   @launch_retries = {}
 end

 def launch_server(queue, type)
   # call @ec2.create_instances
   return 0
 rescue
   if @launch_retries[queue] && @launch_retries[queue] == 10
     return 1
   else
     sleep(5)
     if @launch_retries[queue].nil?
       @launch_retries[queue] = 0
     else
       @launch_retries[queue] += 1
     end
     launch_server(queue, type)
   end
 end

This code will rescue any exceptions raised by the AWS library and track retry count for each of the queues I am attempting to launch. For now I just hardcoded a retry limit of 10 and a 5 second sleep before calling out to the API again.

Proper exit codes

Since my end goal is to use this script in Jenkins, I need it to return proper exit codes to the shell. When Jenkins executes some code in a “Execute shell” build step, it will check the exit code of the last executed step in the script. If it is non-zero, the build fails.

After some testing I learned that simply doing a return 1 from an executed method in a Ruby script will not affect the return code of the script. This makes sense, as returning 1 or 0 from a Ruby method is not the same as telling our script to exit with a specific code.

$ cat return_code_test.rb
def test
  return 1
end

test

$ ruby return_code_test.rb
$ echo $?
0

In order to get a non-zero exit code I need to actually do an exit 1 from the script, as so:

$ cat exit_code_test.rb
def test
  exit 1
end

test
$ ruby exit_code_test.rb
$ echo $?
1

My Jenkins script is simple:

#!/bin/bash -l
ruby launch_servers.rb $deploy_target deploy
exit $?

As long as I get the launch_servers.rb script to return 0 or non-zero properly, then I can be sure that Jenkins will properly report a successful or a failed build.

I achieved this by simply tracking return codes in my helper methods of either 1 or 0 (you can see that in the rescue above). I chose this rather than true and false just to maintain consistency with the exit codes.

In my deploy action, I run all my deploy actions and exit if any of the steps fail:

 def deploy
   code = 0
   deploy_actions.each do |action|
     code = send(action.to_sym)
     exit code if code == 1
   end

   exit code
 end

One of the issues with this approach was tracking all the return codes for the various launch_server method invocations running in each thread. I found that I could simply track the return with .value method on the the thread and ensure they are all 0:

 exit_codes = []
 threads.each { |t| exit_codes << t.value; t.join }
 exit_codes.detect { |c| c == 1 } ? 1 : 0

Conclusion

After finishing this work we were able to automate the deployment of our main distribution codebase across dozens of AWS instances. We can now reliably start up or tear down the entire distribution infrastructure (or the distribution infrastructure for a single store, depending on what we pass as the $deploy_target parameter) with a single click in Jenkins. The entire process takes less than 4 minutes!