Hi! I’m Ilya Sher.
This guest post will describe the deploy process at Utab, one of my clients.
Background – system architecture summary
Utab uses the services architecture. Some services are written in Java, others in NodeJS. Each server has either one Java application or one NodeJS application. The production environment uses (with the exception of load balancing configuration) the immutable server approach.
Requirements for the deploy process
The client specified the following requirements:
- Support for staging and production environments
- Manually triggered deploy for each of the environments
- Health check before adding a server to load balancing in production environment
- Option to easily and quickly rollback to previous version in production environment
- Simple custom tools are to be used ( no Chef/Puppet/… )
The Utab’s deploy and other scripts were made specifically for the client. Such custom scripts are usually simpler than any ready-made solution. It means they break less and are easier to modify and operate. No workarounds are required in cases when a chosen framework is not flexible enough.
Also note that custom solution solves the whole problem, not just parts of the problem as it’s often the case with ready-made solutions. For example the scripts also handle the following aspects: firewall rules configuration, ec2 volumes attachment, ec2 instances and images tagging, running tests before adding to load balancer. When all of this is solved by a single set of custom scripts and not by using several ready-made solutions plus your own scripts the result looks much better and more consistent.
We estimate that such custom solution has lower TCO despite the development. Also note that in case of Utab, the development effort for the framework scripts was not big: about ten working days upfront and another few days over three years. I attribute this at least partly to sound and simple software and system architecture.
Note that the opinion in the summary above has many opponents. My experience shows that using ready-made solutions for the tasks described above lead to higher TCO. This mostly happens because of complexity of such solutions and other reasons that I’ll be discussing in another post.
- A developer pushes a commit to Github
- Github hook triggers Jenkins
- Jenkins does the build and runs the tests: Mocha unit tests for NodeJS or JUnit tests for Java
- A small script embeds git branch name, git commit and Jenkins build number in pom.xml, in the description field.
- Jenkins places the resulting artifact in the Apache Archiva repository (only if #3 succeeds)
What you see above is common practice and you can find a lot about these steps on the internet except for number 4, which was our customization.
Note that while the repository is Java oriented we use it for both Java and NodeJS artifacts for uniformity. We use mvn deploy:deploy-file to upload the NodeJS artifacts.
For NodeJS artifacts, static files that go to a CDN are included in the artifact. Java services are not exposed directly to the browser so Java artifacts do not contain any static files for the CDN.
Custom scripts involved in the deployment process
Scripts that use the AWS API are in Python. Rest of them are in bash. I would be happy to use one language but bash is not convenient for using AWS API and Python is not as good as bash for systems tasks such as running programs or manipulating files.
I do hope to solve this situation by developing a new language (and a shell), NGS, which should be convenient for the system tasks, which today include talking to APIs, not just working with files and running processes.
- Creates EC2 instance (optional step)
- Runs upload-and-run.sh (see the description below).
- Runs configure-server.sh to configure the server, including the application (optional step)
- Runs tests (optional step)
- Creates an AMI
Gets destination IP or hostname and the role to deploy there.
- Pulls the required artifacts from the repository. Default is the latest build available in the repository. A git branch or a build number can be specified to pick another build. If a branch is given the latest build on that branch is used. The information embedded in pom.xml .
- Runs all relevant “PRE” hook files (which for example upload static files to S3)
- Packages all required scripts and artifacts into a .tar.gz file
- Uploads the .tar.gz file
- Unpacks it to a temporary directory
- Runs setup.sh (a per role file: website/setup.sh, player/setup.sh, etc) which controls the whole installation.
- Starts EC2 instance from one of the AMIs created by the create-image.py script
- Runs applications tests
- Updates the relevant load balancers switching from old-version-servers to new-version-servers.
The Python scripts, when successful, notify the developers using dedicated Twitter account. This is as simple as:
api = twitter.Api(c['consumer_key'], c['consumer_secret'], c['access_token_key'], c['access_token_secret'])
Pulling artifacts from the repository before deploying
I would like to elaborate about pulling the required artifacts from the repository. The artifacts are pulled from the repository to the machine where the deploy script runs – the management machine (yet another server in the cloud, near the repository and rest of the servers). I have not seen this implemented anywhere else.
Note that pulling artifacts to the machine where the script runs works fine when a management machine is used. You should not run the deployment script from your office as downloading and especially uploading artifacts would take some time. This limitation is a result of a trade off. The gained conceptual simplicity outweighs the minuses. When the setup scripts are uploaded to the destination server, the artifacts are uploaded [with them] so the destination application servers never need to talk to the repository and hence no security issues to be handled for example.
Deploying to staging environment
- Artifact is ready (see “The build” section above)
- One of the developers runs create-image.py with flags that tell create-image.py not to create a new instance and not to creat AMI from instance. This limits create-image.py to the depoy process only: run upload-and-run.sh .
Since “ENV” environment variable is also passed, the configuration step is also run ( configure-server.sh )
Since there are switches and environment variables that create-image.py needs that none of us would like to remember there are several wrapper scripts named deploy-ROLE-to-staging.sh
Deploying to production environment
It’s a responsibility of developers to make sure that the build being packaged at this step is one of the builds that were tested in the staging environment.
- Artifact is ready (see “The build” section above)
- One of the developers runs create-image.py and the script creates an AMI
The solution to (not) remembering all the right switches and environment variables is documentation in the markdown formatted readme file at the top level of the repository.
- One of the people that are authorized to deploy to production runs deploy.py
Rollback to previous version
This can be done in one of the following ways:
Run the deploy.py script giving an older version as an argument
Manual quick fix
When old servers are removed from the load balancing, they are not immediately terminated. Their termination is manual, after the “no rollback” decision. The servers that rotated out of the load balancing are now tagged with “env” tag value “removed-from:prod”.
- Change the “env” tag of the new servers to “removed-from:prod” or anything else that is not “prod”
- Change “env” tag on the old servers to “prod”
- Run the load balancers configuration script. The arguments for this script are the environment name and the role of the server. The script updates all the relevant load balancers that point to servers with the given role.
We never finished it as rollbacks were very rare occasion and the other two methods work fine.
Instances tagging and naming
Instances naming: role/build number
For instances with several artifacts that have build number: role/B1-B2-B3-…, where B1 is primary artifacts build number and others follow in order of decreasing importance.
The status tag is updated during script runs so when deploying to production it can for example look like: “Configuring” or “Self test attempt 3/18” or “Self tests failed” (rare occasion!) or “Configuring load balancing”.
This post described one of the possible solutions. I think the solution described above is one of the best possible solutions for Utab. This does not mean it’s a good solution for any particular other company. Use your best judgement when adopting similar solution or any parts of it.
You should always assume that a solution you are looking at is not a good fit for your problem/situation and then go on and prove that it is, considering possible alternatives. Such approach helps avoiding the confirmation bias.