Category Archives: Jenkins

  • 0

The System Of Continuous Migration

Category : Tools

Migrating

Introduction

We live in a world where a commercial organization has to be in a state of constant flux. That is  – if it wants to survive and prosper.

This statement is even more accurate for IT companies. (And  – as the popular saying goes – every company is an IT company today)

One could of course argue that I’m suffering from a consultant worldview bias. After all – consultants are mostly brought in to help with organizational and technological changes. In the last couple of years we at Otomato have been involved in dozens of projects that all had ‘migration’ or ‘transformation’ in their title.  So yes, definitely – change is all we see.

But I’ve spent more than 15 years in IT companies small and large prior to becoming a consultant – and it’s always been like this. With ever accelerating speed. We’ve been changing languages, frameworks, architectural patterns and of course tools. Always migrating, rewriting, adapting and rethinking. Because that’s the business we’re in – the business of innovation. Because the value we provide is the promise of brighter future. And that means we can never stand still – as yesterday’s future is tomorrow’s past.

The practical side of this exciting (and somewhat frightening) reality is that we are always on the outlook for new tools and technologies. Moreover  – at any given moment we have at least one migration project planned, executed or failing. And it is stressful. Because these migrations and POCs are always full of uncertainty and risk. And because our performance is often measured by migration success. We are expected to have grand triumph or to fail fast – to minimize the cost of failure. And the larger the migration project – the harder this becomes. The benefits of the new approach aren’t always immediately measureable. The true costs of migration only become seen after we’re neck deep. And we can’t really stop the daily grind to think it all through till the last bit.

So migrations are inevitable but stressful. And how do we make something less stressful? We practice it daily, we learn all the pitfalls and then develop a system to mitigate failures and risks. In other words – we do it continuously! And it certainly feels like we as an industry can benefit from a systemic definition of continuous migration. So let us look at various existing approaches, try to understand what works best and attempt to define a system.

The Two Approaches

In general we can say there are 2 leading approaches to migration. We can even label them as ‘the old way’ and ‘the new way’. The old way is the grand cutover approach and the new way is the start small approach. Yes, I know –  this old vs. new dichotomy is over-simplistic. Each approach has its own history, its own benefits and disadvantages. Moreover different systems require different approaches. Still there are certain trends in the industry that we can’t ignore. Sometimes these trends influence our decisions. And our goal here is to provide a system to base our decisions upon. A system that cuts through the mist of personal preferences and industry trends and provides a clearer view of the subject at hand.

But before that  – let’s overview the 2 approaches and what each one of them entails. To make this more interesting we’ll start with what we previously labeled ‘new’ and then look at the ‘old’.

I must admit – I have my own biases that I’ve developed over the years. I’ll do my best to keep them out of the text when describing the existing approaches. Still – if I was perfectly sure that one of the approaches is superior  – I wouldn’t be writing this. What we’re trying to do here is to develop a superset of concepts and criteria. Something that will allow us to enjoy the best of both worlds while escaping most pitfalls on the way.

At this stage an attentive reader might object that I’m not discussing anything new here. This is just the old, beaten dichotomy of product development  – waterfall vs. agile, planned ahead vs. iterative. I do realize there are similarities. But migration projects aren’t the same as application development projects. One could argue that in migration there is no such thing as MVP. Showing that migration is viable isn’t enough to prove that it’s cost-effective. Moreover many existing business-wide systems don’t lend themselves easily to iterative migration. In a way they can be seen as life-critical systems which require meticulous testing and extensive proof of meeting the requirements prior to going live. The kind of proof that is very hard to obtain in a playground environment.

The Grand Cutover Diagram

The Grand Cutover

Start Small Diagram

Start Small


 

So let us start:

Start Small (the iterative approach)

This approach stems from the idea that it’s either impossible or too expensive to create a real staging environment for verifying the changes. As a matter of fact it’s not only about creating an environment. It is mainly about generating sufficient load of real-life use cases in order to verify system readiness. The investment in such testing is seen as too high, especially if we think of migration as a one-time process. Migrate and forget. Which – as we already said in the introduction – is not the case in our modern world.

So if preparing everything on the side in one stride doesn’t look feasible, what do we do? We start small. We take a greenfield project, a small service on the side, a specific system module. Or a separate team. The innovateurs. The test pilots. The Kamikazes. The Shaheeds. We migrate (or start from scratch, if it’s a new project) that part of our system to the new framework. This is an experiment, an evaluation. No obligations, no commitments. Only good intentions and some bravery. In fact I think we need a new word for such migration projects – migrevaluations.

As a side note – from a small survey we’ve done – most engineers and managers today prefer to start small. With many of them not even seeing any other option. That’s why I called this ‘the new way’ – this is how many of us today feel things should be done. And it’s quite understandable. Psychologically it’s much easier and less intimidating to start something small than to try and think through all the implications of a months-long system-wide change. Additionally  –  most of us have had our brains so cleanly washed with Agile soap that we don’t see any alternatives. Scrum, Kanban et al. offer some great project management techniques – but they’re not necessarily the best framework for reasoning about a problem.

But the big question with the ‘start small’ approach is always: how (and when) do we verify that the migration is worthwhile?  “Define KPIs!”  – the smarter folks will say. E.g: the migration to the new tool should shorten the build time by 30%. Or: the migration to the new orchestration framework will allow us to release twice as often with 25% less bugs. I certainly believe that defining these goals is important and even vital when starting a new migrevaluation. So let’s say – we’ve determined the KPI. And our small kamikaze project has consistently achieved it across a defined state matrix . Now – how do we know if this achievement will scale all across our system?  After all – it’s evident that large systems require different approaches. You can’t manage a large company the same way you manage a startup. The performance and stability of a large multi-component system is based on the interactions between the multitude of its components. Testing in isolation doesn’t really prove anything.

The preachers of iteration will say: “ok then. If the sample is too small – we’ll add another component, team, service. And we’ll continue adding more – until we prove our point. Or find that the solution doesn’t scale well”  Which is a perfectly valid approach. In the world of science and experimentation. But not in the world of business and heartless financial calculations. Because if we prove ourselves wrong – we’ve already spent a lot of time and money.

In many cases what happens in such situation – is that a migration is led to a completion anyway. With some KPI mangling to make it look more like a success than a wasted effort. This happens because we’re all human and we all have loss aversion hard coded into our system. It’s much harder for us to admit failure after we’ve already envisioned success.

As we’ve seen – the ‘start small’ approach definitely has some very attractive sides, but isn’t without pitfalls. Let’s see what the alternative is.

The Grand Cutover

This approach entails an exhaustive preparation stage. First – all the migration costs are carefully evaluated. The KPIs are defined.  Then – a testing or staging environment is prepared. And only after all the tests have proven that the new platform is fully functional – we perform the grand migration!

We’ve already seen the main issues with this approach. It has high upfront costs, is perceived as hard to pull off and still – gives no promise that the migration will provide the expected benefits. The demon of loss aversion is raising its head in our psyche.

But I would argue that there are situations where investing in preparation is actually much more cost-effective than starting small and planning as we roll.

First there’s the case of life-critical systems – those systems where the cost of disruption is too high.

And second – it’s important to remember that not all migrations we perform are migrevalutaions. Some of them aren’t done to improve any business metrics. Instead they are required because:

  • the old system isn’t supported anymore
  • there’s been a company wide decision we have no influence upon
  • The migration is required by another change in a related system
  • Add your own reason here.

When this is the case – there’s no real reason to start small. Instead we want the transition to be as fast and painless as possible. With minimal downtime and no hidden hope for a rollback. And that means – we need to do everything in our power to get properly prepared for the shift. With steps being:

  • Define all the players and stakeholder influenced by the change
  • Gather their inputs and expectations from the new framework
  • Based on 2 – define the functional requirements that the new framework must implement
  • Define the test data set
  • Define and allocate the necessary resources (human, compute, storage and network)
  • Plan and implement company-wide training
  • Define the minimal time for system functionality restore
  • Rehearse the migration until the defined KPIs are consistently achieved.
  • Set the date for migration.
  • Cut over!

This is easier said than done, of course. Anyone who’s been through such a project should realize how much detail is hidden behind each of these steps. How much virtual blood, sweat and tears have to be shed in order to bring this to completion.

But on the brighter side – this is a much better planned-out process. With a defined start and end criteria, with a decisive direction. As long as we’re on track – we don’t need to re-evaluate as we go. And even if obstacles prevent us from delivering on time – we can always move the dates without compromising the content of the original plan.

Note that with all the grandeur of the task at hand  –  this planned out, monolithic  (I know, I know – this is a curse word) process involves much less heroics ( and consequently – less burnout)  than the guerilla mode of the iterative innovation.

With all that said – we all realize why this approach is out of favour nowadays. Exactly for the same reason we need to be continuously migrating.  The technological world is changing fast, the deadlines are pressing. Companies usually go into cross-the-board migrations only when  they find themselves in a near-death condition. The infamous Project Inversion at Linkedin required the infrastructure team to freeze all changes in existing systems for a few months. Only so were they able to focus on rebuilding everything for the move to microservices they had planned. And it’s not easy to convince ourselves that we need to put everything on hold for the promise of brighter future. It requires either trust or desperation.

Let’s Try To Define a System

So, with all that said – how do we define a global system for continuous migration?

  1. Embrace Continuous Migration

    • The first thing to do here is to accept the fact that migration is a continuous process. No matter of we start small or go all in – this is a work that’s never done. We’ll always have more stuff to migrate even before the current migration is over.
  2. Define Migration Strategy

    • Be very clear about why you’re entering a migration project, what type of system you are migrating, what will be the success and failure criteria and if failure is even an option.
    • Some questions to ask at that stage:
      • Is this a ‘life-critical’ system?
      • What can be considered a representative sample?
      • Is this a migrevaluation or a migration?
      • Are there alternative frameworks you’ll want to evaluate before deciding?
  3. Involve the Stakeholders

    We’ve outlined this when describing the end-to-end migration steps. But – we do believe this  to be a very important stage also when starting small. A lot of migrevaluations or side-project migrations either fail or become too costly because this stage is skipped. Take for example an infrastructure team tasked with evaluating a migration for a codebase that they have no deep understanding of. We always see much better results when developers and testers are involved from the very beginning. The have intimate knowledge of the code, it’s quirks and caveats, and of all the reasons for ugly hacks that are hidden all across the system.So please make sure you:

    • Define all the players and stakeholder influenced by the change
    • Gather their inputs and expectations from the new framework
  4. Define the KPIs and Exit Criteria

    • The intensiveness of this stage very much depends on the type of migration we’ve defined this to be in 2. Still – no matter if we start small or go all-in – we need to have a defined concept of where we want to arrive. Or at least what’s the next milestone we want to reach. And how do we decide if this is a go or  a no go.
  5. Define the Verification Strategy

    • How do we measure the KPIs and criteria we’ve defined? Options include:
        • Defining a testing data set
        • Using A/B testing
        • Using dark launching
        • Manual verification in a sandbox environment.
        • Any combination of the above.
  6. Allocate resources

    • Who is tasked with migration? Do we assign a special team? (generally an anti-pattern, in our experience). Or do we reserve some capacity of the existing teams for continuous migration activity. (The recommended approach) What non-human resources are needed for the migration effort? How scalable do we want these resources to be.
  7. Define the Knowledge Accumulation and Distribution Patterns

    This definitely depends on the migration strategy we’ve chosen. For all-in, grand cutover migrations – we want our teams to be ready when the big day arrives. Therefore this is the time to organize training, assign change agents and start preparing a corporate knowledge base for the new framework.

    If we’re starting small, evaluating and learning as we go – this is where we define best practices for progress documentation and create a migration project Wiki. Needless to say – in evaluation projects the accumulation of knowledge should be our foremost goal.

  8. Start the progress.

    We’re done with all the thinking – time to start doing. It’s important to note that our migration strategy shouldn’t directly impact our project management methods. We can perfectly well manage grand cutover projects using Kanban for splitting the work into manageable tasks, limiting WIP and verifying our progress all along the road.

  9. Plan for the next migration

We’ve already embraced the fact this was a continuous process, haven’t we?

 

Conclusion:

Migrations are an everyday part of our tech life. The stacks will continue to change and we’ll never want to be left behind. Migrations are inevitable but not easy. Different strategies and and approaches can be applied. In this post we’ve presented an attempt at creating a sequence of steps to base our continuous migration effort upon. This sequence is a result of our combined four decades of industry experience. Things we’ve seen working better and worse. Following these steps won’t guarantee a successful migration (as there are a lot of other factors involved) but can definitely make your effort less stressful and more effective.

 

Would you like some help with DevOps transformation or software delivery optimization at your company? Drop us a note – we’ll be happy to help!

 


  • 0

Dynamically spinning up Jenkins slaves on Docker clusters

Introduction:

Being able to dynamically spin up slave containers is great. But if we want to support significant build volumes we need more than a few Docker hosts. Defining a separate Docker cloud instance for each new host is definitely not something we want to do – especially as we’d need to redefine the slave templates for each new host. A much nicer solution is combining our Docker hosts into a cluster managed by a so-called container orchestrator (or scheduler) and then define that whole cluster as one cloud instance in Jenkins.
This way we can easily expand the cluster by adding new nodes into it without needing to update anything in Jenkins configuration.

There are 4 leading container orchestration platforms today and they are:

Kubernetes (open-sourced and maintained by Google)

Docker Swarm (from Docker Inc. – the company behind Docker)

Marathon (a part of the Mesos project)

Nomad (from Hashicorp)

A container orchestrator (or scheduler) is a software tool for the deployment and management of OS containers across a cluster of computers (physical or virtual). Besides running and auditing the containers, orchestrators provide such features as software-defined network routing, service discovery and load-balancing, secret management and more.

There are dedicated Jenkins plugins for Kubernetes and Nomad using the Cloud extension point. Which means they both provide the same ability of spinning up slaves on demand. But instead of doing it on a single Docker host they talk to the Kubernetes or Nomad master API respectively in order to provision slave containers somewhere in the cluster.

Nomad

Nomad plugin was originally developed by Ivo Verberk and further enhanced by yours truly while doing an exploratory project for Taboola. A detailed post describing our experience will be up on Taboola engineering blog sometime next month.
Describing Nomad usage is out of the scope of this book, but in general – exactly as the YAD plugin – it allows one to define a Nomad cloud and a number of slave templates. You can also define the resource requirements for each template so Nomad will only send your slaves to nodes that can provide the necessary amount of resources.
Currently there are no dedicated Pipeline support features in the Nomad plugin.
Here’s a screenshot of Nomad slave template configuration:Screen Shot 2017-07-11 at 12.24.48 AM

Kubernetes

The Kubernetes Plugin was developed and is still being maintained by Carlos Sanchez. The special thing about Kubernetes is that its basic deployment unit is a kubernetes pod which could consist of one or more containers. So here you get to define pod templates. Each pod template can hold multiple container templates. This is definitely great when we want predefined testing resources to be provisioned in a Kubernetes cluster as a part of the build.

The Kubernetes plugin has strong support for Jenkins pipelines with things like this available:

podTemplate(label: 'mypod', containers: [
 containerTemplate(name: 'maven', image: 'maven:3.3.9-jdk-8-alpine', ttyEnabled: true, command: 'cat'),
 containerTemplate(name: 'golang', image: 'golang:1.8.0', ttyEnabled: true, command: 'cat')]) 
{
  node('mypod') {
    stage('Checkout JAVA') {
    git 'https://github.com/jenkinsci/kubernetes-plugin.git'
    container('maven') {
      stage('Build Maven') {
        sh 'mvn -B clean install'
      }
    }
   }
 stage('Checkout Go') {
 git url: 'https://github.com/hashicorp/terraform.git'
 container('golang') {
 stage('Build Go) {
 sh """
 mkdir -p /go/src/github.com/hashicorp
 ln -s `pwd` /go/src/github.com/hashicorp/terraform
 cd /go/src/github.com/hashicorp/terraform && make core-dev
 """
 }
 }
 }
}

There’s detailed documentation in plugin’s README.md on github.

Marathon

There is a Jenkins marathon plugin but instead of spinning up build slaves it simply provides support for deploying applications to a Marathon-managed cluster
It requires a Marathon .json file to be present in the project workspace.
There’s also support for Pipeline code. Here’s an example of its usage:

 marathon(
   url: 'http://otomato-marathon',
   id: 'otoid',
   docker: otomato/oto-trigger')

 

Docker Swarm

I used to think there was no dedicated plugin for Swarm but then I found this. As declared in the README.md this plugin doesn’t use the Jenkins cloud API. Even though it does connect into it for label-based slave startup. This non-standard approach is probably the reason why the plugin isn’t hosted on the official Jenkins plugin repository.
The last commit on the github repository dates back 9 months ago, so it may also be outdated – as Docker and Swarm are changing all the time and so does the API.

Learn More:

This post is a chapter from the ebook ‘Docker, Jenkins, Docker’ which we recently released at Jenkins User Conference TLV 2017.  Follow this link to download the full ebook: http://otomato.link/go/docker-jenkins-docker-download-the-ebook-2/


  • 4

DevOps is a Myth

(Practitioner’s Reflections on The DevOps Handbook)

The Holy Wars of DevOps

Yet another argument explodes online around the ‘true nature of DevOps’, around ‘what DevOps really means’ or around ‘what DevOps is not’. At each conference I attend we talk about DevOps culture, DevOps mindset and DevOps ways. All confirming one single truth – DevOps is a myth./img/sapiens.jpg

Now don’t get me wrong – in no way is this a negation of its validity or importance. As Y.N.Harrari shows so eloquently in his book ‘Sapiens’ – myths were the forming power in the development of humankind. It is in fact our ability to collectively believe in these non-objective, imagined realities that allows us to collaborate at large scale, to coordinate our actions, to build pyramids, temples, cities and roads.

There’s a Handbook!

I am writing this while finishing the exceptionally well written “DevOps Handbook”. If you really want to know what stands behind the all-too-often misinterpreted buzzword – you better read this cover-to-cover. It presents an almost-no-bullshit deep dive into why, how and what in DevOps. And it comes from the folks who invented the term and have been busy developing its main concepts over the last 7 years.


Now notice – I’m only saying you should read the “DevOps Handbook” if you want to understand what DevOps is about. After finishing it I’m pretty sure you won’t have any interest in participating in petty arguments along the lines of ‘is DevOps about automation or not?’. But I’m not saying you should read the handbook if you want to know how to improve and speed up your software manufacturing and delivery processes. And neither if you want to optimize your IT organization for innovation and continuous improvement.

Because the main realization that you, as a smart reader, will arrive at – is just that there is no such thing as DevOps. DevOps is a myth.

So What’s The Story?

It all basically comes down to this: some IT companies achieve better results than others. Better revenues, higher customer and employee satisfaction, faster value delivery, higher quality. There’s no one-size-fits-all formula, there is no magic bullet – but we can learn from these high performers and try to apply certain tools and practices in order to improve the way we work and achieve similar or better results. These tools and processes come from a myriad of management theories and practices. Moreover – they are constantly evolving, so we need to always be learning. But at least we have the promise of better life. That is if we get it all right: the people, the architecture, the processes, the mindset, the org structure, etc.

So it’s not about certain tools, cause the tools will change. And it’s not about certain practices – because we’re creative and frameworks come and go. I don’t see too many folks using Kanban boards 10 years from now. (In the same way only the laggards use Gantt charts today) And then the speakers at the next fancy conference will tell you it’s mainly about culture. And you know what culture is? It’s just a story, or rather a collection of stories that a group of people share. Stories that tell us something about the world and about ourselves. Stories that have only a very relative connection to the material world. Stories that can easily be proven as myths by another group of folks who believe them to be wrong.

But Isn’t It True?

Anybody who’s studied management theories knows how the approaches have changed since the beginning of the last century. From Taylor’s scientific management and down to McGregor’s X&Y theory they’ve all had their followers. Managers who’ve applied them and swore getting great results thanks to them. And yet most of these theories have been proven wrong by their successors.

In the same way we see this happening with DevOps and Agile. Agile was all the buzz since its inception in 2001. Teams were moving to Scrum, then Kanban, now SAFE and LESS. But Agile didn’t deliver on its promise of better life. Or rather – it became so commonplace that it lost its edge. Without the hype, we now realize it has its downsides. And we now hope that maybe this new DevOps thing will make us happy.

You may say that the world is changing fast – that’s why we now need new approaches! And I agree – the technology, the globalization, the flow of information – they all change the stories we live in. But this also means that whatever is working for someone else today won’t probably work for you tomorrow – because the world will change yet again.

Which means that the DevOps Handbook – while a great overview and historical document and a source of inspiration – should not be taken as a guide to action. It’s just another step towards establishing the DevOps myth.

And that takes us back to where we started – myths and stories aren’t bad in themselves. They help us collaborate by providing a common semantic system and shared goals. But they only work while we believe in them and until a new myth comes around – one powerful enough to grab our attention.

Your Own DevOps Story

So if we agree that DevOps is just another myth, what are we left with? What do we at Otomato and other DevOps consultants and vendors have to sell? Well, it’s the same thing we’ve been building even before the DevOps buzz: effective software delivery and IT management. Based on tools and processes, automation and effective communication. Relying on common sense and on being experts in whatever myth is currently believed to be true.

As I keep saying – culture is a story you tell. And we make sure to be experts in both the storytelling and the actual tooling and architecture. If you’re currently looking at creating a DevOps transformation or simply want to optimize your software delivery – give us a call. We’ll help to build your authentic DevOps story, to train your staff and to architect your pipeline based on practice, skills and your organization’s actual needs. Not based on myths that other people tell.


  • 0

Thank you Intel Sports!

Category : Tools

intelinsports

Mission completed! We’ve done a full month of getting the #Intel Sports developers up to speed with git. It’s always fun to train bright folks – and the engineers at Intel are certainly among the brighthest we ‘ve had the privilege to preach git to.

While providing the training we’ve also developed a few ideas regarding git subtree and the plan is to share these ideas in a follow-up post to this one (which compares submodules to repo)

Have a great weekend!

 


  • 0

DevOps Flow Metrics – http://devopsflowmetrics.org

Category : Tools

grpah

DevOps transformation goals can be defined as:

  • Heightened Release Agility
  • Improved Software Quality

Or simply:

Delivering Better Software Faster

Therefore measurable DevOps success criteria would be:

  • Being able to release versions faster and more often.
  • Having less defects and failures.

Measurement is one of the cornerstones of DevOps. But how do we measure flow?

In order to track the flow (the amount of change getting pushed through our pipeline in a given unit of time) we’ve developed the 12 DevOps Flow Metrics.

They are based on our industry experience and ideas from other DevOps practitioners and are a result of 10 years of implementing DevOps and CI/CD in large organisations.

The metrics were initially publicly presented by Anton Weiss at a DevOpsDays TLV 2016 ignite talk. The talk got a lot of people interested and that’s why we decided to share the metrics with the community.

We’ve created a github pages based minisite where everyone can learn about the metrics, download the presentation and submit comments and pull requests.

Looking forward to your feedback!

Get the metrics here : http://devopsflowmetrics.org

 


  • 0

  • 0

Jenkins and the Future of Software Delivery

Are you optimistic when you look into the future and try to see what it brings? Do you believe in robot apocalypse or the utopia of singularity? Do you think the world will change to the better or to the worse? Or are you just too busy fixing bugs in production and making sure all your systems are running smoothly?

Ant Weiss of Otomato describing the bright future of software delivery at Jenkins User Conference Israel 2016.


  • 0

How OpenStack is Built – the Video now Online

Watch Ant Weiss of Otomato provide an overview of the OpenStack CI – probably one of the most advanced Jenkins-based CI infrastructures in the world.


  • 0

The Number One Symptom Of not Actually ‘Doing Devops’

Category : Tools

apply-now-job-hiring-help-ss-1920So I recently talked to a release engineering team leader at a very well-known american software+hardware company located in California.
They contacted me looking for top-notch build infrastructure engineers and we spent a very interesting hour discussing their technological stack and people processes. On the surface – they are moving in the right direction – automating all the things, using Chef for config management, codifying the infrastructure, using Artifactory for binaries… But one thing struck me in our conversation. The team leader (clearly a very smart and talented guy) said : “I’ve been trying to hire for two years. Good professionals are so hard to find…”
I agreed at first – we all know the market is starving for technical talent.
But reflecting on our conversation afterwards I realized that he in fact showed me the number one symptom of how far their team is from ‘doing devops’.

Yes – hiring is challenging. I’ve discussed this before. It may take some time to find the person whose mindset, values and the overall vibe suite your expectations. Even more so – if you’re low on budget, as the market has become highly competitive. But – for this specific team leader the budget was not an issue. He was failing to hire because of his organisation’s unwillingness to invest in mentoring and development.

He was looking for the ready-made fullstack ninja with a very specific skillset. there is no other way to explain his 2 year long quest.

Can you see how this totally opposes the ever important Devops value of Sharing? So you’ve built a few things to be proud of, you’re technologically savvy, you’re on the bleeding edge. Now is the time to share your knowledge! Now is the time to go hire bright novices. They are out there, they are hungry to learn from your experience and by sharing with them you will build the real devops on your team.

So if you’re failing to hire for more than a couple of months – don’t blame the market. Don’t complain about lousy candidates. Go revise your hiring process and more importantly – the way your team works. You may have all the technology, but it looks like your culture is broken. And with broken culture – even if you eventually succeed in hiring , you’ll have hard time reaping the true benefits of the Devops way – agility, quality, motivation and trust.

And may the devops be with you!


  • 0

Custom deploy process at Utab

Hi! I’m Ilya Sher.

This guest post will describe the deploy process at Utab, one of my clients.

Background – system architecture summary

Utab uses the services architecture. Some services are written in Java, others in NodeJS. Each server has either one Java application or one NodeJS application. The production environment uses (with the exception of load balancing configuration) the immutable server approach.

Requirements for the deploy process

The client specified the following requirements:

  1. Support for staging and production environments
  2. Manually triggered deploy for each of the environments
  3. Health check before adding a server to load balancing in production environment
  4. Option to easily and quickly rollback to previous version in production environment
  5. Simple custom tools are to be used ( no Chef/Puppet/… )

 

Solution background

The Utab’s deploy and other scripts were made specifically for the client. Such custom scripts are usually simpler than any ready-made solution. It means they break less and are easier to modify and operate. No workarounds are required in cases when a chosen framework is not flexible enough.

Also note that custom solution solves the whole problem, not just parts of the problem as it’s often the case with ready-made solutions. For example the scripts also handle the following aspects: firewall rules configuration, ec2 volumes attachment, ec2 instances and images tagging, running tests before adding to load balancer. When all of this is solved by a single set of custom scripts and not by using several ready-made solutions plus your own scripts the result looks much better and more consistent.

We estimate that such custom solution has lower TCO despite the development. Also note that in case of Utab, the development effort for the framework scripts was not big: about ten working days upfront and another few days over three years. I attribute this at least partly to sound and simple software and system architecture.

 

Note that the opinion in the summary above has many opponents. My experience shows that using ready-made solutions for the tasks described above lead to higher TCO. This mostly happens because of complexity of such solutions and other reasons that I’ll be discussing in another post.

The build

  1. A developer pushes a commit to Github
  2. Github hook triggers Jenkins
  3. Jenkins does the build and runs the tests: Mocha unit tests for NodeJS or JUnit tests for Java
  4. A small script embeds git branch name, git commit and Jenkins build number in pom.xml, in the description field.
  5. Jenkins places the resulting artifact in the Apache Archiva repository (only if #3 succeeds)

 

What you see above is common practice and you can find a lot about these steps on the internet except for number 4, which was our customization.

Note that while the repository is Java oriented we use it for both Java and NodeJS artifacts for uniformity. We use mvn deploy:deploy-file to upload the NodeJS artifacts.

For NodeJS artifacts, static files that go to a CDN are included in the artifact. Java services are not exposed directly to the browser so Java artifacts do not contain any static files for the CDN.

 

Custom scripts involved in the deployment process

Scripts that use the AWS API are in Python. Rest of them are in bash. I would be happy to use one language but bash is not convenient for using AWS API and Python is not as good as bash for systems tasks such as running programs or manipulating files.

I do hope to solve this situation by developing a new language (and a shell), NGS, which should be convenient for the system tasks, which today include talking to APIs, not just working with files and running processes.

create-image.py

  1. Creates EC2 instance (optional step)
  2. Runs upload-and-run.sh (see the description below).
  3. Runs configure-server.sh to configure the server, including the application (optional step)
  4. Runs tests (optional step)
  5. Creates an AMI

 

upload-and-run.sh

Gets destination IP or hostname and the role to deploy there.

  1. Pulls the required artifacts from the repository. Default is the latest build available in the repository. A git branch or a build number can be specified to pick another build. If a branch is given the latest build on that branch is used. The information embedded in pom.xml  .
  2. Runs all relevant “PRE” hook files (which for example upload static files to S3)
  3. Packages all required scripts and artifacts into a .tar.gz file
  4. Uploads the .tar.gz file
  5. Unpacks it to a temporary directory
  6. Runs setup.sh (a per role file: website/setup.sh, player/setup.sh, etc) which controls the whole installation.

 

deploy.py

  1. Starts EC2 instance from one of the AMIs created by the create-image.py script
  2. Runs applications tests
  3. Updates the relevant load balancers switching from old-version-servers to new-version-servers.

 

The Python scripts, when successful, notify the developers using dedicated Twitter account. This is as simple as:

   api = twitter.Api(c['consumer_key'], c['consumer_secret'], c['access_token_key'], c['access_token_secret'])
   return api.PostUpdate(s)

Pulling artifacts from the repository before deploying

I would like to elaborate about pulling the required artifacts from the repository. The artifacts are pulled from the repository to the machine where the deploy script runs – the management machine (yet another server in the cloud, near the repository and rest of the servers). I have not seen this implemented anywhere else.

Note that pulling artifacts to the machine where the script runs works fine when a management machine is used. You should not run the deployment script from your office as downloading and especially uploading artifacts would take some time. This limitation is a result of a trade off. The gained conceptual simplicity outweighs the minuses. When the setup scripts are uploaded to the destination server, the artifacts are uploaded [with them] so the destination application servers never need to talk to the repository and hence no security issues to be handled for example.

 

Deploying to staging environment

  1. Artifact is ready (see “The build” section above)
  2. One of the developers runs create-image.py with flags that tell create-image.py not to create a new instance and not to creat AMI from instance. This limits create-image.py to the depoy process only: run upload-and-run.sh  .

    Since “ENV” environment variable is also passed, the configuration step is also run ( configure-server.sh )

    Since there are switches and environment variables that create-image.py needs that none of us would like to remember there are several wrapper scripts named deploy-ROLE-to-staging.sh

 

Deploying to production environment

It’s a responsibility of developers to make sure that the build being packaged at this step is one of the builds that were tested in the staging environment.

  1. Artifact is ready (see “The build” section above)
  2. One of the developers runs create-image.py and the script creates an AMI
    The solution to (not) remembering all the right switches and environment variables is documentation in the markdown formatted readme file at the top level of the repository.
  3. One of the people that are authorized to deploy to production runs deploy.py

 

Rollback to previous version

This can be done in one of the following ways:

 

New deploy

Run the deploy.py script giving an older version as an argument

 

Manual quick fix

When old servers are removed from the load balancing, they are not immediately terminated. Their termination is manual, after the “no rollback” decision. The servers that rotated out of the load balancing are now tagged with “env” tag value “removed-from:prod”.

  1. Change the “env” tag of the new servers to “removed-from:prod” or anything else that is not “prod”
  2. Change “env” tag on the old servers to “prod”
  3. Run the load balancers configuration script. The arguments for this script are the environment name and the role of the server. The script updates all the relevant load balancers that point to servers with the given role.

Rollback script

We never finished it as rollbacks were very rare occasion and the other two methods work fine.

Instances tagging and naming

Instances naming: role/build number

For instances with several artifacts that have build number: role/B1-B2-B3-…, where B1 is primary artifacts build number and others follow in order of decreasing importance.

The status tag is updated during script runs so when deploying to production it can for example look like: “Configuring” or “Self test attempt 3/18” or “Self tests failed” (rare occasion!) or “Configuring load balancing”.

 

Summary

This post described one of the possible solutions. I think the solution described above is one of the best possible solutions for Utab. This does not mean it’s a good solution for any particular other company. Use your best judgement when adopting similar solution or any parts of it.

 

You should always assume that a solution you are looking at is not a good fit for your problem/situation and then go on and prove that it is, considering possible alternatives. Such approach helps avoiding the confirmation bias.