Monthly Archives: February 2010

Links for 2010-02-25

Here’s Thursday’s links from My Twitter Stream.

Justifying Continuous Integration Expenditure

Justifying Continuous Integration ExpenditureBanos commented on my last post:

So why, oh why, oh why is it so difficult to get an additional server? Has anyone come up with a formula to produce some numbers for the bean counters to justify this already?

I propose this is an endemic problem that the guys on the ground give up fighting to resolve because there is no budget for more servers.

Here’s a start: by measuring all the time that developers waste waiting for Continuous Integration. Paul Julius writes about the cup of coffee metric. If you can measure that time, there’s your formula:

hours wasted per dev * number of developers * average hourly rate * 22 working days a month (on average) = monthly cost of not having the right hardware.

A local example: 30 minutes a day, for six developers, average cost £50 gives £3,300 a month. That’s not even a high rate for developers, or much time spent. Of course, your organisation may be unable to budget any more on infrastructure. At the same time as it pays developers to watch the Continuous Integration machine. That’s a different problem than managers understanding the true cost of something.

Thanks for the interesting comment, Banos. Has anyone tried this kind of approach and won? Tell us!

(photo via See-Ming Lee)


Build Pattern: Green-lit build

Build Pattern: Green-lit buildContinuous Integration should be a highway, not a parking lot. But that’s what happens sometimes when developers end up competing for limited Continuous Integration capacity. Developers working on critical and time-senstive work like production bugfixes can struggle to get their builds serviced promptly; they can be fighting a tide of checkins from their colleagues working on new functionality. Also, functional tests can jam up the available build agents while short builds queue up. This delays the feedback to the developers that they have integrated their code properly.

What to do? Dedicate some capacity to those builds that could actually take priority. How you implement depends on your continuous integration system. On CruiseControl (the original one), I made a seperate server for this. I’m implementing the same thing on Team City at the moment and I’ve added an environment variable to the build agents that I want to reserve for fast builds. Any build that is longer than 15 minutes is told not to use it if the variable exists.

One might argue that I’m not making best use of the CI server by doing this. That doesn’t matter. People are more expensive than Continuous Integration servers; let’s optimise the system for them.

(image from Ted Percival)


Links for 2010-02-24

Here’s Wednesday’s links from My Twitter Stream.

Links for 2010-02-23

Here’s Tuesday’s links from My Twitter Stream.

The hidden cost of building

The hidden cost of buildingThanks to EJ Ciramella for this thought provoking post. There’ll be a Build Doctor T-Shirt on it’s way to him soon.

In this down economy, irrespective of size of company involved, people want to save money, limit costs and increase throughput of their systems. One area of savings is the build and continuous integration environment.

With the smattering of continuous integration servers available to said company’s release engineering staff, many offer non-acl controlled build buttons. Without this control point, anyone in any department attached to the corporate network can click off a build, regardless of the readiness of the code within source control.

Where I work, we’ve recently dropped CruiseControl for Hudson. A few colleagues have come by since the rollout asking where the build button is because they don’t have access. When pressed about why they’d like to manually spin a build, the resounding answer has been “to see what it looks like in Hudson”. This is the exact situation we’re trying to avoid and the exact subject I’ll try and illuminate within this article.

Before diving any deeper, there are a few things I think release engineering for any company must understand. Each company has a unique workflow, from project concept, to design, to implementation to release and off into support mode. What works great at one place may or may not work at another company. There are industry best practices and white papers aplenty, but if you find it difficult to follow any of these at your company, the best approach in most cases is to take what you can from these documents and carefully plan an evolutionary (not revolutionary) process to reach a tailored solution.

With all of this in mind, let’s cover the various steps and the unseen costs to performing builds.

Here is a high-level a list of things that happen when we spin a release type build.

  • SCM label

Initially, we used to label first and ask questions later. The mindset is, if the build fails, a developer could sync (we use Perforce) to a label and get exactly what the build server had used to generate the failure. Since the Perforce plugin for Hudson doesn’t operate in this manner (and a few other ways that we ended up altering the plugin to suit), we made the switch to labeling only if the build passes. Since very few developers ever did take time to sync by the labels, the failed build labels were just a waste of space. Either way, depending on the size of the codebase getting the label attached, there is disk space and memory consumption that happens on the Perforce server.

  • Build node (CPU/RAM/HD)

I’m sure that the visitors of this site are savvy enough to understand the beauty and flexibility that come with a distributed build system. I use the term “distributed” here loosely as any given build is not farming out various parts of compilation, just individual jobs (in Hudson parlance) or in some cases, individual maven modules. By the time a build reaches this queue stage, the job has now consumed a executor within the cluster. In our current cluster, we have three slaves and a master. The master and two slaves are only allowed one executor. That third slave has six. If this build is forced to run on the “singleton” nodes, that node becomes unavailable until that build is completed. Thankfully, our build times are short (the longest is 20 mins), but because of the speed of the build fired, this I think mentally cheapens the process. Don’t misinterpret this as “let’s artificially inflate build times”, but keep this in mind when a large refactoring of the build process and associated scripts yields a massive time savings (heck, why not push back to get more testing done in that same time frame?).

  • Client spec updates (Perforce Server)

One of the great features of Perforce is the server is where the “what you have” list is maintained. I’ve seen arguments for working offline and if a refactoring happens, yadda-yadda-yadda. But in large-scale corporate environments, many institutional services will be unavailable in this mode.

When a person syncs a project (in Perforce terms, but essentially a directory of directories and files), the server updates its files to reflect what the user ends up with. Same is true with the build process. When setting up Hudson, our SCM configuration choices left us with Hudson managing the client specs. This means, in some cases for us, there is one client spec for each node for each job. Even if only one is getting updated, that is still data being written to the server (again).

  • Artifact storage (both live and backup)

Now the build is finished and we’re going to retrieve the artifacts for storage. Where I work, the build artifacts are stored in a few ways (all on various NetApp slices). They are as follows:

  • Deployable units – These are the actual applications that you push to our various deployment environment. Because of a facilitating maven project that allows cross application dependency validation, some of these artifacts are stored on generalized NetApp slice (to allow us to keep our artifact storage Continuous Integration agnostic – who knows when we’ll switch from Hudson to something else) and replicated to Archiva to allow people to reference certain bits as dependencies from pom files.
  • Libraries – These are effectively the building blocks of the deployable units. The final destination of libraries is Archiva (or your repository manager of choice).

Essentially, there are two places things are stored, Archiva and our “buildartifacs” mount, to reiterate, both of which are NetApp mounts. There is a backup mechanism that keeps around hourly, daily, monthly and yearly restore points. All of this takes up space but if we ever had a complete system failure, we could very quickly return to business as usual.

  • Potential deployment of said artifact

Now that the build is done and someone within the organization has chosen to deploy and test it, they may deploy it (or put a dependency on it and trigger an application build – all with no changes) to a given stack for testing. One of our typical artifacts is close to 280 mb zipped. And to successfully deploy and test, this artifact is extracted on at least two servers and typically has a 139 mb web content artifact also deployed at the same time (as we have to keep these things in sync). Deployments back up the previous deployment (just a few) prior to extracting the new item.

  • Testing requirement of artifact (with no changes)

Once deployed, now comes the tax on the squishy bits (humans) if you have no auto integration testing/smoke/load testing. And if you do have all those tiers of testing, you’ll be consuming each one along the way. Couldn’t everything be doing something more productive rather than re-testing a deploy that contains no changes?

  • Other

There are other fiddly bits that happen as part of each build like sending emails, various cleanup steps, etc., not to mention the can of worms that is dependency generation (if Build-A happens, Build-B needs to be re-spun to pick up the change in Build-A ad infinitum ad nauseum). What if the build artifact has to actually be transferred to another country for deployment? Transferring 280 mb of data while people are trying to sync an as of yet un-proxied Perforce project or retrieving dependencies from Archiva is not a good way to spend your workday.

Everything above consumes hardware resources, time and space that could be reserved for a legitimate build or better testing. From cpu time to memory allocation to the various stopping points and distribution mechanisms. if people go ahead and deploy and try manual, testing now we’re talking about consuming one or more human resources. I cover a single large application above with the quoted sizes, but actually, this maven module generates another 120 mb artifact that is stored in Archiva which is consumed by another deployable unit that is 338 mb zipped.

This is why it’s best to either limit or prevent people from firing builds manually. I’ve taken the tack that if we find people spinning unnecessary builds, we’ll revoke their privileges within the Hudson matrix ACL settings. I’m not opposed to taking away further permissions forcing more to rely on the polling aspect of Hudson.

(image care of swimparallel)

Continuous Integration in the cloud: good idea?

Continuous Integration can be tricky to provision. It’s IO or CPU bound at the beginning and then it has a tendancy to batter your database for a long time while staying almost idle. Slava Imeshev of Viewtier kindly commented on myoutsourcing continuous integration post:

My take on this is that hosted CI in a common virtualized environment such as EC2 won’t work. A CI or a build server, unlike the rest of the applications, needs all four components of a build box, CPU, RAM, disk and network I/O. The industry wisdom says applications that are subject of virtualization may demand maximum two. Sure you can run a build in EC2, but you will have to sacrifice build speed, and that’s usually the last thing you want to do. If you want fast builds, you have to run in the opposite direction, towards a dedicated, big fat box hosted locally.

Viewtier has been hosting Continuous Integration for open source projects for five years, and our experiences shows that even builds on a dedicated build box begin to slow down if the number of long-running builds exceeds a double of number of CPUs. Actually, we observe a trend towards farms of build machines hosted locally.

Hmm. He makes an interesting point. Seems that we might need to do more than throw EC2 agents at our favourite Continuous Integration servers. The great appeal of Cloud Continuous Integration is that there’s no limit to the amount of resource that we can buy. There’s an assumption that you’ll be able to make use of that. I’m wondering what patterns will emerge to deal with that. Will we fire many builds that compile and run unit tests (proper unit tests, without database calls) on EC2, and then queue them elsewhere for slow functional test runs?

Ideally we’d get more effective at writing tests; I’d think that parallelizing a test execution via the cloud could allow us to sweep test performance issues under the rug. We’ll just have to wait and see.

Links for 2010-02-22

Here’s Monday’s links from My Twitter Stream.

  • 22:07: Testing is … [Jason Sankey of Zutubi bangs the pulpit]
  • 15:30: Anti-pattern: The release vehicle. – Delivering software
  • 15:28: Continuous Integration: Servers and Tools | Refcardz [ Paul Duvall writes again]
  • 11:35: Coding Horror: The Non-Programming Programmer ” “ [people have every incentive to bag a coding job even if they can’t]

Links for 2010-02-19

Here’s Friday’s links from My Twitter Stream.

  • 21:08: Where are the design patterns for software operations? – Blog – dev2ops [Alex should come to my spa talk]
  • 21:03: Deployment management design patterns for DevOps – Blog – dev2ops –
  • 12:21: Planet DevOps now up and running – – send or tweet me your “devops” category or tag feed #devops (via @kartar)

Links for 2010-02-18

Here’s Thursday’s links from My Twitter Stream.