Building Real Software: February 2013

Thursday, February 21, 2013

Hardening Sprints - in Portuguese

My post on Hardening Sprints has been translated into Portuguese at the Brazilian iMasters site. If your Portuguese is good, you can read the post here.

Wednesday, February 20, 2013

A Bug is a Terrible Thing to Waste

Some development teams, especially Agile teams, don’t bother tracking bugs. Instead of using a bug tracking system, when testers find a bug, they talk to the developer and get it fixed, or they write a failing test that needs to be fixed and add it to the Continuous Integration test suite, or if they have to, they write up a bug story on a card and post it on the wall so the team knows about it and somebody will commit to fixing it.

Other teams live by their bug tracking systems, using tools like Jira or Bugzilla or FogBugz to record bugs as well as changes and other work. There are arguments to be made for both of these approaches.

Arguments for tracking bugs – and for not tracking bugs

In Agile Testing: A Practical Guide for Testers and Agile Teams, Lisa Crispin and Janet Gregory examine the pros and cons of using a defect tracking system.

Using a system to track bugs can be the only effective way to manage problems for teams who can’t meet face-to-face - for example, distributed teams spread across different time zones. It can also be useful for teams who have inherited open bugs in a legacy system; and a necessary evil for teams who are forced to track bugs for compliance reasons. The information in a bug database is a potential knowledge base for testers and developers joining the team – they can review bugs that were found before in the area of the code that they are working on to understand what problems they should look out for. And bug data can be used to collect metrics and create trends on bugs – if you think that bug metrics are useful.

But the Lean/Agile view is that using a defect tracking system mostly gets in the way and slows people down. The team should stay focused on finding bugs, fixing them, and then forget about them. Bugs are waste, and everything about them is waste – dead information, and dead time that is better spent delivering value. Worse, using a defect tracking system prevents testers and developers from talking with each other, and encourages testers to take a “Quality Police” mindset. Without a tool, people have to talk to each other, and have to learn to play nicely together.

This is a short-term, tactical point of view, focused on what is needed to get the software out the door and working. It’s project-thinking, not product thinking.

Bugs over the Long Term

But if you’re working on a system over a long time like we are, if you're managing a product or running a service, you know that it’s not that simple. You can’t just look at what’s in front of you, and where you want to be in a year or more. You also have to look back, at the work that was done before, at problems that happened before, at decisions that were made before, to understand why you are where you are today and where you may be heading in the future.

Because some problems never go away. And other problems will come back unless you do something to stop it. And you’ll find out that other problems which you thought you had licked never really went away. The information from old bugs, what happened and what somebody did to fix them (or why they couldn't fix them), which workarounds worked (and which didn't) can help you understand and deal with the problems that you are seeing today, and help you to keep improving the system and how you build it and keep it running.

Because you should understand the history of changes and fixes to the code if you’re going to change it. If you like the way the code is today, you might want to know how and why it got this way. If you don’t like it, you’ll want to know how and why it got this way – it’s arrogant to assume that you won’t make the same mistakes or be forced into the same kinds of situations. Revision control will tell you what was changed and when and who did it, the bug tracking system will tell you why.

Because you need to know where you have instability and risk in the system. You need to identify defect-dense code or error-prone code – code that contains too many bugs, code that is costing you too much to maintain and causing too many problems, code that is too expensive to keep running the way that is today. Code that you should rewrite ASAP to improve stability and reduce your ongoing costs. But you can’t identify this code without knowing the history of problems in the system.

Because you may need to prove to auditors or regulators and customers and investors that you are doing a responsible job of testing and finding bugs and fixing them and getting the fixes out.

And because you want to know how effective the team is in finding, fixing and preventing bugs. Are you seeing fewer bugs today? Or more bugs? Are you seeing the same kinds of bugs – are you making the same mistakes? Or different mistakes?

Do you need to track every Bug?

As long as bugs are found early enough, there’s little value in tracking them. It’s when bugs escape that they need to be tracked: bugs that the developer didn't find right away on their own, or in pairing, or through the standard automated checks and tests that are run in Continuous Integration.

We don’t log

defects found in unit tests or other automated tests – unless for some reason the problem can’t or won’t be fixed right away;
problems found in peer reviews – unless something in the review is considered significant and can’t get addressed immediately. Or a problem is found in a late review, after testing has already started, and the code will need to be retested. Or the reviewer finds something wrong in code that wasn't changed, an old bug – it’s still a problem that needs to be looked at, but we may not be prepared to deal with it right now. All problems found in external reviews, like a security review or an audit, are logged;
static analysis findings – most of the problems caught by these tools are simple coding mistakes that can be seen and fixed right away, and there’s also usually a fair amount of noise (false positives) that has to be filtered out. We run static analysis checks and review them daily, and only log findings if we agree that the finding is real but the developer isn't prepared to fix it immediately (which almost never happens, unless we’re running a new tool against an existing code base for the first time). Many static analysis tools have their own systems for tracking static analysis findings any ways, so we can always go back and review outstanding issues later;
bugs found when developers and testers decide to pair together to test changes early in development, when they are mostly exploring how something should work – we don’t usually log these bugs unless they can’t be / won’t be fixed (can’t be reproduced later for example).

A Bug is a Terrible thing to Waste

We log all other bugs regardless of whether they are found in production, in internal testing, partner testing, User Acceptance Testing, or external testing (such as a pen test). Because most of the time, when software is handed to a tester, it’s supposed to be working. If the tester finds bugs, especially serious ones, then this is important information to the tester, to the developer, and to the rest of the team. It can highlight risks. It can show where more testing and reviews need to be done. It can highlight deeper problems in the design, a lack of understanding that could cause other problems.

If you believe that testing provides important information not just about the state of your software, but also on how you are designing and building it – then everyone needs to be able to see this information, and understand it over time. Some problems can’t be seen or fully understood right away, or in 1-week or 2-week sprint-sized chunks. It can take a while before you recognize that you have a serious weakness in the design or that something is broken in your approach to development or in your culture. You’ll need to experience a few problems before you start to find relationships between them and before you can look for their root cause. You’ll need data from the past in order to solve problems in the future.

Tracking bugs isn't a waste if you learn from bugs. Throwing the information on bugs away is the real waste.

Wednesday, February 13, 2013

Releasing more often drives better Dev and better Ops

One of the most important decisions that we made as a company was to release less software, more often. After we went live, we tried to deliver updates quarterly, because until then we had followed a staged delivery lifecycle to build the system, with analysis and architecture upfront, and design and development and testing done in 3-month phases.

But this approach didn't work once the system was running. Priorities kept changing as we got more feedback from more customers, too many things needed to fixed or tuned right away, and we had to deal with urgent operational issues. We kept interrupting development to deploy interim releases and patches and then re-plan and re-plan again, wasting everyone’s time and making it harder to keep track of what we needed to do. Developers and ops were busy getting customers on and fire fighting which meant we couldn't get changes out when we needed to. So we decided to shorten the release cycle down from 3 months to 1 month, and then shorten it down again to 3 weeks and then 2 weeks, making the releases smaller and more focused and easier to manage.

Smaller, more frequent releases changes how Development is done

Delivering less but more often, whether you are doing it to reduce time-to-market and get fast feedback in a startup, or to contain risk and manage change in an enterprise, forces you to reconsider how you develop software. It changes how you plan and estimate and how you think about risks and how you manage risks. It changes how you do design, and how much design you need to do. It changes how you test. It changes what tools people need, and how much they need to rely on tools.

It changes your priorities. It changes the way that people work together and how they work with the customer, creating more opportunities and more reasons to talk to each other and learn from each other. It changes the way that people think and act – because they have to think and act differently in order to keep up and still do a good job.

Smaller, more frequent releases changes how Development and Ops work together

Changing how often you release and deploy will also change how operations works and how developers and operations work together. There’s not enough time for heavyweight release management and change control with lots of meetings and paperwork. You need an approach that is easier and cheaper. But changing things more often also means more chances to make mistakes. So you need an approach that will reduce risk and catch problems early.

Development teams that release software once a year or so won’t spend a lot of time thinking about release and deployment and operations stuff in general because they don’t have to. But if they’re deploying every couple of weeks, if they’re constantly having to push software out, then it makes sense for them to take the time to understand what production actually looks like and make deployment - and roll-back – easier on them and easier on ops.

You don’t have to automate everything to start – and you probably shouldn't until you understand the problems well enough. We started with check lists and scripting and manual configuration and manual system tests. We put everything under source control (not just code), and then started standardizing and automating deployment and configuration and roll-back steps, replacing manual work and check lists with automated audited commands and health checks. We've moved away from manual server setup and patching to managing infrastructure with Puppet. We’re still aligning test and production so that we can test more deployment steps more often with fewer production-specific changes. We still don’t have a one-button deploy and maybe never will, but release and deployment today is simpler and more standardized and safer and much less expensive.

Deployment is just the start

Improving deployment is just the start of a dialogue that can extend to the rest of operations. Because they’re working together more often, developers and ops will learn more about each other and start to understand each other’s languages and priorities and problems.

To get this started, we encouraged people to read Visible Ops and sent ops and testers and some of the developers and even managers on ITIL Foundation training so that we all understood the differences between incident management and problem resolution, and how to do RCA, and the importance of proper change management – it was probably overkill but it made us all think about operations and take it seriously. We get developers and testers and operations staff together to plan and review releases, and to support production and in RCA whenever we have a serious problem, and we work together to figure out why things went wrong and what we can do to prevent them from happening again. Developers and ops pair up to investigate and solve operational problems and to improve how we design and roll out new architecture, and how we secure our systems and how we set up and manage development and test environments

It sounds easy. It wasn't. It took a while, and there were disagreements and problems and back sliding, like any time you fundamentally change the way that people work. But if you do this right, people will start to create connections and build relationships and eventually trust and transparency across groups – which is what Devops is really about.

You don’t have to change your organization structure or overhaul the culture – in most cases, you won’t have this kind of mandate anyways. You don’t have to buy into Continuous Deployment or even Continuous Delivery, or infrastructure as code, or use Chef or Puppet or any other Devops tools – although tools do help.

Once you start moving faster, from deploying once a year every few months to once a month and as your organization’s pace accelerates, people will change the way that they work because they have to.

Today the way that we work, and the way that we think about development and operations, is much different and definitely healthier. We can respond to business changes and to problems faster, and at the same time our reliability record has improved dramatically. We didn't set out to “be Agile” – it wasn't until we were on our way to shorter release cycles that we looked more closely at Scrum and XP and later Kanban to see how these methods could help us develop software. And we weren't trying to “do Devops” either – we were already down the path to better dev and ops collaboration before people started talking about these ideas at Velocity and wherever else. All we did was agree as a company to change how often we pushed software into production. And that has made all the difference.

Wednesday, February 6, 2013

Code and Code Reviews: What’s in a Name?

In a code review a developer needs to look at the code from two different perspectives:

Correctness. Is the code logically correct, does it do what it is supposed to do? Will it hold up in the real world? Is it safe? Does it handle errors and exceptions? Does it check for bad input parameters and return values? Is it secure? And where performance is important, is it efficient?
Maintainability. Can I understand this code well enough to maintain it, could I change it safely myself? Is it readable and consistent? Is the logic too complex, are the pieces too big? Is it unit tested and if not can it be unit tested? This is where a reviewer looks out for too much copying and pasting, whether hand-rolled code could be done using standard libraries or language features instead, adherence to style guidelines and standards.

It’s obvious that using good names for classes and methods and variables is important in making code understandable (if you can’t understand it, you can’t tell if it is doing what it is supposed to do) and easier to maintain. Books like Clean Code and Code Complete have entire chapters on proper naming. But even good, experienced developers have a hard time coming up with the right abstractions and good, meaningful, intention-revealing names for them. It’s especially hard if they’re working on code that they don’t know that well.

Bad names cause reviewers to stumble, or to make the wrong assumptions about what the code is doing. There are lame names – lazy, ambiguous, generic names that don’t help the reader understand what’s happening in the code. Then there are names which are misleading or just plain wrong – names which used to be right, but aren't now because the logic changed but the name didn't. You’re reading through code, it calls postPayment, but postPayment doesn't just post a payment, it does a lot more now – or worse, it doesn't post a payment at all any more.

Focusing on naming becomes more important over time as more code gets changed more often. The design changes, responsibilities change, often incrementally and subtly. Code gets added, then later other code gets deleted or moved. The programmer making the latest change just wants to focus on what they’re doing, fix the code and get out, and doesn't look at the bigger picture, doesn't notice that the meaning of the code has changed or become less clear because of what they have just done.

Reviewers need to be on the watch for these kinds of problems.

But naming isn't just about making it easier for somebody else to read the code and maintain it – or even making it easier for you to read it and change it yourself when you come back to it later. As a colleague of mine has pointed out, if someone can’t come up with a good name for a method or class, it’s a sign that they are having problems understanding what they were working on. So in reviewing code, a bad name is more than just a distraction or an irritation – it’s a warning sign that there could be more problems in the code, that the developer might not have understood the design well enough to make the change correctly, or that they weren't paying close enough attention to what they were working on, and they may have made other mistakes that you – the reviewer – need to be on the lookout for.

Focusing on naming sometimes seems fussy and nit picky, another battle in the endless “style wars” which end in hand-to-hand fighting over bracket placement and indentation. But good naming is much more important than aesthetics or standardization. It’s about making sure that code is working correctly, and that it will stay that way.