Thursday, April 24, 2014

Driving Devops

There is a lot of talk in the devops community about the importance of sharing principles and values, and about silo busting: breaking down the “wall of confusion” between developers and operations to create agile, cross-functional teams. Radical improvement through fundamental organizational changes and building an entirely new culture.

But it doesn’t have to be that hard. All it took us was 3 simple, but important, steps.

Reliability First

When we first launched our online platform, things were pretty crazy. Sales was busy with customer feedback and onboarding more customers. Development was still finishing the backlog of features that were supposed to already be done, responding to changes from sales and partners, and helping to support the system. Ops was trying to stabilize everything, help onboard more customers and address performance issues as more customers came on. We were all rushing forwards, but not always in the same direction.

Our CEO recognized this and made an important decision. He made reliability the #1 priority – the reliability and integrity of our systems and of our customers’ data, and the services we provided. For everyone: not just ops, but development, sales, marketing, compliance, admin. Above everything else. It was more important not to mess up the customers that we had than to get new customers or hit deadlines or cut costs.

Reliability, resilience, integrity have remained our #1 driver for the company over several years as we continued to grow.

This meant that everyone was working towards the same goals – and the goals were easy to understand and measure: improving MTTF, MTTD and MTTR windows; reducing bug counts and variability in response time, improving results of audits and pen tests.

It gave people more reasons to work together at more levels..

It reduced politics and conflicts to a minimum.

Development’s first priority changed from pushing features out ASAP to making sure that the system was running optimally and that any changes wouldn't negatively impact customers. This meant more time spent with ops on understanding the run-time, more time troubleshooting operational issues, more reviews and regression testing and stress testing, anticipating compatibility issues, planning for roll-back and roll-forward recovery.

Smaller, more frequent releases

Spending some more time on testing and reviews and working with ops meant that it took longer to complete a feature. But we still had to keep up with customer demands – we still had to deliver.

We did this by shortening the software release cycle, from 2-3 months to 2-3 weeks or sometimes shorter. Delivering less in each release, sometimes only 1 new feature or some fixes, but delivering much more often. If a change or feature had to be delayed, if developers or testers needed more time to make sure that it was ready, it wasn't a big deal – if something wasn't ready this week, it would be ready next week or soon after, still fast enough for the business and for customers.

Planning and working in shorter horizons meant that development could respond faster to changes in direction and changing priorities, so developers were always focused on delivering what was most important at the time.

Shorter releases drove development to be more agile, to think and work faster. To automate more testing. To pay more attention to deployment, make the steps simpler and more efficient – and safer.

Fewer changes batched together made it easier to review and test. Less chances to make mistakes. Easier to understand what went wrong when we did make a mistake, and less time to fix it.

RCA – Learn from Mistakes

We still made mistakes, shit still happened. When something went seriously wrong, it was my job to explain it to our customers and owners. What went wrong, why, and what we were going to do to make sure that it didn’t happen again.

We didn't know about blameless post mortems, but this is the way we did it anyway. We got developers and testers and ops and managers together in Root Cause Analysis sessions to carefully examine what happened, what went wrong, understand why, and fix it.

We made sure that people focused on the facts and on problem solving: what happened, what happened next, what did we see, what didn’t we see, why? What could we do to fix it or to prevent it from happening again or to recognize and respond to problems like this more effectively in the future? Better training, better tools, better procedures, better documentation, better error handling, better testing and reviews, better configuration checks and run-time checking, better information and better ways of communicating it.

Focusing on details and problems, not people. Proving that it was ok to make mistakes, but not ok to hide them. We got much better: at operations, testing, design, deployment, monitoring, incident handling. And better as an organization. We built transparency and trust within and across teams. We learned how to move forward from failure, and to be more resilient and confident in our ability to deal with serious problems.

Delivering Better and Faster Together

We didn't restructure or change who we were as an organization. Dev and ops still work in separate organizations for different managers in different countries. They have their own projects and their own ways of working, and they don’t always speak the same language or agree on everything.

We have lots of checks and balances and handoffs and paperwork between dev and ops to make sure that things are done properly and to make the regulators happy. There are still more steps that we could automate or simplify, more we can do to build out our Continuous Delivery pipelines, more things we can get out of Puppet and Vagrant and other cool tools.

But if devops is about developers and operations sharing responsibility for the system, trusting each other and helping each other to make sure that the system is always working correctly and optimally, looking for better solutions together, delivering better and faster – then we’ve been doing devops for a while now.

Monday, April 14, 2014

Agile - What’s a Manager to Do?

As a manager, when I first started learning about Agile development, I was confused by the fuzzy way that Agile teams and projects are managed (or manage themselves), and frustrated and disappointed by the negative attitude towards managers and management in general.

Attempts to reconcile project management and Agile haven't answered these concerns. The PMI-ACP does a good job of making sure that you understand Agile principles and methods (mostly Scrum and XP with some Kanban and Lean), but is surprisingly vague about what an Agile project manager is or does. Even a book like the Software Project Manager’s Bridge to Agility, intended to help bridge PMI's project management practices and Agile, fails to come up with a meaningful job for managers or project managers in an Agile world.

In Scrum (which is what most people mean when they say Agile today), there is no place for project managers at all: responsibilities for management are spread across the Product Owner, the Scrum Master and the development team.

We have found that the role of the project manager is counterproductive in complex, creative work. The project manager’s thinking, as represented by the project plan, constrains the creativity and intelligence of everyone else on the project to that of the plan, rather than engaging everyone’s intelligence to best solve the problems.
In Scrum, we have removed the project manager. The Product Owner, or customer, provides just-in-time planning by telling the development team what is needed, as often as every month. The development team manages itself, turning as much of what the product owner wants into usable product as possible. The result is high productivity, creativity, and engaged customers.

We have replaced the project manager with the Scrum Master, who manages the process and helps the project and organization transition to agile practices.

Ken Schwaber, Agility and PMI, 2011

Project Managers have the choice of becoming a Scrum Master (if they can accept a servant leader role and learn to be an effective Agile coach – and if the team will accept them) or a Product Owner (if they have deep enough domain knowledge and other skills), or find another job somewhere else.

Project Manager as Product Owner

The Product Owner is command-and-control position responsible for the “what” part of a development project. It's a big job. The Product Owner owns the definition of what needs to be built, decides what gets done and in what order, approves changes to scope and makes scope / schedule / cost trade-offs, and decides when work is done. The Product Owner manages and represents the business stakeholders, and makes sure that business needs are met. The Product Owner replaces the project manager as the person most responsible for the success of the project (“the one throat to choke”).

But they don’t control the team’s work, the technical details of who does the work or how. That’s decided by the team.

Some project managers may have the domain knowledge and business experience, the analytical skills and the connections in the customer organization to meet the requirements of this role. But it’s also likely to be played by an absentee business manager or sponsor, backed up by a customer proxy, a business analyst or someone else on the team without real responsibility or authority in the organization, creating potentially serious project risks and management problems. Some organizations have tried to solve this by sharing the role across two people: a project manager and a business analyst, working together to handle all of the Product Owner’s responsibilities.

Project Manager as Scrum Master

It seems like the most natural path for a project manager is to become the team’s Scrum Master, although there is a lot of disagreement over whether a project manager can be effective – and accepted – as a Scrum Master, whether they will accept the changes in responsibilities and authority, and be willing to change how they work with the team and the rest of the organization.

The Scrum Master is a “process owner” and coach, not a project manager. They help the team – and the Product Owner – understand how to work in an Agile process framework, what their roles and responsibilities are, set up and guide the meetings and reviews, and coach team members through change and conflict.

The Scrum Master works a servant leader, a (nice) process cop, a secretary and a gofer. Somebody who supports the team and the Product Owner, “carries food and water” for them, tries to protect them from the world outside of the project and helps them solve problems. But the Scrum Master has no direct authority over the project or the team and does not make decisions for them, because Agile teams are supposed to be self-directing, self-organizing and self-managing.

Of course that’s not how things start off. Any group of people must work their way through Tuckman’s 4 stages of team development: Forming-Storming-Norming-Performing. It’s only when they reach the last stage that a group can effectively manage themselves. In the mean time, somebody (the Scrum Master / Coach) has to help the team make decisions that they aren’t ready to make on their own. It can take a long time for a team to reach this point, for people to learn to trust each other – and the organization – enough. And it may not last long, before something outside of the team’s control sets them back: a key person leaving or joining the team, a change in leadership, a shock to the project like a major change in direction or cuts to the budget. Then they need to be led back to a high performing state again.

Coaching the team and helping them out can be a full-time job in the beginning. After the team has got together and learned the process? Not so much. Which is why the Scrum Master is sometimes played part-time by a developer or sometimes even rotated between people on the development team.

But even when the team is performing at a high level, there’s more to managing an Agile project than setting up meetings, buying pizza and trying to stay out of the way. I've come to understand that Agile doesn't make a manager’s job go away. If anything, it expands it.

Managing Upfront

First, there’s all of the work that has to be done upfront at the start of a project – before Iteration Zero. Identifying stakeholders. Securing the charter. Negotiating the project budget and contract terms. Understanding and navigating the organization’s bureaucracy. Figuring out governance and compliance requirements and constraints, what the PMO needs. Working with HR, line managers and functional managers to put the team together, finding and hiring good people, getting space for them to work in and the tools that they need to work with. Lining up partners and suppliers and contractors. Contracting and licensing and other legal stuff. >/p>

The Product Owner might do some of this work - but they can't do it all.

Managing Up and Out

Then there’s the work that needs to be managed outside of the team.

Agile development is insular, insulated and inward-looking. The team is protected from the world outside so they can focus on building features together. But the world outside is too important to ignore. Every development project involves more than designing and building software – often much more than the work of development itself. Every project, even a small project, has dependencies and hand-offs that need to be coordinated with other teams in other places, with other projects, with specialists outside of the team, with customers and partners and suppliers. There is forward planning that needs to be done, setting and tracking drop-dead dates, defining and maintaining interfaces and integration points and landing zones.

Agile teams move and respond to change quickly. These changes can have impacts outside of the team, on the customer, other teams and other projects, other parts of the organization, suppliers and partners. You can try using a Scrum of Scrums to coordinate with other Agile teams up to a point, but somebody still has to keep track of dependencies and changes and delays and orchestrate the hand-offs.

Depending on the contracting model and your compliance or governance environment, formal change control may not go away either, at least not for material changes. Even if the Product Owner and the team are happy, somebody still has to take care of the paperwork to stay onside of regulatory traceability requirements and to stay within contract terms.

There are a lot of people who need to know what’s going on in a project outside of the development team – especially in big projects in big organizations. Communicating outwards, to people outside of the team and outside of the company. Communicating upwards to management and sponsors, keeping them informed and keeping them onside. Task boards and burn downs and big visible charts on the wall might work fine for the team, but upper management and the PMO and other stakeholders need a lot more, they need to understand development status in the overall context of the project or program or business change initiative.

And there’s cost management and procurement. Forecasting and tracking and managing costs, especially costs outside of development labor costs. Contracts and licensing need to be taken care of. Stuff needs to be bought. Bills need to be paid.

Managing Risks

Scrum done right (with XP engineering practices carefully sewed in) can be effective in containing many common software development risks: scope, schedule, requirements specification, technical risks. But there are other risks that still need to be managed, risks that come from outside of the team: program risks, political risks, partner risks and other logistical risks, integration risks, data quality risks, operational risks, security risks, financial risks, legal risks, strategic risks.

Scrum purposefully has many gaps, holes, and bare spots where you are required to use best practices – such as risk management.
Ken Schwaber
While the team and the Product Owner and Scrum Master are focused on prioritizing and delivering features and resolving technical issues, somebody has to look further out for risks, bring them up to the team, and manage the risks that aren't under the team’s control.

Managing the End Game

And just like at the start of a project, when the project nears the end game, somebody needs to take care of final approvals and contractual acceptance, coordinate integration with other systems and with customers and partners, data setup and cleansing and conversion, documentation and training. Setting up the operations infrastructure, the facilities and hardware and connectivity, the people and processes and tools needed to run the system. Setting up a support capability. Packaging and deployment, roll out planning and roll back planning, the hand-off to the customer or to ops, community building and marketing and whatever else is required for a successful launch. Never mind helping make whatever changes are required to business workflows and business processes that may be required with the new system.

Project Management doesn't go away in Agile

There are lots of management problems that need to be taken care of in any project. Agile spreads some management responsibilities around and down to the team, but doesn’t make management problems go away. Projects can’t scale, teams can’t succeed, unless somebody – a project manager or the PMO or someone else with the authority and skills required – takes care of them.

Thursday, March 27, 2014

Secure DevOps - Seems Simple

The DevOps security story is deceptively simple. It’s based on a few fundamental, straight forward ideas and practices:

Smaller Releases are Safer

One of these ideas is that smaller, incremental and more frequent releases are safer and cause less problems than big bang changes. Makes sense.

Smaller releases contain less code changes. Less code means less complexity and fewer bugs. And less risk, because smaller releases are easier to understand, easier to plan for, easier to test, easier to review, and easier to roll back if something goes wrong.

And easier to catch security risks by watching out for changes to high risk areas of code: code that handles sensitive data, or security features or other important plumbing, new APIs, error handling. At Etsy for example, they identify this code in reviews or pen testing or whatever, hash it, and automatically alert the security team when it gets changed, so that they can make sure that the changes are safe.

Changing the code more frequently may also make it harder for the bad guys to understand what you are doing and find vulnerabilities in your system – taking advantage of a temporary “Honeymoon Effect” between the time you change the system and the time that the bad guys figure out how to exploit weaknesses in it.

And changing more often forces you to simplify and automate application deployment, to make it repeatable, reliable, simpler, faster, easier to audit. This is good for change control: you can put more trust in your ability to deploy safely and consistently, you can trace what changes were made, who made them, and when.

And you can deploy application patches quickly if you find a problem.

“...being able to deploy quick is our #1 security feature”
Effective Approaches to Web Application Security, Zane Lackey

Standardized Ops Environment through Infrastructure as Code

DevOps treats “Infrastructure as Code”: infrastructure configurations are defined in code that is written and managed in the same way as application code, and deployed using automated tools like Puppet or Chef instead of by hand. Which means that you always know how your infrastructure is setup and that it is setup consistently (no more Configuration Drift). You can prove what changes were made, who made them, and when.

You can deploy infrastructure changes and patches quickly if you find a problem.

You can test your configuration changes in advance, using the same kinds of automated unit test and integration test suites that Agile developers rely on – including tests for security.

And you can easily setup test environments that match (or come closer to matching) production, which means you can do a more thorough and accurate job of all of your testing.

Automated Continuous Security Testing

DevOps builds on Agile development practices like automated unit/integration testing in Continuous Integration, to include higher level automated system testing in Continuous Delivery/Continuous Deployment.

You can do automated security testing using something like Gauntlt to “be mean to your code” by running canned attacks on the system in a controlled way.

Other ways of injecting security into Devops include:

  1. Providing developers with immediate feedback on security issues through self-service static analysis: running Static Analysis scans on every check-in, or directly in their IDEs as they are writing code.
  2. Helping developers to write automated security unit tests and integration tests and adding them to the Continuous testing pipelines.
  3. Automating checks on Open Source and other third party software dependencies as part of the build or Continuous Integration, using something like OWASP’s Dependency Check to highlight dependencies that have known vulnerabilities.
Fast feedback loops using automated testing means you can catch more security problems – and fix them – earlier.

Operations Checks and Feedback

DevOps extends the idea of feedback loops to developers from testing all the way into production, allowing (and encouraging) developers visibility into production metrics and getting developers and ops and security to all monitor the system for anomalies in order to catch performance problems and reliability problems and security problems.

Adding automated asserts and health checks to deployment (and before start/restart) in production to make sure key operational dependencies are met, including security checks: that the configurations correct, ports that should be closed are closed, ports that should be opened are opened, permissions are correct, SSL is setup properly…

Or even killing system processes that don’t conform (or sometimes just to make sure that they failover properly, like they do at Netflix).

People talking to each other and working together to solve problems

And finally DevOps is about people talking together and solving problems together. Not just developers talking to the business/customers. Developers talking to ops, ops talking to developers, and everybody talking to security. Sharing ideas, sharing tools and practices. Bringing ops and security into the loop early. Dev and ops and security working together on planning and on incident response and learning together in Root Cause Analysis sessions and other reviews. Building teams across silos. Building trust.

Making SecDevOps Work

There’s good reasons to be excited by what these people are doing, the path that they are going down. It promises a new, more effective way for developers and security and ops to work together.

But there are some caveats.

Secure DevOps requires strong engineering disciplines and skills. DevOps engineering skills are still in short supply. And so are information security(and especially appsec) skills. People who are good at both DevOps and appsec are a small subset of these small subsets of the talent available.

Outside of configuration management and monitoring, the tooling is limited – you’ll probably have to write a lot of what you need yourself (which leads quickly back to the skills problem).

A lot more work needs to be done to make this apply to regulated environments, with enforced separation of duties and where regulators think of Agile as “the A Word” (so you can imagine what they think of developers pushing out changes to production in Continuous Deployment, even if they are using automated tools to do it). A small number of people are exploring these problems in a Google discussion group on DevOps for managers and auditors in regulated industries, but so far there are more people asking questions than offering answers.

And getting dev and ops and security working together and collaborating across development, ops and security might take an extreme makeover of your organization’s structure and culture.

Secure DevOps practices and ideas aren't enough by themselves to make a system secure. You still need all of the fundamentals in place. Even if they are releasing software incrementally and running lots of automated tests, developers still need to understand software security and design security in and follow good software engineering practices. Whether they are using "Infrastructure as Code" or not, Ops still has to design and engineer the datacenter and the network and the rest of the infrastructure to be safe and reliable, and run things in a secure and responsible way. And security still needs to train everyone and followup on what they are doing, run their scans and pen tests and audits to make sure that all of this is being done right.

Secure DevOps is not as simple as it looks. It needs disciplined secure development and secure ops fundamentals, and good tools and rare skills and a high level of organizational agility and a culture of trust and collaboration. Which is why only a small number of organizations are doing this today. It’s not a short term answer for most organizations. But it does show a way for ops and security to keep up with the high speed of Agile development, and to become more agile, and hopefully more effective, themselves.

Thursday, March 20, 2014

Implementing Static Analysis isn't that easy

Static Analysis Testing (SAST) for software bugs and vulnerabilities should be part of your application security – and software quality – program. All that you need to do is run a tool and it will find bugs in the code, early in development when they are cheaper and easier to fix. Sounds easy.

But it takes more than just buying a tool and running a scan – or uploading code to a testing service and having them run the scans for you. You need the direct involvement and buy-in from developers, and from their managers. Because static analysis doesn't find bugs. It finds things in the code that might be bugs, and you need developers to determine what are real problems and what aren't.

This year’s SANS Institute survey on Appsec Programs and Practices which Frank Kim and I worked on found that use of static analysis ranks towards the bottom of the list of tools and practices that organizations find useful in their appsec programs.

This is because you need a real commitment from developers to make static analysis testing successful, and securing this commitment isn't easy.

You’re asking developers to take on extra work and extra costs, and to change how they do their jobs. Developers have to take time from their delivery schedules to understand and use the tools, and they need to understand how much time this is going to require. They need to be convinced that the problems found by the tools are worth taking time to look at and fix. They may need help or training to understand what the findings mean and how to fix them properly. They will need time to fix the problems and more time to test and make sure that they didn't break anything by accident. And they will need help with integrating static analysis into how they build and test software going forward.

Who Owns and Runs the Tools?

The first thing to decide is who in the organization owns static analysis testing: setting up and running the tools, reviewing and qualifying findings, and getting problems fixed.

Gary McGraw at Cigital explains that there are two basic models for owning and running static analysis tools.

In some organizations, Infosec owns and runs the tools, and then works with developers to get problems fixed (or throws the results over the wall to developers and tells them that they have a bunch of problems that need to be fixed right away). This is what McGraw calls a “Centralized Code Review Factory”. The security team can enforce consistent policies and make sure that all code is scanned regularly, and follows up to make sure that problems get fixed.

This saves developers the time and trouble of having to understand the tool and setting up and running the scans, and the Infosec team can make it even easier for developers by reviewing and qualifying the findings before passing them on (filtering out false positives and things that don’t look important). But developers don’t have control over when the scans are run, and don’t always get results when they need them. The feedback cycle may be too slow, especially for fast-moving Agile and Devops teams who rely on immediate feedback from TDD and Continuous Integration and may push out code before the scan results can even get back to them.

A more scalable approach is to make the developers directly responsible for running and using the tools. Infosec can help with setup and training, but it’s up to the developers to figure out how they are going to use the tools and what they are going to fix and when. In a “Self Service” model like this, the focus is on fitting static analysis into the flow of development, so that it doesn't get in the way of developers’ thinking and problem solving. This might mean adding automated scanning into Continuous Integration and Continuous Delivery toolchains, or integrating static analysis directly into developers’ IDEs to help them catch problems immediately as they are coding (if this is offered with the tool).

Disciplined development and Devops teams who are already relying to automated developer testing and other quality practices shouldn't find this difficult – as long as the tools are set up correctly from the start so that they see value in what the tools find.

Getting Developers to use Static Analysis

There are a few simple patterns for adopting static analysis testing that we've used, or that I have seen in other organizations, patterns that can be followed on their own or in combinations, depending on how much software you have already written, how much time you have to get the tools in, and how big your organization is

Drop In, Tune Out, Triage

Start with a pilot, on an important app where bugs really matter, and that developers are working on today. The pilot could be done by the security team (if they have the skills) or consultants or even the vendor with some help from development; or you could make it a special project for a smart, senior developer who understands the code, convince them that this is important and that you need their help, give them some training if they need it and assistance from the vendor, and get them to run a spike – a week or two should be enough to get started.

The point of this mini-project should be to make sure that the tool is installed and setup properly (integrate it into the build, make sure that it is covering the right code), understand how it provides feedback, make sure that you got the right tool, and then make it practical for developers to use. Don’t accept how the tool runs by default. Run a scan, see how long it takes to run, review the findings and focus on cutting the false positives and other noise down to a minimum. Although vendors continue to improve the speed and accuracy of static analysis tools, most static analysis tools err on the side of caution by pointing out as many potential problems as possible in order to minimize the chance of false negatives (missing a real bug). Which means a lot of noise to wade through and a lot of wasted time.

If you start using SAST early in a project, this might not be too bad. But it can be a serious drag on people’s time if you are working with an existing code base: depending on the language, architecture, coding style (or lack of), the size of the code base and its age, you could end up with hundreds or thousands of warnings when you run a static analysis scan. Gary McGraw calls this the “red screen of death” – a long list of problems that developers didn't know that they had in their code yesterday, and are now told that they have to take care of today.

Not every static analysis finding needs to be fixed, or even looked at in detail. It’s important to figure out what’s real, what’s important, and what’s not, and cut the list of findings down to a manageable list of problems that are worth developers looking into and maybe fixing. Each application will require this same kind of review, and the the approach to setup and tuning may be different.

A good way to reduce false positives and unimportant noise is by looking at the checkers that throw off the most findings – if you’re getting hundreds or thousands of the same kind of warning, it’s less likely to be a serious problem (let’s hope) than an inaccurate checker that is throwing off too many false positives or unimportant lint-like nitpicky complaints that can safely be ignored for now. It is expensive and a poor use of time and money to review all of these findings – sample them, see if any of them make sense, get the developer to use their judgement and decide whether to filter them out. Turn off any rules that aren't important or useful, knowing that you may need to come back and review this later. You are making important trade off decisions here – trade-offs that the tool vendor couldn't or wouldn't make for you. By turning off rules or checkers you may be leaving some bugs or security holes in the system. But if you don’t get the list down to real and important problems, you run the risk of losing the development team’s cooperation altogether.

Put most of your attention on what the tool considers serious problems. Every tool (that I've seen anyway) has a weighting or rating system on what it finds, a way to identify problems that are high risk and a confidence rating on what findings are valid. Obviously high-risk, high-confidence findings are where you should spend most of your time reviewing and the problems that probably need to be fixed first. You may not understand them all right away, why the tool is telling you that something is wrong or how to fix it correctly. But you know where to start.

Cherry Picking

Another kind of spike that you can run is to pick low hanging fruit. Ask a smart developer or a small team of developers to review the results and start looking for (and fixing) real bugs. Bugs that make sense to the developer, bugs in code that they have worked on or can understand without too much trouble, bugs that they know how to fix and are worth fixing. This should be easy if you've done a good job of setting up the tool and tuning upfront.

Look for different bugs, not just one kind of bug. See how clearly the tool explains what is wrong and how to correct it. Pick a handful and fix them, make sure that you can fix things safely, and test to make sure that the fixes are correct and you didn't break anything by accident. Then look for some more bugs, and as the developers get used to working with the tool, do some more tuning and customization.

Invest enough time to for the developers to build some confidence that the tool is worth using, and to get an idea of how expensive it will be to work with going forward. By letting them decide what bugs to fix, you not only deliver some real value upfront and get some bugs fixed, but you also help to secure development buy-in: “see, this thing actually works!” And you will get an idea of how much it will cost to use. If it took this long for some of your best developers to understand and fix some obvious bugs, expect it to take longer for the rest of the team to understand and fix the rest of the problems. You can use this data to build up estimates of end-to-end costs, and for later trade-off decisions on what problems are or aren't worth fixing.

Bug Extermination

Another way to get started with static analysis is to decide to exterminate one kind of bug in an application, or across a portfolio. Pick the “Bug of the Month”, like SQL injection – a high risk, high return problem. Take some time to make sure everyone understands the problem, why it needs to be fixed, how to test for it. Then isolate the findings that relate to this problem, figure out what work is required to fix and test and deploy the fixes, and “get er done”.

This helps to get people focused and establish momentum. The development work and testing work is simpler and lower risk because everyone is working on the same kind of problem, and everyone can learn how to take care of it properly. It creates a chance to educate everyone on how to deal with important kinds of bugs or security vulnerabilities, patch them up and hopefully stop them from occurring in the future.

Fix Forward

Reviewing and fixing static analysis findings in code that is already working may not be worth it, unless you are having serious reliability or security problems in production or need to meet some compliance requirement. And as with any change, you run the risk of introducing new problems while trying to fix old ones, making things worse instead of better. This is especially the case for code quality findings. Bill Pugh, the father of Findbugs, did some research at Google which found that

“many static warnings in working systems do not actually manifest as program failures.”
It can be much less expensive and much easier to convince developers to focus only on reviewing and fixing static analysis findings in new code or code that they are changing, and leave the rest of the findings behind, at least to start.

Get the team to implement a Zero Bug Tolerance program or some other kind of agreement within the development team to review and cleanup as many new findings from static scans as soon as they are found – make it part of their “Definition of Done”. At Intuit, they call this “No New Defects”.

Whatever problems the tools find should be easy to understand and cheap to fix (because developers are working on the code now, they should know it well enough to fix it) and cheap to test – this is code that needs to be tested anyway. If you are running scans often enough, there should only be a small number of problems or warnings to deal with at a time. Which means it won’t cost a lot to fix the bugs, and it won’t take much time – if the feedback loop is short enough and the guidance from the tool is clear enough on what’s wrong and why, developers should be able to review and fix every issue that is found, not just the most serious ones. And after developers run into the same problems a few times, they will learn to avoid them and stop making the same mistakes, improving how they write code.

To do this you need to be able to differentiate between existing (stale) findings and new (fresh) issues introduced with the latest check-in. Most tools have a way to do this, and some, like Grammatech's CodeSonar are specifically optimized to do incremental analysis.

This is where fast feedback and a Self-Service approach can be especially effective. Instead of waiting for somebody else to run a scan and pass on the results or running ad hoc scans, try to get the results back to the people working on the code as quickly as possible. If developers can’t get direct feedback in the IDE (you’re running scans overnight, or on some other less frequent schedule instead), there are different ways to work with the results. You could feed static analysis findings directly into a bug tracker. Or into the team’s online code review process and tools (like they do at Google) so that developers and reviewers can see the code, review comments and static analysis warnings at the same time. Or you could get someone (a security specialist or a developer) to police the results daily, prioritize them and either fix the code themselves or pass on bugs or serious warnings to whoever is working on that piece of code (depending on your Code Ownership model). It should only take a few minutes each morning – often no time at all, since nothing may have been picked up in the nightly scans.

Fixing forward gets you started quicker, and you don’t need to justify a separate project or even a spike to get going – it becomes just another part of how developers write and test code, another feedback loop like running unit tests. But it means that you leave behind some – maybe a lot of – unfinished business.

Come Back and Clean House

Whatever approach you take upfront – ignoring what’s there and just fixing forward, or cherry picking, or exterminating one type of bug – you will have a backlog of findings that still should be reviewed and that could include real bugs which should be fixed, especially security vulnerabilities in old code. Research on “The Honeymoon Effect” shows that there can be serious security risks in leaving vulnerabilities in old code unfixed, because this gives attackers more time to find them and exploit them.

But there are advantages to waiting until later to review and fix legacy bugs, until the team has had a chance to work with the tool and understand it better, and until they have confidence in their ability to understand and fix problems safely. You need to decide what to do with these old findings. You could mark them and keep them in the tool’s database. Or you could export them or re-enter them (at least the serious ones) into your defect tracking system.

Then schedule another spike: get a senior developer, or a few developers, to review the remaining findings, drop the false positives, and fix, or put together a plan to fix, the problems that are left. This should be a lot easier and less expensive, and safer, now that the team knows how the tool works, what the findings mean, what findings aren't bugs, what bugs are easy to fix, what bugs aren't worth fixing and what bugs they should be careful with (where there may be a high chance of introducing regression bug by trying to make the tool happy). This is also the time to revisit any early tuning decisions that you made, see if it is worthwhile to turn some checkers or rules back on.

Act and Think for the Long Term

Don’t treat static analysis testing like pen testing or some other security review or quality review. Putting in static analysis might start with a software security team (if your organization is big enough to have one and they have the necessary skills) or some consultants, but your goal has to be more than just handing off a long list of tool findings to a development lead or project manager.

You want to get those bugs fixed – the real ones at least. But more importantly, you want to make static analysis testing an integral part of how developers think and work going forward, whenever they are changing or fixing code, or whenever they are starting a new project. You want developers to learn from using the tools, from the feedback and guidance that the tools offer, to write better, safer and more secure code from the beginning.

In “Putting the Tools to Work: How to Succeed with Source Code Analysis” Pravir Chandra, Brian Chess and John Steven (three people who know a lot about the problem) list five keys to successfully adopting static analysis testing:

  1. Start small – start with a pilot, learn, get some success, then build out from there.
  2. Go for the throat – rather than trying to stomp out every possible problem, pick the handful of things that go wrong most often and go after them first. You’ll get a big impact from a small investment.
  3. Appoint a champion – find developers who know about the system, who are respected and who care, sell them on using the tool, get them on your side and put them in charge.
  4. Measure the outcome – monitor results, see what bugs are being fixed, how fast, which bugs are coming up, where people need help.
  5. Make it your own – customize the tools, write your own application-specific rules. Most organizations don’t get this far, but at least take time early to tune the tools to make them efficient and easy for developers to use, and to filter out as much noise as soon as possible.

Realize that all of this is going to take time, and patience. Be practical. Be flexible. Work incrementally. Plan and work for the long term. Help people to learn and change. Make it something that developers will want to use because they know it will help them do a better job. Then you've done your job.

Thursday, March 13, 2014

Application Security – Can you Rely on the Honeymoon Effect?

I learned about some interesting research from Dave Mortman at this year’s RSA conference in San Francisco which supports the Devops and Agile arguments that continuous, incremental, iterative changes can be made safely: a study by by the MIT Lincoln lab (Milk or Wine: Does Software Security Improve with Age?) and The Honeymoon Effect, by Sandy Clark at the University of Pennsylvania

These studies show that most software vulnerabilities are foundational (introduced from start of development up to first release), the result of early decisions and not the result of later incremental changes. And there is a “honeymoon period” after software is released, before bad guys understand it well enough to find and exploit vulnerabilities. Which means the more often that you release software changes, the safer your system could be.

Understanding the Honeymoon Effect

Research on the honeymoon period, the time “after the release a software product (or version) and before the discovery of the first vulnerability” seems to show that finding security vulnerabilities is “primarily a function of familiarity with the system”. Software security vulnerabilities aren't like functional or reliability bugs which are mostly found soon after release, slowing down over time:

“…we would expect attackers (and legitimate security researchers) who are looking for bugs to exploit to have the easiest time of it early in the life cycle. This, after all, is when the software is most intrinsically weak, with the highest density of ”low hanging fruit” bugs still unpatched and vulnerable to attack. As time goes on, after all, the number of undiscovered bugs will only go down, and those that remain will presumably require increasing effort to find and exploit.

But our analysis of the rate of the discovery of exploitable bugs in widely-used commercial and open-source software, tells a very different story than what the conventional software engineering wisdom leads us to expect. In fact, new software overwhelmingly enjoys a honeymoon from attack for a period after it is released. The time between release and the first 0-day vulnerability in a given software release tends to be markedly longer than the interval between the first and second vulnerability discovered, which in turn tends to be longer than the time between the second and the third…”

It may take a while for attackers to find the first vulnerability, but then it gets progressively easier – because attackers use information from previous vulnerabilities to find the next ones, and because the more vulnerabilities they find, the more confident they are in their ability to find even more (there's blood in the water for a reason).

This means that software may actually be safest when it should be the weakest:

“when the software is at its weakest, with the ‘easiest’ exploitable vulnerabilities still unpatched, there is a lower risk that this will be discovered by an actual attacker on a given day than there will be after the vulnerability is fixed!”

Code Reuse Shortens your Honeymoon

Clark’s team also found that re-use of code shortens the honeymoon, because this code may already be known to attackers:

“legacy code resulting from code-reuse [whether copy-and-paste or using frameworks or common libraries] is a major contributor to both the rate of vulnerability discovery and the numbers of vulnerabilities found…

We determined that the standard practice of reusing code offers unexpected security challenges. The very fact that this software is mature means that there has been ample opportunity to study it in sufficient detail to turn vulnerabilities into exploits.”
In fact, reuse of code can lead to “less than Zero day” vulnerabilities – software that is already known to be vulnerable before your software is released.

Leveraging Open Source or frameworks and libraries and copying-and-pasting from code that is already working obviously saves times and reduces development costs, and helps developers to minimize technical risks, including security risks – it should be safer to use a special-purpose security library or the security features of your application framework than it is to try to solve security problems on your own. But this also brings along its own set of risks, especially the dangers of using popular software components with known vulnerabilities – software that attackers know and can easily exploit on a wide scale. This means that if you’re going to use Open Source (and just about everybody does today), then you need to put in proactive controls to track what code is being used and make sure that you keep it up to date.

Make the Honeymoon Last as Long as you can

One risk of Agile development and Devops is that security can’t keep up with the rapid pace of change - at least not the way that most organizations practice security today. But if you’re moving fast enough, the bad guys might not be able to keep up either. So speed can actually become a security advantage:

“Software that was changed more frequently had a significantly longer median honeymoon before the first vulnerability was discovered.”

The idea of constant change as protection is behind Shape Shifter, an interesting new technology which constantly changes attributes of web application code so that attackers, especially bots, can’t get a handle on how the system works or execute simple automated attacks.

But speed of change isn't enough by itself to protect you, especially since a lot changes that developers make don’t materially affect the Attack Surface of the application – the points in the system that an attacker can use to get into (or get data out of) an application. Changes like introducing a new API or file upload, or a new user type, or modifying the steps in a key business workflow like an account transfer function could make the system easier or harder to attack. But most minor changes to the UI or behind the scenes changes to analytics and reporting and operations functions don't factor in.

The honeymoon can’t last forever any ways: it could be as long as 3 years, or as short as 1 day. If you are stupid or reckless or make poor technology choices or bad design decisions it won’t take the bad guys that long to find the first vulnerability, regardless of how often you fiddle with the code, and it will only get worse from there. You still have to do a responsible job in design and development and testing, and carefully manage code reuse, especially use of Open Source code – whatever you can to make the honeymoon last as long as possible.

Wednesday, March 5, 2014

Appsec and Devops at RSA 2014

At last week’s RSA security conference in San Francisco the talk was about how the NSA is compromising the security of the Internet as well as violating individual privacy, the NSA and RSA (did RSA take the money or not), the NSA and Snowden and how we all have to take insider threats much more seriously, Apple’s SSL flaw (honest mistake, or plausibly deniable back door?) and how big data is rapidly eroding privacy and anonymity on the Internet anyways (with today’s – or at least tomorrow’s – analytic tools and enough Cloud resources, everyone’s traces can be found and put back together). The Cloud is now considered safe – or at least as safe as anywhere else. Mobile is not safe. Bitcoin is not safe. Critical infrastructure is not safe. Point of Sale systems are definitely not safe – and neither are cars or airplanes or anything in the Internet of Things.

I spent most of my time on Appsec and Devops security issues. There were some interesting threads that ran through these sessions:

Third Party Controls

FS ISAC’s recent paper on Appropriate Software Security Control Types for Third Party Service and Product Providers got a lot of play. It outlines a set of controls that organizations (especially enterprises) should require of their third party service providers – and that anyone selling software or SaaS to large organizations should be prepared to meet:

  1. vBSIMM - a subset of BSIMM to evaluate a provider’s secure development practices
  2. Open Source Software controls – it’s not enough that an organization needs to be responsible for the security of the software that they write, they also need to be responsible for the security of any software that they use to build their systems, especially Open Source software. Sonatype has done an especially good job of highlighting the risk of using Open Source software, with scary data like “90% of a typical application is made up of Open Source components”, “2/3 of developers don’t know what components they use” and “more than 50% of the Global 500 use vulnerable Open Source components”.
  3. Binary Static Analysis testing – it’s not enough that customers ask how a provider secures their software, they should also ask for evidence, by demanding independent static analysis testing of the software using Veracode or HP Fortify on Demand.

Threat Modeling

There was also a lot of talk about threat modeling in secure development: why we should do it, how we could do it, why we aren’t doing it.

In an analysis of results from the latest BSIMM study on software security programs, Gary Mcgraw at Cigital emphasized the importance of threat modeling / architectural risk analysis in secure development. However he also pointed out that it does not scale. While 56/67 of firms in the BSIMM study who have application security programs in place conduct threat modeling, they limit this to only security features.

Jim Routh, now CISO at Aetna (and who lead the FS ISAC working group above while he was at JPMC), has implemented successful secure software development programs in 4 different organizations, but admitted that he failed at injecting threat modeling into secure development in all of these cases, because of how hard it is to get designers to understand the necessary tradeoff decisions.

Adam Shostack at Microsoft outlined a pragmatic and understandable approach to threat modeling. Ask 4 simple questions:

  1. What are you building?
  2. What can go wrong?
  3. What are you going to do about the things that can go wrong?
  4. Did you do a good job of 1-3?

Asking developers to “think like an attacker” is like asking developers to “think like a professional chef” or “think like an astronaut”. They don’t know how. They still need tools to help them. These tools (like STRIDE, Mitre’s CAPEC, and attack trees) are described in detail in his new book Threat Modeling: Designing for Security which at over 600 pages is unfortunately so detailed that few developers will get around to actually reading it.

Devops, Unicorns and Rainbows

A panel on Devops/Security Myths Debunked looked at security and Devops, but didn’t offer anything new. The same people and same stories, the same Unicorns and rainbows: n thousand of deploys per day at Amazon, developer self-service static analysis at twitter, Netflix’s Chaos Monkey… If Devops is taking the IT operations world by storm, there should be more leaders by now, with new success stories outside of big Cloud providers or social media.

The Devops story when it comes to security is simple. More frequent, smaller changes are better than less frequent big changes because small changes are less likely to cause problems, are easier to understand and manage and test, easier to test, and easier to debug, fix or roll back if something goes wrong. To deploy more often you need standardized, automated deployment – which is not just faster but also simpler and safer and easier to audit, and which makes it faster and simpler and safer to push out patches when you need to. Developers pushing changes to production does not mean developers signing on in production and making changes to production systems (see automated deployment above).

Dave Mortman tried to make it clear that Devops isn’t about, or shouldn’t only be about, how many times a day you can deploy changes: it should really about getting people talking together and solving problems together. But most of the discussion came back to speed of development, reducing cycle time, cutting deployment times in half – speed is what is mostly driving Devops today. And as I have pointed out before, infosec (the tools, practices, ways of thinking) is not ready to keep up with the speed of Agile and Devops.

Mobile Insecurity

The state of mobile security has definitely not improved over the past couple of years. One embarrassing example: the RSA conference mobile app was hacked multiple times by security researchers in the days leading up to and during the conference.

Dan Cornell at Denim Group presented some research based on work that his firm has done on enterprise mobile application security assessments. He made it clear that mobile security testing has to cover much more than just the code on the phone itself. Only 25% of serious vulnerabilities were found in code on the mobile device. More than 70% were in server code, in the web services that the mobile app called (web services that “had never seen adversarial traffic”), and the remaining 5% were in third party services. So it’s not the boutique of kids who your marketing department hired to write the pretty mobile front end – it’s the enterprise developers writing web services that are your biggest security problem.

Denim Group found more than half of serious vulnerabilities using automated testing tools (58% vs 42% found in manual pen testing and reviews). Dynamic analysis (DAST) tools are still much less effective in testing mobile than for web apps, which means that you need to rely heavily on static analysis testing. Unfortunately, static analysis tools are extremely inefficient - for every serious vulnerability found, they also report hundreds of false positives and other unimportant noise. We need better and more accurate tools, especially for fuzzing web services.

Tech, Toys and Models

Speaking of tools... The technology expo was twice the size of previous years. Lots of enterprise network security and vulnerability management and incident containment (whatever that is) solutions, governance tools, and application firewalls and NG firewalls and advanced NNG firewalls, vulnerability scanners, DDOS protection services, endpoint solutions, a few consultancies and some training companies. A boxing ring, lots of games, coffee bars and cocktail bars, magicians, a few characters dressed up like Vikings, some slutty soccer players – the usual RSA expo experience, just more of it.

Except for log file analysis tools (Splunk and LogRhythm) there was no devops tech at the tradeshow. There were lots of booths showing off Appsec testing tools (including all of the major static analysis players) and a couple of interesting new things:

Shape Shifter from startup Shape Security, is a new kind of application firewall technology for web apps: “a polymorphic bot wall” that dynamically replaces attributes of HTML or Javascript with random strings to deter automated attacks. Definitely cool, it will interesting to see how effective it is.

Denim Group showed off a major upgrade to ThreadFix, a tool that maps software vulnerabilities found through different testing methods including automated dynamic analysis, static analysis and manual testing. This gives you the same kind of vulnerability management capabilities that you get with enterprise appsec testing suites: hybrid mapping of the attack surface with drill down to line of code, tracking metrics across applications and portfolios, integration with different bug tracking systems and the ability to create virtual patches for WAFs, but in a cross-vendor way. You can correlate findings from HP Fortify static analysis testing and dynamic testing results from IBM Appscan, or from a wide range of other tools and security testing platforms, including Open Source tools like Findbugs or Brakeman or ZAP and Arachni and Skipfish. And ThreadFix is Open Source itself (available for free or in a supported enterprise version).

The best free giveway at the conference was viaProtect an app from mobile security firm viaForensics that lets you check the security status of your iOS or Android device. Although it is just out of beta and still a bit wobbly, it provides a useful summary of the security status of your mobile devices, and if you choose to register, you can see track the status of multiple devices on a web portal.

Another year, another RSA

This was my third time at RSA. What has changed over that time? Compliance, not fear of being hacked, is still driving infosec: “attackers might show up; auditors definitely will”. But most organizations have accepted that they will be hacked soon, or have been hacked but they just haven’t found out yet, and are trying to better be prepared for when the shit hits the fan. More money is being poured into technology and training and consulting, but it’s not clear what’s making a real difference. It’s not money or management that’s holding many organizations back from improving their security programs – it’s a shortage of infosec skills.

With a few exceptions, the general quality of presentations wasn't as good as previous years – disappointing and surprising given how hard it is to get a paper accepted at RSA. There wasn't much that I had not already heard at other conferences like OWASP Appsec or couldn't catch in a free vendor-sponsored webinar. The expert talks that were worth going to were often over-subscribed. But at least the closing keynote was a good one: Stephen Colbert was funny, provocative and surprisingly well-informed about the NSA and related topics. And the the parties – if getting drunk with security wonks and auditors is your idea of crazy fun, nothing, but nothing, beats RSA.

Monday, February 3, 2014

Data Privacy and Security in ThoughtWorks Radar, Sort of

Once or twice a year the thought leaders at ThoughtWorks, including their Chief Scientist and book writer Martin Fowler, get together and put together a Radar report listing software development techniques and technologies (tools, platforms, languages and frameworks) that they think are interesting, and that they think other developers should be interested in too. Unlike analyses from Gartner, the Radar only includes things that ThoughtWorks teams have actually tried and seen work, or tried and seen not work, or are trying and think might work.

The Radar is always a good read, a way to keep up with the latest fashions, and is an especially good resource on practices and tools for mobile, Web and Cloud development projects, and Open Source tools and platforms for automated testing and build and deployment.

ThoughtWorks was a pioneer in continuous build and Continuous Integration, and in devops: ideas and tools for Continuous Deployment and Continuous Delivery have been included in the Radar going back to 2009, and ThoughtWorks has an entire practice built around Continuous Delivery.

And now, maybe because they were shamed into this by Matt Konda at Jemurai Security, ThoughtWorks have included data privacy and application security in the latest Radar, although in an unfortunately obscure and limited way.

Data Privacy – Assess Datensparsamkeit

There are four rings in the ThoughtWorks Radar:

  • Adopt (ThoughtWorks feels strongly that everyone should be doing this)
  • Trial (worth pursuing, but maybe start off carefully)
  • Assess (try it out, it might work, at least learn something about it)
  • Hold (proceed with caution – i.e., you should probably not do/use this, or if you are doing/using this you should probably stop doing/using this)

Concerns for Data Privacy were added to the Jan 2014 Radar. The idea is sound:

“only store as much personal information as is absolutely required for the business or applicable laws… If you never store the information, you do not need to worry about someone stealing it.”

But the way it was presented was unfortunate. Data privacy was added as a Radar blip in the early stage “Assess” try-it-out ring, and with a cute but obscure label (“Datensparsamkeit”) taken from German privacy legislation.

This is a recognized good practice, demanded by many regulations. Why is this in “Assess”, and why is it hidden under a German name?

Application Security – Hold Ignoring OWASP Top 10

ThoughtWorks has now recognized that security is important:

“Barely a week goes by without the IT industry being embarrassed by yet another high profile loss of data, leak of passwords, or breach of a supposedly secure system.”

The way that this report works, people should stop doing what is in the “Hold” ring, and focus most of their attention on what is in the “Adopt” ring because these are proven, key technologies and practices that are wroth following. Instead of asking developers to Adopt secure design and development practices, they've added security as a “first-class concern during software construction” by putting “Ignoring OWASP Top 10” in the Hold ring.

Like “Assess Datensparsamkeit”, “Hold Ignoring OWASP Top 10” won’t make a lot of sense to most developers, unless they take extra time to read and understand more on their own.

Oh Well, at least this is something for now

Although this could have been done in a much more understandable and straightforward way, at least this Radar shows that ThoughtWorks is actively thinking about security and privacy in their projects, and that they think that other developers should too. The ThoughtWorks Radar will reach a different (and probably bigger) audience than most software security-focused publications, including developers who have never heard of the OWASP Top 10 or Datensparsamkeit, so this is a good thing.

All of this is likely to be temporary, however, because of the attention-deficit way that the Radar works. ThoughtWorks only lists things that they currently find interesting in each report. A few practices and technologies stay on the Radar for a while as they move from Assess to Trial to Adopt (if they prove to be key) or Hold (if they don’t work out), and because they are fundamental to the way that ThoughtWorks teams work (like evolutionary architecture and continuous build and automated testing). But most ideas and tools drop off the Radar often and quickly, as ThoughtWorkers move on to the next shiny new thing.

So, for the moment at least, security and privacy will get some extra attention from ThoughtWorks and the developers that they influence.

Past Radars

If you are interested in following the changing ideas, cool tools and recent fashions in software development hilighted in the Radar, here are links going back to 2009:

Jan 2014 (the Radar discussed in this post)

May 2013

October 2012

March 2012

July 2011

January 2011

August 2010

April 2010

January 2010

November 2009

Site Meter