By Paul Kehrer
It is a truism in modern software development that a robust continuous integration (CI) system is necessary. But many projects suffer from CI that feels brittle, frustrates developers, and actively impedes development velocity. Why is this? What can you do to avoid the common CI pitfalls?
Continuous Integration Needs a Purpose
CI is supposed to provide additional assurance that a project’s code is correct. However, the tests a developer writes to verify the expected functionality are at their least useful when they are initially written. This is perhaps counterintuitive because a developer’s greatest familiarity comes when they initially write the code. They’ve thought about it from numerous angles and considered many of the possible edge cases and have implemented something that works pretty well!
Unfortunately, writing code is the easiest part of programming. The real challenge is building code that others can read so your project can thrive for many years. Software entropy increases over time. Developers—especially ones not familiar with large long-term codebases—can’t anticipate how their code may be integrated, refactored, and repurposed to accommodate needs beyond those that weren’t originally considered.
When these sorts of refactors and expansions occur, tests are the only way changes can be made confidently. So why do developers end up with systems that lack quality testing?
When writing tests, especially for high code-coverage metrics, the most common complaint is that some tests are trivial and exercise nothing interesting or error-prone in the codebase. These complaints are valid when thinking about the code as it exists today, but now consider that the software could be repurposed from its original intention. What once was trivial might now be subtle. Failing to test trivial cases may lead your work into a labyrinth of hidden traps rooted in unobservable behavior.
Remember these three things:
- No test is trivial in the long run.
- Tests are documentation of expected behavior.
- Untested code is subject to incidental behavioral change.
Unreliable CI is poison for developers. For internal projects, it saps productivity and makes people hate working on it. And for open-source projects, it drives away contributors faster than they can arrive.
Find what’s causing your tests to be unreliable and fix it. Unreliable CI commonly manifests as flaky tests, and tools exist to mark tests as flaky until you can find the root cause. This will allow immediate improvement in your CI without crippling the team.
You may find yourself with an excessively long CI cycle time. This is problematic because a quality development process requires that all CI jobs pass. If the cycle time is too long and complex so that it’s impractical to run it locally, then developers will create workarounds. These workarounds may take many forms, but it’s most common to see PR sizes balloon when no one wants to put in a 2-line PR, wait an hour for it to merge, and then rebase their 300-line PR. On top of it when they can just make a few unrelated changes in a single PR. This causes problems for code reviewers and lowers the quality of the project.
Developers aren’t wrong to do this, and CI has failed them. When building CI systems, it’s important to keep a latency budget in mind that goes something like, “CI should never be slower than time, t, where t is chosen a priori.” If CI becomes slower than that, then an effort is spent to improve it, even if it encroaches on the development of new features.
Coverage is difficult
Part of responsible testing is knowing which lines of code your tests are exercising—a nice, simple number that tells you everything. So why is coverage so commonly ignored?
First, the technical challenge. Modern software runs against many disparate targets. To be useful, CI systems should run against numerous targets that submit data to a hosted system that can combine coverage. (The frustration of tools like this failing and how to maintain development velocity despite all software being hot garbage is another discussion.) Absent this, service software developers often fail to notice missed coverage as it becomes lost in the noise of “expected” missed lines.
Now let’s talk about the social challenges. Software is typically written in a way that makes it difficult to test small pieces of functionality. This issue gave rise to the test-driven development (TDD) trend, where tests are written first to help developers factor their code in a testable manner. This is generally a net win in readability and testability but requires discipline and a different approach to development that doesn’t come naturally to many people.
The perceived drudgery in making more of a codebase testable causes complaints that coverage is an imperfect metric. After all, not all code branches are created equal, and depending on your language, some code paths should never be exercised. These are not good reasons to dismiss coverage as a valuable metric, but on specific occasions, there may exist a compelling reason to not spend the effort to cover something with tests. However, be aware that by failing to cover a certain piece of code with tests, its behavior is no longer part of the contract future developers will uphold during refactoring.
What do we do?
So how do we get to CI nirvana given all these obstacles? Incrementally. An existing project is a valuable asset, and we want to preserve what we have while increasing our ability to improve it in the future. (Rewrites are almost universally a bad idea.) This necessitates a graduated approach that, while specifically customized to a given project, has a broad recipe:
- Make CI reliable
- Speed up CI
- Improve test quality
- Improve coverage
We should all spend time investing in the longevity of our projects. This sort of foundational effort pays rapid dividends and ensures that your software projects can be world-class.
I totally agree. Once the build is regularly failing due to flaky tests, it’s easy to lose trust in the test suite. Before long, you find yourself clicking “rebuild” without even looking at the failing test to see if it might be a legitimate failure.
We got hit pretty hard by this while working on Atom (https://atom.io). It was a lot of manual work trying to keep track of which tests were flaky, and we never felt like we had a reliable assessment of which tests were flaky and just how flaky they were.
Thanks for helping to spread awareness that there are tools nowadays to help teams deal with flaky tests! I’m hoping that tools like https://buildpulse.io (which I’m currently working on) can make it easier for teams to keep their test suite healthy. By automatically detecting flaky tests and showing their failure rate over time, everyone gets a shared understanding of how/whether flaky tests are impacting the project. And by identifying the most disruptive flaky tests, teams can see exactly where to start in order to have the most impact on improving the reliability of the test suite.
Hi there! The correct url to pytest failure skips is now https://docs.pytest.org/en/latest/how-to/skipping.html
Fixed! Thank you.