How Brex uses Signadot to scale developer testing across 100s of Engineers
Brex, a company at the forefront of financial services technology, encountered significant hurdles with their Kubernetes namespace-based preview environments. These challenges ranged from operational inefficiencies to the high costs of maintaining isolated environments for testing. Recognizing the need for a more scalable and efficient solution, Brex turned to Signadot, a platform designed to streamline testing and previews by creating isolated sandboxes within a shared cluster. This move not only aimed to address the immediate issues of cost and reliability but also sought to enhance developer productivity by providing more realistic and up-to-date testing.. This article delves into how Brex leveraged Signadot to overcome the limitations of traditional preview environments, setting a new standard for software development and testing practices.
Adopting Signadot brought rapid and measurable improvements for the developer experience. Previewing changes for a service took 80% less time when using Signadot. User satisfaction (CSAT) were 28 points higher for Signadot than the previous tool, and by some measures the infrastructure cost of allowing developers to preview their changes was reduced by 99% with Signadot.
The Problem: Unmaintained Preview Environments
Preview environments in Kubernetes namespaces, while powerful at inception, are now hitting significant operational roadblocks. The architecture, designed to isolate and replicate production-like environments for development and testing, is showing its age.
Brex implemented a system a few years ago to use duplicated namespaces to create isolated environments for developer testing and experimentation. By providing access to these namespaces directly to developers, it ensured ‘inner loop’ testing, described in more detail below.
Brex developers ran enough microservices in each of these environments that they were soon pushing the scaling limits of a single Kubernetes control plane. Availability issues eventually pushed some teams to test more in staging, abandoning their preview environment services to disrepair. Developers would notice these unmaintained services and attribute problems to the platform. The platform team scaled the system to multiple clusters, but the unmaintained services continued to be tested primarily in staging.
As Connor Braa, Software Engineering Manager at Brex, puts it:
“The primary problem is just the cost of getting something wrong or breaking something in a preview environment was never high enough to prevent folks from leaving things broken. And so they were constantly in a state of disrepair.”
The effect of unmaintained preview environments is a hit to developer velocity. When the first realistic, high fidelity testing happens in Staging, it will inevitably lead to breaking staging more often. This matters because Staging going down will block other teams’ testing on that environment, causing further delays for other teams. Further, tests happening on staging are an ‘outer loop’ of testing with feedback having to come from the full merge process and integration tests.
Reliability problems with duplicated environments
The downstream effects of reliability problems with duplicated dev environments lead to behavior almost like a natural stream: developer teams creating workarounds for the problems with using dev environments:
“There was a group of teams building very important product features that just eschewed development against preview environments entirely, and instead focused on unit tests. After unit tests, it was tested on Staging, and were required to use feature flags way more heavily. It made the data quality and reliability problems with moving to staging way worse.”
It’s easy to empathize with the decision of these teams to work around their preview environments, but without this critical step more testing happens either on Staging, where problems will block the work of other teams, or even on Production.
Business effects of poor reliability of previews.
While previous sections have touched on how unreliable preview environments affect the development lifecycle, these issues have real effects on business goals as well. Costs are one concern.
An issue with isolated developer environments is that you’re running a large number of isolated services that aren’t used most of the time. Connor explains:
“The other big problem is cost, having isolated environments for our internal software that’s hundreds of microservices solely for the purpose of testing gets insanely expensive. We’re talking about replicating 800+ services just so a developer can test changes on 1-2 services. That’s a great deal of compute and memory.”
What about some hidden costs? When the team doesn’t feel like they’re working at their best level, it can hurt retention, job satisfaction, and performance. Connor describes it as “developers not feeling like they have the tools to ship the level of quality of software that they want to ship.”
With single developer hires costing six figures, it’s worth asking if your team can afford to lose engineers because your best people are frustrated when they can’t ship code.
Signadot: testing on staging with isolation
The adoption of Signadot, which allows isolating sandboxes within a shared cluster, meant that Staging could be used as an environment where developers can experiment without conflicting with others’ work. This is superior to an entire isolated cluster for a developer in a number of ways, but to start with cost: only the ‘sandboxed’ services that a developer wants to manipulate need to be run separately from the main Staging baseline services. With a time-to-live setting, sandboxes can shut down after a certain amount of time meaning no huge infrastructure bills to run services that no one ever uses.
The Culture of Staging
A key benefit for Brex of the switch to Signadot is that it didn’t require a major change in practices, but helped lead to a cultural change of running both unit tests and integration tests as part of an ‘inner loop’ before the deploy process. Since Staging was always the ‘last word’ in testing code changes before deployment, it’s expected to be kept up-to-date. As Connor puts it:
“Staging has always been a place where we have a bit higher standard. It needs to be working, and there are real business consequences when it breaks. Being able to run Signadot in Staging alleviates this pressure where no one wants to maintain a separate dev environment just for dev testing.”
Another benefit is the quality of the data. We’ve all been there: when setting up a temporary dev environment we fill all the customer databases with johnsmith1,123fakestreet,usa for 2,000 rows and call it a day. But as Connor continues:
“The data quality that we have on Staging and Production is so much higher in terms of being able to test stuff in a realistic way [...] we could solve this [on developer environments] but it would require a huge cultural shift to prioritize realistic data on a replication environment. As a platform team changing everyone’s behavior isn’t a road we want to go down.”
Accelerating Testing: A Tale of Two Loops
Any time your shop is larger than three engineers, there are two ways that code changes are tested. In one loop the developer changes the code and sees the results of the changes with testing. This ‘inner loop’ is fast, with the person who wrote the code seeing the feedback right away. It is by definition fast, with the information flowing to the person best suited to fix the problem.
On the 'outer loop', changes are validated after code has been reviewed and merged to "main". They might be deployed to staging, tested, then rolled to production, or carefully feature flagged against real production data. On the outer loop, tests are more realistic, but the process of fixing a flaw found in outer loop testing is dramatically higher as each iteration must pass through review and release processes. Brex wants developers to find as many bugs as possible on their inner loop.
At Brex, developers were doing a great job with writing unit tests, meaning that their initial code evaluation inner loop was both readily available and highly accurate. But these kinds of unit tests run by developers are inherently limited in scope and can’t cover everything that would be covered in integration tests.
With Signadot, individual developers can manually create a sandbox and perform highly accurate tests of their services on a shared cluster with all needed dependencies. This, combined with unit tests, greatly expands the number of problems identified before a pull request is submitted. With more issues found in the inner testing loop, the result is a huge improvement in developer velocity.
The Application Infrastructure team at Brex has built an integration testing framework which can pick what tests to run based on scanning a PR, and will be adding more automated integration tests using Signadot. While everything after the PR phase is technically the outer loop, automated feedback can be read directly by the original developer and still accelerate development.
By the numbers: cost, velocity, and user satisfaction
Cost
It’s worth coming back to the cost advantages of using Signadot vs. replicating whole environments for developers. With Signadot, only the services being tested needed to be replicated. With replicated environments, instead of running a handful of forked services, Brex had to run 800+ services just for that preview environment. As Connor puts it:
“On the margin, with the Signadot approach, 99.8% of the isolated environment's infrastructure costs look wasteful. that percentage looks like an exaggeration, but it's really not.”
Since the infrastructure costs for replication and preview environments can be significant, this massive cut can have a real impact on your overall budget for developer experience.
Velocity
While not the only goal of a developer platform, a good measure for success is how long it takes to go from writing code to seeing it run, whether from a deployment, preview environment, or local testing. Signadot was a key part of a significant acceleration at Brex. John Salem, Senior Software Engineer at Brex, describes the change:
“A typical deploy in our Signadot based tooling, end-to-end, is less than five minutes versus the 30-60 minute deploy times we would see with preview environments. Local sandboxes allow us to take that down to almost instantaneous testing.”
Satisfaction
For platform engineers, focusing on the development team's Customer Satisfaction (CSAT) scores is crucial. It's not merely about ensuring that end-users are content; it's about recognizing that developers are, in many ways, your internal customers. High CSAT scores among developers can lead to more efficient development cycles, improved software quality, and ultimately, a stronger product.
John Salem saw marked satisfaction improvements with Signadot:
“Our Signadot based tooling had a CSAT that was 28 points higher than our older Platform Engineering tech.”
In the end the job of Developer Experience, or Platform Engineering, is to make developers’ lives easier. By improving satisfaction scores significantly the team at Brex made big gains on the health of their product team.
Conclusions: Signadot for Developer Velocity
The transition to Signadot for managing preview environments at Brex marks a significant evolution in how development and testing are approached, addressing the core issues of cost, reliability, and efficiency that plagued their previous system. By integrating isolated sandboxes within a shared staging cluster, Brex not only streamlined the testing process but also significantly reduced the overhead associated with maintaining numerous isolated environments. This shift enhances the quality of testing by ensuring that developers work with realistic data and dependencies. The move towards a more centralized and efficient staging environment underscores a pragmatic approach to software development, where the focus is squarely on improving developer velocity and product quality without the unnecessary duplication of resources.
By enabling developers to test changes in a production-like environment without stepping on each other's toes, Brex has effectively shortened the feedback loop for developers, thereby accelerating the development process. This approach not only ensures that developers have the tools and environments necessary to ship high-quality software but also addresses the hidden costs associated with developer dissatisfaction and turnover. Ultimately, the adoption of Signadot at Brex represents a forward-thinking solution to the challenges of “shifting left” realistic testing , setting a benchmark for others in the industry to follow.
Join our 1000+ subscribers for the latest updates from Signadot