How Brex uses Signadot to scale developer testing across 100s of Engineers

Brex, a company at the forefront of financial services technology, encountered significant hurdles with their Kubernetes namespace-based preview environments. These challenges ranged from operational inefficiencies to the high costs of maintaining isolated environments for testing. Recognizing the need for a more scalable and efficient solution, Brex turned to Signadot, a platform designed to streamline testing and previews by creating isolated sandboxes within a shared cluster. This move not only aimed to address the immediate issues of cost and reliability but also sought to enhance developer productivity by providing more realistic and up-to-date testing.

Adopting Signadot brought rapid and measurable improvements for the developer experience. Previewing changes for a service took 80% less time when using Signadot. User satisfaction (CSAT) were 28 points higher for Signadot than the previous tool, and by some measures the infrastructure cost of allowing developers to preview their changes was reduced by 99% with Signadot.

By the numbers: 80% faster service previews · CSAT 28 points higher · 99% infrastructure cost reduction

The Problem: Unmaintained Preview Environments

Preview environments in Kubernetes namespaces, while powerful at inception, were hitting significant operational roadblocks. Brex implemented a system a few years ago to use duplicated namespaces to create isolated environments for developer testing and experimentation.

Brex developers ran enough microservices in each of these environments that they were soon pushing the scaling limits of a single Kubernetes control plane. Availability issues eventually pushed some teams to test more in staging, abandoning their preview environment services to disrepair.

“The primary problem is just the cost of getting something wrong or breaking something in a preview environment was never high enough to prevent folks from leaving things broken. And so they were constantly in a state of disrepair.” Connor Braa · Software Engineering Manager, Brex

The effect of unmaintained preview environments is a hit to developer velocity. When the first realistic, high fidelity testing happens in Staging, it will inevitably lead to breaking staging more often, blocking other teams and causing further delays.

Reliability problems with duplicated environments

The downstream effects of reliability problems with duplicated dev environments lead to behavior almost like a natural stream. As Connor noted, “There was a group of teams building very important product features that just eschewed development against preview environments entirely, and instead focused on unit tests. After unit tests, it was tested on Staging, and were required to use feature flags way more heavily. It made the data quality and reliability problems with moving to staging way worse.”

Business effects of poor reliability of previews

While previous sections have touched on how unreliable preview environments affect the development lifecycle, these issues have real effects on business goals as well.

“The other big problem is cost, having isolated environments for our internal software that’s hundreds of microservices solely for the purpose of testing gets insanely expensive. We’re talking about replicating 800+ services just so a developer can test changes on 1-2 services. That’s a great deal of compute and memory.” Connor Braa · Software Engineering Manager, Brex

What about some hidden costs? When the team doesn’t feel like they’re working at their best level, it can hurt retention, job satisfaction, and performance. Connor describes it as “developers not feeling like they have the tools to ship the level of quality of software that they want to ship.” With single developer hires costing six figures, it’s worth asking if your team can afford to lose engineers because your best people are frustrated when they can’t ship code.

Signadot: testing on staging with isolation

The adoption of Signadot, which allows isolating sandboxes within a shared cluster, meant that Staging could be used as an environment where developers can experiment without conflicting with others’ work. Only the ‘sandboxed’ services that a developer wants to manipulate need to be run separately from the main Staging baseline services. With a time-to-live setting, sandboxes can shut down after a certain amount of time meaning no huge infrastructure bills to run services that no one ever uses.

The Culture of Staging

A key benefit for Brex of the switch to Signadot is that it didn’t require a major change in practices, but helped lead to a cultural change of running both unit tests and integration tests as part of an ‘inner loop’ before the deploy process. As Connor put it, “Staging has always been a place where we have a bit higher standard. It needs to be working, and there are real business consequences when it breaks. Being able to run Signadot in Staging alleviates this pressure where no one wants to maintain a separate dev environment just for dev testing.”

Another benefit is the quality of the data. As Connor continued, “The data quality that we have on Staging and Production is so much higher in terms of being able to test stuff in a realistic way […] we could solve this [on developer environments] but it would require a huge cultural shift to prioritize realistic data on a replication environment. As a platform team changing everyone’s behavior isn’t a road we want to go down.”

Accelerating Testing: A Tale of Two Loops

Any time your shop is larger than three engineers, there are two ways that code changes are tested. In one loop the developer changes the code and sees the results of the changes with testing. This ‘inner loop’ is fast, with the person who wrote the code seeing the feedback right away. On the ‘outer loop’, changes are validated after code has been reviewed and merged to “main”. Brex wants developers to find as many bugs as possible on their inner loop.

With Signadot, individual developers can manually create a sandbox and perform highly accurate tests of their services on a shared cluster with all needed dependencies. This, combined with unit tests, greatly expands the number of problems identified before a pull request is submitted.

By the numbers: cost, velocity, and user satisfaction

Cost

With Signadot, only the services being tested needed to be replicated. With replicated environments, instead of running a handful of forked services, Brex had to run 800+ services just for that preview environment.

“On the margin, with the Signadot approach, 99.8% of the isolated environment’s infrastructure costs look wasteful. That percentage looks like an exaggeration, but it’s really not.” Connor Braa · Software Engineering Manager, Brex

Velocity

Signadot was a key part of a significant acceleration at Brex. John Salem, Senior Software Engineer at Brex, described the change: “A typical deploy in our Signadot based tooling, end-to-end, is less than five minutes versus the 30-60 minute deploy times we would see with preview environments. Local sandboxes allow us to take that down to almost instantaneous testing.”

Satisfaction

For platform engineers, focusing on the development team’s Customer Satisfaction (CSAT) scores is crucial. John Salem noted the improvement with Signadot: “Our Signadot based tooling had a CSAT that was 28 points higher than our older Platform Engineering tech.”

Conclusions: Signadot for Developer Velocity

The transition to Signadot for managing preview environments at Brex marks a significant evolution in how development and testing are approached, addressing the core issues of cost, reliability, and efficiency that plagued their previous system. By integrating isolated sandboxes within a shared staging cluster, Brex not only streamlined the testing process but also significantly reduced the overhead associated with maintaining numerous isolated environments.

By enabling developers to test changes in a production-like environment without stepping on each other’s toes, Brex has effectively shortened the feedback loop for developers, thereby accelerating the development process. Ultimately, the adoption of Signadot at Brex represents a forward-thinking solution to the challenges of “shifting left” realistic testing, setting a benchmark for others in the industry to follow.

How Brex uses Signadot to scale developer testing across 100s of Engineers

The Problem: Unmaintained Preview Environments

Reliability problems with duplicated environments

Business effects of poor reliability of previews

Signadot: testing on staging with isolation

The Culture of Staging

Accelerating Testing: A Tale of Two Loops

By the numbers: cost, velocity, and user satisfaction

Cost

Velocity

Satisfaction

Conclusions: Signadot for Developer Velocity

Cookies