Yellow.ai and Signadot: Pioneering Parallel Feature Development for Faster Releases
When you’ve got over 100 developers and a single testing environment, you’re bound to run into conflicts. At Yellow.ai, conflicts and failed deployments to QA were adding significantly to the time to ship code. With Signadot’s model of request isolation, developers could do highly accurate tests and exploration on a shared cluster without impacting each other’s work. The result was a 2-3x improvement in developer productivity.
Yellow.ai is a conversational AI platform, enabling enterprises to unlock business potential at scale. The platform is trusted across 85+ countries by 1000+ enterprises, including Domino’s, Sephora, Hyundai, MG Motors, and Biogen International. This is the story of how they came to adopt Signadot.
The problem: many developers sharing Staging
In conversation with Lokeshwaran, a senior Platform Engineer at Yellow, we learned about the state of staging and integration testing at Yellow.ai before Signadot:
“We had a single staging environment which over 100 devs used to push code for testing, a lot of unstable code would get pushed, and it created a lot of friction while testing new features. It was very time consuming and annoying!”
One thing we don’t discuss enough is how failures to push to Staging (your team may not call “The environment right before Production” Staging but we all have one whether it’s called QA, Test, or something else) add stress to the developer, especially when that environment is shared.
Picture our product engineer, she’s just put in the time to develop a feature, write unit tests, and make sure it meets whatever contract testing is implemented. All of this happened at the dev’s own pace, likely on her own laptop. She’s confident this is ready to release. She submits her PR and moves on. Then a while later, after starting a new task, she gets a report that this branch broke staging. She may or may not have any indicator of how Staging is broken, or what logs or showing up since it's likely another engineer having to document and describe the problem to the product developer. Now not only is the problem harder to find and fix, but to add stress other developers are waiting to push to Staging. With less information, out-of-band work, and a time crunch, fixing a simple bug becomes a lot more stressful.
How did this affect Yellow.ai’s business? By hurting overall velocity. As Lokeshwaran describes:
“These conflicts and failed merges meant the time taken to push the code from QA to production was significantly longer than expected, leading to much less developer productivity and an increase in overall feature development time.”
One attempted fix: per-developer clusters
While the problem of conflicts over Staging is nearly universal in large microservice architectures, the possible solutions are quite varied. One that Lokeshwaran, senior Platform Engineer at Yellow, tried was creating new virtual clusters for each developer:
“We were trying out having virtual clusters per developer, but maintaining services and database schemas in sync was the biggest challenge. Maintaining services, database schemas was a very big challenge in a non-collaborative environment.”
This points out the biggest problem with a per-developer cluster solution at large scale: keeping all those clusters updated correctly. Their isolation is the big advantage, but in a fast-paced environment each cluster may require daily updates which adds cost, effort, and impacts developer velocity. The work done to update one isolated cluster isn’t shared across all of them, so this multiplies the amount of work needed for updates. Another concern is cost, as Lokesh Waran continues:
“We were looking at building a virtual cluster using vcluster per developer, but it ended up costing quite a lot.”
This brings up the second drawback: if you’re only running these clusters when needed then your developers will be waiting for them to start up each day at least. If they’re running continuously, then infrastructure costs become an issue, especially galling since 99% of the time these clusters are completely idle!
“We were looking out for solutions to improve developer productivity,” says Lokesh Waran, “and that's where we came across signadot and evaluated the product.”
Choosing Signadot for a shared cluster with developer isolation
Signadot’s request isolation means developers can work on branches with modified versions of one or more services, all the while still sending requests to services in a shared cluster. Other developers can use and test on the same cluster without impacting each other. This lets developers test on a shared cluster earlier, an effective tool for ‘shifting left’ with integration testing. Cost was a motivating factor for Lokesh Waran, and the rest of the Platform Engineering team at Yellow.ai
“[Signadot is] cost effective because of time-to-life (TTL) of sandboxes and we can deploy only the application that is required to be tested as a new deployment, while all the other microservices need not be affected”
Two critical features there: limited deployments and TTL. Since request isolation means you only need to deploy the services you actually need to fork, deployments are smaller and faster. And with a TTL setting on sandboxes, there’s no need to run even these smaller deployments for longer than necessary, cutting infrastructure costs.
Signadot makes it easy to get started
“We found signadot promising because it is very simple to use and adopt,” writes Lokesh Waran “setup and implementation was quite straightforward. We had to install the Signadot operator, and for the requests to be routed to a sandbox, we would need to add an annotation to that deployment, which would in turn add a sidecar to route to request to the appropriate service”
In order for request-based isolation to work, there needs to be a system in place for context propagation, so that requests that should interact with a particular sandbox are routed correctly, while other requests that expect the ‘baseline’ version of a service go to the service within the shared cluster. For most users, OpenTelemetry is the best project for adding context propagation, and Lokesh Waran at Yellow.ai was no exception:
“Context propagation was a prerequisite for signadot to work, but this was no problem since we had adopted OpenTelemetry recently. Signadot worked very well for us without any hassle for all the HTTP use cases which covers more than 90% of the platform currently.”
With this basic use case handled, the Yellow.ai platform engineering team was able to move on to some more advanced flows within their cluster.
Non-HTTP requests, and message isolation with Kafka and RabbitMQ queues
There are specific requirements if your Test/Staging cluster includes a queue, to ensure that context propagation works properly for events passed through a queue like Kafka or RabbitMQ. Only slightly more work was required to add support for requests that weren’t HTTP or gRPC, as Lokesh Waran explains:
“Only HTTP and gRPC requests are supported since context propagation is supported from these protocols only. We had to write monkey patches of client libraries for Kafka and RabbitMQ to push the routing key in the message header and some logic from consumer side to route it into sandbox/baseline”
The work described here is making sure that consumers of a queue take only the messages intended for them, based on the Signadot routing key.
What’s next for Signadot: more complex transformations
Asked ‘what’s next’ for Signadot adoption, Lokesh Waran says he looks forward to adding the last few flows that aren’t currently covered
“There are still some cases where signadot can’t be used. For example, DB migrations, schema changes and other stateful changes to dependencies. We’ll be working to build some solutions using Signadot in the near future.”
We can’t wait to see what the Yellow.ai Platform Engineers do with signadot next!
Results: for Yellow.AI, a 2-3X in developer velocity
Adoption for the Signadot process, creating sandboxes in a shared cluster, has been a success at Yellow.ai. As Lokesh Waran put it:
“[Signadot] has boosted team productivity and delivered a significant ‘wow factor’ for those who have used it.”
The use of local sandboxes, where developers can host the forked versions of services locally, is seeing good use as well:
“Local sandboxes are a great feature, it improves local testing without devs have to run other services/api-gateways in their local”
It’s been a great experience for developers, with great results:
“Signadot has definitely improved developer productivity by 2-3x, since sandboxes can’t be polluted by multiple people, and testing is seamless!”
If you’d like to learn how Signadot can help you, or you want to hear others’ experiences, join our Slack community to find out more.
Join our 1000+ subscribers for the latest updates from Signadot