AIs, Assemble! | Introducing Agentic Consensus
How we make multiple AIs work together to cut our manual review process by 95%
At FreeStand, empathy is one of the most core values. We are an emotional bunch. We whoop at every win and lament at every loss. This ability to just feel, is what I believe leads my co-founders to sense the pains of today which might become the torments of tomorrow for our clients. This has led FreeStand to develop some amazing solutions for our clients. One of such recent discoveries was the problem of consumers trying to claim multiple samples of our clients’ brands from the same address - the address duplication problem.
We had the basic checks in place to identify strictly duplicate addresses but since Indian consumers are crafty, many of them could bypass them by using what I call - linguistic tricks.
One such instance is, we have had consumers trying to claim samples by giving
7-A, ABC Street, City Name, State Name, Pincode
andVII-A, ABC Street, City Name, State Name, Pincode
as their addresses.
Upon review, such addresses would have to be marked as duplicate and we would dispatch a sample for one of them. We receive thousands of sample claims daily, creating a review hell for anyone verifying addresses before processing them for delivery. As a startup, every resource at hand is super valuable. And time, is one of the most paramount amongst them. Every moment that is spent not creating value by even one members incurs a significant cost on all others. With our regular volumes, this would take about 8 hours on a daily basis. On days of increased volumes, the process would go upto 12 hours at times as well.
As we are in the business of dealing with end consumers of our clients, it is always made sure that in this process of review and detection, a genuine address shouldn’t get missed. Fearing the possibility of numerous such cases occurring, I was sceptical about taking the human judgement out of the process.
Similarity Score
One of the core tenets of building things at FreeStand is iterations. And this problem was solved similarly. It started out when one of our developers and the CEO was catching up with each other post work hours. The developer decided to take on building the first iteration of enhancing our review system by creating a similarity score for the reviewer. With vector embeddings to create the score and some UI/UX magic to streamline the review process, he was able to reduce the time by about 20%; bringing it down to 5 hours on regular days though it was still 8 hours on high volume days.
The results spoke for themselves and my earlier reservations about algorithmic intervention in the judgement process were lifted.
LLM opinion
A follow-up conversation with my co-founders (who had also participated in a few review sessions) along with other regular reviewers made it clear to me how excruciating the experience would become. It was very evident that the existing solution was due for an upgrade. The developer who had worked on the first iteration was occupied with another crucial build (a follow up piece coming soon on that!) and so the second iteration was taken over by someone else.
We started discussing possible ways in which we could solve it and he made a suggestion about prompting one of the LLMs to make a judgement call on the duplicity of the addresses having a similarity in the mid ranges. These made the bulk of the addresses in review. This seemed like a really intriguing proposition to me and I told him to run a test on the subset of the data. The experiment involved running several variations of prompts with different OpenAI models. We tracked accuracy of each prompt with different models, keeping economics and latency in mind for every combination. The results of this experiment showed great promise and we decided to implement it.
This implementation led us to further cut down our review timing by HALF! We had reached to 2 - 2.5 hours from 8 - 12 hours. Another win in the process, another moment of kaafi khushi for us.
Agentic Consensus
But I felt more could more could be done. More importantly, I felt it was one iteration shy from achieving its true potential.
Amid the constant stream of new model releases and the buzz of election season in Delhi, it struck me—we could further reduce ambiguity in the review process by having multiple LLMs weigh in on dubious address duplicates.
A consensus mechanism would be reliable enough to reject further duplicate address. And so we decided to proceed with building an “Agentic Consensus” layer on top of it. We decided to use 3 different LLM AIs for it - Grok, OpenAI gpt 4-o mini and Anthropic Claude 3.5. When all three of them would have the same opinion on the duplicity of the address, then our review system would accept that result. In case of a difference in opinion by even one of them, the address would be left to review. And again with a little bit of UI/UX magic to this, we ended up bringing down our review time to under 30 mins. A 95% reduction.
I am told that the reviewers wake up a lot happier and with a lot of possibilities.