⇓ More from ICTworks

Lessons Learned: I Built a Successful Generative AI Chatbot that Failed

By Wayan Vota on May 16, 2024

genai chatbot failure

Grab your popcorn! This is gonna be a great crash and burn story about Generative AI success and failure in 2024!

I recently built a GenAI chatbot that successfully parsed the US Foreign Affairs Manual for the second time (more on that later) yet this time it failed to gain market traction and I shut it down.

In the spirit of Fail Festivals that I still run, here is my story in all its pain and glory.

Core Problem

The Foreign Affairs Manual (FAM) and associated Handbooks (FAHs) are a single, comprehensive, and authoritative source for the State Department’s organization structures, policies, and procedures. They are also an immense list of rules and regulations that print out to over 10,000 pages. Yes, really.

The FAM/FAH website has a search function, though it’s hard to use. You can get false positives with any search, and you may want to read through sections manually rather than searching for an answer.

This can leave Foreign Service staff in the State Department, USAID, and elsewhere lost. They then overwhelm EXOs with basic questions, errors can happen even with good intentions, and money can be wasted on trying to comply. Worse, rules can be broken and careers damaged through plain ignorance and confusion.

My LLM Inspiration

Large language models (LLMs) that power Generative AI solutions intuitively understand large bodies of text. They are skilled in presenting dense government rules or verbose corporate regulations in easily understood human-readable forms. This is what GenAI does best.

I saw the opportunity to generate natural language in 2020. Fast forward to January 2023, and a friend of mine asked me how I could help her understand US government regulations. I immediately started designing a US regulation chatbot with my previous employer that would live behind government firewalls, in a data-safe environment.

I move fast.

By April 2023 we had an MVP. In mid-2023 my GenAI Retrieval Augmented Generation (RAG) solution was shown to be more accurate than Google Search for the Federal Acquisition Regulation, faster than the State Department’s own search engine for the Foreign Assistance Manual, and more comprehensive than either public GenAI tool on IRS’ Internal Revenue Manuals.

My success did not go unnoticed. My employer started to sell this idea to every government entity. By the time I left in September 2023, they’d sold over $30 million in new contracts to Federal and state government agencies based on my design and had countless existing contracts re-written to include this tool at an additional price.

Do you want similar success with AI? Hire me in October!

That solution is still behind a corporate paywall. I wanted to replicate its success in a publicly available tool using my own personal funds and friends. I dreamed of developing a FAM Policy Chatbot that would generate income for the creators and happiness for the users.

I did not dream of the many challenges and obstacles I’d face building a tool without corporate resources. But I’m getting ahead of myself.

Generative AI Solution

My initial idea was to use existing ChatGPT Plus resources, where you can train ChatGPT to parse data sets to your instructions. Previously, I built GenAI chatbots for the FAR, ADS, AIDAR. Oddly, while you can download the FAR as one file (and the FAR is 4X the size of the Bible), you have to download each FAM/FAH separately, which creates a 10,000 page PDF.

ChatGPT Plus has a current file size limitation that does not allow one to train it on 10,000 pages, no matter how you cut up the files. Stymied by ChatGPT, I decided to invest my own funds to build a FAM Policy Chatbot using open-source software.

Using my personal initiative, resources, and time, I hired two great Zambian software developers to build a custom Generative AI Large Language Model trained on the FAM.

I spent $1,500 of my personal funds to build the FAM Policy Chatbot using Llama, an open-source large language model, and a vector database to do Retrieval Augmented Generation. We uploaded the entire 10,000 page FAM. We had wanted to combine the FAM, FAH, and ADS, but it was too much data formatted poorly.

FAM Policy Chatbot worked!!

You could ask the FAM Policy Chatbot questions about the State Department rules and procedures, and get accurate, understandable answers with links back to the source documents to check the LLM’s responses.

Now it didn’t have the full functionally of my original policy bot. It didn’t give a relevance score for each link, it was wonky in some answers, and there wasn’t a clear way for users to give training feedback. Still, it was usable, and I received very positive feedback.

Chatbot Market Failure

The technology challenges were annoying but not a real problem. We could correct and adapt as we improved the software and user interface. Regardless, users loved the chatbot experience.

I received dozens of emails shortly after launch praising the FAM Policy Chatbot. It was helping staff subject to the FAM, to understand its rules. It was relieving pressure on EXO to answer routine queries. It showed that creating a tool to parse the FAM was possible with easily-accessible tools.

And yet, no one wanted to pay for it.

When I asked how much users would be willing to pay for the FAM Policy Chatot I kept hearing two very similar responses:

  • Why are you building this? Shouldn’t USG have this already?
  • Why should I pay for it? USG should buy this for us.

The FAM Policy Chatbot was providing a needed solution, and it was pleasing its core customers. However, there wasn’t a commercial market for its services. No one was paying to use it, while I was paying $30 a day to run it. A clear market failure. Hence, I shut it down when I got my first $1,500 server bill (ouch!).

GenAI Lessons Learned

Building the FAM Policy Chatbot was a fun learning experience for me. Costly in time, money, and brain space, but worthy to (re)learn a few key lessons that are often present in Fail Festivals.

1. Building an LLM is easy.

Yes, really! I do not have amazing software development skills. In fact, I cannot code my way out of a paper bag. However, if you have a clear business need and defined use cases, a competent software developer can create an LLM that answers your need. Or if your need is simple, you can quickly create a custom ChatGPT app for $20 a month.

2. Cost is negligible.

Don’t get me wrong. Investing $3,000 of my own money in a failure hurts. I haven’t told my wife yet. When she reads this, I’ll have explaining to do. But investing $3,000 and 3 FTE for a month to test if an LLM can solve a multi-million-dollar business problem is a rounding error in opportunity cost for an international NGO. Every firm should be investing in their own LLMs in 2024 or buying services like this or like that right now.

3. Market fit is hard.

I can easily show an Internal Rate of Return for a USAID implementer to invest in LLMs to solve key business needs. However, creating business focused on retail customers is very hard. Especially when they (USG staff) expect that State or USAID should provide a functioning FAM/FAH search tool or internal LLM that could parse regulations for them.

4. Personal growth is harder.

I firmly believe that everyone recognizes regulations like the FAM/FAH are really confusing, and there needs to be a better way to understand them. Anyone could build an LLM in 2024. The tools are there. Someone should build a functioning search tool for the FAM/FAH and deliver us from regulatory pain. Yet no one is trying to solve this problem (that I know of) – except me.

I created not one, but two FAM Policy Chatbots. One is a commercial success that is now replicated across Federal and state governments. The other is a market failure that I killed off on Monday. And today I share my failure story with you.

We should do the hard things. Not because they are easy, but because they are hard. Doing hard things, and occasional failing at those hard things, then talking about our failure, is even harder. Yet it’s the only way we grow.

Since you’ve read this far, consider hiring me in October! I can bring this entrepreneurial spirit to your organization too.

Filed Under: Featured, Solutions
More About: , , , , , ,

Written by
Wayan Vota co-founded ICTworks. He also co-founded Technology Salon, MERL Tech, ICTforAg, ICT4Djobs, ICT4Drinks, JadedAid, Kurante, OLPC News and a few other things. Opinions expressed here are his own and do not reflect the position of his employer, any of its entities, or any ICTWorks sponsor.
Stay Current with ICTworksGet Regular Updates via Email

7 Comments to “Lessons Learned: I Built a Successful Generative AI Chatbot that Failed”

  1. Kurt Moses says:

    Wayan, considerable thanks for your step by step approach and doing what you do best–illustrating how to see both success and failure. The Fail Fest Idea remains so powerful as long as you keep up the momentum. Come to me anytime for either a reference or an “atta boy.” Keep on keeping on.
    PS. My first recommendation in a recent journal article is that all the key development agencies should be combining their masses of data in specific developing regions and beginning massive linked searches. That is where corporate scale can really reap rewards. Implementation of the findings will always be the crucible, but the research takes a corporate approach or a massive shared approach, funded by a corporate implementer willing to use its scale to fund it.

  2. Matt Berg says:

    “I haven’t told my wife yet” – lol!

    • Wayan Vota says:

      Yes, this was my business idea for the year. Unlike FailFestival, JadedAid, and KinderPerfect, and like many other efforts, it was not a success. I treasure her patience with me.

  3. tahzeeb says:

    I would love to join your team; you’re doing a fantastic job!

  4. Gawain says:

    Amazing, as usual. $3000 doesn’t include your time and expertise, but even with that included, it seems like an amazing deal if budget holders could only see it. This technology is absolutely critical to any “localization” agenda to help smaller, locally- or nationally-based, non-English speaking actor understand and access US support. If USAID were serious about localization, they would stand you up to provide this for that purpose.

  5. Glen says:

    So what would be a next step here? It doesn’t sound like you failed at GenAI, but rather a business model challenge. That was always going to be a heavy lift with the FAM, as those who are good at translating it see that as a differentiator for them that keep “competition” out of the way. And for them, hallucinations would conceivably have serious repercussions. There is a very real challenge around GenAI business models that probably should be part of the discussion from the beginning of rolling a LLM solution out: why would someone pay $3000 for it? How do you get them to see value?

    I loved this idea from the beginning, and glad we get to learn from you trying it. I just wish there was someone (cough…USG) who could see the value in deploying it…

  6. Olivier says:

    Thanks for sharing your experiences! Having built many digital platforms for the WASH space, the business model (or cost recovery model) is definitely much harder than the tech side.

    It’s ironic that Microsoft will reap the market benefits when it’s right as they integrate Azure AI tools and govs will see the value and finally pay for it.

    My experience with WASH AI so far (which is a much larger system than FAM it seems(?), with over 500k chunk records and a multi-agent RAG backend using 5 different open source models for aggregation and inference) .. is to continually rebuild the stack using cheaper and more effective models, only fine tune when needed and don’t go public until it’s funded. Finding the funders is the hard part. Governments are too slow on the AI train, maybe their innovation departments but it’ll take years before it reaches the departments that matter. NGOs might pay but only the mid size ones if you can find a champion with decision making power, then finally the consulting firms which are probably a good middle.

    It’s true building LLM apps is becoming easier by the day.. helping users see value is key.