The challenge of real-time AI: How to drive down latency and cost
Learn about reducing latency, managing costs, and achieving real-time access, as discussed by Dr. Sharon Zhou, co-founder of Lamini, at the Real-Time Data Summit.
The value of artificial intelligence (AI) is significantly enhanced when it can operate in real time, enabling faster, more impactful decisions and actions. But it’s tough for large language models (LLMs) to be real time because of the computational load they demand and the cost involved. The good news is that latency, cost, and real-time access are interrelated, and finding ways to reduce latency also reduces costs and makes real-time access easier to achieve.
That’s according to Dr. Sharon Zhou, co-founder and CEO of Lamini, a Menlo Park, California company that builds an integrated LLM fine-tuning and inference engine for enterprises. Previously, she was on the computer science faculty at Stanford, leading a research group in generative AI (GenAI). It’s also where she received her PhD in GenAI with Dr. Andrew Ng. She spoke at the Real-Time Data Summit, a virtual event intended to advance the market and equip developers for the rapid growth of real-time data, leveraging its power for AI to address this latency, cost, and real-time access conundrum.
The key challenges of real time
Zhou cites three main areas that make it difficult to deliver AI-driven applications in real time:
Computation
It’s no secret to anyone that AI applications, particularly LLMs, require enormous computations. And we’re talking on the order of billions – just for a single question. That’s a big change from the machine learning (ML) of the past. "Even a million parameters was enormous and almost unthinkable, and that was just a few years ago," Zhou says.
Take a seemingly simple example. "Even when you ask [an LLM], 'Hi, what's up?' it's doing all the computation on the word 'Hi,' then ‘what's, and then 'up,'" Zhou says. "It's just very expensive to get through all of that."
But that’s just the start. Not only does the LLM need to interpret the input, it needs to create output, and that’s just as complex. "For example, 'Hi, what's up? I am good,’" Zhou explains. "It takes time for it to read, and it takes time for it to write. The bigger the model, the more computations there are. For a 100-billion parameter model, which the original GPT-3 and ChatGPT-like models were, that is very expensive."
Now imagine a real-world application where you're not just saying, "Hi, what's up," but giving the LLM additional data. "Let's say you want to detect whether their sign-in was fraudulent or not," Zhou says. "That's a lot of different user data you can pass through into the model. It needs to read all of that and then produce an answer for you."
Actual applications also require more complex output. "You want it to give an explanation for why it said what it said, whether it's yes or no," Zhou explains.
Cost
There are three main drivers that make real-time LLMs so expensive. "One is that just running the model once is expensive, certainly more expensive than pinging a website." It might cost just a fraction of a cent, but that adds up as you feed more data to the model.
Plus, computational loads are so heavy that completely new, reliable infrastructures need to be built to handle them, Zhou says. "It’s a really hard and very complex software engineering problem."