If you watched the StoreAI launch in May, you’ll have seen the demo of Andreas talking to himself. More accurately, talking to an AI-driven avatar of himself in a big screen. In this example, he asked the avatar for advice on what phone charger to buy. The AI recommended suitable products, and then gave him directions to find the one he wanted.
It almost feels like something out of science fiction, but it was actually surprisingly simple to construct.
One of the things I love about working with this new generation of AI tools is what you can do by combining multiple tools and how easily they all fit together. (Slight digression here – a couple of months ago, I made a short promotional video in the style of a children’s storybook by getting ChatGPT to write the script, prompting MidJourney to generate images, and then using text to speech from ElevenLabs to create the narration. It was fun, easy, and – unbelievably – took less than two hours. But let’s get back to StoreAI...)
There were three components to AI Andreas: the content, the voice and the image. Let’s take a quick look at each of those and see how we did it.
- The content is generated using a combination of ChatGPT and Pointr running on Ombori Grid. The bulk of the conversation works just like a text-based chat, except that instead of typing, Andreas could just speak to the AI – no different to using Siri, Alexa, or Google. The text is supplemented with an animated map generated using Pointr, and an image and QR code pulled from the product database held in Grid Products.
- The next step is to turn the text into audio. To do that, we used ElevenLabs, which is one of my personal favorites for AI text to voice solutions. We cloned Andreas’s voice using a few short samples. This enables us to turn the output from OpenAI into audio that sounds like Andreas talking. It’s not quite real time, as you may have noticed, but it’s close. There’s a delay of a second or so while the audio is generated, but as the technology improves, that should come down.
- Then finally, we can get AI Andreas to talk. There are many tools available for animating a photorealistic image and real-time lip syncing based on text or audio, and they’re getting better all the time. We used D-ID for this demo, because it does a great job of animating a face from just a single portrait, which was all we needed.
And that’s it – we turned the output of ChatGPT from raw text into a conversation with an on-screen avatar that looks, and sounds, like Andreas. We added some extra graphics such as QR codes and maps to increase functionality, and placed it on a large screen that encourages people to interact.
The potential for this kind of real-time video is phenomenal, and we are only just beginning to explore the ways it could be used. Some retailers may choose to use AIs to define their brand and ensure consistency across all locations: wherever you are, you’ll talk to the same avatars. Others may go the other way and give customers the option to choose who they speak to: they may prefer talking to a woman, a man, or someone of their own age or ethnic origin. Or maybe we’ll see this as an opportunity for senior executives to get their faces in front of customers – if someone wants to know about sustainability, who better than the Chief Sustainability Officer to give them the lowdown on what the company is doing?
I can even imagine a situation where there are multiple expert systems, each of which has a distinct personality and appearance. So if you’re in a department store and you’re asking about children’s clothes, you may find yourself talking to someone who looks like a mom. Then you want to ask about fountain pens or art supplies, and you’re now talking to someone else. This helps humanize the entire process and reinforces the illusion that you’re talking to knowledgeable staff, not to a bot.
AI Andreas was just a first step. Who knows what video chatbots will be like this time next year?