Picture of a man using audio dictation

Why multimodal search should be a part of your strategy

Search engines and AI models no longer need you to type exact phrases. They can process a spoken description, an uploaded photo and even a short video clip alongside text or audio to understand intent.

Shayna Burns

12 November 2025

4 minute read

Recently in Australia, Google has been running ads demonstrating how the new Pixel 10 smartphone (with Gemini Live) allows people to ask a question by sharing a photo or video and asking a question either verbally or in writing:

Picture of a Google AI ad

Another example of a Google ad you may have seen on free-to-air TV involves a woman holding her phone up to a display case of sunglasses, stating that she has a heart-shaped face and asking which of the sunglasses would look best.

The process of using multiple search inputs (text, voice, video, photo) is called multimodal search, and it’s one of the most natural ways we query and look for information.

Search engines and AI models no longer need you to type exact phrases. They can process a spoken description, an uploaded photo and even a short video clip alongside text or audio to understand intent.

Examples you may already know:

  • Google Lens: Upload an image to identify a plant, landmark or product.
  • Voice search: “Hey Google, where’s the nearest late-night chemist?”
  • AI models: GPT-4o and Gemini can accept text, images and voice in a single conversation.

Behind the scenes, AI systems can answer these multi-input queries by using techniques like retrieval-augmented generation (RAG) – combining user input with external data sources – to ground their answers. 

Searching with different modes replicates our natural way of inquiring

This new search experience allows people to transition from keywords to context-based searching, modelling our natural behaviour and reducing overall friction.

  • From keywords: “cheap flights Melbourne Tokyo”
  • To conversational queries: “What’s the cheapest way to fly from Melbourne to Tokyo in the next three months?”
  • To multimodal queries: [holds up photo of outer cover of passport] “What visa do I need to visit Japan with this document?”

Ultimately, this progression lowers the barrier for individuals asking a question. You don’t need the right words – you just need to show or describe what you mean to get a suitable answer.

A multimodal search strategy matters for all sectors

It’s easy to dismiss multimodal search as a retail gimmick, since most things we might photograph or record are things, and things are usually commercialised products. But the implications – and opportunities – for this go far wider. 

Here are use cases across industries:

  • Travel: Upload a photo of a beach and ask, “Find me somewhere like this in Asia.”
  • Higher education: A prospective student takes a photo of a course brochure and asks, “What are the career pathways from this program?”
  • Healthcare: Take a photo of a rash and ask, “Is this rash serious enough to go to the hospital?”
  • Customer support: Point your camera at a bill and ask, “What’s this fee?”
  • Public sector: Snap a broken street sign and report it directly to the local council.

The thread running through all of these examples is that multimodal search makes discovery and problem-solving more human.

Why having diverse content types is critical for multimodal search

Diverse content creation – creating images, videos and audio – is the entry ticket to multimodal search. Without it, you won’t even be in the game.

To succeed, you need to create multimedia assets that AI systems can recognise and surface, based on what your target audience is most likely to search for.

  • Images: Product photography, diagrams, infographics and contextual lifestyle shots all feed visual search engines like Google Lens.
  • Video: Walkthroughs, demonstrations and explainers often answer “show me” and “how to” queries better than text ever can.
  • Audio: Podcasts, interviews and recorded snippets open doors to voice-led discovery, especially when paired with transcripts.

Text is still critical for structure and context, but in a multimodal world, text is the scaffolding and images, video and audio are the assets that get surfaced.

How to optimise your assets for multimodal search

The good news is making your digital assets friendly for multimodal search doesn’t mean rebuilding your digital presence. Rather, it’s about finally implementing best practices that SEO, content and UX specialists have already been recommending:

1. Make your visuals machine-readable

  • Use descriptive alt text and clear file names
  • Use ImageObject schema alongside contextual entity schema (like ProductHowTo) so machines can understand both the image and what it represents
  • Avoid baking text into an image with no accompanying written copy

2. Make your audio and video searchable

  • Always provide transcripts and captions
  • Add schema markup (e.g. VideoObjectPodcastEpisode)
  • Ensure key takeaways are present in both the audio/video and the text description

3. Optimise for voice and natural language

  • Include FAQs and conversational content on your site
  • Answer the kinds of prompts people might search for about your business
  • Write in complete sentences that can be cleanly quoted by a search engine or LLM
    Multimodal search in summary.

Google’s recent ads are more than hype. They’re a signal that search no longer needs to be just about typing text. The businesses that build multimodal search strategies now will be more discoverable, relevant and trusted in an AI-driven world.

The question isn’t whether your customers will use voice, image or video to search; it’s whether your brand will be ready when they do.

Keep Reading

Want more? Here are some other blog posts you might be interested in.