Multimodality: voice and images for a more human-like AI
x icon pop up DISCOVER AIWAVE PLATFORM

Search the site

Didn't find what you were looking for?

Multimodality: voice and images for a more human-like AI

Gemini_Generated_Image_44l2wx44l2wx44l2

Artificial Intelligence

8 April 2026

Multimodality is now considered a fundamental standard in AI.

The global multimodal AI market is expected to reach $98.9 billion by 2037, and Google has listed it among the Google Cloud AI Business Trends 2025.

This comes as no surprise. If AI is to become a true ally for both citizens and professionals, it must be able to understand the world in a way that more closely resembles human perception. This means embracing multimodality: the ability to interpret information across multiple formats—text, voice, images, and audio—simultaneously, much like the human brain does.

As McKinsey notes: “[Multimodal AI models] mirror the brain’s ability to combine sensory inputs for a nuanced, holistic understanding of the world, much like how humans use their variety of senses to perceive reality.”

Only in this way can AI become what people truly need: a tool capable of fully understanding context and simplifying real-world tasks.

For instance, this is particularly valuable for railway construction workers, who can interact with AI or share on-site images to receive real-time operational guidance.

In public administration, multimodality can simplify interactions between citizens and digital services. For example, a user could upload a photo of an administrative document and ask for guidance via voice.

Interacting with AI systems as naturally as we interact with people will make work and everyday life simpler, faster, and in many cases safer.

What is multimodal AI and how it works

Multimodal AI refers to systems capable of analyzing and processing not only text, but also images, audio, video, and code at the same time, identifying complex patterns and correlations.

This allows AI to extract insights from a much broader range of contextual sources, producing more accurate and personalized outputs.

In recent years, many AI models have been designed from the ground up as multimodal systems, capable of both receiving and generating text, images, and voice, making interactions more natural.

For example, multimodal AI can analyze combined text and image data to support medical diagnoses or assess tone of voice and facial expressions during video calls to estimate conversational sentiment.

It can also generate high-quality music or realistic images from simple text prompts.

Multimodal AI relies on different data fusion techniques, which can occur at various stages:

  • Early fusion: different data types (text, images, audio, etc.) are processed together from the start, transformed into a shared representation within the model.
  • Mid fusion: each modality is processed separately at first, then combined before generating the final output.
  • Late fusion: each modality is handled by a dedicated model, and the outputs are merged at the end.

Through these techniques, multimodal AI integrates information from multiple sources to build a more complete understanding of context—much like humans combining words, sounds, and visual cues.

From text to context: a new level of understanding

Text-based models remain highly effective and efficient for many use cases. However, for more complex scenarios, multimodal AI is often the better choice.

In 2026, voice and images will be the most transformative aspects of multimodality, as they bring AI closer to real-world understanding and make interactions more natural in practical workflows.

MIT neuroscientists have found that the human brain can identify an image in just 13 milliseconds, and research suggests that images are processed much faster than text.

Similarly, in AI systems, images provide richer contextual information. When combined with text or voice input, they help models better interpret situations and generate more relevant responses.

Voice interaction, on the other hand, makes AI more accessible and natural, especially in environments where typing is impractical or impossible.

As a result, multimodality is evolving from a feature into a standard. According to Gartner, by 2030, 80% of enterprise software will be multimodal.

Young woman laughing while talking on cellphone in office

Benefits of multimodal AI for public administrations and businesses

According to Google, multimodal AI represents a major step forward in how developers can build and expand next-generation applications. It moves AI closer to being a true expert assistant rather than just a tool.

Today, these models already offer several advantages across both public and private sectors:

1. Better contextual understanding

Combining diverse data types allows AI to generate more complete and relevant outputs. For example, integrating text, voice, and images enables systems to better understand real-world situations and respond more accurately—especially in healthcare, where medical images can complement written data.

2. More human-like interactions

Multimodal AI can interpret not just words, but also tone of voice, facial expressions, and body language. This allows virtual assistants and chatbots to better understand emotions and context, making interactions feel more natural.

3. Greater adaptability to real-world scenarios

Multimodal AI excels in complex environments such as autonomous driving, construction sites, healthcare, industrial maintenance, and citizen services, where processing multiple signals is essential.

4. Enhanced security and recognition

By combining different modalities, AI enables more advanced and reliable authentication systems. Multimodal AI underpins biometric technologies that integrate facial recognition, voice analysis, and motion detection.

5. More intuitive and accessible

Multimodal interfaces reduce friction and make technology more inclusive. Voice interaction and intuitive interfaces help elderly users, people with disabilities, and those speaking different languages engage more easily with systems.

6. Improved operational efficiency

Integrating multimodal AI into daily workflows leads to faster decision-making, more accurate responses, and reduced time spent on support requests and manual verification tasks.

5 real-world use cases in public and private sectors

How can multimodal AI support professionals such as doctors, field workers, or customer service operators? Let’s explore some practical applications:

1. Automated analysis of administrative documents

Public administrations manage large volumes of documents every day—not only text, but also charts, floor plans, and photos. Multimodal AI makes it possible to analyze all these different types of documents together within administrative processes. It can automatically verify whether documentation is complete and detect any inconsistencies. As a result, it speeds up review procedures and significantly reduces the manual workload for public sector staff.

2. Decision Support Systems (DSS) in healthcare

A hospital Decision Support System (DSS) powered by multimodal AI can process and analyze both textual data and medical images.This enables clinical teams to make faster and more accurate decisions about treatments or surgical interventions. The benefits are especially significant in cases involving severe conditions or situations where timely action can make a critical difference.

3. Intelligent information kiosks in tourism and customer service

In the tourism sector and public services, multimodal AI can be used to develop intelligent information kiosks capable of interacting with visitors in their native language. By combining speech recognition, natural language understanding, and image analysis, tourists can ask for information verbally, share an image of a monument, and receive personalized guidance on itineraries, transportation, or nearby points of interest—in their own language.

4. Transcription of doctor–patient conversations

During a clinical visit, multimodal AI can transcribe the conversation between doctor and patient in real time using voice alone. The resulting text can then be processed to generate summaries or automatically populate medical records. This reduces the time spent on documentation, allowing doctors to focus entirely on patient care.

5. Field support in the railway sector

In a railway construction environment, multimodal AI enables workers to submit both visual and verbal inputs—such as a component image and a spoken description of an issue. The system integrates these data sources, matches them against technical documentation and historical records, and generates diagnostic insights along with recommended procedures.

Municipal services provide community infrastructure. Person holding public sector icons on dark background

Multimodality today: expanding AI’s real-world applications

The use cases discussed so far highlight a key point: multimodality doesn’t just make AI more powerful—it expands its range of applications, enabling scenarios that were previously out of reach.

Until recently, AI was best suited to digital, structured environments, where information was already available in textual form or easily interpretable.

As we’ve seen, however, real-world situations are often far more complex. The introduction of images and voice fundamentally changes the approach: being able to show a situation through a photo or describe it verbally reduces friction and makes it possible to use AI in contexts much closer to real operational settings.

In this sense, multimodality is not just a technological advancement, but a shift in perspective. AI is no longer confined to digital environments—it becomes a practical tool that can be used in real-world contexts, alongside people who observe, speak, and act.

Privacy and security: protecting multimodal data

Multimodal AI is primarily applied in highly regulated sectors such as healthcare and public administration.

In these contexts, the data it relies on often includes biometric images, audio recordings, and sensitive personal or medical documents. This makes data protection an even more pressing concern when working with such systems.

Unlike text-only systems, multimodal data can contain sensitive information in implicit ways: a voice, a face, or an image can reveal personal details and context that are difficult to fully anonymize. This also makes data management and governance more complex, as information must be handled consistently across different modalities to ensure traceability and control throughout its lifecycle.

To safeguard sensitive data and privacy, it is essential to rely on robust measures such as anonymization and pseudonymization, while also ensuring compliance with regulations (such as GDPR) and maintaining transparency in how data is used and stored.

It is within this context that solutions like Velvet Speech 2B come into play, offering advanced speech processing capabilities while placing strong emphasis on data protection and regulatory compliance.

Businessman presenting virtual AI agents icons for automation tools, chatbots, image and video generation, voice synthesis, and multimodal artificial intelligence with prompt-driven interface.

Velvet Speech 2B and multimodality: voice AI with responsible data management

Velvet Speech 2B is the first multimodal model in Almawave’s Velvet model family.

Compact and versatile, it is designed for dynamic professional interactions, with the ability to process and understand spoken language. Users can provide input either through text or voice, while outputs remain in text format.

Speech 2B retains the strengths of Velvet 2B—lightweight design and edge deployment capabilities—while adding voice capabilities such as:

  • Automatic Speech Recognition (ASR)
  • Spoken query and question answering
  • Understanding of written and spoken Italian and English, even in mixed conversations
  • Speech emotion recognition

What makes Speech 2B particularly well suited for highly regulated sectors is its strong focus on data protection and data quality, making it a highly promising solution for contexts where the use of personal data is especially critical.

With Velvet Speech 2B, voice interaction can be introduced without altering the existing infrastructure. Speech is converted into text and can then be managed according to established policies, without creating new data flows that are difficult to control.

This approach allows organizations to expand the range of real-world use cases for Velvet Speech 2B, while maintaining full compliance with regulations and ensuring the protection of user privacy.

Want to learn more about Velvet Speech 2B and multimodal AI?

Discover more arrow right