From text to voice: the evolution of AI interaction with Velvet Speech 2B
x icon pop up DISCOVER AIWAVE PLATFORM

Search the site

Didn't find what you were looking for?

From text to voice: the evolution of AI interaction with Velvet Speech 2B

Smartphone using AI voice recognition technology on dark table. Mobile phone screen projects virtual hologram of microphone icon, audio soundwave. Using smart assistant app for speech commands,

Artificial Intelligence

19 February 2026

Today, most GenAI interactions happen through written text. Individuals and professionals type requests asking AI to translate content, draft documents, or summarize information, and they receive text-based responses in return.

This interaction model works well in many business contexts. However, there are numerous situations where AI can support processes in different ways. In these environments, communication primarily happens through voice—either between people or between a person and an automated system. Examples include conferences, corporate meetings, digital communication channels, customer care activities, and especially sensitive fields such as healthcare.

In these scenarios, voice can make a real difference. The ability to interact with an AI system that understands spoken commands allows users to retrieve and organize information, ask questions, and launch analyses, translations, or transcriptions without interrupting their ongoing work.

It is no coincidence that the latest AI models are evolving toward multimodal interaction—combining written input with other forms of communication, particularly voice.

Velvet Speech 2B was created to meet this need. It extends the Velvet family of large language models (LLMs) with a new voice-based interaction capability designed for dynamic, real-world professional environments.

Person using AI voice search on a laptop with futuristic interface and data visualizations

Velvet Speech 2B: interacting with AI through voice

Velvet Speech 2B is the first multimodal model in the Velvet family. Compact and versatile, it is built for dynamic interactions and can process and understand spoken language. Users can submit requests either in writing or through voice input, while responses are delivered in text format.

This innovation builds on established expertise: Almawave has been developing speech and voice recognition technologies in its research labs for years. These capabilities now enhance the evolution of its language models.

From a technical standpoint, Speech 2B retains the strengths of Velvet 2B while adding advanced voice-related features, including automatic speech recognition and spoken queries and question answering.

The model supports both Italian and English, even within mixed-language conversations. It also integrates speech emotion recognition capabilities, enabling it to analyze tone and vocal patterns to better understand context.

Below are the key features that distinguish Velvet Speech 2B.

Automatic speech recognition

Automatic speech recognition enables the model to listen to recordings or live conversations and convert them into written text. This capability is particularly useful when spoken exchanges—such as meetings, public hearings, or interviews—need to be quickly transformed into structured documents.

Spoken queries and question answering

Users can ask questions verbally—for example, “Show me all open cases from the last 30 days”—and the system processes the request exactly as it would a written command, returning a clear and structured response.

Seamless interaction between voice and text

Whether a request is typed or spoken, the system interprets it consistently and delivers coherent responses. There are not two separate systems for voice and text; the experience remains uniform regardless of the input method.

Bilingual support (Italian and English)

Speech 2B can understand and transcribe both Italian and English, even when the two languages alternate within the same conversation. This makes it particularly suitable for institutional and corporate settings where multilingual exchanges are common, ensuring accuracy within a single, unified information-processing workflow.

Speech emotion recognition

Beyond the literal meaning of words, the model analyzes vocal elements such as tone and rhythm to detect emotional signals. This feature is especially valuable in contexts where emotional nuance plays an important and sensitive role, such as doctor–patient interactions or public-facing customer care services.

Compact and versatile design

One of Velvet Speech 2B’s most distinctive characteristics is its compact architecture and internal optimization. It is a lightweight model that can be integrated into infrastructures with limited computing power, without requiring complex environments or external dependencies.

This makes it especially well-suited for organizations where data must remain on-premises—such as public administrations, healthcare institutions, or companies managing sensitive information—ensuring strong data governance and control.

Man using voice assistant on his mobile phone showcasing voice recognition technology

From public administration to operational environments: putting voice to work

Velvet Speech 2B can be successfully deployed in a wide range of public and private settings. Its lightweight design makes it ideal for local infrastructures and edge devices. Combined with its focus on data protection and data quality, it is particularly promising in sectors where personal data management is critical.

These include public administration and healthcare, where sensitive information is handled daily. In such contexts, it is essential to maintain full control over where data resides and who has access to it.

With Velvet Speech 2B, organizations can enable voice interaction without changing their existing infrastructure. Spoken input is converted into text and managed according to established policies, without introducing additional layers of data exposure.

Here are some potential application scenarios.

Public administration: automatic transcription and summarization of public sessions

During city council meetings or public hearings, Velvet Speech 2B can transcribe public discussions and generate official minutes or summaries of key points. This leads to significant time savings, greater administrative efficiency, and immediate access to transparent documentation.

Healthcare: written summaries of doctor–patient consultations

In healthcare settings, physicians often have to manually enter information from consultations into digital systems. With Velvet Speech 2B, doctors can focus entirely on the patient while the model accurately records the conversation and produces summaries that support medical reporting.

Healthcare: structured pre-triage

Pre-triage typically involves collecting basic patient information—an activity that AI models can effectively support. Patients answer guided questions verbally about symptoms, duration, and medical history. Velvet Speech 2B transcribes responses and can automatically generate a preliminary report to be validated by medical staff.

Field operations: regulatory consultation

Construction sites and other high-intensity operational environments are often noisy and require hands-free work. In these settings, voice-based document consultation becomes not just beneficial but essential.

Technicians can verify regulations and procedures by speaking directly to the system, accessing critical information quickly without relying on printed manuals.

Additional use cases

The potential applications of Velvet Speech 2B are limitless. It can streamline workflows, reduce wait times, enhance professional productivity, and minimize errors.

In citizen services or public customer care, for example, it can automatically transcribe calls and organize incoming requests. In corporate operational meetings or technical briefings, it can transform discussions into structured minutes and key takeaways.

Voice does not replace writing—it enhances it. With Velvet Speech 2B, AI expands its interaction capabilities and better adapts to real-world professional environments, where flexibility, speed, and data control are essential.

In a landscape where security, governance, and reliability are central priorities, integrating voice makes AI not only more powerful but also more aligned with the real needs of businesses and institutions.

Would you like to learn more about the Velvet family?

Visit our website arrow right