From text to voice: the evolution of AI interaction with Velvet Speech 2B
Artificial Intelligence
19 February 2026
Today, most GenAI interactions happen through written text. Individuals and professionals type requests asking AI to translate content, draft documents, or summarize information, and they receive text-based responses in return.
This interaction model works well in many business contexts. However, there are numerous situations where AI can support processes in different ways. In these environments, communication primarily happens through voice—either between people or between a person and an automated system. Examples include conferences, corporate meetings, digital communication channels, customer care activities, and especially sensitive fields such as healthcare.
In these scenarios, voice can make a real difference. The ability to interact with an AI system that understands spoken commands allows users to retrieve and organize information, ask questions, and launch analyses, translations, or transcriptions without interrupting their ongoing work.
It is no coincidence that the latest AI models are evolving toward multimodal interaction—combining written input with other forms of communication, particularly voice.
Velvet Speech 2B was created to meet this need. It extends the Velvet family of large language models (LLMs) with a new voice-based interaction capability designed for dynamic, real-world professional environments.
Velvet Speech 2B: interacting with AI through voice
Velvet Speech 2B is the first multimodal model in the Velvet family. Compact and versatile, it is built for dynamic interactions and can process and understand spoken language. Users can submit requests either in writing or through voice input, while responses are delivered in text format.
This innovation builds on established expertise: Almawave has been developing speech and voice recognition technologies in its research labs for years. These capabilities now enhance the evolution of its language models.
From a technical standpoint, Speech 2B retains the strengths of Velvet 2B while adding advanced voice-related features, including automatic speech recognition and spoken queries and question answering.
The model supports both Italian and English, even within mixed-language conversations. It also integrates speech emotion recognition capabilities, enabling it to analyze tone and vocal patterns to better understand context.
Below are the key features that distinguish Velvet Speech 2B.
Automatic speech recognition
Automatic speech recognition enables the model to listen to recordings or live conversations and convert them into written text. This capability is particularly useful when spoken exchanges—such as meetings, public hearings, or interviews—need to be quickly transformed into structured documents.
Spoken queries and question answering
Users can ask questions verbally—for example, “Show me all open cases from the last 30 days”—and the system processes the request exactly as it would a written command, returning a clear and structured response.
Seamless interaction between voice and text
Whether a request is typed or spoken, the system interprets it consistently and delivers coherent responses. There are not two separate systems for voice and text; the experience remains uniform regardless of the input method.
Bilingual support (Italian and English)
Speech 2B can understand and transcribe both Italian and English, even when the two languages alternate within the same conversation. This makes it particularly suitable for institutional and corporate settings where multilingual exchanges are common, ensuring accuracy within a single, unified information-processing workflow.
Speech emotion recognition
Beyond the literal meaning of words, the model analyzes vocal elements such as tone and rhythm to detect emotional signals. This feature is especially valuable in contexts where emotional nuance plays an important and sensitive role, such as doctor–patient interactions or public-facing customer care services.
Compact and versatile design
One of Velvet Speech 2B’s most distinctive characteristics is its compact architecture and internal optimization. It is a lightweight model that can be integrated into infrastructures with limited computing power, without requiring complex environments or external dependencies.
This makes it especially well-suited for organizations where data must remain on-premises—such as public administrations, healthcare institutions, or companies managing sensitive information—ensuring strong data governance and control.
From public administration to operational environments: putting voice to work
Velvet Speech 2B can be successfully deployed in a wide range of public and private settings. Its lightweight design makes it ideal for local infrastructures and edge devices. Combined with its focus on data protection and data quality, it is particularly promising in sectors where personal data management is critical.
These include public administration and healthcare, where sensitive information is handled daily. In such contexts, it is essential to maintain full control over where data resides and who has access to it.
With Velvet Speech 2B, organizations can enable voice interaction without changing their existing infrastructure. Spoken input is converted into text and managed according to established policies, without introducing additional layers of data exposure.
Here are some potential application scenarios.
Public administration: automatic transcription and summarization of public sessions
During city council meetings or public hearings, Velvet Speech 2B can transcribe public discussions and generate official minutes or summaries of key points. This leads to significant time savings, greater administrative efficiency, and immediate access to transparent documentation.
Healthcare: written summaries of doctor–patient consultations
In healthcare settings, physicians often have to manually enter information from consultations into digital systems. With Velvet Speech 2B, doctors can focus entirely on the patient while the model accurately records the conversation and produces summaries that support medical reporting.
Healthcare: structured pre-triage
Pre-triage typically involves collecting basic patient information—an activity that AI models can effectively support. Patients answer guided questions verbally about symptoms, duration, and medical history. Velvet Speech 2B transcribes responses and can automatically generate a preliminary report to be validated by medical staff.
Field operations: regulatory consultation
Construction sites and other high-intensity operational environments are often noisy and require hands-free work. In these settings, voice-based document consultation becomes not just beneficial but essential.
Technicians can verify regulations and procedures by speaking directly to the system, accessing critical information quickly without relying on printed manuals.
Additional use cases
The potential applications of Velvet Speech 2B are limitless. It can streamline workflows, reduce wait times, enhance professional productivity, and minimize errors.
In citizen services or public customer care, for example, it can automatically transcribe calls and organize incoming requests. In corporate operational meetings or technical briefings, it can transform discussions into structured minutes and key takeaways.
Voice does not replace writing—it enhances it. With Velvet Speech 2B, AI expands its interaction capabilities and better adapts to real-world professional environments, where flexibility, speed, and data control are essential.
In a landscape where security, governance, and reliability are central priorities, integrating voice makes AI not only more powerful but also more aligned with the real needs of businesses and institutions.
Would you like to learn more about the Velvet family?