When computers listen to how we speak: how artificial intelligence helps detect dementia early

Dr Karol Chlasta, Assistant Professor at the Department for Management in Networked and Digital Societies shows how a simple conversation with a medical doctor can reveal the first signs of dementia years before visible symptoms appear.

This is no longer science fiction. A research team from Kozminski University in Warsaw, led by Dr Karol Chlasta, has developed a new method that analyzes patients’ speech using artificial intelligence. Their study shows that a computer can significantly help detect subtle changes in a person’s speech that signal emerging memory problems and early stages of dementia.

 

Dementia is not only a medical problem but also a massive social challenge. According to estimates published in Lancet Public Health, the number of people living with dementia will rise sharply: from 57.4 million in 2019 to as many as 152.8 million in 2050. This means that over the course of three decades, the number of patients will nearly triple. According to a report by Alzheimer Europe, in Poland, where the population is aging rapidly, problems related to progressive neurodegenerative diseases of the brain will be particularly acute in the future.

The biggest challenge in the fight against dementia is that it is often diagnosed too late. By the time a patient or their family notices the first symptoms – memory problems, difficulty performing daily activities, changes in personality – the disease is already at a considerably advanced stage. However, the brain has an amazing ability to compensate: it can hide progressive damage for years by using healthy areas to perform functions that would normally be handled by the damaged regions.

Speech as a window into the brain

We made a fascinating discovery in our study: the way we speak changes much earlier than other cognitive functions. Speech is a complex process that involves various areas of the brain, from planning what to say, through word selection, to controlling the muscles responsible for articulation. When dementia begins to damage the brain, these subtle changes in speech appear as the very first warning signs.

What changes exactly? Patients begin to take longer pauses between words, as if searching for the right expression.

Their speech becomes less fluent, with repetitions and errors appearing. Syntax, the way sentences are constructed, becomes simplified. Semantic errors also appear, which mean problems with choosing the right words to express thoughts. At first, these changes are so subtle that neither the patient nor their close ones notice them. But a computer does.

We conducted this study as part of the IEEE PROCESS Signal Processing Grand Challenge, an international competition held during the ICASSP 2025 conference. Teams from around the world competed in analyzing speech recordings to predict cognitive function and aid in the diagnosis of dementia. In my team, which also included Piotr Struzik and Grzegorz Marcin Wójcik, we used three standard neuropsychological tasks that have been used for years in the diagnosis of cognitive disorders.

  • The first is the Cookie Theft picture description task: the patient describes the scene depicted in the drawing, which allows us to assess their ability to create a coherent narrative.
  • The second task is semantic fluency: it involves naming as many animals as possible within a set time limit.
  • The third, finally, is phonemic fluency: coming up with words that begin with the letter P.

These seemingly simple tasks are very complex for the brain. They require not only access to vocabulary but also speech planning, attention control, and flexible thinking. That is why these exercises are so sensitive to early cognitive changes.

 

 

Three types of artificial intelligence in a single system

The key to my team’s success was combining three different artificial intelligence technologies, each of which analyzes speech from a different angle:

The first technology is Hidden-Unit BERT (HuBERT), a self-learning model of speech representation. It can be compared to a sensitive ear that listens not only to what we say, but also to how we say it. HuBERT analyzes the rate of articulation, variations in pitch, and the spectral structure of the voice – the distribution of the voice’s acoustic energy across its individual frequencies – and uses this to create a 1,024-dimensional feature vector, which is a mathematical description of how we speak. It’s like a fingerprint of our speech.

The second technology used is the so-called extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS), a standard set of acoustic voice features used primarily in machine learning, artificial intelligence, and emotional speech analysis. Using the open-source tool Speech and Music Interpretation by Large-space Extraction (openSMILE), we extracted 88 standard paralinguistic parameters, that is voice features significant from the perspective of sound engineering: jitter (frequency irregularity), shimmer (amplitude irregularity), HNR (Harmonics-to-Noise Ratio, the ratio of vocal fold – so-called harmonic – vibrations to noise), and mel-frequency cepstral coefficients (MFCC). This is a standard technique for extracting features from sound, widely used in speech recognition systems and music analysis. Together, these parameters described the quality and stability of the voice.

The third set of technologies we used, from OpenAI, consists of the Whisper and GPT-4o models. First, the Whisper model transcribes the recording into text; then, GPT-4o evaluates various aspects of language: content accuracy, linguistic fluency, grammar and syntax, repetition and intrusion errors, and content ambiguity. Each dimension is rated on a scale from 0 to 10, resulting in a set of features that can be easily interpreted in clinical practice.

Surprising results

Our results turned out to be quite astonishing, and it was particularly exciting to analyze which features proved the most important for predictions.

It turned out that HuBERT, a model analyzing speech acoustics, accounted for as much as 75.7% of the predictive power. Linguistic features analyzed by GPT-4o accounted for 13.5%, and the remaining acoustic features from eGeMAPS made up the rest. This shows that how we speak is just as important as what we say.

Our team’s results proved to be a great success. Our university team ranked 10th out of 80 teams from around the world Ii the task of predicting the outcome of the Mini-Mental State Examination (MMSE), a standard cognitive function test, and outperformed teams from such renowned universities as the Singapore University of Technology, Donghua University in Shanghai, Cooper Union in New York, the University of Edinburgh, and KU Leuven in Belgium.

The practical applications of this technology are potentially revolutionary:

  • First, speech can be analyzed remotely, frequently, and inexpensively. It is an ideal tool for telemedicine and senior care. Imagine a future where senior citizens regularly converse with a voice assistant that monitors their cognitive functions and alerts them to concerning changes.
  • Second, this technology offers interpretable results. In contrast to many AI solutions that operate like a black box, the features evaluated by the language model are easy for doctors to interpret, as we can see different aspects of speech correctness for each subject. This allows doctors to independently analyze why in a specific case the system indicates a higher risk of dementia: whether it is due to a decline in the semantic coherence of speech, problems with fluency, or a higher number of grammatical errors in the given patient.
  • Third, this technology enables early detection of subtle changes that are invisible in traditional tests. The combination of acoustic and linguistic analysis provides a more complete picture of a person’s cognitive functioning than either of these methods used separately.
  • Finally, this technology has enormous implementation potential. It relies solely on audio recordings, and these can be easily collected using smartphones or tablets. Thanks to this simplicity, it can support primary care physicians, speech therapists, and long-term care systems.

Challenges and avenues for development

Nevertheless, we acknowledge the limitations of our findings: errors in automatic speech recognition can affect linguistic assessment. Data used to train the AI models also lacked certain demographic details, such as a person’s exact age. At the same time, the tasks related to speech fluency were relatively simple, and the dataset was fairly small. Further research should focus on more natural recordings, dialogues, longer narratives, and the analysis of so-called longitudinal data, which is data based on patient records collected over the years. This will allow us to fully harness the potential of large language models in the analysis of cognitive changes.

In sum, our work shows that we are on the cusp of a new era in neuropsychology, as well as in healthcare management. Modern language models can complement traditional diagnostic tools by providing objective, quantitative indicators of cognitive changes. This does not mean replacing a doctor with a computer but rather equipping the former with powerful tools to aid them in their diagnosis.

The proposed approach offers a scalable, data-driven method for early cognitive screening. With an aging society, such solutions could become an essential component of healthcare systems. Early detection of dementia means treatment can start sooner, care can be planned more effectively, and patients, as well as their families, have a better chance of maintaining a good quality of life.

 

___

The research article is available in Frontiers in Neuroinformatics: https://doi.org/10.3389/fninf.2025.1679664

The article was machine translated.

See also