Tuesday 20 April 2010

A hard limit on the performance of automatic speech recognition?

I've just finished reading Jerry Feldman's book - 'From Molecule to Metaphor' - which I found quite compelling, especially with regard to his insistence that meaning must be grounded in embodied experience. What I found less savoury was one of his final questions -"if the meaning of language is based in bodily experience and if computers cannot share our subjective experience, will we ever be able to communicate naturally with computers?"(page 340).

This is a fundamental issue in human-machine (and human-robot) interaction that appears to be hardly addressed. I'm always 'banging on' about us not using human-like voices in interactive speech synthesis applications, but instead insisting that we should employ voices that are more 'appropriate' to the application/agent (e.g. a voice-enabled garbage can should have a garbage can voice). I've always argued that this is vitaly important if we're to avoid a user overestimating the capabilities of an automated system, and thus stepping outside the range of behaviours that such a system could handle.

However, the question raised by Feldman goes even further - maybe it will never be possible to hold a meaningful conversation with a machine/robot, simply because of the lack of common grounding in real-world experiences.

If this is true, then coupled with the argument that one of the reasons for the exceptional abilities of human beings to be able to recognise speech in difficult circumstances is context (expressed as a dynamic statistical prior), then there must be a hard limit on the accuracy we can expect from automatic speech recognition systems.

So, are we there yet?