scarica il

Transcript

scarica il
Interfacce Vocali Uomo
Macchina
Verso Prestazioni Umane
Roberto Pieraccini
Director of Advance Conversational Technology
“at some company somewhere in the US”
Conferenza TAL, Torino, 21 Gennaio, 2014
Comunicazione vocale uomo macchina
Le previsione di Kubrick & Clarke per il 2001 (1969)
La realta’ nel 2001
Design: Jonathan Bloom
Realization: Peter Krogh
From an idea by Roberto Pieraccini
Cosa e’ successo da allora?
!   La realta’ tecnologica in molti casi ha superato la previsione
!   Internet, Web, Wikipedia, Social Networks
!   PC, tablets, smartphones, wireless
!   Genomics, personalized medicine, brain research
!   Big data, quantum computing
MA NON PER LE FUNZIONI PRETTAMENTE
UMANE
! Visione, Voce, Intelligenza generale e “common sense”
Statistical Approach
Radio Rex
TASK RESOURCE MANAGEMENT Approximate amount of data (hours) ATIS 30 40 WALL STREET JOURNAL 100 MEETING SPEECH 100 SWITCHBOARD 300 BROADCAST NEWS Siri, Google voice search 10,000 Unknown, but pracLcally unlimited 60 anni di storia dei computers che
capiscono la voce
Start of the
speech
technology
Industry
Dynamic Time Warping
Von Kampelen's
speaking machine
Statistical Approach
Radio Rex
SIRI, Google
Voice Search
Dudley’s Voder
Bell Labs AUDREY
1769
1920 1952
1936
1971 1975 1982 1988
1995
1985 1990
2000
2011
Siri e Google Voice Search
Siamo a un punto di svolta?
!   Riconoscimento della voce disponibile, per la prima volta, alla maggioranza della
popolazione
! 
! 
Siri, Google Voice Search, Google Translate
L’utilizzo di quantità illimitate di dati genera miglioramenti valutabili
!   Google, Apple, Amazon stanno investendo massivamente nella ricerca nel campo del
riconoscimento della voce e linguaggio naturale
! 
Quantità mai vista prima di offerte di lavoro nel campo della voce
!   Nuova ondata di entusiasmo nell’intelligenza artificiale
! 
!
Google assume Ray Kurtzweil, e acquisisce l’azienda di Geoffrey Hinton su “deep
learning”. Acquista unD-Wave quantum computer per ricerca in AI . Continua ad acquisire
aziende di robotica.
Facebook annuncia un nuovo “AI lab” in New York, IBM segue a ruota.
!   Un film già visto?
! 
L’AI è “morta” circa quattro volte in cinquant’anni a causa dell’entusiasmo infondato e
grandi promesse non mantenute. E due di queste volte con le reti neuronali (Yann LeCun
NYU, e capo del nuovo gruppo di AL Labs a Facebook)
Bill Gates l’aveva predetto ….
Bill Gates, 1 October 1997: “In this 10-year time
Gates,
Julynot
2003:
“It’s
thethe
dreams of
frame, IBill
believe
that28we’ll
only
be all
using
software,
of
visiontoand
speech
recognition
keyboard
and the25
mouse
interact,
during thatand
Bill Gates,
June 1998:
“The but
breakthroughs
in
business
intelligence;
those
are
within
our
grasp.
Bill aren’t
Gates,
25 February
2004:
with speech it’s
time we
will have
perfected
speech
recognition
and
interaction
going
to come
in the“Now,
next
three
Somenot
people
might
say those
it’s
three
years,
some
people
as
easy.
Speech
is another
onerecognition,
that
be solved,
speechyears.
output
well
enough
that
will
become
a will
We’ll
have
some
additional
speech
Bill
Gates,
24
March
1999:
“Speech
recognition
…I
might
say
it’s
10
years
to
solve
those
things,
but
by
and
will
be
solved
for
a
broad
range
of
applications
standard
part
of
the
interface.”
Bill
Gates,
14
September
2005:
“We
totally
believe
but itdon’t
won’tthink
be the
center
ofdictation
the interface.
But in the
you’ll
see
as
something
that
and large,
those
very
interesting will
things,
put aside somewhere
within
thistimeframe,
decade.”
speech
recognition
go
mainstream
three-to-six-year
I feel
very
confident
Bill
Gates,
10
March
2000:
“You
know,
when
most
people
will
use
in
the
next
couple
of years.
TheI was a
machine
learning,
the
very
interesting
tool-based
Bill
Gates,
14
October
2005:
“Another
bigwas
change
next
decade.”
that that
will
beover
notatthe
only
a standard
thing,
but
built group
student
Harvard,
the
defense
DARPA
extra
processing
power,
getting
the
extra
memory
I
things,
I think it’s
verysee
clear
that
we’rehave
on
amicrophones
track to
you’ll
is that
we’ll
on PCs and
into
the
operating
system,
and
something
that
giving
out
money
to
universities
that
said,
yes,
in
think
has
us
on
a
track
to
provide
that,
but
for
most
make some incredible
advances.”
theBill
recognition
berecognition
built-in
as …
athing
standard
Gates,
9 June
“The
next
is
applications
will
sit
on
top
ofhave
and
take
advantage
of.”
three
years
we’ll
great
speech
people,
I think
itspeech
will
be
more
like2011:
awill
five-year
timebig
feature.
Andspeech
that’s probably
two
to three years from
definitely
andout
recognition.
so that’s
the
frontiers
that way
are
there
— great You’ll be able
frameAnd
before
a standard
ofvoice
interacting.”
now
that
that
becomes
to touch
thatreally
board
or speech
speakmainstream…”
torecognition,
it and get your
handwriting
recognition,
great
message
to colleagues
world. Screens
are
even having
the computer
have around
a visualthe
capability
so
it can seecheap
who’s” coming in, what’s going on, all of
those things undoubtedly will be solved in the next
decade.”
Riconoscere la voce e’ … DIFFICILE!
! Nonostante tutto le prestazioni’ umane sono ancora
imbattibili
!   …ci aspettiamo che le macchine abbiano le stesse prestazioni
!   PROBLEMI
!
!
!
!
Rumore di fondo, riverbero
Variazioni di caratteristiche vocali, accento e linguistiche
Vocabolario limitato
Parlatori simultanei
5
0
-5
88.36
66.90
26.13
87.55
62.15
27.18
87.80
53.44
20.58
87.60
64.36
24.34
87.82
61.71
24.55
Riconoscimento della voce e rumore di
Average between 0
88.75
87.95
86.52
88.03
fondo
87.81
and 20dB
Noise level
Digitforrecognition
accuracy
(AURORA-2)
Table 1: Word accuracy as percentage
test set A in multi-condition
training
SNR/dB
Restaurant
Street
Airport
Train-station
Average
clean
20
15
10
5
0
-5
98.68
96.87
95.30
91.96
83.54
59.29
25.51
98.52
97.58
96.31
94.35
85.61
61.34
27.60
98.39
97.44
96.12
93.29
86.25
65.11
29.41
98.49
97.01
95.53
92.87
83.52
56.12
21.07
98.52
97.22
95.81
93.11
84.73
60.46
25.89
Average between 0
85.39
87.03
87.64
85.01
86.27
and 20dB
Table 2: Word accuracy as percentage for test set B in multi-condition training
Hirsch, Pearce, ISCA ITRW ASR2000
SNR/dB
Subway(MIRS)
Street(MIRS)
Average
clean
20
15
10
5
98.50
97.30
96.35
93.34
82.41
98.58
96.55
95.53
92.50
82.53
98.54
96.92
95.94
92.92
82.47
Parole nuove
L’effetto cocktail party
Separazione di sorgente
From: Audio Alchemy: Getting Computers to Understand Overlapping Speech
J. R. Hershey, P. A. Olsen, S. J. Rennie, A. Aaron, Scientific American, April 2011
SPEAKER MASKING ALGORITHM
MIXED SPEECH
Speaker 1: Lay white at K 5 again.
Speaker 2: Bin blue by M zero now.
Speaker 3: Set green in M 7 please.
Speaker 4: Lay green with S 7 please
SEPARATION BY SPEAKER MASKING
Speaker 1: Lay white at K 5 again.
Speaker 2: Bin blue by M zero now.
Speaker 3: Set green in M 7 please.
Speaker 4: Lay green with S 7 please
Riverbero
Close talking.
2 meters. Twice
as many errors
4 meters. 4 times
as many errors
From: Sub-band temporal modulation envelopes and their normalization for automatic speech recognition in reverberant
environments, X. Lu, M. Unoki, S. Nakamura, Computer Speech and Language, July 2011
Architettura di interfaccia vocale
FRONT-END
From speech to features
I want to fly to San
Francisco leaving from
New York in the
morning
SEARCH
From features to words
Acoustic
Models
Representations
of speech units
derived from
data
I
San
to
om
leaving fr
Fran
cisco morning
ork
fly
New Y
LANGUAGE
UNDERSTANDING
From words to meaning
Language
Models
Representations
of sequences of
words derived
from data
request(flight)
origin(SFO)
destination(NYC)
time(morning)
DIALOG
From meaning to actions
What date do you
want to leave?
Il Front End
Ancora oggi, i front-end dei riconoscitori commerciali, usano tecniche di
quantizzazione spettrale relativamente semplici
Oggi sappiamo che l’apparato uditivo umano usa dei meccanismi di
rappresentazione molto piu’ sofisticati
Front-end – come quello umano?
I modelli acustici
Gli imbattibili Modelli Markoviani “Nascosti” (Hidden
Markov Models)
L’assunzione di indipendenza statistica assunta dai modelli markoviani limita
oggi la capacita’ di migliorare le prestazioni e utilizzare a pieno l’enrome
quantita’ di dati disponibili (Wegmann, Morgan, Cohen, 2013)
Vi ricordate la tecnica dei “templates”?
s
s
0.0
e
e
v
v e
time (sec)
e
n
n
1.0
Breve storia dei meccanismi di riconoscimento della
voce
250
180
2001
2012
La potenza dei computer continua ad aumentare
esponenzialmente – Moore’s Law
Il ritorno dei “templates”
From “Exemplar-Based
Processing for Speech
Recognition, Sainath et als”,
IEEE Signal Processing
Magazine, 2012
Van Compernolle et. als (Univ of Leuven, Belgium), Nguyen and Zweig (MS
Research), Sainath, Ramabhadran, Nahamoo, Kanesky et als (IBM Research)
Se si usano milioni di templates, si raggiungono prestazioni confrontabili con
quelle dei modelli markoviani – il potere della statistica empirica contro la
parametrica
Reti neuronali nel riconoscimento della voce
!   Speech/non-speech classification (Morgan, 1983)
!   Speech event classification (Makino, 1983)
!   Recurrent ANN (Fallside, Robinson, 1989)
!   Time-Delay neural Networks (Alex Waibel et als.,
1989)
!   Hybrid HMM/ANN (Morgan, Bourlard, 1989)
!   Hidden Control Neural Networks (Esther Levin,
1990)
!
Dopo i tentativi iniziali di usarle per riconoscer la
voce direttamente, le reti neuronali sono state
impiegate essenzialmentecome modelli di
rappresentazione della distribuzione statsitica negli
stati dei modelli markoviani.
Il ritorno delle reti neuronali
OUTPUT LAYER
HIDDEN LAYER
INPUT LAYER
Il ritorno delle reti neuronali
DEEP NEURAL NETWORKS
OUTPUT LAYER
HIDDEN LAYER
INPUT LAYER
Il ritorno delle reti neuronali
DEEP NEURAL NETWORKS
OUTPUT LAYER
HIDDEN LAYER
HIDDEN LAYER
HIDDEN LAYER
HIDDEN LAYER
HIDDEN LAYER
HIDDEN LAYER
HIDDEN LAYER
INPUT LAYER
Risultati incoraggianti con “deep neural
networks”
E la comprensione del linguaggio, il
dialogo?
Comprensione: Modelli Concettuali
(CHRONUS -- Pieraccini, Levin, 1991)
CONCEPT 2
CONCEPT 1
P(S12|S11)
P(S12|S11)
P(wt|wt-1, wt-2, S12)
P(S23|S22)
P(wt|wt-1, wt-2, S21)
P(C2|C1)
P(wt|wt-1, wt-2, S11)
P(wt|wt-1, wt-2, S22)
P(S13|S12)
P(wt|wt-1, wt-2, S23)
P(C3|C1)
P(wt|wt-1, wt-2, S13)
P(C3|C2)
CONCEPT 3
P(S12|S11)
P(wt|wt-1, wt-2, S22)
Apprendimento Automatico di
Grammatiche Statistiche
TRANSCRIPTIONS
ANNOTATIONS
want to cancel the account
CANCEL_ACCOUNT
cancel service
CANCEL_ACCOUNT
I cant send a particular message to a certain group of people
CANNOT_SEND_RECEIVE_EMAIL
cancellation of the service
CANCEL_ACCOUNT
I need to setup my email
EMAIL_SETUP
they registered my modem in from my internet and I need to get my email address
EMAIL_SETUP
my emails are not been received at the address I sent it to
CANNOT_SEND_RECEIVE_EMAIL
…
Language Model for Speech
Recognition
Statistical Semantic Classifier
Apprendimento Continuo
D. Suendermann, J. Liscombe, and R. Pieraccini: How to Drink
TRANSCRIPTIONS
from a Fire Hose: One Person Can Annoscribe 693 Thousand
ANNOTATIONS
want to cancel the
Utterances
in account
One Month. In Proc. of the SIGDIAL 2010, 11th
Annual
Meeting of the Special Interest Group on Discourse and
cancel service
Dialogue,
2010.
I cant send aTokyo,
particularJapan,
messageSeptember
to a certain group
of people
CANCEL_ACCOUNT
cancellation of the service
CANCEL_ACCOUNT
I need to setup my email
EMAIL_SETUP
they registered my modem in from my internet and I need to get my email address
EMAIL_SETUP
my emails are not been received at the address I sent it to
CANNOT_SEND_RECEIVE_EMAIL
CANCEL_ACCOUNT
CANNOT_SEND_RECEIVE_EMAIL
…
Language Model for Speech
Recognition
Statistical Semantic Classifier
Dialogo – modelli di controllo a stati finiti
Modelli di apprendimento automatico della
funzione del dialogo
! Introduzione dei modelli di “reinforcement learning” con
MDP (Markov Decision Process) (Levin, Pieraccini, 2001)
! Introduzione dei POMDP (Partially Observable Markov
Decision Process) (Williams, Young, 2006)
! L’apprendimento della funzione di dialogo e’ ancora un
problema accademico
Ottimizzazione del dialogo “online”
D. Suendermann and R. Pieraccini: One Year of Contender: What Have We Learned about Assessing and Tuning Industrial Spoken
Dialog Systems? In Proc. of the Workshop on Future Directions and Needs in the Spoken Dialog Community at the Conference of the
North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Montreal, Canada, June
2012.
C1
33%
C1 of example Contenders.
Table 1: Statistics
application
33% 33% TV
TV
TV
C2
TV
50%TV
TV50%
Internet
C3
Internet
Internet
50%
50% TV/Internet
55%
Contender
# calls
problem capture
13,477,810
15% 30%
cable box reboot order
4,322,428
outage prediction
2,758,963
C2 485,300
on demand
input source troubleshooting
1,162,445
72%
account lookup
28% 9,627
troubleshooting paths I
275,248
C3
troubleshooting paths II
1,389,489
computer monitor
1,500,010
16% instruction
opt in
6,865,929
84%
C1
−1 ]
∆At [mo
98%
∆R
40,362 2% 0.05
0%
28,975
0.11
08,198
0.04
C2
08,123
0.17
03,487
0.05
03,201
0.02 95% 5%
05,568
0.02
C3
03,530
0.01
03,271
3% 0.01
31,764
0.05
97%
is calculated
by multiplying the observed dif-C4References
C4
C4
ference in automation rate ∆A with the number K. Acomb, J. Bloom, K. Dayanidhi, P. Hunter, P. Krogh,
33%
12%
5%
of monthly calls hitting the Contender (t).
E. Levin, and R. Pieraccini. 2007. Technical Support
33% 33%
3 Conclusion
call-flow
0
Dialog Systems: Issues, Problems, and62%
Solutions.
In
23%
46% 42%
call-flow
We have seen that the use of Contenders (a method
to assess and tune arbitrary components of induscalls
callstrial spoken dialog systems)10,000
can be very
beneficial in multiple respects. Applications can selfcorrect as soon as Probability
reliable data becomes available
without additional recalculation
manual analysis and intervention.
Moreover, performance can increase substantially
in applications implementing Contenders. Looking
Proc. of the HLT-NAACL, Rochester, USA.
call-flow
K. Evanini, P. Hunter, J. Liscombe,
D. Suendermann,
K. Dayanidhi, and R. Pieraccini:. 2008. Caller Experience: A Method for Evaluating Dialog Systems and
Its Automatic Prediction.
Proc. of the SLT, Goa,
20,000In calls
India.
S. Möller, K. Probability
Engelbrecht, and R. Schleicher. 2008. Predicting the
Quality and Usability of Spoken Dialogue
recalculation
Services. Speech Communication, 50(8-9).
A. Raux, B. Langner, D. Bohus, A. Black, and M. Eske-
Avg # of
automated
calls gained/
month
Conclusioni
!   Rinnovato entusiasmo nell’interazione vocale uomomacchina generato dalle applicazioni di massa
!   Ancora lontani da prestazioni umane, specialmente in
situazioni acustiche non ottimali
!   Ricerca di nuove soluzioni e rivisitazione di alcune vecchie
idee potrebbero ridurre il divario
!   Progresso limitato nella comprensione e nella gestione del
dialogo.