SINAI at CLEF 2004: Using Machine Translation resources with

Transcript

SINAI at CLEF 2004: Using Machine Translation resources with
SINAI at CLEF 2004: Using
Machine Translation resources
with mixed
2-step RSV merging algorithm
University of Jaén - SPAIN
Fernando Martínez Santiago
Miguel Ángel García Cumbreras
Manuel Carlos Díaz Galiano
Alfonso Ureña López
1
SINAI at CLEF 2004
Multilingual CLEF task
•2-step
RSV calculus requires that given a word of the
{
original query, its translation to the rest of languages must
be known.
•MT translates the whole of the phrase better than word for
word,
{ we can’t use MT with 2-step RSV merging algorithm
directly.
•We propose an straightforward and efective algorithm to
align the original query and its translation at term level.
Only queries are translated:
merging algorithms are required
{
Main interest: testing Machine
Translation(MT) with mixed 2-step
RSV merging algorithm.
University of Jaén -- SINAI at CLEF 2004
2
Content
{
{
{
{
{
{
How 2-step RSV merging algorithm
works?
An algorithm to align parallel text at
term level based on MT
Experimentation framework
Results
Global pseudo-relevance feedback
Conclusions a future work
University of Jaén -- SINAI at CLEF 2004
3
2-step RSV
method
Conflict of Interest in Italy?
Conflicto
intereses
Italia
Spanish IR
Conflits
intérêt
Italie
Conflitto
interessi
Italia
Italian IR
#1 (conflicto, conflitto, conflits)
#2 (intereses, interessi, intérêt)
#3 (Italia, Italia, Italie)
French IR
Spanish, Italian, French retrieved
documents
Indexing concepts
(#1,#2,#3)
STEP 2
Spanish
Concept IR
STEP 1
Italian
French
Concept
Term
df
Term
df
Term
df
Id.
df
conflicto
1000
conflitto
1500
conflits
1250
#1
intereses
5000
interessi
6000
Intérêt
4000
#2
Italia
3500
Italia
6000
Italie
4500
#3
1000+1500
+1250=3750
5000+6000
+4000=15000
3500+6000
+4500=14000
Multilingual list of
Ranked documents
An algorithm to align at term level a phrase and its
translations by using machine translation resources
STEP 1
Pen Æ
Pesticides in
baby food
Machine
Translation
STEP 2
Psp Æ
Pesticidas en
alimento para
niños
Unigrams (Pen) =
{Pesticides, baby,
food}
Bigrams (Pen) =
{Pesticides baby,
baby food}
U (Psp) = {Pesticidas,alimento,bebés}
B (Psp) = {Pesticidas bebés,
alimento para niños}
...
STEP 5
STEP 4
To Align Unigrams
Non Stopwords Stemmed
...
STEP 3
Psp Æ Pesticida alimento niño
U (Psp) =
{Pesticida,alimento,bebé}
B (Psp) = {Pesticida bebé,
alimento niño}
Pesticida alimento niño, {Pesticida, alimento, bebé}
Pesticide baby food, {
Pesticide
,
Food ,
...
baby}
STEP 6
To Align Bigrams
Pesticida alimento niño, {Pesticida bebé, alimento niño}
...
Pesticide baby food, {Pesticide baby,
University of Jaén -- SINAI at CLEF 2004
baby
food
}
5
An algorithm to align at term level a phrase and its
translations by using machine translation resources
Spanish German French Italian
91%
87%
86%
88%
Percentage of aligned non-empty words
(CLEF2001+CLEF2002+CLEF2003 query set,
Title+Description fields, Babelfish machine Translation)
Finnish
French
Russian
100%
85%
80%
Percentage of aligned non-empty words (CLEF2004
query set, Title+Description fields,
MT for French and Russian. MDR for Finnish)
University of Jaén -- SINAI at CLEF 2004
6
Mixed 2-step RSV:Queries partially
aligned
{
Raw mixed 2-step RSV:
Conflicto de intereses en Italia
{
SP123
RSValigned terms
Aligned
terms
Logistic regression:
573
Dinamic 2-step RSV index
Non- aligned
terms
RSVnot
aligned terms
39
Monolingual index (Spanish)
University of Jaén -- SINAI at CLEF 2004
7
Experimentation framework –
language dependent features
English Finnish
French
Russian
Preprocessing stop words removed and stemming
Additional
preprocessing
compounds
words to
simple words
CyrillicÆASCII
Query aligment at word
level algorithm based on
MT
Translation
approach
FinnPlace
MDR
University of Jaén -- SINAI at CLEF 2004
Reverso Prompt MT
MT
8
Experimentation framework –
language independent features
{
ZPrise IR engine
{
OKAPI probabilistic model (fixed at b =
0.75 and k1 = 1.2 for every language and
for the 2-step RSV index)
{
Neither blind feedback nor query
expansion (no improvement except of
French)
University of Jaén -- SINAI at CLEF 2004
9
Results
University of Jaén -- SINAI at CLEF 2004
10
Global pseudo-relevance feedback
{
The idea:
z
{
Since 2-step RSV creates a only index for all collections
and it returns a only multingual list of documents, why
don’t apply PRF with such index?
The implementation:
1.
2.
3.
4.
5.
Merge the document rankings using 2-step RSV.
Apply blind relevance feedback to the top-N documents
ranked into the multilingual list of documents.
Add the top-N more meaningful terms to the query.
Expand the concept query with the selected terms.
Apply again 2-step RSV over the ranked lists of
documents, but by using the expanded query instead of
the original query.
University of Jaén -- SINAI at CLEF 2004
11
Conflict of Interest in Italy?
Conflicto
intereses
Italia
Spanish IR
#1 (conflicto, conflitto, conflits)
#2 (intereses, interessi, intérêt)
#3 (Italia, Italia, Italie)
OCSE, concept#2, politica
concept#1, broadcasting
Conflits
intérêt
Italie
Conflitto
interessi
Italia
Italian IR
French IR
Spanish, Italian, French retrieved
documents
Indexing concepts
(#1,#2,#3)
Concept IR
Multilingual list of
Ranked documents
Global pseudo-relevance feedback
{
But global PRF doesn’t work:
z
z
Usually, blind relevance feedback is poorly suited to
CLEF document collections.
We use the expanded query to apply 2-step RSV reweighting the documents retrieved for each language,
but the list of retrieved documents does not change (it
only changes the score of such documents).
University of Jaén -- SINAI at CLEF 2004
13
Conclusions and future work
{
{
{
2-step RSV merging algorithm works well with
Machine Translation!
We propose a new word-level aligment algorithm
based on MT
In the future:
z
z
Partially aligned queries Æ integration of two scores
for documents must be improved by using other
algorithms (normalized versions of raw mixed 2step RSV, SVM and neural and bayesian
networks...)
Global PRF idea must be more investigated
University of Jaén -- SINAI at CLEF 2004
14