Web Corpora

Transcript

Web Corpora
Italian Corpora.
Spoken, written and web corpora for
research and language teaching
Massimo Moneglia
LABLITA (University of Florence)
Corpora di italiano (Marco Baroni)
Voce in Enciclopedia dell'Italiano (2010)
http://www.treccani.it/enciclopedia/corpora-diitaliano_(Enciclopedia-dell'Italiano)/
Isabella Chiari Web page
http://www.alphabit.net/home/index.php?option=com_c
ontent&view=section&id=6&Itemid=11
Manuel Barbera Linguistica dei corpora e linguistica dei corpora
italiana. Un'introduzione
http://www.bmanuel.org/man/cl-HOME.htm
Types of Resources
•
•
•
•
•
Spoken Corpora
Broadcasting Corpora
Written Corpora
Learners Corpora
Web corpora
Distribution mean
• Corpora available through web service
• Corpora on DVD
• Corpora for free downloading
Query types
• Concordances
– by lemma / form / phrase
– Ordering by context
– Ranking of types
• Patterns
• Collocation
–
–
–
–
General
Restricted per PoS
Word sketches
Sketch difference
• CQL
• Frequency lists
Corpora in the classroom
• TALC conferences (since 1994)
• data-driven learning
– Students explore concordances
• Discover language facts for themselves
– Real language
– Test hypotheses
– If they learn like this, they will remember
• After twenty years
– Minority interest
– Advanced level (university) only
– Most teachers haven't heard of it
Do they meet student
needs?
• Dictionary is much easier
• Concordances
– slow and arduous
– distractions, confusions
• Motivation
– Not sexy
• “I want to learn English, not Corpus Linguistics”
Adam Kilgarriff
Spoken corpora
• Acoustic information
• Context variation
• Lessico di frequenza dell'italiano parlato (LIP)
(De Mauro, Mancini, Vedovelli e Voghera,1993)
Around 500.000 words (57 hours of speech)
No acoustic source available
Diaphasic and diatopic variation (recordings a Milano,
Firenze, Roma e Napoli,
Now searchable on line URL: http://badip.uni-graz.at/
• API/AVIP/IPAR (1999-2001)
– map task of different Italian varieties (Pisa, Napoli,Bari )
– high acoustic quality
– 75 minutes segmented into phonemes
– Dinstributed on cd-rom by CIRASS and though
ftp.cirass.unina.it
URL: http://www.cirass.unina.it/
www.parlaritaliano.it. (server not active in this moment)
Corpora Linguistici per l'Italiano Parlato e Scritto
(CLIPS 2001-2003)
– Around 100 hours spoken Italian (50% male 50% female
voices) Partially transcribed and phonetic transcription
– . Recordings in Bari, Bergamo, Bologna, Cagliari, Catanzaro,
Firenze, Genova, Lecce, Milano, Napoli, Palermo, Parma,
Perugia, Roma, Venezia. For each city:
- Broadcasting (news, interview, talk shows);
- semi spontaneous dialogues (240 dialoguers of map task)
- Read texts by non professional speakers (20 sentences
covering the high freq lexicon)
- Telephone conversations (300 speakers)
- Read texts by 20 professional speakers (160 sentences
covering high freq phonotactic sequences) in anechoic room
Free download URL: http://www.clips.unina.it/.
Corpus LABLITA-C-ORAL-ROM
•
•
Spontaneous Spoken Italian Corpus (Cresti, 2000)
Large context variation (recorded in Tuscany since 1965 (1.00.000 W.)
– Corpus design
– Text / Speech synchronization x utterance
– Tools for the exploitation of the acoustic information
– Comparative approach
•
Partially published in the C-ORAL-ROM Italian corpus within the
multilingual romance corpus C-ORAL-ROM (300.000 words, 32 hours)
– DVD encrypted edition for personal use (Cresti & Moneglia eds. CORAL-ROM. Integrated Reference Corpora for Spoken Romance
Languages. Amsterdam: Benjamins)
http://www.benjamins.nl/#catalog/books/scl.15/
– DVDs for laboratories, distributed by ELDA (Paris)
– http://www.elda.org/catalogue/en/speech/S0172.html
Corpus Design
Acoustic source
• Segmentation of spontaneous speech by
learners
– exploitation of transcripts
– utterance boundaries
• isolated listening
• repetition
• lexical and prosodic patterns
IPIC Data Base
• C-ORAL-ROM Italian Informal
• Multilingual platform for comparable Brazilian
Portuguese minicorpus (Spanish-in preparation)
• Annotation
– PoS/Lemma
– Information structure / prosodic parsing
• Available on the web from the LABLITA web site
• http://lablita.dit.unifi.it/ipic/ipic_access
Broadcasting
• Lessico di frequenza dell'italiano radiofonico (LIR)
– around 60 hours (transcribed, lemmatized, aligned)
– on cd-rom
at present only on site at the Accademia della Crusca, but
expected on line within the VIVIT infrastructure
• Il Lessico Italiano Televisivo
• LIT e DIA-LIT Multimodal-multichannel + transcripts
searchable on line
- LIT sampling of Rai e Mediaset emissions during 2006 (around
168 hours)
– Dia-lit 40 hours from 1954 up to now .
• http://www.italianotelevisivo.org/
Written corpora
•
•
•
•
•
Language of the origins
Literature
Targeted resources
Learners corpora
Standard Italian
Language of the Origins
• Tesoro della lingua italiana delle origini (TLIO)
• TLIO si basa sul corpus testuale dell'italiano antico
dell'OVI, di cui è possibile la consultazione integrale.
– Text data base 2.001 texts; 21.911.171 tokens (2012) 26504
lemmas
– Italian written before 1375 (Dead of Boccaccio) both poetry and
prose
– Searchable on line
• Banche Dati dell'Opera del Vocabolario Italiano
• http://www.ovi.cnr.it/index.php?page=banchedati
• http://gattoweb.ovi.cnr.it/(S(fbi1pu45pc0jdxqnq2d2wi55))/
CatForm01.aspx
• CT "Corpus Taurinense“
•
– Corpus di Italiano antico (21 texts XIIIcentury, Firenze)
259,299 tokens 21,087 types 7,599 lemmas.
– built up with the same bunch of Old Florentine texts choosen
by Lorenzo Renzi and Giampaolo Salvi for their ItalAnt,
Grammatica dell'italiano antico.
• This set of texts is, a subset of TLIO, Tesoro della lingua italiana delle
origini kindly supplied by Pietro Beltrami (OVI).
Lemma and POS-tagging according to EAGLES specs
– http://www.corpora.unito.it/italant/index.html
Italian Literature
Letteratura Italiana Zanichelli (Picchi & Stoppelli CDrom )
1000 works of Italian literature by 245 authors, from Francesco
d’Assisi’s Cantico delle creature to Italo Svevo’s La Coscienza di
Zeno. The search interface allows for the creation of word
indices by alphabet, frequency, incipits.
Primo Tesoro della Lingua Letteraria del Novecento
(De Mauro ed.)
•
•
•
Selection of 100 novels among those preesented at the
Premio Strega from 1947 to 2006 (Vinners & some among
the more significant).
8 milion words, Lemma/Pos annotation
DVD published by UTET with internal search engine and
statistics
• Corpus e Lessico di Frequenza
dell'Italiano Scritto (ColFI)
• 3.150.075 words taken fron news papers
magazins and miscelaneous texts balanced
according to the impact on the italian
audience
• Description . http://www.istc.cnr.it/material/database/colfis/
• dowload http://www.ge.ilc.cnr.it/strumenti.php (partial)
Targeted Corpora from UNITO and
UNIBO
• Athenaeum Corpus
corpus of written academic Italian dell'Università di
Torino;
– Various textual tipologies
306.927 token; 32.221 type; 11.748 lemmas
• Jus Jurium
(in progress) a free Italian Corpus covering the full Legal
universe of discourse current in Italy.
http://www.bmanuel.org/projects/
The Bononia Legal Corpus – BoLC
a multilingual comparable legal corpus: parallel corpora
in Italian and English.
• http://corpora.ficlit.unibo.it/
Learners corpora
Corpus LIPS (Lessico Italiano Parlato di Stranieri)
• transcripts from CILS - Certificazione di Italiano come
Lingua Straniera dell’Università per Stranieri di Siena
• Oral exames only (bidirectional and monodirectional
exchanges)
Around 700.000 words (100 hours)
• Lemmatized through TreeTagger,
• frequency lists for each learning level
• Free Dowload from www.Parlaritaliano.it.
• (non active now??)
VALICO: Varietà di Apprendimento della Lingua Italiana
VINCA: Varietà di Italiano di Nativi Corpus Appaiato
(Barbera, Marello & Corino)
VALICO multilingual learners corpus: free texts,
translations, written texts elicitated by iconic stimuli
– Main languages English, French, Spanish, German
– Sampling of text of less represented languages (Maltese, Polish,
Japanese, Arab, Serbian, Portuguese, Hungarian.)
. VINCA parallel corpus of tests written by mother tongue
informants
http://www.valico.org/
.
Main Italian corpora on the web
Reference corpus
– CoRIS/Codis
Opportunistic
– Corpus la Repubblica
Web Corpora
–
–
–
–
–
–
NUNC
Webbit
Itwak
Ten-ten-it
Paisà
RIDIRE
Corpus di Italiano Scritto contemporaneo CORIS/CODIS
–
–
–
–
–
Around 130 milion words (up-dated every two years)
Corpus Design Balanced as a refence corpus
pos-tagged / lemmatized
Searchable on line
Allow selection of subcorpora for domain specific research
• http://corpora.dslo.unibo.it/coris_ita.html
– COrpus di Riferimento dell'Italiano Scritto ( Coris ) .
– COrpus Dinamico dell'Italiano Scritto ( Codis )
Allows search on balanced subcorpora of the CODIS corpus
[lemma="andare"] [pos="PREP"]
Corpus La reppubblica (Bologna Forlì) 2004
• Searchable on line
• http://dev.sslmit.unibo.it/corpora/corpus.php?path=
&name=Repubblica
• Opportunistic corpus taken from the news paper
“La Repubblica” 1985-2000
– pos-tagged / lemmatized
– Searchable on line
– categorized in terms of genre and topic
General labels: news-report and comment;
Topic labels:
church, culture, economics,
education, news, politics, science,
society, sport, weather.
New generation
Web corpora
• Representativeness
• Technical problems (boilerplate cleaning and deduplication tools)
– Cleaning html pages: definizione di ciò che è testo (html
codes,images, banner, menù,intestazioni, link)
– Duplicated pages
– Effimeral pages
– Processing Format
Representativeness of the language on
the web
• Internet is the largest repository of linguistic
information
• It is the main enviroment for the use of written
information in all domains
–
NUNC "NewsgroupsUseNet Corpora".
•
•
Multilingual Corpus
based on newsgroups in various semantic
domains
more that 600 milion words per language: It. De. Fr.
En. Es. Ma. Su. Ee. Pt.
•
–
–
NUNC Italiano (I parte)
NUNC Italiano (II parte)
•
NUNC Cucina
•
NUNC Motori
•
NUNC Foto
•
NUNC Foto
•
NUNC Cinema
M. Barbera , S. Colombo, E. Corino, C. Marello,
http://www.corpora.unito.it
http://www.bmanuel.org/projects/
WEBBIT
• Corpus of Italian Web pages over 150 milion words.
http://clic.cimec.unitn.it/marco/webbit/
• Seatrchable On line
Sampling strategy
• 1. selection of kwords : 500 frequent forms;
• 2. query google: 5,000-8,000 queries, with 4 words strings
• 3. downloading: processing of the first 10 pages returned
WaCky Web-as-Corpus kool ynitiative
ITWACThe first Italian web corpus
– 2 billion words from the Web in domines .it
– PoS/Lemma Tagged
​WebBootCaT: a web tool for instant corpora Marco Baroni, Kilgarriff, Jan
Pomikálek, Pavel Rychlý (2006) Proc. Euralex. Torino, Italy
WaCky / ITWAK sampling
strategy
– Selection of seeds: quering google through
couples a mid-frequency taken from la
repubblica corpus + basic Italian vocabulary
list.
– Crawling of the web site corresponding to
seeds
• Marco Baroni, Silvia Bernardini, Adriano Ferraresi, Eros Zanchetta
The WaCky wide web: a collection of very large linguistically processed
web-crawled corpora, Language resource and Evaluation 2009
• ​WebBootCaT: a web tool for instant corpora Marco Baroni, Kilgarriff,
Jan Pomikálek, Pavel Rychlý (2006) Proc. Euralex. Torino, Italy
ItWaC
ItWaC is searchable through various web interfaces
exploiting corpora IMS/CWB and NoSketch Engine
Download free of charge on request
Free search from:
http://nl.ijs.si/noske/wacs.cgi/first_form
info and freq lists
http://wacky.sslmit.unibo.it/doku.php?id=download
.
TenTen corpora
• New generation of Web corpora.
• Created by Web crawling and processed with the latest
boilerplate cleaning and de-duplication tools.
• The "TenTen" designates the target sizes of the corpora
which is 1010 (10 billion) words.
itTenTen
initial version -- 3.1 billion tokens
https://the.sketchengine.co.uk/login/
massimo.moneglia
7uFp3Bh2ma
web corpus Paisà
• web corpus (around 250 milion tokens) from Creative
Commons texts
• PoS-tagging (Istituto di Linguistica Computazionale,
Pisa)
• Sintactic dependencies in CONLL format through DeSR
parser
• Free download
• Search on line
– form and lemmas search + sintactic distribution.
–http://www.corpusitaliano.it
/
Relations
RIDIRE targheted web corpus
(around 2Bilion words)
Sampling strategy: target the language
usage in the domains which can be of
interest for a learner
• Domains characterizing for a functional use of
the language.
• Semantic domains of excellence of the Italian
culture
Semantic Domains vs Functional Domains
1- cooking
(100 MLN)
2- Literature and Theatre (100 MLN)
3- Architecture & Design (100 MLN)
4- Sport
(100 MLN)
5- Fashion
(100 MLN)
6- Music
(100 MLN)
7-Religion
(100 MLN)
8- Cinema
(100 MLN)
9- Fine arts
(100 MLN)
1- News
2-Low & Administration
3- Business
400 MLN
300 MLN
300 MLN
http://www.ridire.it/it.drwolf.ridire/home.seam
Beta version 750 MW
User: demo
Pass: demolima
(rilascio versione 1.0 dicembre 2013)
CORpora DIdattiCi- LABLITA
• CorDIC-scritto
• CoDIC-parlato.
– Two strictly comparable resources for
comparison of the spoken and written
varieties for didactic purposes
• http://corporadidattici.lablita.it/
• 500.000 W each in 200 samples (2.500
word average)
Written
Domain
N. texts
N. wprds
art
40
101299
20,15%
burocracy
40
98814
19,66%
creative
40
101725
20,24%
echonomy
40
100072
19,91%
newspapers
40
100755
20,04%
200
502665
Total
Spoken
Context
private
public
Broadcasting
Total
N. texts
82
86
32
200
N. parole
193905
198468
106638
499011
Interaction
dialogues
monologues
Total natural
context
N. texts
115
53
N. words
266095
126278
168
392373
38,86%
39,77%
21,37%
67,82%
32,18%