Curriculum - Valter Crescenzi


Curriculum - Valter Crescenzi
Curriculum Vitae
Valter Crescenzi
February 2012
Contact Info
Valter Crescenzi
Via della Vasca Navale, 79 — I-00146 Rome, Italy
Tel. +39 06 5733 3535
e–mail: [email protected]
Current Position: Assistant Professor at ‘Università degli Studi Roma Tre’
Research Activities
Research Positions
• Assistant Professor at ‘Facoltà di Ingegneria’of ‘Università degli Studi Roma
Tre’ (2005—
• Junior Researcher for ‘Dipartimento di Informatica ed Automazione’ of ‘Università degli Studi Roma Tre’. (2003–2004)
• Research Fellow (project “Gestione dei dati per i processi decisionali: acquisizione, integrazione e presentazione”) at ‘Dipartimento di Informatica ed
Automazione’ of ‘Università degli Studi Roma Tre’, under the supervision of
Prof. Paolo Atzeni. (2002–2003)
• PhD in Computer Engineering received on february 2002 from ‘Dipartimento
di Sistemistica’ of ‘Università degli Studi di Roma La Sapienza’. Dissertation:
“On Automatic Data Extraction from Large Websites”[40]. Supervisors: Prof.
Paolo Atzeni and Prof. Giansalvatore Mecca.
The main results has been published on international journal [2] and presented
in international conferences [8].
• Computer Engineering degree (“Laurea in Ingegneria Informatica”) in 1998
from ‘Università degli Studi Roma Tre’, with a thesis titled “Un riconoscitore
di grammatiche formali con gestione delle eccezioni”, under the supervision of
Prof. Paolo Atzeni and Prof. Giansalvatore Mecca.
The main results has been published on a international journal [1].
Research Topics
During his master thesis, Valter Crescenzi developed interests for research topics
related to information extraction from web sources.
Initially (1998–1999) he was interested to the definition of a new formalism for
manual yet effective specification of wrapper software modules, i.e. programs able
at extracting structured information from unstructured web pages. He developed a
formalism aiming at joining the advantages of declarative languages (such as grammars) and procedural languages (such as editing scripts) for expressing effective and
precise extraction rules [1].
During his PhD studies (1999-2002), he researched how to further improve the
level of automation of wrapper production for large website [24, 6, 25], whose pages
are generally produced by quering an undelying database and embedding the query
results into a fixed HTML template. Even if this websites contains a large number
of pages, they can usually be classified in a relatively small number of classes [10]
composed of structurally similare pages.
This research activities produced results in two phases, from 1999 to 2002, and
from 2002 to 2004:
• in the former phase (1999-2002) an innovative algorithm for inferring regular
expressions has been proposed: the algorithm was based on a progressive and
comparative analysis of sample pages obaying to the a regular grammar picked
from a family of grammars crafted on purpose [8, 26, 9]
• in the latter phase (2002-2004) the relationships between this algorithm, presented to the data extraction community, and the learning algorithms presented inside the much more consolidated grammar inference community [41, 42]
has been clarified, with interesting results for both communities [27, 2]
Namely, it has been claried that many inference algorithms taking only positive
samples as input (a paradigm known as identification in the limit [41]) that were
studied by researchers of the grammar inference community, were not useful as a
tool to produce wrappers.
One of goals of that community is to study how to learn expressive class of
languages, but the more expressive the class of languages inferred, the less likely is
the availability of a representative and finite sample of pages [42]). Since a wrapper
generation tool requires a non-expert user to provide these samples, they should
be obtained by randomly picking a small number of sample pages [27]. A class of
languages (called “Prefix Mark-Up Languages”) identifiable in the limit has been
proposed as first example of class of languages suitable for wrapper generation and
formally studied [2].
The following reasearches (2004-2008) can be summarized into two main lines:
the grammar inference algorithm has been refined [29, 13, 11] to deal with many
structures frequently occuring on the Web; the class of languages identifiable in the
limit has been expanded maintaining the simplicity of the characteristic samples [4].
Following research studies (2002-2008) aimed at scaling out the extraction process to cover many classes of pages from several large websites. Many research issues
arises, including the effective crawling of sample pages within a website [12, 15, 3],
and the classification of downloaded pages into classes suitable from automatic wrapping. There have been pursued both approached base on the analysis of the regularities in the inner structure of pages [10, 14], and approaches based on the analysis
of the regularities in the topology of large website [30, 3, 34, 18].
Recently, this reasearch line has been further expanded at the web scale [17, 31, 5]
tackling the additional issues related to searching and retrieving websites publishing
relevant information [36, 31], their integration [32], and the scalability of the overall
approach [38]. In this context, naturally arises the idea of characterizing probabilistically the quality of the extracted information [37, 19, 20] and the accuracy of the
involved sources [21, 33, 39, 22], even in presence of copiers amongst them.
Most of this research activities has been developed in the context of international
and national research projects.
Partecipation to Research Projects
• international research project INTAS: “Modeling and Management of Semi
Structured Data for Dynamic World Wide Web Applications” (1999–2000).
• national research project MURST (ex 40%) Data–X: “Gestione, Trasformazione e Scambio di Dati in Ambiente Web” (1999–2000).
• FIRB-MIUR project MAIS: “Multichannel adaptive information systems ”
• european research project (Vfp) MOSES: “MOdular and Scalable Environment for the Semantic web” (2002–2006).
• national research project MIUR ECD: “Tecnologie per arricchire e fornire
accesso a contenuti” (2002–2005).
• national research project (PRIN) WISDOM: Ricerca Intelligente su Web
basata su Ontologie di Dominio (2004–2006)
• principal investigator of a project for realizng an industrial demonstrator of
a web data extractor. The project has been funded by “progetto DOCUP
Obiettivo 2 Regione Lazio – Programma 2000-2006 – sottomisura II.5.2.”
• national research project MIUR “NGS: Nuove Tecnologie e Strumenti per
l’Interrogazione di Servizi di Ricerca su Web” (2007–2009).
• project MORNING - “Metodologie e strumenti per analizzare dati da sorgenti
del Social Web.” FILAS-RS-2009-1132, funded by “CUP F87I10000750007,
POR FERS Lazio 2007/2013 Asse I Attività I.1.” (2009–2012)
• national research project (PRIN) “EASE: Identificazione, riconciliazione, estrazione e integrazione di Entità dal Web” (2010–2012).
Other Collaborations
• Dal 1999 al marzo 2004 ha partecipato alla progettazione, creazione e gestione
della versione XML del sito online di ACM Sigmod Record. In particolare
si è occupato dell’estrazione dei dati da sorgenti web ed il loro riversamento in formato XML. Il risultato dell’iniziativa è stato oggetto di molti studi
• Member of the Committee Program of several national and international conferences Workshop on Adaptive Text Extraction and Mining (ATEM 2003),
Workshop on Adaptive Text Extraction and Mining (ATEM 2006), International Conference on Web Information Systems Engineering (WISE 2008),
Sistemi Evoluti per Basi di Dati (SEBD 2012)
• External reviewer for many conferences including (SAC 2002, ACM SIGMOD
2003, ICWE 2004, VLDB 2004, ACM SIGMOD 2005, ICDE 2006, EDBT
2006, VLDB 2007)
• Reviewer for international journal such as Information Systems (Kluwer Publishers), Software: Practice and Experience (Wiley), Data And Knowledge
Engineering (Elsevier), Journal of Intelligent Information Systems (Springer)
• He has been the presenting author in these international conferences: SAC
2002 (Madrid, Spagna), WebDB 2003 (San Diego, USA), ATEM 2003 (San
Josè, USA), WEBIST 2005 (Miami, USA), ICDE2006 Workshops (Atalanta,
• Panelist during the Workshop on Adaptive Text Extraction and Mining (ATEM
• co-founder of an academic spin-off “Chi-Technologies” s.r.l. a company partecipated by ‘Università degli Studi Roma Tre’ whose goal is the industrial
enhancement of the research results on the automatic information extraction
from the Web
Teaching Experience
Institutional Teaching Activities
He has been tearcher of the following academic courses, ‘Facoltà di Ingegneria’,
‘Università degli Studi Roma Tre’:
• Sistemi Operativi II, 2003/2004, 2004/2005
• Programmazione Concorrente, 2005/2006, 2006/2007, 2007/2008, 2008/2009,
2009/2010, 2010/2011 e 2011/2012
• Elementi di Informatica, 2010/2011, 2011/2012
• Programmazione Orientata agli Oggetti, 2004/2005, 2005/2006, 2006/2007,
He has been teaching assistant for the following courses, ‘Facoltà di Ingegneria’,
‘Università degli Studi Roma Tre’,
• Sistemi Operativi, academic year 2000/2001
• Sistemi Operativi 1, Sistemi Operativi 2, 2002/2003
• Programmazione Orientata agli Oggetti, 2002/2003
• Ingegneria del Software, 2003/2004
• Progetto di Sistemi Informatici, 2004/2005, 2005/2006, 2006/2007 e 2007/2008
He teached in the following second-level master courses of ‘Università degli Studi
Roma Tre’:
• Basi di Dati — Master Universitario in Economia e Tecnologia della Società
dell’Informazione academic years 2001/2002, 2002/2003, and 2003/2004
• Basi di Dati — Master Universitario in Governance, Sistema di Controllo e
Auditing academic years 2005/2006 e 2006/2007
• Programmazione orientata agli oggetti, Basi di dati ed XML, Metodi per lo
sviluppo agile — Master Universitario in Governo dei Sistemi Informativi:
sviluppo, gestione, monitoraggio, 2007/2008
• È stato docente per corsi di Basi di dati e Metodi per lo sviluppo agile —
Master Universitario in Governo dei Sistemi Informativi: sviluppo, gestione,
monitoraggio 2009/2010 and 2011/2012
He is tutor of the following academic courses, Facoltà di Ingegneria, l’Università
Telematica Internazionale UNINETTUNO:
• Sistemi Informativi e Basi di dati — Corso di Studi in Ingegneria Informatica
ed Ingegneria Gestionale academic year 2011/2012
• Ingegneria del Software e Programmazione ad Oggetti — Corso di Studi in
Ingegneria Informatica, 2011/2012
Other Institutional Teaching Activities
During academic years 2004/2005, 2005/2006, 2006/2007, and 2007/2008 he designed and supervisioned the developmnet of a web application for partially automatizing the exams of several programming courses of ‘Facoltà di Ingegneria’,
‘Università degli Studi Roma Tre’, including Programmazione Orientata agli Oggetti, Fondamenti di Informatica I , Laboratorio di Informatica, and Programmazione
Professional Teaching Experience
He has been teacher for the following courses:
• Progettazione Banche Dati for Engineering Ingegneria Informatica SpA (2000–
• Progettazione Banche Dati, Il linguaggio XML, Sistemi Operativi for “Direzione Corsi Elettronica, Optoelettronica ed Informatica” for Ministero della
Difesa (2003–2007)
• Basi di Dati for Scuola di Polizia Tributaria.
• Specialista Sviluppo Applicazioni Object Oriented, Analista Programmatore for
Centro Italiano Opere Femmilili Salesiane - Formazione Professionale
• Il linguaggio UML, for Sudgest S.C.p.a
• Progettista di Siti Web, for ENAIP Lazio, 2010.
International Journals
[1] V. Crescenzi and G. Mecca. Grammars Have Exceptions. Information Systems,
23(8): 539-565 (1998)
[2] V. Crescenzi and G. Mecca. Automatic information extraction from large
websites. Journal of the ACM, 51(5): 731-779 (2004)
[3] V. Crescenzi, P. Merialdo and P. Missier. Clustering Web pages based on their
structure. Data & Knowledge Engineering, 54(3): 279-299 (2005)
[4] V. Crescenzi and P. Merialdo. Wrapper Inference for Ambiguous Web Pages.
Applied Artificial Intelligence, 22(1):21-52, (2008)
[5] L. Blanco, V. Crescenzi and P. Merialdo. Structure and Semantics of Dataintensive Web Pages: an Experimental Study of their Relationships. Journal of
Universal Computer Science. Special Issue on Wrapping Web Data Islands.
International Conference Proceedings
[6] G. Mecca, P. Merialdo, P. Atzeni and V. Crescenzi. The (short) Araneus Guide to Web Site Development. Second Workshop on Databases and
the Web (WebDb’99) in conjunction with ACM SIGMOD’99, Philadelphia
(Pennsylvania), (giugno 1999).
[7] V. Crescenzi, G. Mecca and P. Merialdo. The RoadRunner Project: towards
Automatic Extraction of Web Data. International Workshop on Automatic
Text Extraction Methods (ATEM 2001) in conjunction with Seventeenth International Joint Conference on Artificial Intelligence (IJCAI 2001), Seattle
(Washington), (2001).
[8] V. Crescenzi, G. Mecca and P. Merialdo. RoadRunner: Towards Automatic
Data Extraction from Large Web Sites. Proceedings of the 27th International
Conference on Very Large Databases (VLDB 2001), Roma (Italy), pag. 109–119,
Morgan Kaufmann, (2001).
[9] V. Crescenzi, G. Mecca and P. Merialdo. Automatic Web Information Extraction in the RoadRunner System. International Workshop on Data Semantics
in Web Information Systems (DASWIS 2001) in conjunction with 20th International Conference on Conceptual Modeling (ER 2001), Yokahama (Japan).
Lecture Notes in Computer Science 2465 Springer, (2002).
[10] V. Crescenzi, G. Mecca and P. Merialdo. Wrapping-oriented classification of
web pages. ACM Symposium on Applied Computing (SAC), 10-14 Marzo, 2002,
Madrid (Spain). ACM Press (2002).
[11] L. Arlotta, V. Crescenzi, G. Mecca and P. Merialdo. Automatic annotation
of data extracted from large Web sites. Sixth Int. Workshop on Databases
and the Web (WebDb’99) in conjunction with ACM SIGMOD’03, San Diego
(California), (giugno 2003).
[12] V. Crescenzi, P. Merialdo and P. Missier. Fine-grain Web Site Structure Discovery. Fifth ACM CIKM International Workshop on Web Information and Data
Management (ACM WIDM 2003), Novembre 2003, New Orleans (Lousiana).
ACM Press (2003).
[13] V. Crescenzi, G. Mecca and P. Merialdo. Handling irregularities in roadRunner.
The AAAI-04 International Workshop on Adaptive Text Extraction and Mining
(ATEM 2004), July 26th, 2004, San Jose (California) (2004).
[14] V. Crescenzi, G. Mecca, P. Merialdo and P. Missier. An Automatic Data Grabber for Large Web Sites. Proceedings of the 30th International Conference
on Very Large Databases (VLDB 2004), Settembre 2004, Toronto (Ontario,
Canada) (2004).
[15] L. Blanco, V. Crescenzi, and P. Merialdo. Efficiently Locating Collections
of Web Pages to Wrap. First International Conference on Web Information
Systems and Technologies, May 2005, Miami (Florida) (2005).
[16] V. Crescenzi, and P. Merialdo. Efficient Techniques for Effective Wrapper Induction. Proceedings of the 22nd International Conference on Data Engineering
Workshops, ICDE 2006, April 2006, Atlanta (Georgia) USA.
[17] L. Blanco, V. Crescenzi, P. Merialdo and P. Papotti. Flint: Google-basing the
Web. 11th International Conference on Extending Database Technology, Nantes,
France, March 2008.
[18] C. Bertoli, V. Crescenzi, and P. Merialdo. Crawling Programs for Wrapperbased Applications. The 2008 IEEE International Conference on Information
Reuse and Integration (IEEE IRI-08), July 13-15, 2008 - Las Vegas, USA.
[19] L. Blanco, M. Bronzi, V. Crescenzi, P. Merialdo and P. Papotti. Exploiting
information redundancy to wring out structured data from the web. The 19nd
International Conference on World Wide Web, WWW 2010, Raleigh, North
Carolina, USA, April 26-30, 2010.
[20] L. Blanco, M. Bronzi, V. Crescenzi, P. Merialdo and P. Papotti. RedundancyDriven Web Data Extraction and Integration. The 13th International Workshop
on the Web and Databases, WebDB 2010, Indianapolis, Indiana, USA, June 6,
[21] L. Blanco, V. Crescenzi, P. Merialdo and P. Papotti. Probabilistic Models
to Reconcile Complex Data from Inaccurate Data Sources.The 22nd International Conference on Advanced Information Systems Engineering, CAiSE’10,
Hammamet, Tunisia, June 2010.
[22] L. Blanco, V. Crescenzi, P. Merialdo and P. Papotti. Automatically Building
Probabilistic Databases from the Web The 20th International Conference on
World Wide Web, WWW 2011, hyderabad, India, March 18-April 1, 2011.
[23] M. Bronzi, V. Crescenzi, P. Merialdo and P. Papotti. Wrapper Generation for
Overlapping Web Sources. Web Intelligence 2011, WebDB 2010, Lyon, France,
August 22-27, 2011.
National Conference Proceedings
[24] G. Mecca, P. Merialdo, P. Atzeni and V. Crescenzi. The ARANEUS Guide
to Web–Site Development. (versione estesa di [6]) Atti del Settimo Convegno
Nazionale su Sistemi Evoluti per Basi di Dati (SEBD’99): pag. 167–177, Como,
23–25 giugno 1999.
[25] G. Mecca, P. Merialdo, P. Atzeni and V. Crescenzi. Experiences in XML data
management. Atti dell’Ottavo Convegno Nazionale su Sistemi Evoluti per Basi
di Dati (SEBD2000): pag. 109–119, L’Aquila, 24–26 giugno 2000.
[26] V. Crescenzi, G. Mecca and P. Merialdo. The RoadRunner Web Data Extraction System. Atti del Nono Convegno Nazionale su Sistemi Evoluti per Basi di
Dati (SEBD2001), Venezia, 27–29 giugno 2001.
[27] V. Crescenzi, G. Mecca and P. Merialdo. Back to Gold’s Age: Bridging the
Gap Between Traditional Grammar Inference and Web Information Extraction. Atti del Decimo Convegno Nazionale su Sistemi Evoluti per Basi di Dati
(SEBD2002), Isola d’Elba, giugno 2002.
[28] L. Arlotta, V. Crescenzi, G. Mecca and P. Merialdo. Automatic annotation of
data extracted from large Web sites. Atti dell’Undicesimo Convegno Nazionale
su Sistemi Evoluti per Basi di Dati (SEBD2003), Cetraro (CS), giugno 2003.
[29] V. Crescenzi, G. Mecca and P. Merialdo. Improving the expressiveness of RoadRunner. Atti del Dodicesimo Convegno Nazionale su Sistemi Evoluti per Basi
di Dati (SEBD2004), Cagliari, giugno 2004.
[30] L. Blanco, V. Crescenzi, and P. Merialdo. Harvesting Structurally Similar Pages. Atti del Tredicesimo Convegno Nazionale su Sistemi Evoluti per Basi di
Dati (SEBD2005), Bressanone, giugno 2005.
[31] L. Blanco, V. Crescenzi, P. Merialdo. Searching Entities on the Web by Sample.
Atti del Sedicesimo Convegno Nazionale su Sistemi Evoluti per Basi di Dati
(SEBD2008), Mondello (PA), giugno 2008.
[32] L. Blanco, V. Crescenzi, P. Merialdo. Data Extraction and Integration from
Imprecise Web Sources. Atti del Diciassettesimo Convegno Nazionale su Sistemi
Evoluti per Basi di Dati (SEBD2009), Camogli (GE), giugno 2009.
[33] L. Blanco, V. Crescenzi, P. Merialdo. Probabilistic Reconciliation of Records
from Inaccurate Web Sources. Atti del Diciottesimo Convegno Nazionale su
Sistemi Evoluti per Basi di Dati (SEBD2010), Rimini, giugno 2010.
Technical Reports
[34] V. Crescenzi, P. Merialdo and P. Missier. Discovering the structure of large web
sites. Rapporto Tecnico RT-DIA-89-2004, Università degli Studi “Roma Tre”,
Dipartimento di Informatica e Automazione (2004).
[35] L. Blanco, V. Crescenzi and P. Merialdo. Automatically Generating Reports
from Large Web Sites. Rapporto Tecnico RT-DIA-90-2004, Università degli
Studi “Roma Tre”, Dipartimento di Informatica e Automazione (2004).
[36] L. Blanco, V. Crescenzi, P. Merialdo and P. Papotti. Searching Entities on the
Web by Sample. Rapporto Tecnico RT-DIA-121-2007, Università degli Studi
“Roma Tre”, Dipartimento di Informatica e Automazione (2007).
[37] L. Blanco, V. Crescenzi, P. Merialdo and P. Papotti. A Probabilistic Model
to Characterize the Uncertainty of Web Data Integration: What Sources Have
The Good Data? Rapporto Tecnico RT-DIA-146-2009, Università degli Studi
“Roma Tre”, Dipartimento di Informatica e Automazione (2009).
[38] L. Blanco, V. Crescenzi, P. Merialdo and P. Papotti. Exploiting Information Redundancy to Extract and Integrate Data from the Web. Rapporto Tecnico RTDIA-151-2009, Università degli Studi “Roma Tre”, Dipartimento di Informatica
e Automazione (2009).
[39] L. Blanco, V. Crescenzi, P. Merialdo and P. Papotti. Probabilistic Models to
Reconcile Complex Data from Inaccurate Data. Rapporto Tecnico RT-DIA170-2010, Università degli Studi “Roma Tre”, Dipartimento di Informatica e
Automazione (2010).
PhD Thesis
[40] V. Crescenzi. On Automatic Information Extraction from Large Websites. Collana delle Tesi di Dottorato, Università degli Studi di Roma La Sapienza
Other Cited Publications
[41] E. M. Gold. Language identification in the limit. Information and Control.
10(5), 447–474.
[42] D. Angluin. Inference of Reversible Languages. Journal of the ACM. 29(3),