as a PDF

Transcript

as a PDF
Alma Mater Studiorum - Università di Bologna
Facoltà di Scienze Matematiche, Fisiche e Naturali
Corso di Laurea Specialistica in Informatica
Materia: Tecnologie Web
Conversione automatica
di documenti: un modello
e un'implementazione
Tesi di Laurea di
Silvio Peroni
Relatore
Chiar.mo Prof. Fabio Vitali
Parole chiave:
Document segmentation, PML, Pattern, Pentaformat, XML
Sessione III
Anno Accademico 2006-2007
In Xanadu did Kubla Khan
A stately pleasure-dome decree:
Where Alph, the sacred river, ran
Through caverns measureless to man
Down to a sunless sea.
From Kubla Khan or, A Vision in a Dream. A Fragment. by Samuel Taylor Coleridge
Table of contents
Sommario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7
Chapter 1: Once upon a time (or Introduction) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Chapter 2: On the way to Content Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Section 2.1: What users want (or What is content?) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Section 2.2: What about images and tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22
Section 2.3: There is not content only . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Section 2.4: How to extract data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Section 2.5: So what? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Chapter 3: Pentaformat Markup Language and other stories . . . . . . . . . . . . . . . . . . . . . . . . . 31
Section 3.1: The Pentaformat model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Section 3.1.1: Five easy dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Section 3.1.2: The need of a five-dimensions segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Section 3.1.3: The Pentaformat segmentation for (X)HTML documents: an example . . . . . . . . . . . . . . . . 38
Section 3.2: Pentaformat Markup Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Section 3.2.1: Terminology and syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Section 3.2.2: Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Section 3.3: The new Extraction of Layout Information via Structural Analysis . . . . . . . . . . . . . . . . 47
Section 3.3.1: elISA: a rib of ISAWiki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Section 3.3.2: New features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Section 3.4: From PML to IML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Section 3.4.1: The issue of structured content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .55
Section 3.4.2: Seven patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Section 3.4.3: PML and IML: what is the difference? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Section 3.4.4: Patterning process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Section 3.4.5: PML patterns (PMLp) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .64
Section 3.5: So what? (take 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .68
Chapter 4: Features hole and its monsters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Section 4.1: elISA engine: the infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Section 4.1.1: Three steps in five phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72
Section 4.1.2: Rules and thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Section 4.2: Pattern engine: the infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Section 4.2.1: How to define a patterning rule-set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Section 4.2.2: The configuration file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Section 4.3: elISA Server Side . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Section 4.4: Summarizing all the infrastructures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Chapter 5: Happily ever after (or Conclusions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Sommario
Sommario
Questa tesi propone un meccanismo basato su regole per la segmentazione di documenti
XML secondo un modello chiamato Pentaformato. Questo modello permette di identificare
ogni elemento di un qualsiasi documento testuale come appartenente ad una o più delle cinque
dimensioni indicate dal modello stesso: contenuto, struttura, presentazione, comportamento
dinamico e metadato. Lo scopo ultimo di questo lavoro è di permettere la conversione di un
documento segmentato in un nuovo documento usando tutte o una parte di queste dimensioni.
Dopo una breve panoramica sul contesto tecnologico in cui ci poniamo, presentiamo il
primo risultato di questo nostro lavoro: il linguaggio XML Pentaformat Markup Language
o PML. Esso permette di associare una o più dimensioni del Pentaformato ad ogni nodo di
un documento XML tramite un'apposita dichiarazione. Attraverso l'uso di questo linguaggio
abbiamo sviluppato un motore che permette la segmentazione di qualsiasi documento XML
sulla base di determinate regole strutturali: elISA 2.0 (Extraction of Layout Information via
Structural Analysis). Queste regole sono contenute in un ulteriore documento XML e sono
facilmente modificabili ed estendibili in conformità con una specifica grammatica da noi
sviluppata.
Il documento XML segmentato restituito da elISA 2.0 è il punto di partenza per realizzare
la conversione in altri formati sulla base delle dimensioni del Pentaformato individuate per il
documento iniziale. In particolare, in questo lavoro ci siamo concentrati sulla conversione da
PML in IML (ISAWiki Markup Language). IML è il formato di memorizzazione per documenti
usato in ISAWiki, una piattaforma client/server che implementa il concetto di editabilità globale
per tutto il Web in modo che ogni utente registrato possa modificare a piacimento il contenuto
di quasiasi pagina. L'idea è quella di inserire elISA 2.0 all'interno di questa piattaforma in modo
da individuare il contenuto di un qualsiasi documento web. Tutte le modifiche apportate ad
una pagina e tutte le nuove pagine create vengono memorizzate da ISAWiki come documenti
IML. Il nostro obiettivo è quello di convertire il documento PML ritornato da elISA 2.0 in un
documento IML.
7
Sommario
IML è un formato basato su due delle cinque dimensioni del Pentaformato: contenuto e
struttura. Ad una prima analisi la conversione da PML in IML può sembrare semplice se si
considera che bisogna passare da un modello a cinque dimensioni ad uno con un sottoinsieme
dimensionale più piccolo. Ma il numero di dimensioni non è l'unica differenza tra PML e IML.
Quest'ultimo, infatti, prevede una strutturazione ferrea del contenuto basata su sette determinati
pattern strutturali. Per definizione, il Pentaformato (e conseguentemente PML) non prevede
una gerarchizzazione per quel che concerne le sue dimensioni: l'autore è libero di strutturare il
proprio documento come meglio crede.
Il problema è il seguente: per convertire un documento PML in IML è necessario
patternizzare in qualche modo il primo in base ai sette pattern strutturali propri del secondo. In
modo da realizzare questa conversione, abbiamo sviluppato un ulteriore motore che permette
la patternizzazione di un qualsiasi documento XML sulla base di determinate regole. Come
per elISA 2.0, anche le regole per questo nuovo motore, contenute in un documento XML,
sono facilmente modificabili ed estendibili in conformità con una specifica grammatica da noi
sviluppata. Attraverso questa operazione di patternizzazione, si riesce ad ottenere un documento
PML perfettamente patternizzato secondo il modello dei pattern strutturali usati in IML.
Il risultato finale presentato come conclusione di questo lavoro è un'applicazione web,
chiamata elISA Server Side, che implementa tutto il processo di conversione attraverso l'uso
consecutivo di elISA 2.0 e del motore di patternizzazione. Partendo da un comune documento
web, elISA 2.0 identifica le dimensioni ad esso relative restituendo un documento PML. Su
quest'ultimo viene poi applicata una patternizzazione strutturale in modo da ottenere un nuovo
documento PML conforme ai sette pattern strutturali usati in IML. Infine si converte questo
nuovo documento PML patternizzato in un documento IML composto da tutto il contenuto
individuato da elISA 2.0 nel documento iniziale.
I benefici introdotti dal nostro lavoro sono fondamentalmente due. Il primo riguarda la
segmentazione di documenti. Tramite l'identificazione penta-dimensionale dei ruoli dei vari
elementi di un documento XML, si possono realizzare nuovi documenti partendo solo da alcune
delle dimensioni individuate. La conversione da PML ad IML rappresenta un esempio di questa
operazione: tramite il processo di conversione realizzato nel nostro lavoro, si riesce ad ottenere
un nuovo documento prendendo in considerazione soltanto le dimensioni contentuto e struttura
del primo. Chiaramente, questo ragionamento si può applicare anche ad altre dimensioni.
Ad esempio potremmo decidere di usare esclusivamente la struttura e la presentazione di un
documento per creare una sorta di template per la creazione di nuovi documenti, e così via.
L'altro beneficio introdotto dal nostro lavoro riguarda i pattern strutturali. Sugli elementi
di un documento che si sa essere patternizzato si possono effettuare semplici operazioni
di deduzione in modo da capire, analizzando la loro struttura, quale sia il loro pattern e,
8
Sommario
conseguentemente, cosa possano o non possano contenere. In questo caso risulta più semplice,
ad esempio, individuare quali elementi possano prevedere del contenuto testuale - solo alcuni
pattern, infatti, ammettono del testo al loro interno - e quali siano, invece, gli elementi usati per
la sola strutturazione logica del documento. Il motore di patternizzazione che abbiamo realizzato
permette di assumere di avere un documento patternizzato su cui applicare queste deduzioni
automatiche.
9
Sommario
10
Chapter 1 | Once upon a time (or Introduction)
Chapter 1
Once upon a time (or Introduction)
In this thesis we propose a rule-based mechanism to segment XML documents according
to a five-dimensional model called Pentaformat [Dii07] in order to convert automatically them
in new documents using one or more of constituents introduced by the model: content, structure,
presentation, behaviour and metadata.
The Pentaformat is a model suggested by Di Iorio in his Ph.D. thesis [Dii07]. This model
concerns the recognition of the roles that the elements of any document can have. The goal of
this recognition is to segment a document according to five particular constituents in order to
reuse parts of it in different contexts. Every constituent - also called dimension - represents a
point of view on the document that we analyze.
We can identify as content all the informations written by the author of the document. For
example we can take into consideration a common newspaper article: in this case the content
of this document is the article itself leaving out all the typographical elements such as the font
family, the font size, et cetera. We can associate these last kinds of elements to presentation.
Presentation is the dimension that concerns how the document look like. Font attributes, title
layout, content placement, spaces between parts of the document, additional information not
related to content: all these items concern the presentation. Presentation concerns only the layout
of elements. It does not take care of the logical organization of content: presentation lays out
paragraphs, titles, images but it does not identify what role they have in the document. This
distinction is what the structure dimension refers to. We ordinarily use structures - such as
paragraphs, containers, headers, inline elements - to arrange content. The goal of this dimension
is to identify what are these structures, leaving out all hierarchical relations among them.
Content, presentation and structure are not the only points of view for a document. There
is also information about the document itself. In a newspaper article there are some items,
such as the heading, that are not only content of the document but also that define particular
relations between themselves and the document. We can take into consideration the content
A that represents the author of the article. This particular content does not only belong to the
11
Chapter 1 | Once upon a time (or Introduction)
content dimension but it also define a relation between itself and the document: A is the author
of the document. This kind of relation is what the metadata dimension refers to. While the
previous four dimensions - content, presentation, structure and metadata - can been applied for
any document [GM02], the last one - behaviour - are especially related to the digital documents.
It identifies all items specifying interaction or dynamism for the document or its parts, such
as links or scripts for web pages. It is important to understand that even if these dimensions
are completely different, they are also connected. For example, an element that structurally is
a picture can be treated as content of the document as well, the content of an element “h1” in
a web document can be the title - from the point of view of the metadata - of the document
itself, and so on.
The Pentaformat model is the model that we have used to develop a segmentation tool
for XML documents. We have chosen it because we think it is the best model to segment this
kind of document, such as (X)HTML documents - that often are characterized by all the five
dimensions. We perform the segmentation using a rule-based mechanism to identify what are
the dimensions associated to the elements of a document. The reason for using a rule-based
tool to complete this process is the following: if we have rules that segment any (X)HTML
document, and in a next version of this language its structure will be changed, we can re-write
our rules according to the new language definition without any change of the tool core.
The segmentation mechanism we have developed, that represent the contribution of my
work, is one of the tool used in ISA* [Dii07]. As we can see in Picture 1, ISA* is an architecture
- developed at the University of Bologna and applied in scenarios ranging from web editing, to
e-learning, to book printing - that offers a general purpose model to structure frameworks based
on document transformation and analysis. This architecture get in input a digital document in
order to segment it according to the Pentaformat model. This segmentation can be used by an
application logic in order to use the five dimensions or a subset of them for example to convert
the input document to another format or to reformat the document using another presentation.
The tool developed for segmenting documents can be used in this architecture for the
phases concerning the pre-parsing (the generation of well-formed XML), post-parsing (adding
or removing features) and content analysis (document segmentation). One of the frameworks
developed at the University of Bologna that uses the ISA* architecture is ISAWiki [DV04]. It
is a framework that implements the concept of global editability for web pages on the model
of Ted Nelson's Xanadu project [Nel80]. To complete this goal, ISAWiki includes a client
application and a server application in order to let signed users edit any web page and store it in
an appropriate server. In order to identify what parts of a web document users can modify, this
platform uses an engine called elISA (Extraction of Layout Information via Structural Analysis)
[DVV04] for the segmentation of all the web documents according to two main dimensions:
12
Chapter 1 | Once upon a time (or Introduction)
content and (a small set of) presentation. The main goal of this engine is to extract content from
any web document in order to convert it into an IML (ISAWiki markup language) document
[San06]: this is the format used by ISAWiki to store documents. The new planned version of
ISAWiki will take into consideration the document segmentation according to all Pentaformat
dimensions. Our tool for segmenting, called elISA 2.0, is what this new ISAWiki version can
use to perform this five-dimensional segmentation. In this context, a multi-dimensional point of
view allows to use any combination of these dimensions to perform some operations on the input
document such as document conversion, presentation replacement, filtering content and so on.
Picture 1 The ISA* architecture
To complete the main goal of ISAWiki - the global editability - we need to convert
automatically the segmented document returned by elISA 2.0 into an IML document. This is
not an easy process because, besides the dimensional issue, there is another great difference
between the output of elISA 2.0 and IML: the former, according to the Pentaformat model, does
not force any hierarchical order for structures, while the latter complies to a structural pattern
theory [DDD07] in order to arrange content. This theory is based on seven patterns that we
can use to structure any XML document: marker (an empty element whose meaning depends
on its position or its existence), atom (an element that can contain text only), inline and block
(elements that can contain text and repeatable inline/atom/marker), container (an element that
13
Chapter 1 | Once upon a time (or Introduction)
can contain any element except inline), table (it contains homogeneous non-inline elements),
record (a sequence of optional but non-repeatable and non-inline elements). Then we need to
pattern the output of elISA 2.0 before to convert it in an IML document. For this reason we have
developed another rule-based engine called patterning engine. It allows to pattern any XML
document using a set of patterning rule-based on some patterning operations. The conjoined use
of these two engines allows to convert any web document in IML documents preserving all the
informations related to the Pentaformat dimensions.
We can enlarge this introduced context in order to illustrate the main field in which we have
accomplished our work. The ISA* architecture is a general model that includes the analysis of
digital documents in order to recognize the roles of their elements. This analysis concerns the
extraction of data especially referred to content extraction [BBC07b] of web documents. Before
discussing this matter we must understand what content is in a digital document. Considering
the Web context, we can give two intuitive definitions of content: it is what the author of a
web document has written (leaving out all data added by automatic processes); or it is what
users search googling. Using these intuitive definitions, we can introduce some examples of
web documents in order to recognize in a visual way - for example looking a web page whether an item is or is not content. Understanding whether a picture of a web page is related
to content or describes only a presentational item represents a significant example of this kind
of recognition. The main point of more works - such as [LLY03], [CGG04] and [AHR01] - is
to understand what is the content of a web document, leaving out all the remaining non-content
items. We understand that content is the most important thing of a document. However we think
the recognition of the roles of other non-content elements is important too. What these works
do not comply to is a multi-dimensional model, such as the Pentaformat, for the recognition of
roles of all elements of a web document.
The last argument is another reason why we have developed elISA 2.0 according to the
Pentaformat model. In order to use this model to segment web documents we have developed
a new language to make declarations about the elements of XML documents: the Pentaformat
Markup Language or PML. A pml declaration is formed by four main items:
•
the Pentaformat dimension that characterizes the declaration;
•
a name that describes the chosen dimension;
•
a reference to the element which declaration refers to;
•
the content, i.e., the value of the declaration.
The output of elISA 2.0 is a PML document that is identical to the input document but it
has some pml declarations specified by qualified elements. Applying the elISA 2.0 output to the
14
Chapter 1 | Once upon a time (or Introduction)
patterning engine we obtain a patterned PML document that we can convert easily into an IML
document. This whole conversion process is performed by a web application called elISA Server
Side. We have developed it to allow this conversion from any browser. This web application
represents the final production of my work: it implements the goal that we have brought forward
in the beginning of this chapter.
The rest of the thesis is structured as follows. In Chapter 2 we will discuss some content
extraction techniques, introducing the matter about content and illustrating some works related
to this issue. In Chapter 3 we will introduce our language (PML) used to segment XML
documents according to the Pentaformat model. In addition to that we will illustrate the main
features of the two developed engine (elISA 2.0 and the patterning engine) used to convert a
web page in a IML document. In Chapter 4 we will deepen the architecture description of these
engines and we will introduce the web application that uses them to perform the conversion
process from a web page to IML. Final remarks and ideas for future work are in Chapter 5.
15
Chapter 1 | Once upon a time (or Introduction)
16
Chapter 2 | On the way to Content Extraction
Chapter 2
On the way to Content Extraction
In order to understand the technological context of our work we need to clarify some
concepts concerning the extraction of data from digital documents. In this chapter we are
going to introduce content extraction [BBC07b] from web documents as the main technological
context in which we work to develop this thesis. Content extraction means to identify relevant
content of a web document excluding all other data. The first issue is to understand what content
is. We can describe it using two different definitions: content is what the author of a document
has written; or it is what users search googling. As we will see in Section 2.1 in depth, in most
cases these two definitions coincide.
Another issue is related to understand when images or tables are properly content or when
we can regard them as presentation. This point concerns a markup analysis of all tags used to
insert an image or a table in a web document but also a pattern recognition analysis [KT06]
of the documents in order to understand if the image is a logo, a banner, a picture related to
the content, or if the table is genuine or presentational [Bag04]. We will deepen this matter in
Section 2.2.
As we know a web document is not formed by the content and presentational elements only.
In addition to that there are other items, such as metadata [Nis04], that play important roles in the
Web context. Introduced by a specific element, such as the tag “meta” for (X)HTML [JLR99],
or by a standard language (RDF [BM04], OWL [BDH04], RDFa [AB07], microformats [All07])
they are used to solve more easily some tasks, for example helping users to find what they want
in the Web. All these technologies represent the basis of the Semantic Web [BHL01]: it is an
ambitious project promoted by W3C [http://www.w3.org] to express web content in a format
that can be read and used by automated tools. In Section 2.3 we discuss in depth these issues.
The last topic that we want to discuss concerns a selection of some articles about methods
for data extraction referred in particular to the extraction of content from web documents. We
will give a small explanation of them in order to introduce a work carried out by Gottron [Got07]
17
2.1 | What users want (or What is content?)
in which he analyzes the performances of each methods. We introduce these issues in Section
2.4.
In the conclusions of this chapter (Section 2.5) we will re-introduce briefly all those subjects
concerning the problem of content extraction and we will emphasize the shortcomings that the
existing solutions have dealing with. This will justify our approach to the problem.
2.1 What users want (or What is content?)
First of all to understand what we mean with “content extraction” we must explain what is
the content of a web document. We can try to describe it using two different definitions:
1.
content is what an author of a document has written;
2.
content is what users search googling.
To understand the first definition we can think about an article of a web newspaper such
as The New York Times [http://www.nytimes.com]. In this case we can identify easily what is
the content: if the contents of a newspaper are articles then the content of an article is the article
itself. But in a web newspaper, and consequently in all its articles, there are some presentational
aspects that do not belong to the content of the article but they are inserted by some automatic
processes. We can take a look at Picture 2 thinking what were the reasons that motivated the
users to read that article. Probably they got into it from the home page because they were
captured by the headline or they were interested to all the articles written by Michael M. Gordon
and Stephen Farrell (the authors of that article). But principally readers click on an article link
for one main reason: they want to read what the author has written about the topic. Nothing
else. So they know unconsciously what is the content of the article because it is what they want.
When a reader read the article there is the content only: the text of the article, all the images or
other items related to it. Everything else - any menu, banner, logo, publicity, video, et cetera is not related to the article, it ispresentation.
In the context of search engines we have defined content as what people search. This
definition is quite true because usually search engines make their searching on some pieces of
content, not all. They perform a sort of content extraction during the indexing process [BP00] in
which they collect the most relevant parts of each document. This kind of extraction concerns
a few but meaningful parts of the content of a web document that we call information [Flo05].
Considering a datum as “a putative fact regarding some differences or lacks of uniformity within
some contexts”, Floridi define the information as a tripartite definition:
•
18
D consists of one or more data;
2.1 | What users want (or What is content?)
•
the data in D are well-formed, i.e. all data are clustered together correctly, according
to the rules (syntax) that govern the chosen system, code or language being analysed;
•
the well-formed data in D are meaningful.
Picture 2 “Iraq Lacks Plan on the Return of
Refugees, Military Says” from The New York Times
According to this definition, we call information extraction [BBC07b] a process that
automatically extracts data having a pragmatic meaning for a certain domain. This kind of
extraction is what the indexing process of search engines performs.
19
2.1 | What users want (or What is content?)
Picture 3 Google results for “iraqi” query
Then when users look for some contents using specific keywords, search engines look for
relevant informations about these keywords and returns some plausible results. Obviously it is
not sure that the content results returned by a search engine is what users want. For example
suppose that an user wants to find the article in Picture 2 remembering the word “Iraqi” and
the website only.
Google search engine return this article as the third result, as we can see in Picture 3. With
one simple keyword, it finds the content that the (imaginary) user wants. Not all the search
engines return exactly the same results. For example in Picture 4 we show the results of Yahoo
search engine for the same query. In this case all the results returned concern the “iraqi” word,
so they refer to the domain that the user wants. But there is not the result. Then the user, if he
wants to find the article come hell or high water, can perform another search or can use one of
the suggestions proposed by the search engine, such as “iraqi flag”.
20
2.1 | What users want (or What is content?)
Picture 4 Yahoo results for “iraqi” query
The same result as Yahoo is returned by Microsoft Live Search with a little difference. As
we can see in Picture 5, the only suggestion from the search engine corrects the word “iraqi”
with “iraq”. Though the two words are similar, “iraq” is not what the users want.
Regardless of results, all the search engines are able to identify the correct context for the
queries basing their assumptions on the content of web documents: this is important.
We have just explained what the word “content” means. But, sometimes, it is not simple to
decide if particular elements of (X)HTML documents, for example images or tables, are content
or not because they can be used for many purposes. In next section (Section 2.2) we try to
explain when these elements refer to the content and when they refer to the presentation.
21
2.2 | What about images and tables
Picture 5 Microsoft Live Search results for “iraqi” query
2.2 What about images and tables
In Section 2.1 we have explained what content is and because it is so important in the
context of web applications such as search engines: it is “what users want”. In this section we
keep discussing about content but we focus on all items of a web document that are not text,
for example images and tables.
As we know, not all the images of a web document are properly content. As we can see
in Picture 6, there are (at least) two images: the first one (top-left) is the logo of the website;
the second one (middle-right) is a picture about the object of the article, Tim Berners-Lee. Are
both content? Obviously the answer is no because the logo is inserted by an automatic process
(the wiki engine itself) but the picture has been specified by one of the authors of the article.
For a human being the distinction is clear but it is more difficult if we want to distinguish their
roles using an automatic process.
22
2.2 | What about images and tables
Picture 6 The article of Wikipedia about Tim Berners-Lee
There are two main approaches to identify the real role of an image of a web document:
•
analyzing all the metadata about the tag “img” related to any specific image, also
considering the location that it has in the source of the document;
•
applying some pattern recognition algorithms trying to disambiguate the content of
the image in order to understand what is the role related to it.
The first approach works well if and only if there is enough context around the image. For
example if an image is inside an element classified as content of the article, such as a “div”
with an attribute “class” setted to “bodyContent”, is more likely to think it as content. On the
contrary, if an image is in the first 20% of the structure of a web document [DGK02], probably
it is the logo of the website.
The second approach is based on pattern recognition theory [KT06]. As Koutroumbas and
Theodoridis say, “pattern recognition is the scientific discipline whose goal is the classification
of objects into a number of categories or classes”. A specific application of this discipline
concerns the classification (so the disambiguation) of images. On the basis of this theory,
Choochaiwattana et al [CNS07] suggest an heuristic approach to classify all the images of any
web page according to four categories: human images, icons, banners, scenic images.
23
2.3 | There is not content only
Mixing both these approaches we can make a program that tries to distinguish automatically
whether an image of a web document is content or not. Probably, in most cases, it will work
well. The problem is that in the Web 2.0 era we want to classify not only images but also other
multimedia objects such as animations or videos. Obviously their classification is more difficult
than images classification because they are not static. To deepen this argument we can read
[EFH02] and [MRS02].
A similar issue about disambiguation concerns the use of tables [JLR99]. In most cases the
element “table” of (X)HTML is used by web designers to arrange the layout of a web page as
well as to display tabular data. Probably this dual use has been triggered by a weak definition for
the element itself: “the HTML table model allows authors to arrange data - text, preformatted
text, images, links, forms, form fields, other tables, etc. - into rows and columns of cells”. We
want to point out a specific matter: the authors of the previous definition used the word “data”
and with this word we identify not only the content but also all presentational elements of a web
document. So using a table for layout is included by the definition.
By the way, the same authors have specified in another document [CJV99] that all the
“tables should be used to mark up truly tabular information” and not “to lay out pages”. One
reason to avoid this use of layout table is related to people operating with a screen reader to
surf the web. Probably the users using screen readers are not interested to any presentational
element. In this case the sentence “content is what users want” is even more true.
In this context it is useful to have an automatic mechanism that allows to identify all the
layout tables of a web document. From another point of view, the identification of layout tables
means to understand which table is properly content and which is not. A possible approach to
solve this issue has been suggested by Vitali et al [DVV04] and Bagnasco [Bag04], using a rulebased solution that tries to identify whether a table is genuine (data table) or not (layout table).
After we have discussed what content is and how it concerns not only text but also
structured elements such as images and tables, we want to point out that a web document is not
only formed by content or presentation but it is formed by structural items and metadata too.
We will deepen this issue in Section 2.3.
2.3 There is not content only
In Section 2.1 and in Section 2.2 we have analyzed what content is and what elements in
the context of web documents the content refers to. Specifically, our dissertation concerns the
identification of the real role of elements of a web document in general - such as text, images
and tables - in order to understand if we can consider a particular element as content or as
24
2.3 | There is not content only
presentation. For an example we can think about the difference between data tables and layout
tables.
We think that the distinction of two different roles is not enough to make a good
segmentation of a web document. For example let us consider metadata [Nis04]. For a general
definition metadata are data about data, i.e. a piece of information that describes, explains,
locates another information resource. We use them every day in every context. Think about a
book such as “Alice's Adventures in Wonderland” [Car65]. In the text we distinguish two kind
of information: information contained into the book - the content - and information about the
book - the metadata. For example the name “Lewis Carroll” is a part of the contents of the book
(it is on the book cover) but also we consider it as the author of the book itself. This information
about a name - “the author of the book Alice's Adventures in Wonderland is Lewis Carroll” belongs to the metadata set related to the book. In this case other metadata can be the title, the
edition, the publishing house, the release data and so on.
In an (X)HTML document we can define metadata about it using the tag “meta” [JLR99].
Specifying a property (through the attribute “name”) and a value (through the attribute
“content”) we can make an easy metadata declaration about the document. For example look
at Code 1. In this case the sentence “the author of Alice's Adventures in Wonderland is
Lewis Carroll” is defined using the simple metadata declaration <meta name="author"
content="Lewis Carroll" />.
Code 1 “Alice's Adventures in Wonderland” in a XHTML document
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Alice's Adventures in Wonderland</title>
<meta name="author" content="Lewis Carroll" />
</head>
<body>
<h1>Alice's Adventures in Wonderland</em></h1>
<h2>Chapter 1: Down the Rabbit Hole</h2>
<p>
Alice was beginning to get very tired of sitting by her sister
on the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, <q>and what is the use of a
book,</q> thought Alice, <q>without pictures or conversation?</q>
</p>
<p>
So she was considering, in her own mind (as well as she could,
for the hot day made her feel very sleepy and stupid), whether
the pleasure of making a daisy-chain would be worth the trouble
of getting up and picking the daisies, when suddenly a White
Rabbit with pink eyes ran close by her.
</p>
[...]
</body>
</html>
25
2.3 | There is not content only
As we know, these metadata are not showed directly in a web page so a user who reads the
text does not realize if there are some metadata specified or not. So, why do we use metadata if
users do not view them? The answer is simple: to help machines to retrieve information [Rij79]
about a document. For example, in the context of search engines all metadata are very important.
They use them to reply to a query in the best possible way; in fact, evaluating metadata search
engines have more meaningful information in order to work out the results. For this reason we
can reformulate the previous answer: we use metadata in order to allow machines to help users.
In the context of web documents there are other languages that allow to define metadata
or relations about something. The need to define relations between elements is so important
that Tim Berners-Lee et al [BHL01] have thought to postulate a sort of new version of the Web
based on the use of these kind of relations. The result is known as Semantic Web that “provides
a common framework that allows data to be shared and reused across application, enterprise,
and community boundaries”. This framework is originally based on two main languages: the
Resource Description Framework or RDF [BM04] and the Web Ontology Language or OWL
[BDH04].
RDF is an XML language that allows to make relations among items of an XML document
using the triple subject-predicate-object. This triple describes a directed graph in which the
subject and the object are the nodes and the predicate is the arc that connects the subject to the
object. For example we can think the previous sentence “the author of Alice's Adventures in
Wonderland is Lewis Carroll” as “Lewis Carroll” (subject) “is author of” (predicate) “Alice's
Adventures in Wonderland” (object). We use RDF to represent information in the Web.
OWL is a language that extends RDF in order to define ontologies [Gru92]. An ontology
is an explicit specification of a conceptualization, i.e. a description of the concepts and
relationships that can exist between classes of items. For example let's suppose that we have
three classes such as “Dogs”, “Cats” and “Animals”. In this context many relations can exist
among these classes: “Dogs” and “Cats” are subclasses of “Animals”, “Dogs” hate “Cats”,
“Cats” are slier than “Dogs”, and so on. All these relations are made by the RDF triples or using
appropriate OWL constructs.
The work-in-progress ambitious goal of the Semantic Web is a one by one step process,
as described in the “layer-cake” in Picture 7. As the first step the users need to understand how
and why they can use these technologies. This step is so difficult because RDF and OWL are
not easy to understand as (X)HTML. A possibly solution is to make some programs, such as the
Calais project [http://www.opencalais.com/] by Reuters, that take an XML document in order
to enrich it with semantic informations.
26
2.3 | There is not content only
Picture 7 The Semantic Web Layer Cake (© W3C - CC-BY-3.0)
Another possible solution is to get users used to technologies related to Semantic Web
which are easier technology related. We can think to the RDFa [AB07] (based on RDF) or to
the microformats [All07]. They are languages (X)HTML embedded that allow to define easily
relations among items of a web document using some attributes of (X)HTML language such as
“href”, “link”, “property”, “content” or “class”.
As we have seen, the extraction of metadata referred to web documents has a great impact
for the audience (users, companies, communication) in these years. For this reason we identify
clearly more than two roles for segmenting web documents - not all is content or presentation
in the Web - in order to consider other constituents such as metadata. Some studies have made
different approaches that allow the segmentation of a web document based on the identification
of content, presentation or other constituents. We will analyze some of them in Section 2.4.
27
2.4 | How to extract data
2.4 How to extract data
In the previous sections we have analyzed what the content extraction is. To understand
our explanation we have defined some concepts such as content, information and datum. After
that we have seen some examples of “what is content” in web documents in order to understand
what is the difference between content and presentation. To introduce and explain this subject
we have used as examples some typical items of a web document such as text, images and tables.
But not all the elements of a web document concern content or presentation: also other items,
such as metadata, are important because, through automated tools, they help users to find “what
they want”. In this section we introduce some works related to extraction of data - in particular
about content extraction - in order to see the possible approaches that we can use to complete
this operation.
A first work [LLY03] concerns the individuation of noise of a common web document. Lan
Yi et al define two different types of noise: a global noise related to a point of view with large
granularity such as duplications of the same pages through a mirror, old versions of a same page,
et cetera; there is also a local noise - which is the topic of their investigation - that concerns
all the parts of a web document that are disconnected to the content, for example navigation
menus, banners or ads. Their approach is based on the analysis of the Document Object Model
(DOM) [BCL04] of a web page considering the following assumption: in all the pages of a web
site, such as a commercial web site, there are some structures that never change (menus, logos,
et cetera) and that tend to follow the same layout, because it is (often) generated automatically.
Their goal is to identify these kind of structures - the local noise of a web page - in order to
filter them end to obtain the content alone.
Gupta et al [CGG04] introduce a tool, called Crunch, that uses a structural analysis based
on the DOM of web pages in order to identify their content. This framework allows to handle
a web page using filters that are completely customizable. Moreover it defines an application
programming interface (API) in order to extend it with other filters and plugins.
Rahman et al [AHR01] mix the structural analysis with a contextual analysis of the different
zones of a web document in order to reformat the important content of it for devices with small
screen such as PDAs or cellular phones. They propose a five-steps approach:
28
1.
analyze the structure of the web document;
2.
split the web document into sub-documents basing on the analyzed structure;
3.
analyze each sub-document considering its specific context;
2.5 | So what?
4.
summarize each sub-document in order to make a table of contents (TOC) for the
original document that follows the original sub-document order;
5.
sort this TOC on the basis of the relative importance of each sub-document.
Another tool that identifies the content of a web document and some presentational
elements - logos, banners, ads, navigation menus, et cetera - is called elISA (Extraction of
Layout Information via Structural Analysis) [DVV04]. It was developed by the University of
Bologna and it is based on XSL Transformations (XSLT) [Cla99], a transformation language
for XML documents. This engine uses an XML document with a user defined rule-set - based
on the structure of web documents - and a meta stylesheet to produce a new stylesheet to
segment the input document. All rules of the rule-set document are written through XPath 1.0
[CD99], a language that allows to address parts of an XML document. In the context of XPath
queries, some empiric studies such as [AKK06] suggest to use relative XPath expressions to
address XML nodes of a document because they are more robust than absolute expressions. This
robustness concerns the sensitiveness related to changes of the structure of an XML document:
Kowalkiewicz et al say that, if the structure of a document changes, the possibility that a relative
XPath expression has to remains true is higher than the possibility that an absolute expression
has.
In the context of some of these works and other investigations - for example [FKS01],
[BCC02] and [CMO05] - Gottron [Got07] has carried out, using his framework, an interesting
study about the performance of these methods for content extraction. He concludes its article
suggesting the Branstein et al's work (with some adaptations) as the best performing methods.
The only claim of this section has been to introduce some working methods and tools to
extract data from web documents. The following conclusions of the chapter Section 2.5 will
summarize all the issues related to data extraction in order to introduce our contributions about
these topics.
2.5 So what?
In this chapter we have introduced the main context in which we work: the extraction of
data from web documents. To explain this, in Section 2.1 we have clarified what is the content
of a web document and why it is so important. To understand what differences there are between
content and presentation, in Section 2.2 we have introduced some examples related to the
possible roles that images and tables can assume. Then we have shown (Section 2.3) how web
documents are not formed by content and presentational elements alone but they can also have
much more information specified, such as metadata: these kinds of information are much useful
29
2.5 | So what?
to improve the quality of results of search engines. After this explanation about content, we have
introduced some methods and tools to extract data from web documents, especially focused
on methods to extract content, and we have concluded the section (Section 2.4) introducing an
analysis [Got07] about the performance of some of these methods.
All these methods concern the recognition of content leaving out the analysis of the other
non-content elements of a document. The following is their shortcoming: in most cases these
methods analyze a document considering a flat model - an element it is or it is not content, that
is all. We think that the analysis of the content is important; but to identify the roles of all the
non-content elements is important too. For this reason in Chapter 3 we will introduce our work
related to the segmentation of web documents based on a five-dimesional model to segment any
document - digital or not - called Pentaformat [Dii07]. We will introduce an implementation
of this model for XML documents and we will describe two different engines that use this
implementation to segment and transform web documents according to the Pentaformat model.
30
Chapter 3 | Pentaformat Markup Language and other stories
Chapter 3
Pentaformat Markup Language and other stories
In the previous chapter (Chapter 2) we have illustrated the technological context where we
have developed this thesis. We have discussed about data extraction and its related topics: what
we mean for content of web documents, why processes of content extraction are so important
for Web users, what kind of other different constituents, such as presentation or metadata, we
can identify to characterize the elements of a web document.
Considering this context and remembering the claim of our work - “we propose a rulesbased mechanism in order to segment XML documents” (Chapter 1) - in this chapter we propose
a language to specify the roles that each element of an XML document can have. Before the
introduction of this language we introduce the model which it complies to: the Pentaformat
[Dii07]. This model is used to segment any document - digital or not - according to five
different connected constituents called dimensions: content, structure, presentation, behaviour
and metadata. We will deepen the description of this model in Section 3.1 introducing an
example.
Then we will introduce the first result of our thesis: the Pentaformat Markup Language or
PML. This is a language defined in Relax NG [Oas01] that allows to segment XML documents
according to the Pentaformat model. In Section 3.2 we will define the PML terminology and
syntax and we will show a couple of examples.
To develop an engine capable of segmenting XML documents using PML is one of
the main goals of this thesis. To complete this purpose we have re-written the Extraction of
Layout Information via Structural Analysis (or elISA) [DVV04] - the same rule-based engine
introduced in Section 2.4 - in order to handle PML. The old version of elISA identifies the
content and some presentational elements of a web document: it transforms the input document
- through the application of a meta-XSLT document [Kay07] with some rules - into a new
XML document with the content and the presentational elements identified. Indeed the version
that we have developed (elISA 2.0) segments XML documents according to the Pentaformat
model. Applying three different meta-XSLT documents with some rules, the input document is
31
3.1 | The Pentaformat model
transformed in a new XML document with some dimensional declarations expressed through
PML tags. We will introduce in depth this new engine in Section 3.3.
elISA - both the old and the new version - is not only an engine to segment XML document;
it is also an important module of an ambitious project developed by the University of Bologna:
ISAWiki [DV04]. This is a client/server platform, inspired by Ted Nelson's Xanadu project
[Nel80], in which every signed user can create, modify or reuse any web page using a client side
editor called ISAWiki editor. It uses elISA to identify the editable content of a web page. All
pages created or modified through this editor are saved on an ISAWiki server in an intermediate
language called ISAWiki Markup Language or IML ([San06]). It is used to store the structured
content of the document alone, leaving out the presentation. The goal is to transform the output
of elISA 2.0, a PML document, into an IML document. This operation seems easy at the surface
because we have to transform a document with five constituents into another with only two
constituents: the content and the structure. It is not true because IML - but not PML - has
another important feature: it complies to seven structural patterns [Gub04] to organize the
content. In order to produce a pattern-compliant PML document we have developed a specific
language called PML patterns. PMLp allows to rebuild an XML document according to some
pattering operations. Using another rule-based engine we transform the input PML document
in a patterned PML+PMLp document. The latter is easily transformable into an IML document
through a simple meta-XSLT. We will sicuss in depth these issues in Section 3.4.
At the end of this chapter (Section 3.5) we will briefly re-introduce all matters concerning
our works.
3.1 The Pentaformat model
As we have already seen during the introduction of the chapter, in order to understand the
language that we have developed to segment XML documents, we must present the model that
our language complies to. The Pentaformat [Dii07] is a model that can be used to segment any
kind of documents (not only digital ones). It also allows to re-introduce its data (or parts of
them) in different contexts, such as the layout adaptation of a specific web page for a “smallscreen” device.
First of all we will describe the five constituents, called dimensions, that characterize the
model. Even though they are distinguished, these dimensions - content, structure, presentation,
behaviour, metadata - are connected too. We will explain their characterization in Section 3.1.1.
After that, we will discuss why we use a five-dimensional model to segment documents. We
will justify the choice to exclude a “content-structure-presentation” model [GM02] to handle
32
3.1.1 | Five easy dimensions
any document and we will introduce what kinds of benefits has a five-dimensional model in the
way to segment documents. We will discuss in depth these matters in Section 3.1.2.
Latter, to understand how we can use the Pentaformat model, we will introduce an example
based on an article of the online edition of The New York Times [http://www.nytimes.com] and
we will analyze it through the five dimensions of the model. We will report this analysis in
Section 3.1.3.
3.1.1 Five easy dimensions
To understand the Pentaformat model and consequently our language to segment XML
documents (that we will introduce in Section 3.2), we must describe all its constituents. now we
introduce one by one the five dimensions illustrated in Picture 8.
Picture 8 The Pentaformat model
We call content all the non-structured information written by the author of the document.
Thinking about a classic (X)HTML document referred to an article of a web newspaper, such
as “The New York Times”, we identify as content:
33
3.1.1 | Five easy dimensions
•
the text of the article;
•
the close-up image;
•
the small and clickable pictures related to the article.
Picture 9 “Iraq Lacks Plan on the Return of
Refugees, Military Says” from The New York Times
The other parts, such as the main menu or all the elements added automatically by scripts,
are not considered as content. As we can see in Picture 9, we can associate to the content
dimension the text of the article, the picture with a woman in foreground and a lot of children
in background and the figure with a Baghdad map. These things are related to what the authors
34
3.1.1 | Five easy dimensions
has written. All the other elements - such as menus (“Most popular” and the printing facilities
menu), advertising images or videos (on top and right of Picture 9), the internal search box and
so on - do not belong to the content dimension.
The logical organization of the whole information of a document is related to the structure.
This dimension describes what kind of structure is used to contain a specific group of
information, such as text, images, video, menu, etc. For example, in the first paragraph of the
article in Picture 9 we can recognize two structures, as we can see in Code 2: a “p” element and
an “a” element that represent a paragraph - the first paragraph - and a link respectively.
Code 2 Structures related to the first paragraph of Picture 9
<p>
BAGHDAD, Nov. 29 — As
<a
href="http://topics.nytimes.com/top/news/international/countriesandterritories/iraq/iraqi_refugees/
index.html?inline=nyt-classifier"
title="Recent and archival news about Iraqi refugees.">
Iraqi refugees
</a>
begin to stream back to Baghdad, American military officials say the Iraqi government has yet to
develop a plan to absorb the influx and prevent it from setting off a new round of sectarian
violence.
</p>
Generally speaking, the presentation concerns how all the elements of a document look
like. It is not finished because it exists more than one layer of presentation in a document and
this is especially true if we refer to a digital document such as an (X)HTML document. The
most obvious layer refers to the placement of the various (and structured) elements that compose
the document. Another layer concerns the typographical and presentational layout - colors,
background, fonts, etc. - of the document. A third layer is referred to all the elements that are
not written by the author but are inserted into the document through some automatic processes,
e.g. all the contextual information that we can see in any article of a web newspapers (the “Most
popular” menu in Picture 9 is a good example of this kind of processes) or all the dynamic ads
so often visible on websites (e.g. the ones using the Google Adsense [https://www.google.com/
adsense] platform).
All the dynamic elements of a digital document, such as ads, banners, logos and so on,
are not correlated to the presentation only. In particular, all the elements that have some sort of
“dynamism” or have any kind of interaction with the users can be described by the behaviour
dimension. From this point of view, a link to another document belongs to this dimension as
much as any script used to handle banners or all the AJAX technologies or the interaction with
the visitors of the site.
35
3.1.2 | The need of a five-dimensions segmentation
The last dimension of the Pentaformat model is related to any information about the
document itself or parts of it. These meta information, called metadata, enrich the document
with assertions about the author, the creation date, the title, and so on. In that manner we
allow these metadata to be used by machines, intelligent agents, indexing processes. All meta
information, such as the Dublin Core [http://dublincore.org/] metadata systems, represent a rib
of the ambitious W3C project known as Semantic Web [http://www.w3.org/2001/sw/], as we
can deduce from [BHL01].
In this section we have analyzed the five dimensions of the Pentaformat model
understanding what they refer to. In order to answer the questions “Why do we use a fivedimensional model to segment a document?” and “Are a three-dimensional model not enough?”
we have written a little explicative section (Section 3.1.2).
3.1.2 The need of a five-dimensions segmentation
In Section 3.1.1 we have explained what are the dimensions of the Pentaformat model and
what they refer to. In this section we explain why Di Iorio have suggested a five-dimensional
model [Dii07] to treat the problem of document segmentation. After that we illustrate what
kinds of benefits has the Pentaformat model.
As [GM02] suggests, to analyze a document such as a poster or a book we can use a three
layer model in which we can distinguish three fundamental constituents:
•
the content, i.e. all the information concerning the document itself, that answers to
the question “what is it”;
•
the structure, i.e. what is the content location and in which structure it is contained,
that answers the question “where is it”;
•
the presentation, i.e. how the structured content is shown, that answers to the question
“what does it look like”.
Generally these three constituents are enough to segment a non-digital document. With
the beginning of the Semantic Web (microformat, RDFa, RDF, OWL) and with the coming
of the AJAX technologies (that have permitted the birth of Web 2.0), metadata and dynamic
interaction (or dynamic behaviour) became fundamental keywords for the actual digital
documents such as web pages. Then to segment in the best way any digital or non-digital
document without any lost, [Dii07] has suggested a five-dimensional model in order to allow:
•
36
the reuse of some parts of a document for a lot of purposes for different contexts;
3.1.2 | The need of a five-dimensions segmentation
•
composing different parts of different documents between them in order to create
easily a new document based on multiple source;
•
the portability of a document in order to permit a platform independent visualization.
The major benefits in this five-dimensional approach to segment documents concern a
sort of multiple but interconnected view of the same thing. In order to permit this view is
necessary to connect the five dimensions of the model without any hierarchy: the users define
the hierarchy structuring the document in the way they prefer. For this reason the Pentaformat
model permits to see the same document with different point of views designed to work together.
Combining these five different points of view we have a complete and sophisticated analysis
for any document that allows an all-round reuse of parts of it in multiple contexts.
Understanding the benefits related to this five-dimentional model described in Section
3.1.1, we will propose in Section 3.1.3 an example of segmentation for an article of the online
edition of The New York Times. [http://www.nytimes.com] in order to illustrate how we can use
the Pentaformat to segment documents.
Picture 10 Content identification for Picture 9
37
3.1.3 | The Pentaformat segmentation for (X)HTML documents: an example
3.1.3 The Pentaformat segmentation for (X)HTML documents: an
example
After the introduction of the Pentaformat model in Section 3.1.1, which we have used to
develop our language to segment XML documents, and after the benefits explanation (Section
3.1.2) concerns the use of this model, in this section we suggest a simple and clear example
based on an analysis of Picture 9. We consider all the dimensions one by one.
The identification of content for the article is quite simple because we base all our
deductions on the question “What has the author written?”. From this construction we can
identify the content how we see in Picture 10.
Moreover all the attributes “title” of all the elements “a” contained in the article body have
been written by the author or by an automatic process. If we believe in the first hypothesis then
the attribute “title” can be considered content, otherwise nothing. As we can see in Code 3 it
is very difficult to understand who has written the text of the attribute (probably the author but
it is not certain).
Code 3 The first element “a” of the first paragraph of the article
<a
href="http://topics.nytimes.com/top/news/international/countriesandterritories/iraq/iraqi_refugees/
index.html?inline=nyt-classifier"
title="Recent and archival news about Iraqi refugees.">
Iraqi refugees
</a>
The structure identification is quite easy indeed because in (X)HTML code all the tags are
related to a particular structure. For this reason the element “p” is a paragraph, the element “a”
is a link, the element “div” is a section or a divider, and so on. There can be some cases in which
the use of a particular tag is ambiguous, as we can see at Code 4. In this particular case the
element “div” is not a section or a divider but it has a structure like a paragraph.
Code 4 An element “div” that can be considered a paragraph
<div class="credit">
Michael Kamber for The New York Times
</div>
As we have seen above, the presentation of a document is characterized by a multi-layer
segmentation. In this case, the placement of all the elements and all the typographical entities
38
3.1.3 | The Pentaformat segmentation for (X)HTML documents: an example
are specified by some Cascading Style Sheets assertions ([BCH07]) using the tags “link” and
“style”, as we can see at Code 5.
Code 5 Use of CSS in Picture 9
<link
rel="stylesheet"
type="text/css"
href="http://graphics8.nytimes.com/css/common/global.css" />
<style type="text/css">
@import url(http://graphics8.nytimes.com/css/common/screen/article.css);
</style>
The other presentational layer - referred to any text, image, video, etc. that have not been
written by the author - is identified in Picture 11.
Picture 11 Some presentational entities for Picture 9
In the web page context, all the elements that allow any interaction with users or work
dynamically with parts of the document belong to the behaviour dimension. As we can see in
Picture 12, search engine boxes, links, videos, animations, all the items concerned to AJAX
technologies, scripts are good examples of the behaviour dimension.
39
3.1.3 | The Pentaformat segmentation for (X)HTML documents: an example
Picture 12 Dynamic behaviour in Picture 9
The last but not least there is the metadata dimension. In a (X)HTML document there are
many ways to define a metadata, from “meta” elements - as we can see in Code 6 - to Semantic
Web technologies such as microformats, RDFa, RDF and OWL.
Code 6 A part of “meta” elements in Picture 9
<meta
name="description"
content="U.S. military officials said the Iraqi government had yet to develop a plan
to absorb returning refugees and keep them from setting off a new round of violence.">
<meta
name="keywords"
content="Iraq,Immigration and Refugees,United States International Relations,
Sunni Muslims,Shiite Muslims">
<meta
http-equiv="Content-Type"
content="text/html;
charset=iso-8859-1">
<meta
name="geo"
content="Iraq">
<meta
name="dat"
content="November 30, 2007">
<meta
name="tom"
content="News">
40
3.2 | Pentaformat Markup Language
Not all the metadata are described by elements such as “meta”. there are also hidden
metadata (we use “hidden” because they are not defined through a clear element) that refer to
a specific element and not to the document itself only. Related to this context, all the “src”
attributes are metadata of the related “img” element, all the “title” attributes are metadata of
the related element and so on. In addition to this, some metadata can be hidden in any part of
the document too. As we can see in Code 7, the text nodes of the elements “a” represent the
article authors. By the way, they are metadata of the document but they are not indicated with
a particular element: they are hidden in the text.
Code 7 Hidden metadata in the article body of Picture 9
<div class="byline">
By
<a
href="http://topics.nytimes.com/top/reference/timestopics/people/g/michael_r_gordon/index.html?
inline=nyt-per"
title="More Articles by Michael R. Gordon">
MICHAEL R. GORDON
</a>
and
<a
href="http://topics.nytimes.com/top/reference/timestopics/people/f/stephen_farrell/index.html?
inline=nyt-per"
title="More Articles by Stephen Farrell">
STEPHEN FARRELL
</a>
</div>
In this section we have suggested an example about the Pentaformat model in order
to understand how we can segment documents such as web pages. This example and the
introduction of this model (Section 3.1.1 and Section 3.1.2) are necessary to describe our
language, called Pentaformat Markup Language or PML, that allows to segment XML
documents according to the Pentaformat model. As we have mentioned in the introduction of
Chapter 3, this language is used in elISA 2.0 to segment XML documents. In Section 3.2 we
will analyze the terminology and the syntax of PML and we will use a couple of examples in
order to understand how to segment XML documents.
3.2 Pentaformat Markup Language
As we have seen in Section 3.1, the Pentaformat [Dii07] is a good model to segment
documents, such as web documents, identifying the content, the presentation and some other
dimensions (structure, behavior and metadata). This segmentation can be useful in a context of
extraction of data that we have introduced in Chapter 2.
In this section we focus on the Pentaformat Markup Language (PML), the language that we
have developed to segment XML documents according to the Pentaformat model. Technically
41
3.2.1 | Terminology and syntax
speaking, PML is a “parasite” language we use it to indicate whether some elements of an
existing document, that is made with another XML language (called host), belong to one or
more Pentaformat dimensions or not. We use this language to extend the old version on elISA
[DVV04] in order to segment document according to the Pentaformat model.
In the following sections we introduce the PML terminology and the XML syntax (Section
3.2.1) to show how we can use it through a couple of examples (Section 3.2.2).
3.2.1 Terminology and syntax
In this section we present the PML syntax that we use to perform document segmentations
according to the Pentaformat model, introducing some terminologies referred to our language.
The most important thing to understand is how we can make a pml declaration, i.e. a declaration
in which we identify the Pentaformat dimension associated with some elements of a XML
document. A pml declaration is formed by the following four elements:
•
a Pentaformat dimension that we consider;
•
name is a specific type related to the current dimension. The choosing values that
we consider are specific for any dimension. All these values must be a string of
alphabetical characters without numbers and spaces;
•
ref represents some items related to the current dimension;
•
content represents the declaration object.
A pml declaration can be written as we report in Code 8. In this definition “ref” and
“content” are two Xpath 2.0 [BBC07a] queries in which for the former query the context node
is the document root and for the latter query the context node is the “ref” sequence.
Code 8 Pml declaration
<pml
[dimension]
[name]
[ref]
[content]
>
Any pml declaration has a particular name depending by the “ref” and the “content”. If
both values are referred to an one-item sequence we call the declaration atomic pml declaration,
otherwise we call the declaration complex pml declaration. We illustrate this first simple
difference using the example in Code 9.
42
3.2.1 | Terminology and syntax
Code 9 An extract from Alice's Adventures in Wonderland
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Extract from Alice's Adventures in Wonderland</title>
</head>
<body>
<h1>
Extract from <em>Alice's Adventures in Wonderland</em>
</h1>
<p>
Alice was beginning to get very tired of sitting by her sister
on the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, <q>and what is the use of a
book,</q> thought Alice, <q>without pictures or conversation?</q>
</p>
<p>
So she was considering, in her own mind (as well as she could,
for the hot day made her feel very sleepy and stupid), whether
the pleasure of making a daisy-chain would be worth the trouble
of getting up and picking the daisies, when suddenly a White
Rabbit with pink eyes ran close by her.
</p>
</body>
</html>
We can see some examples about these pml declarations reported in Code 10.
Code 10 Two declaration for Code 9
<pml
content
Text
//(p|h1|q)
text()
>
<pml
content
Text
//em
text()
>
The first declaration described in Code 10 is labelled complex because there are more than
one item in the “ref” sequence and, sometimes, more than one item in the content sequence. On
the other hand, the second declaration reported in Code 10 is labelled atomic because both the
“ref” and the “content” sequences are formed by one item only.
A pml sequence is a sequence made by “([ref])/([content])” query that identifies
which document items are objects of the pml set (it is the set of all declarations related to the
document). Considering the example in Code 10, the pml sequence for the first pml declaration
is (//(p|h1|q))/(text()). The sequence of all the pml sequences related to a specific
dimension is called dimension sequence. An example of this kind of sequences, referred to the
content dimension, is in Code 11
43
3.2.1 | Terminology and syntax
Code 11 The content sequence for the pml declaration in Code 10.
(
((//(p|h1|q))/(text())),
((//em)/(text()))
)
In order to refer these pml declarations to an XML document, such as Code 9, we have made
a small grammar that permits to insert a little group of elements and attributes into any XML
document. All the information about the dimensions are introduced by a specific and qualified
“dimensions” element. A general rule specifies that there must be at most one “dimensions”
element in the document (where the position of it is irrelevant). This element contains all the
pml declarations specified by five qualified elements - “content”, “structure”, “presentation”,
“metadata” and “behaviour” - each of them has three qualified attributes: “name”, “ref” and
“content”. A dimension element and its three attribute represent a pml declaration, as we can
see in Code 12.
Code 12 An extract from Alice's Adventures in Wonderland with the pml declaration specified
in Code 10
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:pml="http://www.essepuntato.it/PML">
<pml:dimensions>
<pml:content
pml:name="Text"
pml:ref="//(p|h1|q)"
pml:content="text()"
/>
<pml:content
pml:name="Text"
pml:ref="//em"
pml:content="text()"
/>
</pml:dimensions>
<head>
<title>Extract from Alice's Adventures in Wonderland</title>
</head>
<body>
<h1>
Extract from <em>Alice's Adventures in Wonderland</em>
</h1>
<p>
Alice was beginning to get very tired of sitting by her sister
on the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, <q>and what is the use of a
book,</q> thought Alice, <q>without pictures or conversation?</q>
</p>
<p>
So she was considering, in her own mind (as well as she could,
for the hot day made her feel very sleepy and stupid), whether
the pleasure of making a daisy-chain would be worth the trouble
of getting up and picking the daisies, when suddenly a White
Rabbit with pink eyes ran close by her.
</p>
</body>
</html>
44
3.2.2 | Examples
In conclusion we introduce the two following auxiliary items of the PML language:
•
a qualified attribute called “pid” (PML id) that can be used by any element of the
original document;
•
a qualified element called “stone”, with the “pid” attribute optional, that can contain
any type of node.
In Section 3.2.2 we will introduce some examples to understand how to segment an XML
document using these elements.
3.2.2 Examples
In order to understand how we can use all the elements presented in Section 3.2.1, we
illustrate two easy segmentation examples. We take into consideration the example in Code 13
that extends the example in Code 9 with a complete pml set.
Code 13 An extract from Alice's Adventures in Wonderland with a complete pml set
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:pml="http://www.essepuntato.it/PML">
<pml:dimensions>
<pml:content pml:name="Text" pml:ref="//body//element()" pml:content="text()" />
<pml:structure
<pml:structure
<pml:structure
<pml:structure
<pml:structure
<pml:structure
<pml:structure
<pml:structure
pml:name="Root" pml:ref="/html" pml:content="." />
pml:name="Head" pml:ref="//head" pml:content="." />
pml:name="Title" pml:ref="//title" pml:content="." />
pml:name="Body" pml:ref="//body" pml:content="." />
pml:name="Paragraph" pml:ref="//p" pml:content="." />
pml:name="Heading" pml:ref="//h1" pml:content="." />
pml:name="Emphasis" pml:ref="//em" pml:content="." />
pml:name="Citation" pml:ref="//q" pml:content="." />
<pml:metadata
pml:name="Title"
pml:ref="/"
pml:content="//title/text()|concat(//h1/text()[1],//h1/em/text())" />
</pml:dimensions>
<head>
<title>Extract from Alice's Adventures in Wonderland</title>
</head>
<body>
<h1>
Extract from <em>Alice's Adventures in Wonderland</em>
</h1>
<p>
Alice was beginning to get very tired of sitting by her sister
on the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, <q>and what is the use of a
book,</q> thought Alice, <q>without pictures or conversation?</q>
</p>
<p>
So she was considering, in her own mind (as well as she could,
for the hot day made her feel very sleepy and stupid), whether
the pleasure of making a daisy-chain would be worth the trouble
of getting up and picking the daisies, when suddenly a White
Rabbit with pink eyes ran close by her.
</p>
</body>
</html>
45
3.2.2 | Examples
As we can see in Code 13, the pml declaration referred to the metadata dimension has the
value of “ref” attribute setted to “/”. As we know, this Xpath refers to the document root. In
PML, all Xpath queries in “ref” that refers to the document root concern the document itself.
Therefore this metadata declaration represents a metadata for the document: in this particular
case it is the title of the document.
The segmentation of the next document, presented in Code 14, is more complex than the
previous (Code 13).
Code 14 An example written for the PML testing
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Dreaming Pentaformat - Home</title>
</head>
<body style="text-align:center;">
<div class="header">
<h1>
<img src="pentagon.png" alt="A pentagon" title="Logo" />
Home
</h1>
<p>
All the descendant elements of the <q>div <i>class</i> header</q> element aren't content.
They aren't written by S. but they are the result of an automatic process.
</p>
</div>
<div class="content">
<p>
You are in the <a href="whatis.html" title="What is this?">Pentaformat
Project <img src="small_pentagon.png" alt="A small pentagon" /></a> home page.
</p>
</div>
</body>
</html>
The Code 14, written for the PML testing, can be segmented with all the five dimensions
using the PML auxiliary items to perform a more accurate segmentation. We report an example
of this improving segmentation in Code 15.
Code 15 A segmentation for Code 14
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:pml="http://www.essepuntato.it/PML">
<pml:dimensions>
<pml:content
pml:name="Text"
pml:ref="//p[@pml:pid='p1']/descendant-or-self::element()"
pml:content="text()|@title|@alt" />
<pml:content pml:name="Picture" pml:ref="//p[@pml:pid='p1']/img" pml:content="." />
<pml:structure
<pml:structure
<pml:structure
<pml:structure
<pml:structure
<pml:structure
<pml:structure
<pml:structure
<pml:structure
<pml:structure
<pml:structure
46
pml:name="Divider" pml:ref="//div" pml:content="." />
pml:name="Paragraph" pml:ref="//p" pml:content="." />
pml:name="Root" pml:ref="/html" pml:content="." />
pml:name="Head" pml:ref="//head" pml:content="." />
pml:name="Title" pml:ref="//title" pml:content="." />
pml:name="Body" pml:ref="//body" pml:content="." />
pml:name="Heading" pml:ref="//h1" pml:content="." />
pml:name="Image" pml:ref="//img" pml:content="." />
pml:name="Link" pml:ref="//a" pml:content="." />
pml:name="Emphasis" pml:ref="//i" pml:content="." />
pml:name="Citation" pml:ref="//q" pml:content="." />
3.3 | The new Extraction of Layout Information via Structural Analysis
<pml:presentation pml:name="InnerCSS" pml:ref="//body" pml:content="@style" />
<pml:presentation
pml:name="Header"
pml:ref="//div[@class = 'header']/(.|.//element())"
pml:content="text()|.|@title|@alt"/>
<pml:presentation pml:name="Logo" pml:ref="//img[@title = 'logo']" pml:content="."/>
<pml:metadata
<pml:metadata
<pml:metadata
<pml:metadata
<pml:metadata
pml:name="Title" pml:ref="/" pml:content="//head/title/text()|//h1/text()" />
pml:name="Title" pml:ref="//img|//a" pml:content="@title" />
pml:name="Source" pml:ref="//img" pml:content="@src" />
pml:name="Description" pml:ref="//img" pml:content="@alt" />
pml:name="Author" pml:ref="/" pml:content="//pml:stone[@pml:pid='s1']/text()" />
<pml:behaviour pml:name="OpenLinkedDocument" pml:ref="//a" pml:content="@href" />
</pml:dimensions>
<head>
<title>Dreaming Pentaformat - Home</title>
</head>
<body style="text-align:center;">
<div class="header">
<h1>
<img src="pentagon.png" alt="A pentagon" title="Logo" />
Home
</h1>
<p>
All the descendant elements of the <q>div <i>class</i> header</q> element aren't content.
They aren't written by <pml:stone pml:pid="s1">S.</pml:stone> but they are the result of
an automatic process.
</p>
</div>
<div class="content">
<p pml:pid="p1">
You are in the <a href="whatis.html" title="What is this?">Pentaformat
Project <img src="small_pentagon.png" alt="A small pentagon" /></a> home page.
</p>
</div>
</body>
</html>
In this example we have used both the “pid” attribute and the “stone” element: the former
is used to refers to a specific element “p” in the first content declaration, in order to exclude the
child element “p” of the element “div class header”; the latter is used to identify the document
author (as we can see in the last metadata declaration).
We think that these two examples clarify the use of PML to segment an XML document.
The output of the new version of elISA [DVV04] is a PML document obtained through the
application of a meta-XSLT [Kay07] with an XML document. This latter document specifies
some rules to identify the roles of the input document elements. In Section 3.3 we will discuss
this new engine introducing its new features.
3.3 The new Extraction of Layout Information via
Structural Analysis
After the introduction of the Pentaformat model [Dii07] in Section 3.1, that allow to
segment any document using five different but connected dimensions (content, structure,
presentation, behaviour, metadata), and after the explanation of our language - the Pentaformat
markup language or PML - that we use to segment XML documents, in this section we introduce
elISA (Extraction of Layout Information via Structural Analysis) [DVV04], a rule-based engine
47
3.3.1 | elISA: a rib of ISAWiki
to segment XML document in two main dimensions: content and presentation. Our goal was to
re-write the engine in order to segment XML documents according to the Pentaformat model
producing as output a PML document.
First of all we want to introduce the main context, concerning the extraction of data (see
Chapter 2), in which elISA works. Referring to the recognition of structured content, some
studies ([Ven03], [Bag04] and [DVV04]) use elISA to extract all the content of a web page
in order to realize a process of global editability for an ambitious project of the University of
Bologna called ISAWiki [http://tesi.fabio.web.cs.unibo.it/Tesi/IsaWiki] [DV04], client/server
platform used to create, modify or reuse any web page. We will explain this framework
clarifying the role of elISA in Section 3.3.1.
After this brief overview, we will introduce all the new features of the new version (2.0) of
elISA in order to clarify what kind of elaborations can be performed on XML documents. The
details of these features will be explained in Section 3.3.2.
3.3.1 elISA: a rib of ISAWiki
To understand the context in which elISA [DVV04] - the rule-based engine to segment
XML document according to content and presentation, presented in Section 2.4 - works, we
must introduce the framework that uses it: ISAWiki [DV04]. It is a client/server platform,
inspired by Ted Nelson's Xanadu project [Nel80], in which every signed user can create, modify
or reuse any web page through a client side editor, the ISAWiki editor. All the pages created or
modified through this editor are saved on a ISAWiki server in an intermediate language, called
ISAWiki Markup Language or IML [San06]. This language is used to store the structured content
of the document, leaving out the presentation. This process handles a version control and a local
storing of all the new or modified documents whereas all the original documents remain on
their respectively server. In addition to this, when we have modified a web page we can ask
ISAWiki to transform this particular document in another one according to one of following
seven formats: HTML, XML, PDF, DOC, ODF, Wiki and Latex.
What a user can modify in a web page is the main thing which the ISAWiki developers
discussed. They have chosen to allow the editing of the content related parts - such as the text of
an article, image concerning an article, and so on - denying any change about the presentation or
the dynamic behaviour of a web page. They have need to identify what in a web page is content
or not. For this reason they have developed an engine to perform this task: Extraction of Layout
Information via Structural Analysis or elISA. This engine is able to identify the content of a web
page and some typical layout elements, such as logo, layout table, advertising banner and so on.
48
3.3.1 | elISA: a rib of ISAWiki
The elISA processing is based on a set of rules (specified by an XML document) that allows the
roles of the XML document elements throw a structural analysis.
The main goal of this version of elISA is to identify the content of a web page. The engine
includes three main components:
•
a set of rules (written complying to a specified DTD grammar) to identify the role of
the document elements on the basis of their structure;
•
a meta-XSLT [Cla99] that gets in input the rule-set document and returns a new
XSLT to transform the original document into another one;
•
a client interface, written in Javascript, that allows the users to use the engine and
to see the result that it produces.
The engine processing result can be two different things: the former is a new document
where every specific part is colored according to its role; the latter is a new IML document, i.e.
a document in which we consider the structured content leaving out all the other dimensions.
We can see an example of this ISAWiki processing in Picture 14.
Picture 13 An elISA analysis
In this picture you can see the a web page from the CNN web site and the
same page scanned by elISA. As you can see, the cyan zones are text areas,
while the orange areas are layout cells and the pink zones are navigation zones.
This version of elISA get a well-formed XML document to work. Today, as we know, the
most part of web pages are not XML. In fact a lot of articles of online newspapers use a mixed
49
3.3.1 | elISA: a rib of ISAWiki
language between HTML (that is not XML but SGML) and XHTML obtaining ugly markup
results, as we can see in Code 16.
Picture 14 ISAWiki process with the old version of elISA
Code 16 A not well formed markup in the article “Iraq Lacks Plan on the Return of Refugees,
Military Says” (The New York Times)
<a
href="javascript:pop_me_up2('http://www.nytimes.com/imagepages/2007/11/29/
world/20071130_REFUGEES_GRAPHIC.html','439_983','width=439,height=983,location=no,scrollbars=yes,tool
bars=no,resizable=yes')">
<IMG
src="http://graphics8.nytimes.com/images/2007/11/29/world/20071130_REFUGEES_GRAPHIC19.jpg"
height="126"
width="190"
alt="A New, Sectarian Map" border="0">
<span class="mediaType graphic">
Graphic
</span>
</a>
The well-forming process is realized by an external tool in order to prepare a correct input
for elISA. In addition to this, the use of the Pentaformat model in ISAWiki for the document
analysis is not achievable with this specific version of elISA because it handles content and
(a part of) presentation dimensions only. For this reason we have realized a new elISA engine
(version 2.0), written in Java, in order to use all the capabilities of the Pentaformat model,
extending the old version with a couple of new features. We will discuss them in Section 3.3.2.
50
3.3.2 | New features
3.3.2 New features
In order to allow the use of the Pentaformat model in the ISAWiki context [DV04], we
have developed a new version (2.0) of elISA [DVV04] to segment XML documents using
the language presented in Section 3.2, PML. In this section we analyze the new features that
characterize this new engine.
As its ancestor, elISA 2.0 allows to specify a rule-set through an XML document that
complies to a specific grammar written in RelaxNG [Oas01]. This grammar is quite similar to
the old DTD grammar except that we have added some new elements in order to help users in
the rules writing. We can see an example of a rule-set document in Code 17.
Code 17 A (very little) rule-set document for the “div” element only
<?xml version="1.0" encoding="UTF-8"?>
<rules xmlns="http://www.essepuntato.it/Rules">
<rule context="div">
<call name="ancestor.class" select="ancestor-or-self::div[exists(@class)]" />
<check>
<whenever test="empty(text()[normalize-space() != 0]) or exists(.//div)">
<setStructure name="Divider" ref="." content="." weight="1.0"/>
</whenever>
<otherwise>
<setStructure name="Paragraph" ref="." content="." weight="1.0"/>
<setContent name="Text" ref="." content="text()[normalize-space() != '']" weight="0.3"/>
</otherwise>
</check>
<check>
<whenever>
<test>
<containsOnly contextString="for $el in $ancestor.class return $el/@class">
<value>post</value>
<value>body</value>
<value>content</value>
</containsOnly>
</test>
<setContent name="Text" ref="." content="text()[normalize-space() != '']" weight="0.5"/>
</whenever>
</check>
<check>
<whenever test="exists(@style)">
<setPresentation name="InnerCSS" ref="." content="@style" weight="1.0" />
</whenever>
<whenever test="exists(@onload)">
<setBehaviour name="OnLoad" ref="." content="@onload" weight="1.0" />
</whenever>
<whenever test="exists(@title)">
<setMetadata name="Description" ref="." content="@title" weight="1.0" />
</whenever>
</check>
</rule>
</rules>
In this little document we have defined an easy (and a bit naive) rule to handle all the
“div” elements for a (X)HTML document using the element “rule”, specifying as “context” an
Xpath 2.0 query that refers to all these elements. Inside this “rule” element, we have made a
variable declaration - the element “call” named “ancestor.class”. It collects, through a Xpath
51
3.3.2 | New features
2.0, all the current and all the ancestor elements “div” with the “class” attribute specified. Even
more we have specified some conditional checkpoints. Every checkpoint has one or more if
statements called “whenever”, that allow to specify (through “setContent”, “setStructure” and
so on) one or more pml declarations for all the elements referred in “ref” and “content” (where
the “ref” sequence represents the current context for the “content” sequence, as usual). On the
contrary, these are not classic pml declarations, but they are probabilistic pml declarations. In
fact, through the attribute “weight”, we associate a value, from 0 to 1, to a pml declaration. Two
probabilistic pml declarations are similar if they have the same dimension and “name” and they
refer to the same “ref” and “content” elements too. Two similar probabilistic pml declaration
can be re-written in a probabilistic pml declaration where the “weight” attribute is equal to the
sum of the “weight” attributes of the two previous declarations. To understand this point, we
image to have three different “div” in a (X)HTML document:
•
one has some text inside;
•
one does not contain text but it has a “class” attribute with the value “post”;
•
one has both the previous features.
If we take into consideration the rule specified in Code 17, the weight values of the
probabilistic pml declaration (dimension = content, name = “Text”) referred to these three “div”
are:
•
0.3 for the “div” with text;
•
0.5 for the “div” without text and with the “class” specified;
•
0.8 for the “div” with text and with the “class” specified.
After we have specified all the probabilistic pml declaration for all the elements of a
document, we can make a choice among these declarations evaluating their weight value. In
order to make these choices we have made a new simple language, written in RelaxNG, to define
thresholds. As we can see in Code 18, we have defined a simple threshold for the same elements
“div” used in Code 17. This simple threshold, referred to the content dimension, takes into
consideration only the pml declarations referred to all the elements “div” with weight greater
or equal to 0.75. We can find a clear explanation of the syntax and the semantics of this two
languages in Section 4.1.
Code 18 A possible thresholds document in according to Code 17
<?xml version="1.0" encoding="UTF-8"?>
52
3.3.2 | New features
<thresholds xmlns="http://www.essepuntato.it/Thresholds">
<threshold context="div">
<content>
<select>
<weight ge="0.75"/>
</select>
</content>
<structure>
<bestWeight priority="Divider Paragraph"/>
</structure>
<presentation>
<select>
<weight gt="0.8"/>
</select>
</presentation>
<metadata>
<select>
<weight gt="0.6"/>
<name value="Description"/>
</select>
</metadata>
<behaviour>
<select>
<weight ge="0.7"/>
</select>
</behaviour>
</threshold>
</thresholds>
Then we have added in elISA 2.0 some new features - such as pml declaration handling and
the thresholds - in comparison to the old version. Moreover we can control two other aspects of
whole process that are handled externally in the previous version of the engine. In this case we
refer to a well-former and a plugin loader. The former well-forms a not well-formed document
in order to use it into the elISA processing. The latter is able to load all the plugins, written
according to a specific Java interface, in order to add information (new elements, new attributes
and so on) to the input document. For example, if we want to add more informations about
dimensions of all the pictures of a document we can write a plugin that:
•
gets all source file of any “img” element;
•
works out the width and the height of any picture;
•
inserts these information as a qualified attribute of the respective element “img”.
We will deeply discuss how to make a plugin for elISA 2.0 in Section 4.1.
As we can see in Picture 15, the result of this three steps process is a PML document
containing all the pml declarations selected in the thresholds step. This is the most important
difference between the old version of elISA and elISA 2.0: while the former returns an IML
document, the new version returns a document with more information than an IML document.
The question is that PML is not supported by the current version of ISAWiki. In order to use
elISA 2.0 in the ISAWiki platform we must convert PML documents into IML documents.
In this section we have discussed all the new elISA 2.0 features and we have introduced the
complication to use this new engine in ISAWiki (the platform that we have presented in Section
53
3.4 | From PML to IML
3.3.1). The conversion from PML into IML seems easy but it hides a trap. In fact any IML
document has another fundamental feature: it complies to seven structural patterns to organize
the content. Obviously PML does not comply to this structural model because it does not force a
hierarchical content structuralization. For this reason we need a patterning engine that converts
a PML document into a new patterns-compliant document with the same pml declarations of
the original document. We will explain in depth all these matters in Section 3.4.
Picture 15 The three steps of elISA analysis
3.4 From PML to IML
As we have seen in Section 3.3, elISA 2.0 generates a PML document as result of its
analysis. Unfortunately, for a complete integration in the ISAWiki framework [DV04] we need
to convert this kind of outputs in an IML document [San06] because the latter is the language in
which ISAWiki stores its documents. This conversion is not easy because IML - but not PML
- has another important feature: it structures content complying to seven structural patterns
[Dii07]. In reality there is another solution to make possible the use of PML in ISAWiki: to
change the whole structure of ISAWiki in order to handle PML as the intermediary language
in which it saves all documents. We do not take into consideration this latter proposal for a
54
3.4.1 | The issue of structured content
main reason: to change ISAWiki in order to include PML is too complex because of the big
dimension of the platform. Then we have made the choice to develop an automatic mechanism
to convert a PML document into an IML document.
In this section we will introduce the issue related to the structured content of XML
documents according to some structural patterns and we will explain the benefits of this
approach (Section 3.4.1). Then we will introduce the seven patterns that we can be used to
structure the content of a XML document (Section 3.4.2) through some examples. After we
have understood what content model characterizes these patterns and after we have remembered
(Section 3.4.3) that the patterned structure is the main difference between IML and PML, we
will propose four operations to pattern XML elements (Section 3.4.4). We will introduce in
Section 3.4.5 a new language and the related engine to pattern an XML document in order to
preserve the old unpatterned structure. We have called this language PML patterns.
3.4.1 The issue of structured content
The “PML to IML” conversion seems easy at the surface: we must transform a document
with five dimensions into a document with two dimensions only. It is true, but, moreover the two
dimensions, IML has a further feature: all its structures comply to a specific structural pattern.
As we know the Pentaformat model (and PML too) does not force a hierarchy among the
dimensions or among elements belonged to a particular dimension. For example a particular
structure for the content is specified by the document authors. [Dii07] suggests a solution to
avoid problems in the content structuring of the XML documents through a model used to
express and normalize structured content of any document. Moreover this model, based on seven
structural patterns, is useful to capture two of the five Pentaformat dimensions: the content and
the structure. The reason to use a pattern approach during the content structuring is justified by
well-known scientific literature: when we find a group of similar problems is useful to find a
common approach to solve them. This common solution is called pattern. This approach have
been suggested by the architect Christopher Alexander in [Ale79] and it have been reused in
the Computer Science field by Erich Gamma et al [GHJ95].
There are some positive aspects in pattern use, among which:
•
the possibility to reuse a particular solution in different contexts or projects;
•
to handle easily the structure of a document, providing a clear organization of all
its elements;
•
to make easy and understandable complex structures composing different simple
patterns.
55
3.4.2 | Seven patterns
On this basis, in Section 3.4.2 we will introduce seven structural patterns used in IML to
structure the content. To Understand how IML uses these patterns is fundamental to develop
an engine that converts a PML document into a new patterned document with the same pml
declarations. To transform this new patterned document into an IML document is really easy
and it allows to use a PML document into ISAWiki.
3.4.2 Seven patterns
As we have introduced in Section 3.4.1, IML uses some structural patterns to structure
content. This is the most important difference between IML and PML. Our goal is to convert
PML documents into IML documents in order to use elISA 2.0 (Section 3.3) in ISAWiki
[DV04]. Before to realize this conversion we must understand what are these patterns and what
kind of content models they have. We try to clarify these issues in this section.
In [Dii07], [DDD07] and [Gub04] the authors suggest some patterns to structure any XML
document through the definitions of the respective content models. In this section we introduce
a clear distinction between these seven patterns. All the examples in the following descriptions
are referred to (X)HTML grammar and we consider any element is associated to one pattern
only, as usual.
The first pattern which we introduce is marker, i.e. an empty element that can have zero or
more attributes. This pattern is splitted in other two subpatterns according to the context:
•
we call milestone all the markers in which the position that they assume in the
document represents the most relevant feature. All the attributes correlated to this
kind of elements are metadata of the element itself. E.g., the element “img” (with its
attribute “src”) is a perfect example of this subpattern;
•
we call meta all the markers in which the existence is important and the position is
not relevant. All these elements assume the same value independent of the position
that they have in the document. A good example of an element that complies to this
kind of patterns is “meta”.
We can see an use of these two subpatterns in Code 19.
Code 19 Using markers in a (X)HTML document
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="author" content="Silvio Peroni"/>
<meta name="description" content="An example in which we introduce the use of markers"/>
</head>
<body>
56
3.4.2 | Seven patterns
<p>
In this paragraph we insert a picture like
<img src="http://www.essepuntato.it/point.png" alt="A picture"/>
to exemplify the use of <em>milestone</em> markers.
</p>
</body>
</html>
An atom is the pattern for all the elements that can contain text only. Two elements that
use this pattern are “title” and “script”, as we can see in Code 20.
Code 20 Using atoms in a (X)HTML document
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>What is an atom?</title>
</head>
<body>
<p>
This document introduce two example of <em>atoms</em>. One is the element <q>title</q>, other
one is the element <q><script type="text/javascript">document.write("script")</script></q>.
</p>
</body>
</html>
The next two patterns, inline and block, have the same content model but they differ for one
aspect: the former can contains itself and this is not true for the latter. Generally they contain
elements that comply to the patterns milestone marker, atom or inline (all repeatable) and can
also contain text. As we can see in Code 21, there are many elements that comply to these two
patterns, for example “p”, “h1” for block and “em”, “i”, “b” for inline.
Code 21 Using inlines and blocks in a (X)HTML document
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Inlines and blocks: an old story</title>
</head>
<body>
<h1>An <em>old</em> story</h1>
<p>
In this example we introduce two different <i>pattern</i>, <b>inline</b> and <b>block</b>,
and some respective elements.
</p>
</body>
</html>
The last three patterns - container, table, record - concern the organization of the content
only. Actually they do not contain text but only all the elements that comply to the following
patterns: meta marker, atom, block, container, table and record. The difference among these
three patterns is how they handle the element repeatability:
57
3.4.2 | Seven patterns
•
the pattern container contains all optional or repeatable elements. For example, a
“div” (operated as a tag without text) can contains zero or more paragraphs, zero
or more lists and so on;
•
the pattern table contains homogeneous and repeatable elements. A good example
of an element that complies to this pattern is “ul” or “ol”;
•
the pattern record contains all optional and not repeatable elements.
A full example that introduces all these patterns is in Code 22.
Code 22 Using containers, tables and records in a (X)HTML document
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>About the structure of a document</title>
</head>
<body>
<p>
In this example we introduce the three patterns used to structure a document:
</p>
<ul>
<li><p>container;</p></li>
<li><p>table;</p></li>
<li><p>record.</p></li>
</ul>
<p>
Are they enough?
</p>
</body>
</html>
Other than the content models, that we can see summarized in Table 1, any of these seven
patterns has associated a particular characterization called behaviour. It modifies, where it is
possible, the content model of an element. The possible behaviours are three:
•
the standard behaviour, that does not modify the original content model;
•
the additive context behaviour, that allow to add one or more elements to the element
that complies to this behaviour and to its descendants;
•
the subtractive context behaviour, that allow to remove one or some elements from
the element that complies to this behaviour and from its descendants.
Mainly there are two benefits using these seven patterns to structure a document:
•
58
with these seven patterns we can make a document with a clear structure, knowing
every time we want the real role of any element;
3.4.3 | PML and IML: what is the difference?
•
if we know that a document is made through these patterns then we can deduce, with
the algorithm described in [DDD07], what is the pattern associated to any element.
EMPTY
milestones
X
meta
X
text
milestones
meta
atom
inline
atom
X
inline
X
X
X
X
block
X
X
X
X
block
container
table
record
container
X
X
X
X
X
X
table
X
X
X
X
X
X
record
X
X
X
X
X
X
Table 1 Summarizing table for the content models of all patterns
In this section we have introduced the seven patterns that IML uses to structure the content.
In Section 3.4.3 we will explain how this feature is the most relevant difference between IML
and PML.
3.4.3 PML and IML: what is the difference?
Besides the issue about dimensions, the main difference between IML and PML is that
the former is strongly patterned while in the latter a patterned structure is completely optional.
Moreover a lot of web documents, such as well-formed (X)HTML documents, are not patterned.
We can take into consideration a common container element, such as a table “td”. As we know, a
container does not contain text. For this reason all the “td” elements in Code 23 are not patterned.
Generally speaking, in any (X)HTML document neither tables are patterned.
Code 23 An extract from the article “Table (information)” of Wikipedia
<p>
The following illustrates a simple table with three columns and six rows.
The first row is not counted, because it is only used to display the column
names. This is traditionally called a "header row".
</p>
<p>
<b>Age table:</b>
</p>
<table class="wikitable" border="5">
59
3.4.4 | Patterning process
<tbody>
<tr>
<th>first</th>
<th>last</th>
<th>age</th>
</tr>
<tr>
<td>Nancy</td>
<td>Davolio</td>
<td>33</td>
</tr>
<tr>
<td>Nancy</td>
<td>Klondike</td>
<td>43</td>
</tr>
<tr>
<td>Nancy</td>
<td>Obesanjo</td>
<td>23</td>
</tr>
<tr>
<td>Justin</td>
<td>Saunders</td>
<td>37</td>
</tr>
<tr>
<td>Justin</td>
<td>Timberlake</td>
<td>26</td>
</tr>
<tr>
<td>Amy</td>
<td>Mes</td>
<td>11</td>
</tr>
</tbody>
</table>
The goal is to convert a PML document into an IML document in order to use elISA 2.0
into ISAWiki. To allow this conversion we have developed an engine that can pattern any
XML document, on the basis of some rules, preserving all the pml declarations. Before the
introduction of this new engine, we need to illustrate what kind of operations we can use to
pattern XML documents. We will discuss this issue in Section 3.4.4.
3.4.4 Patterning process
We have explained that to use elISA 2.0 into ISAWiki we need to convert PML documents
into IML documents. This operation is not easy because IML - but not PML - structures content
according to seven structural patterns, as we have introduced in Section 3.4.2. The point is to
pattern PML documents through a specific engine which we have developed. To understand how
we can specify some patterning rules to the engine we need to explain what kind of operations
we can use to pattern the elements of any XML documents. In this section we answer to this
question.
After an analysis based on some simple and complicate examples, we have found out these
two main operation:
60
3.4.4 | Patterning process
•
the wrap operation on an element allows to insert as its children one or more elements
in order to pattern its structure;
•
the unwrap operation on an element allows to remove it from the document.
These two operations represent the minimum set of operations to pattern any XML
document. Now, to understand what these operation can do, we make some examples. As we
can see in Code 24 (a little example of a non-patterned document in which “div” is a container
and “i” is an inline), the “div” element does not comply to its pattern because it contains text
and an inline element too.
Code 24 A little example of a not patterned document
<?xml version="1.0" encoding="UTF-8"?>
<div>
This is a little example to understand the <i>wrap</i> operation.
</div>
To pattern the document we apply a wrap on the “div” with a block element like “p” in order
to enclose all its children. Through this simple operation we transform the original document in
Code 24 in a patterned document, as we can see in Code 25.
Code 25 A patterned version of the document in Code 24
<?xml version="1.0" encoding="UTF-8"?>
<div>
<p>
This is a little example to understand the <i>wrap</i> operation.
</p>
</div>
We can also apply specific and multiple wrap operations. In the document presented in
Code 26 there is a fake unordered list because there is text in a not correct position. We want to
pattern this text element in order to make a correct list and we must apply two different patterns
to obtain that.
Code 26 Another not patterned document
<?xml version="1.0" encoding="UTF-8"?>
<ul>
<li><p>Correct nesting;</p></li>
Text without a correct nesting;
<li><p>Another correct nesting.</p></li>
</ul>
61
3.4.4 | Patterning process
Even in this case the solution is simple. We apply a wrap on the “ul” text children with a
container element “li”. Now, on this result, we apply another wrap on all the children of the new
“li” (in this case they are text nodes only) with a block pattern such as “p”, obtaining a perfect
patterned document, as we can see at Code 27.
Code 27 A patterned version of the document in Code 26
<?xml version="1.0" encoding="UTF-8"?>
<ul>
<li><p>Correct nesting;</p></li>
<li><p>Text without a correct nesting;</p></li>
<li><p>Another correct nesting.</p></li>
</ul>
To illustrate how the unwrap operation work we can take into consideration the document
in Code 28. In this example we can see an incorrect nesting of two paragraphs. This situations
is not possible in a patterned document: a block cannot contain any block.
Code 28 A not patterned document with too many paragraphs
<?xml version="1.0" encoding="UTF-8"?>
<p>
<p>
Too many paragraphs...
</p>
</p>
The solution for this example is easy to find out: we apply an unwrap operation on the
first paragraph, i.e. the document element, obtaining a perfect patterned document as we can
see in Code 29.
Code 29 A patterned version of the document in Code 28
<?xml version="1.0" encoding="UTF-8"?>
<p>
Too many paragraphs...
</p>
How can we solve a situation such as Code 30? In this case we must remove the first “p”
adding a list element through a multiple operation on the same element, in order to pattern the
document.
62
3.4.4 | Patterning process
Code 30 Where is the list?
<?xml version="1.0" encoding="UTF-8"?>
<p>
<li><p>A list item;</p></li>
<li><p>Another list item;</p></li>
<li><p>But where is the list?</p></li>
</p>
So we can an unwrap operation on the first “p” (the document element) and a wrap
operation on all its children with a list element (for example “ul”). As we can see in Code 31,
through this multiple operation we obtain a perfect patterned document.
Code 31 A patterned version of the document in Code 30
<?xml version="1.0" encoding="UTF-8"?>
<ul>
<li><p>A list item;</p></li>
<li><p>Another list item;</p></li>
<li><p>But where is the list?</p></li>
</ul>
With this two operations (wrap and unwrap) we can pattern any XML document. There
can be some contexts in which it is not convenient combining these two different operations.
As we can see in Code 30, instead of using the multiple operation, we can use rename operation
to change immediately the element “p” without any direct wrap or unwrap operations. Another
example of this kind of troubles is introduced in Code 32. In this case we want to pattern the
document changing the element “b” position in order to place it in a correct location.
Code 32 A not patterned document with some text in bold
<?xml version="1.0" encoding="UTF-8"?>
<div>
<b>
<p>
Text with a <i>bold</i> style.
</p>
<div>
<p>
Another text with a <i>bold</i> style.
</p>
</div>
</b>
</div>
A possible solution is to apply an unwrap operation on “b” and a wrap operation on any “p”
with a new element “b”. Maybe it is too complicate. Should be easier use a specific operation
that is applied only on the element “b”, obtaining the same result illustrated in Code 33.
63
3.4.5 | PML patterns (PMLp)
Code 33 A patterned version of the document in Code 32
<?xml version="1.0" encoding="UTF-8"?>
<div>
<p>
<b>Text with a <i>bold</i> style.</b>
</p>
<div>
<p>
<b>Another text with a <i>bold</i> style.</b>
</p>
</div>
</div>
To solve these situations we introduce two new operations obtained composed by the wrap
and the unwrap:
•
rename an element in order to change its pattern or its name. This operation is
obtained applying an unwrap and a wrap on the element that we want to rename;
•
inject an element in order to remove the element and re-inject it in all descendant
elements that can insert it as a child according to their content model.
With only the inject we can apply multiple wrapping operations as we have already seen
for wrap and unwrap.
An engine based on these four operation must have two fundamental features: it must
preserve the old unpatterned structure of the document; it must preserve all the pml declarations
even though the document structure changes for some patterning operation. To allow the former
feature we have developed another language, called PML patterns (or PMLp): we will illustrate
it in Section 3.4.5, introducing our engine to pattern XML documents. Through this engine we
can transform a PML document into a patterned PML+PMLp document and then we can convert
the latter document into an IML document, in order to use elISA 2.0 into ISAWiki.
3.4.5 PML patterns (PMLp)
In this section we introduce the language that we use to pattern XML documents and the
patterning engine that we have developed. These two things are used to convert a PML document
into an IML document. The chain application of elISA 2.0, of the patterning engine and of a
simple meta-XSLT allows the use of elISA 2.0 into the current version of the ISAWiki platform.
Now we see how the patterning engine returns a perfect patterned XML document with all the
specified pml declarations of the old document.
According to the Section 3.4.4, another problem for the patterning operation is how we
can associate a pattern to any element of a XML document in order to pattern it. The answer
in this case is PML. We can use this language to associate to any element of a document one
64
3.4.5 | PML patterns (PMLp)
of the seven patterns that we have seen in Section 3.4.2. For this reason we extend the current
PML language with a new assumption: the value of “name” of any pml declaration concerning
the structure dimension has seven basic values - “Pmarker”, “Patom”, “Pinline”, “Pblock”,
“Pcontainer”, “Ptable”, “Precord”. Any other possible values (such as “Paragraph”, “Divider”
and so on) has a subclass relation with only one of the seven basic values.
On the basis of new PML version, we have developed a Java rule-based patterning engine
that solves any pattering issue. The document patterning realized by this engine produces a
new document based on a new language called Pentaformat Markup Language and patterns or
simply PMLp. Evidently the patterning process completed by the engine preserves correctly all
the pml declarations of the original non-patterned document even though some new elements
have been added or removed during the patterning process. Now, in order to see how we can
define the patterning rules and how the engine works, we see the examples in Section 3.4.4
again.
To obtain a patterned document from the example in Code 24, we define a wrap rule for
all the elements “div”, as we can see in Code 34. The patterning-rule document is an XML
document in which we can define local or global variables with the element “variable”. Other
than we can define rules with the “pattern” element. This rule matches with the Xpath 2.0
value contained in the attribute “match”. The “pattern” element is composed by some if/else if
statements defined by a succession of “when” elements. We can specify as “when” child only
one of the four operations introduced in Section 3.4.4.
Code 34 A patterning rule-set to solve all the examples in Section 3.4.4
<?xml version="1.0" encoding="UTF-8"?>
<patterns xmlns="http://www.essepuntato.it/Patterns">
<variable name="text.inline" select="text()[normalize-space() != '']|element()[f:isInline(.)]"/>
<pattern match="div">
<choose>
<!-- Rule 1 -->
<when test="exists(element()[f:isInline(.)])">
<wraps>
<wrap pattern="Pinline" select="$text.inline" />
</wraps>
</when>
</choose>
</pattern>
<pattern match="p">
<choose>
<!-- Rule 2 -->
<when test="count(element()[f:isBlock(.)]) = 1 and empty(text()[normalize-space() != ''])">
<unwrap />
</when>
<!-- Rule 3 -->
<when test="count(li) = count(element())">
<rename pattern="Ptable"/>
</when>
</choose>
</pattern>
<pattern match="ul">
<choose>
<!-- Rule 4 -->
65
3.4.5 | PML patterns (PMLp)
<when test="exists(text()[normalize-space() != '']|element()[f:isInline(.)])">
<wraps>
<wrap pattern="Pcontainer" select="$text.inline">
<wrap pattern="Pblock" />
</wrap>
</wraps>
</when>
</choose>
</pattern>
<pattern match="b">
<choose>
<!-- Rule 5 -->
<when test="exists(element()[f:isBlock(.) or f:isContainer(.)])">
<inject />
</when>
</choose>
</pattern>
</patterns>
Through these few rules we pattern all the examples showed in Section 3.4.4. One of the
engine goals is to preserve, somehow, the original structure of the document. To complete this
goal, the engine uses a qualified elements and attributes to remove or describe elements. New we
consider the examples in Code 24 and Code 26, where we use a single and some multiple wrap
operations. In this cases, as we can see in Code 35 and in Code 36, we identify the element that
applies the wrap operation with an identifier specified by the qualified “pmlp:wrap” attribute.
The “pmlp:wrapped” attribute refers to the element that has applied the wrap operation.
Code 35 How the engine pattern the document in Code 24
<?xml version="1.0" encoding="UTF-8"?>
<div pmlp:wrap="div1" xmlns:pml="http://www.essepuntato.it/PML" xmlns:pmlp="http://www.essepuntato.it/PMLp">
<pml:dimensions>
<pml:structure pml:name="Pcontainer" pml:ref="//div" pml:content="." />
<pml:structure pml:name="Pblock" pml:ref="//p" pml:content="." />
</pml:dimensions>
<p pmlp:wrapped="div1">
This is a little example to understand the <i>wrap</i> operation.
</p>
</div>
Code 36 How the engine pattern the document in Code 26
<?xml version="1.0" encoding="UTF-8"?>
<ul pmlp:wrap="ul1" xmlns:pml="http://www.essepuntato.it/PML" xmlns:pmlp="http://www.essepuntato.it/PMLp">
<pml:dimensions>
<pml:structure pml:name="Ptable" pml:ref="//ul" pml:content="." />
<pml:structure pml:name="Pcontainer" pml:ref="//li" pml:content="." />
<pml:structure pml:name="Pblock" pml:ref="//p" pml:content="." />
</pml:dimensions>
<li><p>Correct nesting;</p></li>
<li pml:wrapped="ul1"><p pml:wrapped="ul1">Text without a correct nesting;</p></li>
<li><p>Another correct nesting.</p></li>
</ul>
An unwrapped element is removed to be replaced by a qualified pmlp element called “old”
with two obligatory and qualified attributes: “pmlp:name”, that contains the prefixed name of
66
3.4.5 | PML patterns (PMLp)
the old element, and “pmlp:unwrap”, that represents the identifier for the operation. We can see
an example of this operation in Code 37.
Code 37 How the engine pattern the document in Code 28
<?xml version="1.0" encoding="UTF-8"?>
<pmlp:old pmlp:name="p" pmlp:unwrap="p1" xmlns:pml="http://www.essepuntato.it/PML"
www.essepuntato.it/PMLp">
<pml:dimensions>
<pml:structure pml:name="Pblock" pml:ref="//p" pml:content="." />
</pml:dimensions>
<p>
Too many paragraphs...
</p>
</pmlp:old>
xmlns:pmlp="http://
The rename operation is an application of both the wrap and the unwrap operations. For
this reason, as we can see in Code 38, the result of this operation is a trade-off between
the wrap and unwrap: the old element is replaced by an “pmlp:old” element with a qualified
attribute “pmlp:rename” (that is an identifier). Instead the new inserted element (the object of
the renaming) have specified a qualified “pmlp:renamed” attribute referred to the old element.
Code 38 How the engine pattern the document in Code 30
<?xml version="1.0" encoding="UTF-8"?>
<pmlp:old pmlp:name="p" pmlp:rename="p1" xmlns:pml="http://www.essepuntato.it/PML"
www.essepuntato.it/PMLp">
<pml:dimensions>
<pml:structure pml:name="Ptable" pml:ref="//ul" pml:content="." />
<pml:structure pml:name="Pcontainer" pml:ref="//li" pml:content="." />
<pml:structure pml:name="Pblock" pml:ref="//p" pml:content="." />
</pml:dimensions>
<ul pmlp:renamed="p1">
<li><p>A list item;</p></li>
<li><p>Another list item;</p></li>
<li><p>But where is the list?</p></li>
</ul>
</pmlp:old>
xmlns:pmlp="http://
In the end, the inject has a similar result to the rename. In fact, as we can see in Code 39, the
element that applies the operation is replaced by a “pmlp:old” element (in which the attribute
“pmlp:inject” is an identifier) as usual. All the new elements created by this operation have a
qualified “pmlp:injected” attribute that refers to its “creator”.
Code 39 How the engine pattern the document in Code 32
<?xml version="1.0" encoding="UTF-8"?>
<div xmlns:pml="http://www.essepuntato.it/PML" xmlns:pmlp="http://www.essepuntato.it/PMLp">
<pml:dimensions>
<pml:structure pml:name="Pinline" pml:ref="//(b|i)" pml:content="." />
<pml:structure pml:name="Pcontainer" pml:ref="//div" pml:content="." />
<pml:structure pml:name="Pblock" pml:ref="//p" pml:content="." />
</pml:dimensions>
67
3.5 | So what? (take 2)
<pmlp:old name="b" pmlp:inject="b1">
<p>
<b pmlp:injected="b1">Text with a <i>bold</i> style.</b>
</p>
<div>
<p>
<b pmlp:injected="b1">Another text with a <i>bold</i> style.</b>
</p>
</div>
</pmlp:old>
</div>
The latter issue that we present concerns the difference between syntactic and semantic
patterning.
The syntactic patterning is able to pattern a document basing its operations on the document
structure only. In this case we have many solutions to solve a patterning problem.
The semantic patterning issue is a little bit different. In this case the patterning works well
if and only if the visualization of the patterned document and of the original document are the
same. We can think about “visualization” like a browser displays a web document, taking into
consideration that two web browsers can display a web page in different manners. For this reason
it is not easy to understand what “same visualization” means. This is an hard problem to solve.
The first version of our engine is able to handle the syntactic patterning (because it is based
on clear rules) and the semantic patterning excluding unusual scenarios. Now we are working
on a new version of the engine based on milestones overlapping markup ([AMP03]) to handle
as much semantic patterning scenarios as possible.
In this section we have introduced the patterning engine that we have developed in order
to pattern any XML document preserving all its pml declarations. This transformation is made
using PML with the new PMLp language that we use to pattern the document and to remember
the old unpatterned structure. As we know, we can convert a PML+PMLp document into an IML
document through a simple meta-XSLT. So, through the application of elISA 2.0, the patterning
engine and the latter meta-XSLT, we can convert any web document in an IML document. This
allow us to replace the old version of elISA into ISAWiki with our new version in order to
handle XML segmented documents in according to the Pentaformat model [Dii07].
3.5 So what? (take 2)
In this chapter we have illustrated our main works related to the web documents
segmentation in order to allow the extraction of data (the main technological context of this
thesis). First of all we have introduced a model for a five dimensional segmentation of any
document, called Pentaformat [Dii07]. According to this model we have developed a language,
the Pentaformat Markup Language (PML), that allows to specify some declarations to segment
XML documents. The use of this language is the main innovation of the new version (2.0) of
68
3.5 | So what? (take 2)
elISA [DVV04]: while the old version allows to identify the content and some presentational
elements of a web document, elISA 2.0 segments it according to the Pentaformat model using
PML. In order to replace the old engine version with elISA 2.0 into the ISAWiki platform
[DV04], we have developed another rule-based engine that patterns PML documents according
to seven structural patterns [Gub04]. The output of this engine is a new XML document based on
PML and PML patterns. PMLp is a language developed for the restructuring of XML documents
according to some patterning operations. With this latter step we can obtain from a PML+PMLp
document - through a meta-XSLT [Kay07] - an IML document. This kind of formats is what
ISAWiki uses to stores documents.
In Chapter 4 we will re-introduce the two engines developed, elISA 2.0 and the patterning
engine, in order to describe their implementation with much more details. After that we
will introduce a web application, developed through the two engines and the Java Servlet
Technology [Sun06a], called elISA Server Side. It provides an user interface to segment a
specified web document using elISA 2.0 and, eventually, to transform the PML document
(obtained in the segmentation phase) into an IML document.
69
3.5 | So what? (take 2)
70
Chapter 4 | Features hole and its monsters
Chapter 4
Features hole and its monsters
In the last chapter, Chapter 3, we have introduced all the theories and the two tools that we
have developed for this work in order to complete the goal introduced in Chapter 1: to develop
rule-based mechanism to segment XML documents according to a five dimensional model,
called Pentaformat [Dii07], in order to convert automatically them into new documents using
one or more of constituents introduced by the model.
In this chapter we will deepen some concepts concerning elISA 2.0 and the patterning
engine - introduced in Section 3.3 and Section 3.4 respectively - related to the specific
architectures of these two engines and the specific Relax NG [Oas01] grammars developed
to make documents that the engines use to perform their processes. We will illustrate these
components in Section 4.1 and in Section 4.2.
We can complete the goal of our thesis using an application that can combine these two
engines. For this reason we have developed a web application called elISA Server Side that uses
these engines and other meta-XSLT documents [Kay07] to perform the transformation of a web
document into an IML document via Pentaformat segmentation. This application is developed
in Java 6.0 [Sun06b] with Servlet technology [Sun06a] and it has been tested on Tomcat [http://
tomcat.apache.org/] 6.0.16. Through elISA Server Side we can specify an url of a web document
to be analyzed according to some chooseable rules and thresholds in order to return a PML
document. After this first analysis we can choose whether to get the PML document or to show
it or to transform it into an IML document. We will explain this process in detail in Section 4.3.
4.1 elISA engine: the infrastructure
The engine that we have introduced in Section 3.3, called elISA 2.0, is a software that
segments any web document according to the Pentaformat [Dii07] - the model that we have
illustrated in Section 3.1. This model allows to extract data identifying the roles that the
elements of a web document may have, accordingly to the five dimensions: content, structure,
71
4.1.1 | Three steps in five phases
presentation, behavior and metadata. The output of elISA 2.0 is a PML document (Section
3.2), i.e. an XML document with some pml declarations. As we have illustrated in Section 3.3,
this engine needs two fundamental files which specify, respectively, the rules for defining pml
declarations and the thresholds for selecting a particular subset of the formers.
In this section we will introduce in detail the infrastructure of elISA 2.0. In Section 4.1.1
we will describe which modules are part of it and how they work. Indeed, in Section 4.1.2, we
will describe the structure of the rules document and of the thresholds document in order to
understand how we can define them.
4.1.1 Three steps in five phases
The Picture 15 describes elISA 2.0 as an engine working through three main steps. Quite
true. But two of these three steps are splitted in two phases for a total of five distinct phases,
as we can see in Picture 16:
1.
the establish phase (step 1) we try to well form the input document through
an external well-former, the HTMLCleaner [http://htmlcleaner.sourceforge.net/]
version 1.6 by Vladimir Nikic. It is permitted to change the default well-former
modifying a configuration file;
2.
the load phase (step 1) allows to load and execute on the input document a limitless
number of plugins conforming to a known Java interface. Using these plugins we
can add/remove informations to/from the well formed input document;
3.
through the indicate phase (step 2) the engine completes the first real analysis of the
input document producing an intermediary document, written in a language called
PML Qualifier. We use this language to specify all the probabilistic pml declarations
deduced from a rule-set document;
4.
the next solve phase (step 3) is useful to re-write all the probabilistic pml declarations
of the PMLQualifier document. In this phase we sum the weights as we have
illustrated in Section 3.3.2, solving all the XPath 2.0 queries of the declarations in
order to associate these declaration to any element;
5.
in the last acknowledge phase (step 3) the engine performs the choice of probabilistic
pml declarations that will be presented in the final PML document.
During the first phase our goal is to get a web document, whether it's well-formed or not. In
the first case the establish phase returns the same XML document without changes. Instead, in
72
4.1.1 | Three steps in five phases
the second case, we can re-format in order to become well formed. This operation is performed
by a specific plugin in a JAR archive [Sun03] specified in the configuration file of the engine, as
we can see in Code 40. Considering as context the elISA 2.0 path, “basepath” specifies where the
engine can find the file “name” that contains the well former. Obviously, to create this plugin,
the engine uses the class specified in the attribute “class”. This dynamic loading of Jar files (and
the next pieces of code) is performed by means of Reflection [Sun02]. It is a package that allows
to examine or modify the runtime behaviour of a Java application.
Picture 16 The five phases of elISA 2.0
Code 40 The extract of the configuration file concerns the well-former
<wellformer
basepath="wf"
name="s-wellformer.jar"
class="it.essepuntato.elisa.wellformer.HtmlCleanerWellFormer"
method="run" />
The method to invoke is specified in the attribute “method”. The goal of this method is to
return an instance of a “org.w3c.dom.Document” that represents the XML document obtained
from the input (not well-formed) document. In order to use correctly this external plugin, the
main class of the package must comply to the specific Java interface introduced in Code 41.
73
4.1.1 | Three steps in five phases
As we can see, the class that extends “IElisaWellFormer” has to implement two similar method
that respectively get a string or the source file of the original document and return a well-formed
XML document.
Code 41 The well-former JAVA interface of elISA 2.0
package it.essepuntato.elisa.wellformer;
import java.io.File;
import org.w3c.dom.Document;
public interface IElisaWellFormer {
public Document run(String string);
public Document run(File file);
}
We have used an external plugin to specify the well former because when we want to
change the well former with a new version or with another well former we will do it easily
changing the configuration file only.
The second phase of the engine process - load - works as the first phase. We can specify
some plugins adding it in the configuration file through the tag “plugin”. All these plugins are
in the path specified by the attribute “basepath” of the tag “loader”. All the plugins specified in
the configuration file (Code 42) will execute in ascending order. The attributes “name”, “class”
and “method” are used in the engine as in the first phase. The “filespath” specifies the local
directory - from the “basepath” of the “loader” - in which we put some files for the plugin. All
the elements specified by the tags “param” represent the parameters of the plugin.
Code 42 The extract of the configuration file concerns the plugin loader
<loader basepath="plugin">
<plugin
type="jar"
name="node-enumeration.jar"
class="it.essepuntato.elisa.plugin.NodeEnumeration"
method="run"
filespath="NodeEnumeration">
<param><key>xslt</key><value>NodeEnumeration.xsl</value></param>
</plugin>
</loader>
The goal of every plugin is to get an XML document, to elaborate it and to return a new
XML document with (probably) some changes. Obviously, in order to allow a correct work,
all plugins must comply to the specific Java interface that we can see in Code 43. The method
“run”, invoked by the engine, get three parameters: the XML document to be processed, the
path in which there are some files for the plugin and a “Map” in which there are specified all
the parameters.
74
4.1.1 | Three steps in five phases
Code 43 The plugin JAVA interface of elISA 2.0
package it.essepuntato.elisa.plugin;
import java.io.File;
import java.util.Map;
import org.w3c.dom.Document;
public interface IElisaPlugin {
public Document run(Document dom, File filePath, Map<String,String> params);
}
Using this loading phase to execute plugins, we can remove or add data to the input XML
document in order to improve the quality of the identification of roles. We can add some
information to some elements temporarily - in order to use them in the process only - through
qualified attributes: their namespaces must be “http://www.essepuntato.it/PMLLoad”, and they
must be specified by the prefix “load”. We use this namespace and prefix in order to remove
these added data in the final PML document.
The next three phases - indicate, solve and acknowledge - are made with three different
meta-XSLT:
•
in the third phase we use a meta-XSLT with a document having some rules in order
to create a new XSLT to indicate the probably roles of the elements of the input XML
document. The output is a document written in a intermediary PML language called
PML Qualifier. Using this language we can specify probabilistic pml declarations to
all elements of the document;
•
in the fourth phase we apply the input document to a meta-XSLT and to the resulting
XSLT. Through the first application we create an XSLT on the basis of the XPath
expressions of all probabilistic pml declarations. In the second application we solve
the XPath of all probabilistic pml declarations and we sum their “weight” value
among identical declarations;
•
in the latter phase we choose what declarations we keep in the output PML document.
In order to do it we use a meta-XSLT applied to some thresholds in order to make
another XSLT document used to transform the input “PML Qualifier” document in
a PML document.
In this section we have analyzed what are the five phases that compose the elISA 2.0
process. All these phases are needed to complete the transformation of a web document - well
formed or not - into a PML document on the basis of some rules and thresholds. The two
documents to define these rules and thresholds are based on specific Relax NG [Oas01] +
Schematron [Jel05] grammar. We describe in depth these two grammars in Section 4.1.2.
75
4.1.2 | Rules and thresholds
4.1.2 Rules and thresholds
In Section 4.1.1 we have introduced the five phases of the elISA 2.0 process. As we have
seen, in order to complete these phases, we need two XML documents in which we have
specified rules and thresholds. We have seen two examples of these two documents in Section
3.3.2 (Code 17 and Code 18). In this section we analyze some aspects of their grammar in order
to understand how we can specify rules and thresholds. We introduce all examples using the
Relax NG compact syntax [Oas01].
The document element of a rules document is “rules”. The structure of this element is
simple: we can define some variables using “call”; further we can define some macros (related
to three specific conditional elements: “check”, “whenever” and “test”) using the content model
expressed for statements; obviously we have to define at least one “rule” element, as we can
see in Code 44.
Code 44 The element “rules”
rules =
element rules {
call*,
CM.statement,
rule+
}
CM.statement = (statementcheck | statementwhenever | statementtest)*
An element “rule” (Code 45) is characterized by two attributes: “context” is an XPath 2.0
expression that defines what elements are related to the rule, while “deep” is a boolean that
specifies if the analysis must continue or not for all the children of the context. Other than
optional and repeatable elements “call” and the statement groups, we can give some probabilistic
pml declarations using the set elements. All these elements have four attributes - “name”, “ref”,
“content” and “weight” - used to specify the declarations.
Code 45 The element “rule”
rule =
element rule {
attribute.context,
attribute.deep?,
call*,
CM.statement,
(
setContent
setStructure
setPresentation
setMetadata
setBehavior
)*,
76
|
|
|
|
4.1.2 | Rules and thresholds
(check | refcheck)*
}
check =
element check {
(whenever | refwhenever)+,
otherwise?
}
The last elements - that we can use to structure a rule refer to conditional expressions - are
called “check”. Into these elements we can specify one or more if statements using “whenever”
with some conditions defined by XPath 2.0 expressions. The optional element “otherwise”
represents a sort of else for all the previous “whenever”s: if all their conditions are not satisfied
then we apply the “otherwise” block. The content models of the elements “whenever” and
“otherwise” is the same, except for conditional elements of the first one (the “test” attribute or
the “test” element). Both can contain probabilistic pml declarations - defined by “setContent”,
“setStructure” and so on - and other “whenever”/“otherwise” elements.
The grammar to define thresholds is easier than the rules grammar. A thresholds document,
as we can see in Code 46, has to begin with an element “thresholds” without attributes. It
contains one or more elements “threshold”. These last kinds of elements specify a context
through the attribute “context” - an XPath 2.0 expression - in order to define what elements it
refers to. The optional attribute “priority” defines a priority (XSLT [Kay07] like) among all the
thresholds referred to the same element. The five interleaved elements, called as the Pentaformat
[Dii07] dimensions, allow to specify threshold values for all probabilistic pml declarations
referred to the context.
Code 46 The elements “thresholds” and “threshold”
thresholds =
element thresholds { threshold+ }
threshold =
element threshold {
attribute.context,
attribute.priority?,
(
content?
&
structure?
&
presentation? &
behavior?
&
metadata?
)
}
As we can see in Code 47, the content models of these five elements are similar. Each of
them may contain a best-weight threshold or one or more conditional selector. The first type
of thresholds specifies that the declaration with the best weight wins. In case of standoff we
77
4.1.2 | Rules and thresholds
select the declaration considering the name priority specified with the attribute “priority” of
“bestWeight”.
Code 47 The dimensional elements
content =
element content {
bestWeight.content | select.content+
}
structure =
element structure {
bestWeight.structure | select.structure+
}
presentation =
element presentation {
bestWeight.presentation | select.presentation+
}
behavior =
element behaviour {
bestWeight.behavior | select.behavior+
}
metadata =
element metadata {
bestWeight.metadata | select.metadata+
}
The content model of any element “select” can contain an element “weight” (that specifies
a number and a comparison operator) and an optional element called “name” with a value related
to a dimension. Every element “select” of a threshold is referred to a dimension and represents
a conditional expression; all these expressions are connected using the operator “or” in order
to form a conditional XPath 2.0 expression composed by all the sibling elements “select”. To
understand this we introduce a little example in Code 48. With the threshold referred to all
elements “p” we save all the probabilistic pml declarations that refer to “p” and that concern the
dimension content. In addition to this, the conditional expression (weight >= 0.7 and
name = 'Text') or (weight > 0.9) must be true.
Code 48 An example of multiple select for the element “content”
<?xml version="1.0" encoding="UTF-8"?>
<thresholds xmlns="http://www.essepuntato.it/Thresholds">
<threshold context="p">
<content>
<select>
<weight ge="0.7"/>
<name value="Text"/>
</select>
<select>
<weight gt="0.9"/>
</select>
</content>
</threshold>
</thresholds>
78
4.2 | Pattern engine: the infrastructure
Obviously all the probabilistic pml declarations that are not associated to any threshold are
not saved into the final PML document.
In this section we have analyzed the two grammars related to rules and thresholds in order
to understand how we can define them. All the matters explained in Section 3.3, in Section 4.1.1
and in this one conclude the issues about elISA 2.0. In Section 4.2 we will re-introduce in depth
all that matters regarding the patterning process.
4.2 Pattern engine: the infrastructure
The engine that we have introduced in Section 3.4 allows to pattern PML documents
(presented in Section 3.2) according to some patterning rules specified in an XML document.
To pattern PML documents is fundamental if we want to convert them into IML documents
[San06] in order to use elISA 2.0 in the ISAWiki platform (Section 3.3.1). In fact, as we have
illustrated in Section 3.4.3, the main difference between PML and IML is the patterned structure
of the latter.
In this section we re-discuss the patterning engine introduced in Section 3.4 and deepen
some issues concerning the grammar to define patterning rules (Section 4.2.1) and we present
an important configuration file (Section 4.2.2).
4.2.1 How to define a patterning rule-set
The content model which complies to the document element “patterns” has a structure
similar to the grammar that defines rules for elISA 2.0 (Section 4.1.2). As we can see in Code
49, it is formed by a sequence of global variables (XSLT like), by a sequence of conditional
macros (elements “statement”) and by one or more patterning rules.
Code 49 The element “patterns”
patterns =
element patterns {
cm.patterns
}
cm.patterns =
variable*,
statement*,
pattern+
These rules are defined by the element “pattern”. As we see in Code 50, a rule is
characterized by the attribute “match” (defining what elements refer to this rule) and by an
optional attribute “priority” that specifies a priority value for it. Other than the two sequences
79
4.2.1 | How to define a patterning rule-set
of variables and conditional macros, the element “pattern” must specify an element “choose”
in order to apply some patterning operations.
Code 50 The element “pattern”
pattern =
element pattern {
cm.pattern
}
cm.pattern =
attribute.match,
attribute.priority?,
variable*,
statement*,
choose
The content model of “choose” is simple: it contains a sequence of if/else-if blocks defined
by some “when” elements or references to conditional macros. The condition of the elements
“when” is expressed by the attribute “test” according to XPath 2.0 expressions [BBC07a]. Other
than the variables, in this element we can choose whether to use other conditional block or to
apply some patterning operations.
Code 51 The elements “choose” and “when”
choose =
element choose {
cm.choose
}
cm.choose =
(when | ref)+
when =
element when {
cm.when
}
cm.when =
attribute.test,
variable?,
(
(unwrap | inject | wrap | rename) |
choose
)
As we have seen in Section 3.4.4, we can use one of the four patterning operations: wrap,
unwrap, inject and rename. As we can see in Code 52, all these operations, except rename,
support the use of multiple wrap declarations. Any wrap is based on three attributes whereof
two are optional. The attribute “pattern” - as in rename - specifies what element we use to wrap
the elements specified by the optional attribute “select”. If we do not specify some selections
then we consider all elements as operation subject. The optional attribute “force” is useful if
80
4.2.1 | How to define a patterning rule-set
we want to specify an order between this new element and the inherited elements to bo inserted
specified by previous inject operations.
Code 52 The elements specifying the patterning operations
unwrap =
element unwrap {
cm.operation
}
inject =
element inject {
cm.operation
}
cm.operation =
subwrap*
rename =
element rename {
cm.rename
}
cm.rename =
attribute.pattern
wrap =
element wraps {
cm.wrap
}
cm.wrap =
subwrap+
subwrap =
element wrap {
cm.subwrap
}
cm.subwrap =
attribute.pattern,
attribute.select?,
attribute.force.wrap?,
subwrap*
To understand what is the semantic of a multiple wrap we present the example in Code
53. In this rule we use three different wraps. Through wrap 1 we wrap, with a container, all
the sequences of inline and text nodes. All the elements of these sequences are still wrapped
by a block using the wrap 2 operation. The wrap 3 is applied to all elements except those
elements selected by previous “wrap” siblings: in this case (element()|text()) except
(element()[f:isInline(.)]|text()).
Code 53 An example of multiple wrap
<?xml version="1.0" encoding="UTF-8"?>
<patterns xmlns="http://www.essepuntato.it/Patterns">
<pattern match="body">
<choose>
<when test="exists(element()|text())">
<wraps>
<!-- wrap 1 -->
<wrap pattern="Pcontainer" select="element()[f:isInline(.)]|text()">
<!-- wrap 2 -->
81
4.2.1 | How to define a patterning rule-set
<wrap pattern="Pblock" />
</wrap>
<!-- wrap 3 -->
<wrap pattern="Pcontainer" />
</wraps>
</when>
</choose>
</pattern>
</patterns>
To understand how this patterning rule works we introduce an example. As we can see in
Code 54, we have a non-patterned document composed by some nodes (text, inline and block).
The goal is to pattern the document in order to have the “body” document element patterned by
some containers, such as “div”, and each of them patterned by a block, such as “p”.
Code 54 A non-patterned document
<?xml version="1.0" encoding="UTF-8"?>
<body xmlns:pml="http://www.essepuntato.it/PML">
<pml:dimensions>
<pml:structure pml:name="Pcontainer" pml:ref="//body" pml:content="." />
<pml:structure pml:name="Pblock" pml:ref="//p" pml:content="." />
<pml:structure pml:name="Pinline" pml:ref="//(em|q|b)" pml:content="." />
</pml:dimensions>
This is a little <em>example</em> to understand how we can
use a multi wrap.
<p>
We want to obtain the element <q>body</q> as a sequence
of container such as <q>div</q>.
</p>
With one simple <b>patterning rules</b> we can obtain a patterned
new document.
</body>
The application of the rule introduced in Code 53 to the document in Code 54 makes the
patterned document in Code 55. In this new document all the text and inline nodes are patterned
with the application of wrap 1 and wrap 2. The only block element in the old document is
patterned using the wrap 3 operation instead.
Code 55 A patterned version of Code 54 using the patterning rules specified in Code 53
<?xml version="1.0" encoding="UTF-8"?>
<body pmlp:wrap="body1" xmlns:pml="http://www.essepuntato.it/PML" xmlns:pmlp="http://www.essepuntato.it/
PMLp">
<pml:dimensions>
<pml:structure pml:name="Pcontainer" pml:ref="//body|//div" pml:content="." />
<pml:structure pml:name="Pblock" pml:ref="//p" pml:content="." />
<pml:structure pml:name="Pinline" pml:ref="//(em|q|b)" pml:content="." />
</pml:dimensions>
<!-- wrap 1 -->
<div pmlp:wrapped="body1">
<!-- wrap 2 -->
<p pmlp:wrapped="body1">
This is a little <em>example</em> to understand how we can
use a multi wrap.
</p>
</div>
<!-- wrap 3 -->
<div pmlp:wrapped="body1">
<p>
We want to obtain the element <q>body</q> as a sequence
82
4.2.2 | The configuration file
of container such as <q>div</q>.
</p>
</div>
<!-- wrap 1 -->
<div pmlp:wrapped="body1">
<!-- wrap 2 -->
<p pmlp:wrapped="body1">
With one simple <b>patterning rules</b> we can obtain a patterned
new document.
</p>
</div>
</body>
As we have seen in more examples about documents with patterning rules (Code 34 and
Code 53), we can use some functions, such as f:isInline or f:isBlock, that allow to
interact with all pml declarations of the input document. These functions are splitted in two
categories with respect to patterns and to Pentaformat dimensions:
•
f:isMarker, f:isAtom, f:isInline, f:isBlock, f:isContainer, f:isTable, f:isRecord. These
functions get an element in input and return a boolean value that identifies whether
the input belongs to the specified pattern or not;
•
f:isContent, f:isStructure, f:isPresetation, f:isBehaviour, f:isMetadata. These
functions get an element in input and return a boolean value that identifies
whether the input belongs to the specified dimension or not. We can use other
five functions - f:hasContentName, f:hasStructureName, f:hasPresentationName,
f:hasBehaviourName, f:hasMetadataName - that get in input an element and a name
sequence and return a boolean value that identifies whether the input element belongs
to the specified dimensions with at least one name of the input sequence. The value of
each name of the input sequence is correlated to the respective values of the attribute
“name” that a pml declaration may have.
In this section we have analyzed the grammar to define patterning rules introducing some
examples. In Section 4.2.2 we will illustrate the architecture of the patterning engine in order
to introduce some aspects related to the configuration file.
4.2.2 The configuration file
After we have understood (Section 4.2.1) how we can specify the patterning rules in order to
pattern a PML document and how all the patterning operations work, in this section we illustrate
the architecture of the patterning engine, analyzing its main features.
The patterning engine that we have developed is a Java application based on a meta-XSLT
document [Kay07]. As we can see in Picture 17, its goal is to get an input PML document in
order to return a patterned PML+PMLp document. As we have introduced in Section 3.4, the
83
4.2.2 | The configuration file
PML+PMLp document, returned by the engine, is an XML document with the old non-patterned
structure expressed by PMLp elements and attributes. Other than the input PML document,
there are other two documents that the engine uses to allow the transformation. The former is the
document with patterning rules written complying to the grammar introduced in Section 4.2.1.
Picture 17 The patterning process
The latter, called “definitions.xml”, is used for three reasons:
•
to define a translation for all the values of the attribute “name” of a pml declaration
(concerning the structure) into a specific element;
•
to specify a sort of ontology in order to understand what pattern any structure “name”
value is referred to, as we have illustrated in Section 3.4.5;
•
to describe the content model of all structural patterns.
As we can see in Code 56, an element is associated to every name value of the structure
dimension. Every element has an attribute “name” that specifies how the structure must be
translated. All elements are arranged in a sort of ontology in which the main classes are
represented by the structure “name” of the seven patterns. All other structure elements are
children of only one of them. We use this part of “definitions.xml”, on one hand to specify how
to translate an element referred to a specific name of the Pentaformat structure; on the other
hand we use it to understand what pattern is associated to a specific element.
84
4.2.2 | The configuration file
Code 56 Definitions for the “name” attribute of the dimension structure
<definitions>
<Pblock name="p">
<Paragraph name="p" />
<Heading name="h1" />
</Pblock>
<Pinline name="span">
<Generic name="span" />
<Link name="a" />
<Strong name="b" />
<Citation name="q" />
<Subscript name="sub" />
<Superscript name="sup" />
<Emphasis name="i" />
</Pinline>
<Patom name="span" />
<Pmarker name="span">
<Image name="img" />
<Meta name="meta" />
</Pmarker>
<Pcontainer name="div">
<Head name="head" />
<Body name="body" />
<Divider name="div" />
<Object name="object" />
<ListItem name="li" />
<TableRow name="tr" />
<TableHeader name="th" />
<TableCell name="td" />
</Pcontainer>
<Ptable name="table">
<Table name="table" />
<List name="ul" />
</Ptable>
<Precord name="div">
<Root name="html" />
</Precord>
<GeneralStructure name="div" />
</definitions>
The second part of “definitions.xml” concerns the content model of every pattern. It is used
to understand, during the patterning process, whether the operation may be applied or not to
an element. For example, let us suppose to apply a rename to a “Divider” to transform it into
a “Paragraph”, i.e. we want to transform a container in a block. This operation is allowed if
and only if the content model of the father of the “Divider” accepts block elements as children.
The engine handles this issue according to the content models specified in “definitions.xml”.
So, if the father of the “Divider” does not allow a block in its content model then the engine
does not apply the rename.
Code 57 Content model for all the patterns
<contentModels>
<Pblock>
<text />
<comment />
<processing-instruction />
<Pinline />
<Patom />
85
4.3 | elISA Server Side
<Pmarker />
</Pblock>
<Pinline>
<text />
<comment />
<processing-instruction />
<Pinline />
<Patom />
<Pmarker type="milestone" />
</Pinline>
<Patom>
<comment />
<processing-instruction />
<text />
</Patom>
<Pmarker />
<Pcontainer>
<comment />
<processing-instruction />
<Ptable />
<Precord />
<Pmarker type="meta" />
<Pblock />
<Pcontainer />
<Patom />
</Pcontainer>
<Ptable>
<comment />
<processing-instruction />
<Ptable />
<Precord />
<Pmarker type="meta" />
<Pblock />
<Pcontainer />
<Patom />
</Ptable>
<Precord>
<comment />
<processing-instruction />
<Ptable />
<Precord />
<Pmarker type="meta" />
<Pblock />
<Pcontainer />
<Patom />
</Precord>
</contentModels>
In this section we have illustrated how the patterning process works and what documents
are important for the engine process. In Section 4.3 we will explain how the two engines - elISA
2.0 (Section 3.3 and Section 4.1) and the patterning engine (Section 3.4 and Section 4.2) - are
included in a web applications, called elISA Server Side, that we use to segment (and eventually
to transform) web documents.
4.3 elISA Server Side
In this section we introduce a web application developed using Java 6.0 [Sun06b] with
Servlet technologies [Sun06a]. This application, called elISA Server Side, includes all the
technologies developed for this work and completes the goal introduced in Chapter 1: to make
rule-based mechanism to segment XML documents according to a five dimensional model
86
4.3 | elISA Server Side
called Pentaformat [Dii07] in order to automatically convert them in new documents using one
or more of constituents introduced by the model.
The goal of this application is to analyze a web document, specified by an url [CCD01], in
order to return a PML document or another type of document obtained from the output of elISA
2.0. As we can see in Picture 18, the first step of this web application is to apply elISA 2.0 to
the url-specified web document, as we have illustrated in Section 3.3 and Section 4.1.
Picture 18 elISA Server Side architecture
To use elISA 2.0 for this analysis we choose which rules and thresholds we want. As we
can see in the first screenshot of the web application in Picture 19, we can choose them among
some XML files specified by elISA Server Side.
After this analysis we can decide what to do with the PML document returned by elISA
2.0. Actually, as we can see in the second screenshot of the application in Picture 20, we can
choose between four different options:
•
to get the PML document as is;
•
to show the PML document locally on browser, using an XSLT transformation that
fixes some visualization problems, such as the base path for relative uri and so on;
87
4.3 | elISA Server Side
•
to color some pml declarations in the PML document, showing the result on
browser using a meta-XSLT transformation that colors all nodes belonged to some
Pentaformat dimensions;
•
to transform the PML document into an IML document, using the patterning engine
(Section 3.4 and Section 4.2) to pattern it and a meta-XSLT to transform the PML
+PMLp document into an IML document [San06].
Picture 19 elISA 2.0 engine in elISA Server Side
Picture 20 Four kinds of chooses in order to return the PML document
The transformation of the PML document into an IML document is performed in two steps.
Firstly, we pattern the PML document using the patterning engine. Secondly, using as input the
PML+PMLp document returned by the previous step, we transform it into an IML document
88
4.3 | elISA Server Side
using a simple meta-XSLT document. This stylesheet considers only the elements that contain
some nodes or that are content nodes, in order to translate the elements according to IML. In
Code 58 we can see an extract of the IML document returned by a complete elISA Server
Side process applied to an article [http://en.wikipedia.org/wiki/Web_3.0] of Wikipedia [http://
en.wikipedia.org].
Code 58 An extract from the transformation of the Wikipedia article “Web 3.0” into an IML
document
<iml xmlns="http://www.cs.unibo.it/2006/iml" xml:lang="en">
<body class="mediawiki ns-0 ltr page-Web_3_0">
<div id="globalWrapper">
<div id="column-content">
<div id="content">
<h1 class="firstHeading">Web 3.0</h1>
<div id="bodyContent">
<h3 id="siteSub">From Wikipedia, the free encyclopedia</h3>
<p><b>Web 3.0</b> is a term used to describe the future of the <a
href="/wiki/World_Wide_Web">World Wide Web</a>. Following the
introduction of the phrase "<a href="/wiki/Web_2.0">Web 2.0</a>" as a
description of the recent evolution of the Web, many technologists,
journalists, and industry leaders have used the term "Web 3.0" to
hypothesize about a future wave of Internet innovation.</p>
<p>Views on the next stage of the World Wide Web's evolution vary greatly.
Some believe that emerging technologies such as the <a
href="/wiki/Semantic_Web">Semantic Web</a> will transform the way
the Web is used, and lead to new possibilities in <a
href="/wiki/Artificial_intelligence">artificial intelligence</a>.
Other visionaries suggest that increases in Internet connection speeds,
modular <a class="mw-redirect" href="/wiki/Web_applications">web
applications</a>, or advances in <a href="/wiki/Computer_graphics"
>computer graphics</a> will play the key role in the evolution of
the World Wide Web.</p>
[...]
</div>
</div>
</div>
</div>
</body>
</iml>
In this section we have introduced the web application called elISA Server Side. As we
have seen, this application performs a complete transformation from a web document to an IML
document. It is possible former to segment the input document according to the Pentaformat
model [Dii07] and latter to pattern the PML document returned in the first step, according to
the seven structural patterns [DDD07]. These transformations - from web documents to PML
documents and from PML documents to IML documents - represent the main goal of our work
introduced in Chapter 1.
89
4.4 | Summarizing all the infrastructures
4.4 Summarizing all the infrastructures
In this chapter we have deepened some concepts about the two engines introduced in
Chapter 3: elISA 2.0 and the patterning engine. We have introduced their architectures and we
have illustrated all the Relax NG [Oas01] grammars that we use to write XML documents with
rules and thresholds (for elISA 2.0) and patterning rules (for the patterning engine). We have
introduced some examples in order to understand how we can write these documents and we
have illustrated some aspects about an important configuration file referred to the patterning
engine.
The new tool, introduced in this chapter is developed through our two engines. It is a web
application called elISA Server Side that allows to segment web documents in order to return
PML documents or other documents in different formats. Through this web application we can
complete the goal of this thesis: to develop rule-based mechanism to segment XML documents
according to a five-dimensional model called Pentaformat [Dii07] in order to automatically
convert them into new documents exploiting one or more of the constituents introduced by
the model. In particular, using the patterning engine, we can convert web documents into IML
documents.
In the conclusions of this dissertation (Chapter 5) we will re-discuss all the issues, the
theories and the tools introduced in this chapter and in all the preceding chapters (Chapter 1,
Chapter 2, Chapter 3) in order to suggest some future developments for these arguments.
90
Chapter 5 | Happily ever after (or Conclusions)
Chapter 5
Happily ever after (or Conclusions)
As we have seen in Chapter 1, the claim of this thesis is to develop a rule-based mechanism
to segment XML documents according to a five-dimensional model called Pentaformat [Dii07]
in order to convert automatically them in new documents using one or more of constituents
introduced by the model: content, structure, presentation, behaviour and metadata.
We have introduced some phases to complete this conversion. First of all, in Chapter 2 we
have spent some words to talk about the extraction of data. This is the main context in which
we have worked to develop our tools. In particular we have introduced the concept of content
extraction, clarifying what we intuitively mean with the word “content” in the context of web
pages (what the authors have written or what users search googling). After this first definition
we have explain that not all the elements of a document, such as a web document, belong to the
content. For example in a common web page we can find layout tables, logos, banners that we
can consider as presentational items rather than content.
Other than the separation between content and presentation, we have introduced another
role for the elements of a web document: the metadata. In a common sense, we define
metadata as assertions about the document. We can specify metadata for a web page using
some techniques, from using the (X)HTML tag “meta” to specifying a semantic assertions
through some Semantic Web [BHL01] technologies such as OWL [BDH04], RDF [BM04],
RDFa [AB07] and microformats [All07].
After this brief introduction about the roles that the elements of a web document can have,
we have illustrated some tools and theories related to content extraction, showing how they
work. The main goal of these tools is to identify whether the elements of a web page are or are
not content, leaving out the recognition of roles for all non-content elements. In our opinion,
this is the main shortcoming of these works. We think the extraction of content is important as
well as the identification of the roles of the remaining non-content elements.
To fill up this shortcoming, we have developed a rule-based engine, called elISA 2.0
(Extraction of Layout Information via Structural Analysis 2.0), to segment XML documents
91
Chapter 5 | Happily ever after (or Conclusions)
according to the Pentaformat model. As we have illustrated in Chapter 3, this engine is based on
a language, called PML (Pentaformat Markup Language), that allows to identify the roles of the
web page elements using easy declarations. The specific context in which we want to use this
engine is the ISAWiki platform [DV04]. This is a client/server application that lets signed users
edit any web page and store it in an appropriate server. In order to identify what parts of a web
document users can modify, this platform actually uses an old version of elISA [DVV04] for
the segmentation of all web documents according to two main dimension: content and (a small
set of) presentation. Our goal is to replace this old version of the engine with the new elISA 2.0
in order to segment XML documents according to the Pentaformat model.
Unfortunately PML - the output of elISA 2.0 - is not the format which ISAWiki uses to
store documents. All documents in this platform are stored in an intermediary language called
ISAWiki Markup Language [San06]. IML is a language based on a specific structural pattern
model [DDD07]. It allows to structure the content of a document according to seven structural
pattern: marker, atom, inline, block, container, table, record. The problem is that PML does
not comply to the structural pattern model used in IML. To allow the transformation of a PML
document into an IML document we have needed to develop another rule-based engine. This
tool, called patterning engine, allows to pattern XML documents according to the patterning rule
set specified. The output of this engine is a patterned XML document that can be transformed
easily, thanks to a meta-XSLT document [Kay07], into an IML document.
We have analyzed in depth this conversion process in Chapter 4. In this chapter we have rediscussed the issues concerning the segmentation and the patterning of XML documents in order
to explain the infrastructure of the two engine that we have developed. After this explanation,
we have introduced a web application that performs this conversion: elISA Server Side. This
is a web application - developed in Java 6.0 [Sun06b] using the Servlet technology [Sun06a]
- that includes our engines to transform web documents into IML documents. This application
realizes the claim proposed in Chapter 1.
A possible future work related to our thesis is to extend the patterning engine in order to
allow the semantic patterning. In our context, this type of pattering must:
92
•
pattern the XML document obtaining a new XML document (syntax constraint);
•
specify in the patterned document the non-patterned structure of the old document
(structure constraint);
•
allow the same visualization for both the patterned and the non-patterned documents
(semantic constraint).
Chapter 5 | Happily ever after (or Conclusions)
If the first two points are enough to allow the syntactic patterning, the latter is fundamental
to guarantee the semantic patterning. In most scenarios a syntactic patterning of a document
is enough to have a semantic patterning too. Instead there are some (very unusual) scenarios,
for example the document in Picture 21, in which the syntactic patterning performed by our
patterning engine changes the visualization between the input and the output document.
Picture 21 A non-patterned XML document
As we have illustrated, a constraint for the PML+PMLp document is to specify the nonpatterned structure of the old document. In order to return a patterned XML document complies
to this constraint, the patterning engine adds some elements to the original document, as we
can see in Picture 22. Unfortunately the patterning specified in this document is not a semantic
patterning because the visualization of the old (Picture 21) and the new document (Picture 22)
is not the same.
A way to specify the old non-patterned structure patterning the document semantically is
performed through an overlapping markup technique called milestone [AMP03]. As we can see
in Picture 23, overlapping the element <pmlp:old name="b"> through this technique we
can obtain a correct semantic patterning for the document in Picture 21 storing the old nonpatterned structure too.
93
Chapter 5 | Happily ever after (or Conclusions)
Picture 22 A syntactic patterning for the document in Picture 21
Picture 23 A semantic patterning for the document in Picture 21
94
Chapter 5 | Happily ever after (or Conclusions)
Our goal is extend the current version of the patterning engine in order to make possible
the semantic patterning for any XML document. This goal represent the main future work for
this thesis.
95
Chapter 5 | Happily ever after (or Conclusions)
96
Bibliography
Bibliography
[AB07] B. Adida, M. Birbeck - RDFa Primer: Embedding Structured Data in Web Pages W3C Working Draft [http://www.w3.org/TR/xhtml-rdfa-primer/] - 26 October 2007
[AHR01] H. Alam, R. Hartono, A. F. R. Rahman - Content Extraction from HTML Documents International workshop on the Web Document Analysis (WDA 2001) [http://www.csc.liv.ac.uk/
~wda2001/Papers/11_rahman_wda2001.pdf], Seattle, WA, USA - 8 September 2001
[AKK06] W. Abramowicz, T. Kaczmarek, M. Kowalkiewicz, M. E. Orlowska - Robust Web
Content Extraction - 15th International World Wide Web Conference (WWW 2006) [http://
www2006.org/programme/files/pdf/p91.pdf], Edinburgh, Scotland, UK - 23-26 May 2006
[AMP03] L. Arévalo, J. C. Manzano, A. Polo, M. Salas - Multiple Markups in XML Documents
- Springer Press, Proceedings of the International Conference on Web Engineering (ICWE '03),
pp. 222-225, Oviedo, Spain - 14-18 July 2003
[Ale79] C. Alexander - The Timeless Way of Building - Oxford University Press - 1979
[All07] J. Allsopp - Microformats: Empowering Your Markup for Web 2.0 - Friends of ED Press
- March 26 2007
[BBC07a] A. Berglund, S. Boag, D. Chamberlin, M. F. Fernández, M. Kay, J. Robie, J. Siméon
- XML Path Language (XPath) 2.0 - W3C Recommendation [http://www.w3.org/TR/xpath20/]
- 23 January 2007
[BBC07b] M. Banko, M. Broadhead, M. J. Cafarella, O. Etzioni, S. Soderland - Open
Information Extraction from the Web - Proceedings of the 20th International Joint Conference
on Artificial Intelligence (IJCAI) [http://www.ijcai.org/papers07/Papers/IJCAI07-429.pdf], pp.
2670-2676, Hyderabad, India - 6-12 January 2007
97
Bibliography
[BCC02] M. Branstein, R. Coleman, W. B. Croft, M. King, W. Li, D. Pinto, X. Wei - Quasm:
a system for question answering using semi structured data - ACM, Proceedings of the 2nd
ACM/IEEE-CS joint conference on Digital libraries (JCDL '02), pp. 46-55, Portland, OR, USA
- 14-18 July 2002
[BCH07] B. Bos, T. Çelik, I. Hickson, H. Wium Lie - Cascading Style Sheets Level 2 Revision 1
(CSS 2.1) - W3C Candidate Recommendation [http://www.w3.org/TR/CSS21/] - 19 July 2007
[BCL04] S. Byrne, M. Champion, P. Le Hégaret, A. Le Hors, G. Nicol, J. Robie,
L. Wood - Document Object Model (DOM) Level 3 Core Specification Version 1.0 W3C Recommendation [http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/] 07 April 2004
[BDH04] S. Bechhofer, M. Dean, F. van Harmelen, J. Hendler, I. Horrocks, D. L. McGuinness,
P. F. Patel-Schneider, G. Schreiber, L. A. Stein - OWL Web Ontology Language Reference W3C Recommendation [http://www.w3.org/TR/owl-ref/] - 10 February 2004
[BHL01]
T.
Berners-Lee,
J.
Hendler,
O.
Lassila
The
Semantic
Web
Scientific
American
Magazine
[http://www.sciam.com/article.cfm?
articleID=00048144-10D2-1C70-84A9809EC588EF21] - 2001
[BHL06] T. Bray, D. Hollander, A. Layman, R. Tobin - Namespaces in XML 1.0 (Second
Edition) - W3C Recommendation [http://www.w3.org/TR/REC-xml-names/] - 16 August 2006
[BM04] D. Beckett, B. McBride - RDF/XML Syntax Specification (Revised) - W3C
Recommendation [http://www.w3.org/TR/rdf-syntax-grammar/] - 10 February 2004
[BP00] S. Brin, L. Page - The Anatomy of a Large-Scale Hypertextual Web Search
Engine - Ph.D. thesis [http://infolab.stanford.edu/~backrub/google.html], Computer Science
Department, Stanford University, Stanford, CA, USA - 2000
[Bag04] M. Bagnasco - Progettazione e implementazione di funzionalità di analisi di strutture
di una pagina all'interno di un editor HTML client-side - Undergraduate thesis [http://
tesi.fabio.web.cs.unibo.it/twiki/pub/Main/ElISA/TesiMatteoBagnasco.pdf], Computer Science
Department, University of Bologna, Bologna, Italy - 2004
[CCD01] T. Coates, D. Connolly, D. Dack, L. Daigle, R. Denenberg, M. Dürst, P. Grosso, S.
Hawke, R. Iannella, G. Klyne, L. Masinter, M. Mealling, M. Needleman, N. Walsh - URIs,
98
Bibliography
URLs, and URNs: Clarifications and Recommendations 1.0 - Report from the joint W3C/
IETF URI Planning Interest Group, W3C Note 21 September [http://www.w3.org/TR/uriclarification/] - 21 September 2001
[CD99] J. Clark, S. DeRose - XML Path Language (XPath) 1.0 - W3C Recommendation [http://
www.w3.org/TR/xpath] - 16 November 1999
[CGG04] M. F. Chiang, P. Grimm, S. Gupta, G. E. Kaiser, J. Starren - Automating Content
Extraction of HTML Documents - Fourth International Conference on Language Resources
and Evaluation (LREC 2004) [http://www.psl.cs.columbia.edu/crunch/WWWJ.pdf], Lisbon,
Portugal - 26-28 May 2004
[CJV99] W. Chisholm, I. Jacobs, G. Vanderheiden - Web Content Accessibility Guidelines 1.0
- W3C Recommendation [http://www.w3.org/TR/WCAG10] - 5 May 1999
[CMO05] S. Cassidy, C. Mantratzis, M. Orgun - Separating xhtml content from navigation
clutter using dom-structure block analysis - ACM, Sixteenth ACM Conference on Hypertext
and Hypermedia (HYPERTEXT '05), pp. 145-147, Salzburg, Austria - 6-9 September 2005
[CNS07] W. Choochaiwattana, W. Niranatlamphong, M. B. Spring - Web image classification
algorithm: a heuristic rule-based approach - Vic Grout, editors, Proceedings of the Second
International Conference on Internet Technologies and Applications (ITA '07), pp. 201-207,
Wrexham, North Wales, UK - 4-7 September 2007
[Car65] L. Carroll - Alice's Adventures in Wonderland and Through the Looking-Glass Published by the Penguin Group in 1998 (centenary edition) - 1865
[Cla99] J. Clark - XSL Transformations (XSLT) - W3C Recommendation [http://www.w3.org/
TR/xslt] - 16 November 1999
[DDD07] A. Dattolo, A. Di Iorio, S. Duca, A. A. Feliziani, F.
Vitali - Patterns for descriptive documents: a formal analysis - Technical
Report UBLCS-2007-13 [http://www.cs.unibo.it/pub/TR/UBLCS/ABSTRACTS/2007.bib?
ncstrl.cabernet//BOLOGNA-UBLCS-2007-13], Computer Science Department, University of
Bologna, Bologna, Italy - 2007
[DGK02] M. Diligenti, M. Gori, M. Kovacevic, V. Milutinovic - Recognition of Common Areas
in a Web Page Using Visual Information: a possible application in a page classification - IEEE
99
Bibliography
Computer Society, Proceeding of IEEE International Conference on Data Mining (ICDM 2002),
pp. 250-257, Maebashi City, Japan - 9-12 December 2002
[DV03] A. Di Iorio, F. Vitali - A Xanalogical Collaborative Editing Environment - Proceedings
of the Second International Workshop of Web Document Analysis (WDA 2003) [http://
www.csc.liv.ac.uk/~wda2003/Papers/Section_III/Paper_11.pdf], Edinburgh, Scotland, UK - 3
August 2003
[DV04] A. Di Iorio, F. Vitali - Writing the web - Journal of Digital Information '04 [http://
journals.tdl.org/jodi/article/viewArticle/jodi-139/119] - 5 may 2004
[DV05] A. Di Iorio, F. Vitali - From the Writable Web to the Global Editability - ACM,
Proceedings of the sixteenth ACM conference on Hypertext and hypermedia (HYPERTEXT
'05), pp. 35–45 [http://tesi.fabio.web.cs.unibo.it/twiki/pub/Tesi/IsaWiki/f78-diiorio.pdf], New
York, NY, USA - 2005
[DVV04] A. Di Iorio, E. Ventura Campori, F. Vitali - Rule-based structural analysis of web
pages - In Simone Marinai and Andreas Dengel Eds., editors, Document Analysis VI, volume
3163 of Lecture Notes in Computer Science, pages 425–437 [http://tesi.fabio.web.cs.unibo.it/
twiki/pub/Tesi/MaterialeIsaWiki/elISA.pdf], Springer Verlag - 2004
[Dii07] A. Di Iorio - Pattern-based Segmentation of Digital Documents:
Model and Implementation - Ph.D. thesis [http://www.cs.unibo.it/pub/TR/UBLCS/
ABSTRACTS/2007.bib?ncstrl.cabernet//BOLOGNA-UBLCS-2007-05], Computer Science
Department, University of Bologna, Bologna, Italy - 2007
[EFH02] A. K. Elmagarmid, J. Fan, M. Hacid, X. Zhu - Model-Based Video Classification
toward Hierarchical Representation, Indexing and Access - ACM, Multimedia Tools and
Applications, Volume 17 (Issue 1), pp. 97–120 - 2002
[FKS01] A. Finn, N. Kushmerick, B. Smyth - Fact or fiction: content classification for
digital libraries - Proceedings of the Second DELOS Network of Excellence Workshop
on Personalisation and Recommender Systems in Digital Libraries [http://www.ercim.org/
publication/ws-proceedings/DelNoe02/AidanFinn.pdf], Dublin, Ireland - 18-20 June 2001
[Flo05] L. Floridi - Semantic Conceptions of Information - Edward N. Zalta (ed.), The Stanford
Encyclopedia of Philosophy (Winter 2005 Edition), Available online [http://plato.stanford.edu/
entries/information-semantic/] - 2005
100
Bibliography
[GHJ95] E. Gamma, R. Helm, R. Johnson, J. Vlissides - Design Patterns: Elements of Reusable
Object-Oriented Software - Addison-Wesley Professional, New York, NY, USA - 1995
[GM02] R. J. Glushko, T. McGrath - Document Engineering for e-Business - ACM, Proceedings
of the ACM symposium on Document Engineering, pp. 42-48, McLean, VA, USA - 8-9
November 2002
[Gar05] J. J. Garrett - Ajax: A New Approach to Web Applications - Web article [http://
www.adaptivepath.com/ideas/essays/archives/000385.php] - 18 February 2005
[Got07] T. Gottron - Evaluating content extraction on HTML documents - Vic Grout, editors,
Proceedings of the Second International Conference on Internet Technologies and Applications
(ITA '07), pp. 123-128, Wrexham, North Wales, UK - 4-7 September 2007
[Gru92] T. R. Gruber - A Translation Approach to Portable Ontology Specifications Knowledge Systems Laboratory, Technical Report KSL 92-71 [ftp://ftp.ksl.stanford.edu/
pub/KSL_Reports/KSL-92-71.ps.gz], Computer Science Department, Stanford University,
Stanford, California, USA - 1992
[Gub04] D. Gubellini - Linguaggi di schema per XML e modelli astratti di documenti - Master
thesis [http://tesi.fabio.web.cs.unibo.it/twiki/pub/Tesi/FormatoGenerico/Tesi.pdf], Computer
Science Department, University of Bologna, Bologna, Italy - 2004
[JLR99] I. Jacobs, A. Le Hors, D. Raggett - HTML 4.01 Specification - W3C Recommendation
[http://www.w3.org/TR/html401] - 24 December 1999
[Jel05] R. Jelliffe - Schematron Specification - Final Committee Draft [http://
www.schematron.com/iso/dsdl-3-fdis.pdf] - 2005
[KT06] K. Koutroumbas, S. Theodoridis - Pattern Recognition - Academic Press (3rd edition)
- 2006
[Kay07] M. Kay - XSL Transformations (XSLT) Version 2.0 - W3C Recommendation [http://
www.w3.org/TR/xslt20/] - 23 January 2007
[LLY03] X. Li, B. Liu, L. Yi - Eliminating Noisy Information in Web Pages for Data Mining ACM, The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, Washington, DC, USA - 24-27 August 2003
101
Bibliography
[MRS02] J. Mason, M. Roach, F. Stentiford, L. Xu - Recent trends in video analysis: a taxonomy
of video classification problems - Proceedings of the International Conference on Internet and
Multimedia Systems and Applications (IASTED), St. Thomas, Virgin Islands, USA - 18-20
November 2002
[Nel80] T. Nelson - Literary Machines: The report on, and of, Project Xanadu concerning word
processing, electronic publishing, hypertext, thinkertoys, tomorrow's intellectual... including
knowledge, education and freedom - Mindful Press, Sausalito, CA, USA - 1980
[Nis04] NISO - Understanding Metadata - NISO Press, acaiable on web [http://www.niso.org/
standards/resources/UnderstandingMetadata.pdf] - 2004
[Oas01] OASIS - RELAX NG Specification - Committee Specification [http://relaxng.org/
spec-20011203.html] - 2001
[Rij79] C. J. van Rijsbergen - Information Retrieval - Second edition, available on web [http://
www.dcs.gla.ac.uk/Keith/Preface.html] - 1979
[San06] G. Sanchietti - L'uso di design pattern nella conversione fra formati testuali: una
proposta e un progetto - Undergraduate thesis [http://tesi.fabio.web.cs.unibo.it/twiki/pub/Tesi/
BetterConverter/TesiGiacomoSanchietti.pdf], Computer Science Department, University of
Bologna, Bologna, Italy - 2006
[Sun02] Sun - Reflection Guide - Guide [http://java.sun.com/j2se/1.5.0/docs/guide/reflection/
index.html] - 2002
[Sun03] Sun - JAR File Specification - Specification [http://java.sun.com/j2se/1.4.2/docs/guide/
jar/jar.html] - 2003
[Sun06a] Sun - Servlet 2.5 Specification - Specification [http://jcp.org/aboutJava/
communityprocess/mrel/jsr154/index.html] - 2006
[Sun06b] Sun - Java™ Platform, Standard Edition 6 - API Specification [http://java.sun.com/
javase/6/docs/api/] - 2006
[Ven03] E. Ventura Campori - Estrazione di informazioni di layout attraverso analisi strutturale
nelle pagine HTML - Master thesis [http://tesi.fabio.web.cs.unibo.it/twiki/pub/Main/ElISA/
tesi.ps], Computer Science Department, University of Bologna, Bologna, Italy - 2003
102
Ringraziamenti
Ringraziamenti
Il problema è l'ora. Mancano circa sette ore al termine ultimo per la consegna delle tesi in
segreteria. Ore 3.28: mi sembra il momento più appropriato per scrivere i ringraziamenti.
Dovete scusarmi fin da subito: non credo che il mio italiano (se mai ce ne fosse mai stato
uno) mi sostenga per tutta la stesura. Alla fine è tardi e sono un poco stanco, se permettete.
Come se non bastasse, questi dovrebbero essere, o meglio sono, dei doppi ringraziamenti. Nel
senso che due anni e mezzo fa, mese più mese meno, me ne sono completamente dimenticato.
No, no: non di farli (spero). Di metterli nero su bianco; di scriverli, ecco. Quindi ho deciso di
raccogliere qui i ringraziamenti triennali e specialistici, all inclusive.
Pensavo da tempo a cosa scrivere in questi ringraziamenti. Avevo avuto anche una bella
idea, almeno a parer mio. Una sorta di esposizione che potesse raccogliere tutti in un unica
parola. No, non sto delirando. L'avevo in mente per davvero, quest'idea. Mi era balenata in testa
a seguito di una piccola valutazione: se avessi dovuto scrivere tutti i nomi delle persone che
dovevo (e devo) ringraziare non mi sarebbero bastate venti pagine. In secondo luogo, avrei fatto
sicuramente del torto a qualcuno: figurati se qualche nome non mi si nascondeva.
Da qui l'idea di racchiudere tutti nella parola bologna. Era questa la mia idea. Ma sono
troppo stanco ora per metterla a punto.
Proprio un attimo prima (diciamo una mezz'oretta fa) di mettermi a scrivere questo
testo avevo pensato di fare il tradizionalista: il classico elenco-dei-nomi-con-motivazioni mi
sembrava il più appropriato. Come da tradizione, prima i genitori e sotto tutti gli altri. Il
problema rimane lo stesso: vuoi non dimenticare qualcuno? Sei anni di persone non sono pochi,
eh. Se poi considerassimo il fatto che la mia testa, in questo esatto momento, non è che sia
particolarmente lucida, le cose si sarebbero complicate un bel po'.
Nel contempo che continuo a scrivere (cavolate, direte voi), provo a trovare un buon modo
per ringraziare tutti. Escluderei i discorsi strappa lacrime. Non sono mai stato in grado di farne
(non mi vengono, che ci volete fare).
Ecco un'idea: ringraziare tutti suddividendovi per categorie. Famiglia, Amici di Albiano,
Amici di Bologna, Università e così via. Mhn... no, questo è peggio. Già con l'elenco dei nomi
103
Ringraziamenti
potevo perdermi qualcuno per strada. Ma se qui perdo una categoria mi dimentico di ringraziare
almeno una ventina di persone in un botto solo. No: questo non me lo posso permettere.
Andando avanti così una soluzione la si troverà pure. Ok, perdiamo altri dieci minuti.
Intanto ormai, tardi per tardi...
Come ringraziarvi tutti? In realtà sarei curioso di sapere quanti di voi siano arrivati fino
a qui: in questa precisa riga. Non vi siete neppure un poco stufati? Che siete curiosi? Fatemi
capire: voi credete davvero che da qui alla fine io riesca a trovare un modo per ringraziare
proprio tutti, uno per uno, senza dimenticare nessuno? Ottimisti. No perché io un po' mi sto
stancando di scrivere cose a caso, senza arrivare da nessuna parte.
Il problema è l'ora, ve l'ho detto. Il fatto è che sono stanco. No: il fatto è che non ho troppo
tempo a disposizione per finire questi, ormai morbosi, ringraziamenti. Quindi direi che è ora di
chiudere. Una frase. Una semplice e banale frase, e poi tutti a letto.
Ringrazio tutti quelli che, con parole, fatti, o semplicemente con pensieri, mi hanno
sostenuto (e sopportato) in questi ultimi venticinque anni.
Questo è tutto.
104