La conservazione dei documenti informatici

Transcript

La conservazione dei documenti informatici
Autorità per l’informatica nella Pubblica
Amministrazione
La conservazione dei documenti
informatici - Aspetti organizzativi e
tecnici
Seminario di studio
Roma 30 ottobre 2000
1
Programma
I SESSIONE : LA FUNZIONE CONSERVATIVA. PROBLEMI, STRATEGIE , ESPERIENZE ,
9:00
Apertura del seminario
Salvatore Italia Direttore Ufficio Centrale per i beni archivistici
Guido Mario Rey –Presidente dell’Autorità
9:15
Relazione introduttiva
Carlo Batini (Autorità per l'Informatica nella Pubblica Amministrazione)
9:35
La gestione e la conservazione dei documenti informatici nelle pubbliche
amministrazioni italiane (The management and the preservation of electronic
records in the Italian public administrations)
Antonio Massari (Autorità per l'Informatica),
10:10
Information Management Architecture for Persistent Object Preservation
Reagan W. Moore (University of California)
10:40
Intervallo
11:00
The strategies for the preservation. Projects and perspectives in the US Federal
Government
Ken Thibodeau (National Archives USA)
11:30
The Swedish programs for the electronic records preservation and access
Kristiansson Gorlan (Riksarkivet, Sweden)
12:00
Responding to the challenges and opportunities of ITC: the new record manager
Seamus Ross (University of Glasgow)
12:30
Discussione
13:00
Intervallo per il pranzo
II SESSIONE . I METODI PER LA CONSERVAZIONE , IL RUOLO DEI METADATI,
POTENZIALITÀ DI
14:00
LE
XML
Relazione introduttiva
Maria Guercio (Università di Urbino)
14:30
A logical model for the electronic records authentication
Bill Underwood (Georgia Tech Institute)
15:00
Metadata: Archival Concept or IT Domain?
Peter Horsman (IT Committee, International Council on Archives, president)
15:30
The potential of markup languages to support descriptive access to electronic
records: the EAD standard
2
Anne Swetland Gilliland (University of Los Angeles)
16:00
XML standards for business-to-business and business to government
communication
Zachary Coffin (eXtensible Business Reporting Language)
16:30
XML, uno standard per gli archivi informatici (XML, a standard for the electronic
records)
Daniele Tatti (Autorità per l'informatica)
3
Indice
Indice 4
Premessa 6
THE EVOLUTION OF PROCESSING PROCEDURES FOR ELECTRONIC RECORDS 7
ARCHIVAL PRINCIPLES IMBUED IN ELECTRONIC RECORDS PROCESSING 7
ACCESSIONING 8
ARCHIVAL PRESERVATION SYSTEM 9
ARCHIVAL ELECTRONIC RECORDS INSPECTION AND CONTROL SYSTEM 10
ARCHIVAL MANAGEMENT INFORMATION SYSTEM 10
GAPS DATABASE 11
DOCUMENTATION 11
PRESERVATION 12
REASONS FOR PAST SUCCESS 13
EMERGING ISSUES 14
History of NARA’s Electronic Records Program 15
An Historical Perspective on Appraisal of Electronic Records, 1968-1998 21
Record/ Non-Record, Not Valuable/Valuable 21
Applying Traditional Archival Theory of Appraisal 22
Applying Traditional Records Management Techniques 23
Innovation, trying new approaches 23
Knowledge-based Persistent Archives 26
Abstract 26
1. Introduction 26
2. Knowledge-based Archives 27
2.1 Archive Accessioning Process: 29
2.2 Archival Representation of Collections: 31
3. Relationships between NARA and other Agency projects: 32
Acknowledgements: 32
NARA and Electronic Records: A Chronology 34
4
The potential of markup languages to support descriptive access to electronic records: The EAD
standard 36
Abstract: 36
Introduction 36
Describing Electronic Records 38
Encoded Archival Description 39
Using EAD to Describe Electronic Records 40
Conclusion 42
Preservation and migration of electronic records: the state of the issue 43
The problem of preserving electronic records 43
Approaches to the problem of preserving electronic records 43
The need for an archival approach to preserving electronic records 44
An architecture for archival preservation 45
1.
the right data was put into storage properly; 45
2.
either nothing happened in storage to change this data or alternatively any changes in the
data over time are insignificant; 45
3.
all the right data and only the right data was retrieved from storage; 45
4.
the retrieved data was subjected to an appropriate process, and 45
5.
the processing was executed correctly to output an authentic reproduction of the record. 45
Collection-based persistent object preservation 46
Viability of the persistent object preservation method 47
Conclusion 47
Responding to the Challenges and Opportunities of ICT: The New Records Manager 49
XML per la conservazione dei sistemi documentari informatici 50
5
Premessa
Il seminario, organizzato dall'Autorità e dall'Ufficio Centrale per i Beni Archivistici del
Ministero per i Beni e le Attività Culturali, ha la finalità di affrontare la questione cruciale
della conservazione a lungo termine dei documenti informatici, da un lato sensibilizzando le
amministrazioni pubbliche che hanno già avviato o programmato processi di automazione
dei sistemi documentari, sia di avviare - anche attraverso un confronto internazionale l'analisi e la verifica delle modalità e degli strumenti che consentono il mantenimento nel
tempo dell'integrità e dell'autenticità delle memorie digitali nonché la loro accessibilità. Il
tema ha acquistato attualità e rilevanza crescenti in relazione alla larga diffusione di
tecnologie dell'informazione e della comunicazione per la gestione dei flussi amministrativi e
documentari. Richiede , tuttavia, la soluzione di numerosi problemi di natura organizzativa e
tecnica che gruppi internazionali di ricerca e istituzioni nazionali di tutto il mondo hanno
cominciato a studiare da qualche anno con impegno crescente soprattutto allo scopo di
individuare
soluzioni
scalabili
e
di
basso
costo
che
rendano
possibile
anche
alle
organizzazioni di piccole e medie dimensioni avviare interventi generali di automazione dei
processi amministrativi e delle attività di trasmissione e tenuta dei documenti.
In particolare, nella prima sessione, il seminario si propone di offrire alle amministrazioni
pubbliche italiane - soprattutto ai responsabili dei sistemi informativi e ai responsabili dei
sistemi documentari - un panorama della specifica situazione nazionale e dei progetti e delle
esperienze operative più interessanti condotte in altri Paesi (Stati Uniti e Svezia) in stretta
collaborazione con più ampi gruppi di ricerca internazionali, insieme a un'analisi delle
esigenze di formazione e riqualificazione dei profili professionali destinati ad assumere il
nuovo difficile ruolo di conservatore delle memorie digitali.
La seconda sessione è destinata ad esaminare le questioni più strettamente tecniche della
conservazione, in particolare i diversi metodi oggi disponibili per affrontare l'obsolescenza
tecnologica e, in special modo, le potenzialità dei linguaggi di marcatura (SGML, XML) per il
trattamento e la gestione dei metadati che garantiscono la tenuta a lungo termine del
patrimonio documentario e lo sviluppo di standard specifici (Encoded Archival Description).
6
THE EVOLUTION OF PROCESSING PROCEDURES FOR ELECTRONIC RECORDS
1
Bruce Ambacher
Electronic and Special Media Records Services Division
National Archives and Records Administration
In 1966 Archivist of the United States Robert Bahmer established the Committee on Disposition of
Machine-Readable Records. In January 1968 the Committee presented its report recommending the
National Archives address the records management and archival aspects of Federal machine-readable
records. Bahmer then assigned senior National Archives staff to make the recommendations a reality.
He tasked experienced records managers and archivists such as Ev Alldredge and Meyer Fishbein, with
developing an integrated program which was formally established as the Data Archives Staff late in 1968
by Bahmer’s successor, James Burt Rhoads. It was natural, therefore, that the newest archival records
program would be imbued with as many aspects of traditional records management and archival theory
and practice as possible and that the practices and procedures developed for and by the new program
would conform to National Archives standards.
ARCHIVAL PRINCIPLES IMBUED IN ELECTRONIC RECORDS PROCESSING
The establishment of a separate Data Archives Staff in the fall of 1968 continued several National
Archives traditions. As early as 1938 the Archives had recognized the practical need for separate
programs for distinct types of records which required special staff, storage, preservation, or reference
services. In 1968 separate programs already existed for photographs, motion pictures, sound recordings,
and cartographic materials as well as separate custodial divisions for civil and military textual records.
Thus, establishing a separate Data Archives Staff in 1968 to perform all archival functions (except
appraisal) for computer generated data reflected National Archives practice. This staff immediately
began adapting and applying traditional records management and archival functions to electronic data.
Establishing the Data Archives Staff also confirmed the National Archives continuing support for the
principle of archival custody. Both the experienced Archives staff responsible for establishing the
program and the archivists hired to implement the program, regardless of professional training and
previous experience, recognized the importance of accessioning and gaining physical and intellectual
control over machine-readable records identified as having enduring value.
In the first year the new staff began to use effective textual records management and textual archival
procedures for machine-readable records. The first step was a records management survey of extant
machine-readable records to help determine which Federal records might be accessioned. The staff also
developed the first of an ongoing series of procedures manuals. This one, “A Procedure for Accepting
Digital and Analog Magnetic Tape for Archival Storage,” dealt with both accessioning and preservation
guidelines. It established basic technical accessioning and preservation criteria such as tape format, initial
readability of the data, storage conditions including temperature and humidity, and routine preservation
activities such as tape rotation and controlled tension rewinding. In June 1973 the more extensive draft,
Recommended Environmental Conditions and Handling Procedures for Magnetic Tape, superseded the
preservation sections of the earlier Procedure. Today, lengthy specifications in the Code of Federal
Regulations regulate agency custody and storage of two categories of electronic records – those that are
either unscheduled or those that have enduring value and are scheduled for transfer to the National
Archives. Once electronic records are transferred to NARA they are controlled by the Code of Federal
Regulations, by the division’s internal Preservation Procedures Handbook, and by NARA Preservation
guidelines. These are reinforced by more generalized environmental care and handling guidelines from
sources such as the National Institute for Standards and Technology and the National Media Laboratory,
as well as guidance published by hardware and media manufacturers.
1
An earlier version of this paper was presented at the Annual Conference of the Society of American Archivists in
Denver, Colorado, August, 2000.
7
The initial inventorying and scheduling activities confirmed another simple, but most important archival
principle – the record nature of computer data. As Federal agencies recognized machine-readable items
as Federal records, records managers began to include them in records control schedules. NARA
processed the first schedule that contained a permanent machine-readable series was processed in
1971.
A closely related significant early step was the development and issuance of General Records Schedule
20 covering machine-readable records in April 1972. Linda Henry has reviewed the history of GRS 20. I
wish to emphasize that issuing GRS 20, confirmed another basic National Archives concept –
responsibility for the entire life-cycle of records - identifying, appraising, accessioning, preserving, and
providing reference service for records with enduring value created or received by the Federal
Government, regardless of physical format.
ACCESSIONING
In April 1970 the Data Archives Staff accepted its first electronic records data file transfer, NASA’s Tektite
I, and began the accessioning process. Building on the approach in “A Procedure for Accepting Digital
and Analog Magnetic Tape for Archival Storage” the staff made a master and a backup copy, deposited
the copies in separate storage areas, negotiated the restriction on access statement with the creating
agency, prepared a basic documentation package, and developed a catalog entry. These still are the
basic steps taken at NARA and every repository that has a custodial program for electronic records.
The relatively rapid growth in the volume of accessioned machine-readable holdings and in the number of
staff responsible for them highlighted the need to further develop written standardized procedures. By the
mid-1970s the staff had grown from the initial three to more than fifteen, several with minimal archival
experience. Starting with the first accession in April 1970, the collection had grown to more than 122
series in 18 record groups on 1000 reels of magnetic tape four years later. In 1973 the fledging technical
services staff issued the draft Recommended Environmental Conditions and Handling Procedures for
Magnetic Tape, to address some aspects of preservation. More formal written preservation procedures
focusing on in-house procedures came with in-house computing capabilities.
In 1976 the Supervisory Archivist in charge of accessioning assigned the task of developing a
comprehensive Appraisal-Description-Accessioning Procedures Handbook to a new staff member as a
training project. The first draft was available in 1978. A polished comprehensive loose leaf notebook
format including elaborate flowcharts of all procedures; checklists of required elements for each task; and
samples of appraisals, descriptions, and documentation packages was completed in 1980. The division
has continued maintaining this, and other, procedures manuals. The most recent version, completed in
June by Linda Henry, is online on the division’s local area network and contains all of the current forms as
well as up-to-date samples of current reports and descriptions.
The major purpose of the accessioning and preservation procedures, regardless of the version or the
year created, or of the computer equipment used, is to ensure all aspects of accessioning and initial
preservation are completed and that each accession completes all steps consistently, in conformance
with current Archives standards. Obviously, there has been some variation, in the required forms, the
preservation techniques used, the extent of the documentation, and the type of description over the past
three decades.
In 1968 the National Archives and Records Service was a Service within the General Services
Administration (GSA). In the interest of economy and efficiency, GSA had consolidated all computer
services in the GSA computer staff in Region 3. GSA directed the Data Archives Staff to use that staff for
computer services for accessioning, preservation, and reference. This created problems. GSA’s technical
staff did not appreciate the meaning of terms such as “archival” or “permanent.” They also did not high
priorities to processing older files “inherited” by NARS from other agencies when their own managers
placed greater importance on current projects. This inevitably led to problems. In 1973, as a result of
events that Thomas Brown describes elsewhere, GSA reassigned a computer programmer to the Data
8
Archives Staff giving the staff direct responsibility for technical services associated with accessioning,
preservation, and reference.
The program, however, continued to be dependent upon outside vendors to provide computer time for all
accessioning, preservation, and reference services for two more decades. Technical Services staff wrote
punch card job decks that contained processing instructions. A courier service took the jobs and the
computer tapes to a vendor computer center for batch processing. This processing generally was
performed overnight in non-prime time when computer time was least expensive. Any minor error in
keypunching or job command language would abort the job. The staff would make the minor adjustment
or correction and resubmit the job.
A first major step toward acquiring in house computing capability was a side benefit of American Friends
Service Committee et al vs. William Webster et al, the 1978 lawsuit against the National Archives and the
Federal Bureau of Investigation. The legal aspects and the archival implications of the FBI appraisal
project have been assessed elsewhere. Its impact on the electronic records program was significant.
The Machine-Readable Division, in addition to detailing three staff full-time to the appraisal project,
provided the Archives team with computer-generated samples, computer analyses, and statistics for team
evaluation and court scrutiny. The need to reduce turn-around time led to the division acquiring a “dumb”
terminal to process computer jobs. It also was a factor in GSA’s decision in 1982 to acquire a computer
for the division.
At the same time that the FBI Project was diverting resources, President Ronald Reagan, who had
campaigned on an anti-big government theme, imposed a budget and hiring freeze on Federal agencies.
GSA went even further and initiated a six percent budget reduction on its Services. Within the National
Archives this translated into Reductions-in-Force (RIFs) which led to dismissing all employees hired within
the preveious three years. The Machine-Readable program, with its overwhelming newer employees,
took a disproportionate share of the staff cuts. RIFs reduced the staff from twenty in 1981 to just twelve
in 1982.
Custody of the Archives’ first computer was diverted from the greatly reduced and demoted branch to an
administrative program. Some branch staff devoted to technical services also were transferred to that
division, reducing the staff to just seven. The computer was used primarily for administrative and agencywide needs. Electronic records preservation and reference copying was a low priority.
All aspects of
processing electronic records languished throughout the 1980s.
ARCHIVAL PRESERVATION SYSTEM
The most significant time savings in processing data files comes from introducing in-house automation
into the process. Developing that in-house capability was a key component of the revitalization of the
electronic records program that began in 1989. In addition to automating the accessioning and
preservation processes, automating management control and tracking was also a high priority. The
revised accessioning and preservation strategy recognized that the singular approach of the previous two
decades increasingly had the potential to modify the evidential character of the electronic records being
preserved, to move away from preserving records as received. A more diverse, modular approach was
required to ensure capturing and preserving the essential record characteristics of each file. This strategy
was a major theme in the development of the Archival Preservation System (APS).
NARA designed APS to utilize standard personal computers and magnetic tape drivers with unique
software programming to perform four basic functions. First, APS copies electronic records from a variety
of media to the medium chosen for archival preservation. Second, it imprints the archival copies of
physical files with standard specifications for recording and labeling. Third, APS automatically tracks and
captures information on the files copied, the media volumes created and the processes used to assist
both preservation and access. Finally, APS facilitates future migration of files to new media. The key to
APS efficiencies is standardizing the preserved data files.
APS received a real baptism under fire. It arrived in mid-1993, in time to be diverted from archival
preservation processing to duplication of backup tapes transferred from the Executive Office of the
9
President to NARA under temporary restraining orders imposed by the U.S. District Court as part of the
Armstrong vs. EOP lawsuit. By 1995 APS had succeeded in copying more than 99.98% of all the EOP
records from a variety of physical media.
In the past seven years the electronic records program has used the APS to successfully preserve tens of
thousands of electronic records data files on thousands of physical volumes. Today fifteen APS
workstations in various local area network and stand alone configurations process virtually all of NARA’s
electronic records.
To date NARA has spent $1,845,000 on APS. This includes $500,000 for initial proof of concept,
programming, and subsequent reprogramming of APS code.
Hardware costs, including Y2K
replacements, exceed $900,000. Daily operations and maintenance have cost nearly $400,000. In the
next few years there are plans to expand both APS processing capabilities to include additional types of
computer files such as images and office automation and to enhance its reporting capabilities. Currently
APS is viewed as an essential component of, or delivery system to, the Electronic Records Archive.
ARCHIVAL ELECTRONIC RECORDS INSPECTION AND CONTROL SYSTEM
APS was designed to evolve the way the National Archives accessions and preserves electronic records
as physical files. It did not, however, address the need to verify the contents of a data file automatically
or capture and preserve the logical and conceptual characteristics of the records and the records systems
or databases in which the records were created and used. These goals have been embodied in a second
automated system, the Archival Electronic records Inspection and Control (AERIC) system. As every
processing archivist and trainee who has had to hand validate a sample dump of computer records will
testify, the process is time-consuming, labor intensive, boring, and prone to error. Further, anyone
validating records could only examine an insignificant number of records, typically ten to fifty.
In 1990 the Center for Electronic Records funded a proof of concept study to determine whether and how
computer technology could be applied to automate the data verification process focusing on the logical
and conceptual structure of data. NARA received the AERIC prototype in 1991. Over the past decade
the AERIC system has evolved from a single workstation to a local area network available simultaneously
to every accessioning archivist. An archivist or technician addresses the question of whether the file
relates to what the creating agency claims it is by entering the information about what the structure and
content are supposed to be - the metadata, record structures and code definitions associated with
specific variables. The system then reads the data against the variables and codes and reports on each
nonconforming record. AERIC achieves its greatest efficiencies when multiple files, including periodic
accretions, have the same or similar structures, permitting verification of multiple files based on the same
input.
To date NARA has invested about $1,300,000 on AERIC. This includes $750,000 for initial proof of
concept, programming, and subsequent reprogramming and significant expansion of AERIC code.
Hardware costs, including Y2K replacements and three new stand-alone workstations, exceed $400,000.
Daily operations and maintenance have cost nearly $150,000. The most recent modifications allow staff
to verify the content of structured text files including e-mail and diplomatic cables.
Planned upgrades to AERIC include expanding the types of electronic records that it can process to
include data from other structured files such as GIS and images and to enhance its storage capacity so it
can be used to facilitate researcher access to archival databases. A stand-alone AERIC has just been
placed into service to verify the classified electronic records transferred from the National Security Council
and the offices of the Independent Counsels.
ARCHIVAL MANAGEMENT INFORMATION SYSTEM
For the past quarter century NARA’s electronic records program, like other archival programs, has faced
several recurring and related questions. “What is the status of a particular accession?” “How many
10
accessions have been completed this fiscal year?” “How big are the backlog(s)?” “What is the best way to
manage the backlog?” Throughout the 1970s and 1980s answers to these questions depended upon
supervisors maintaining a variety of manual logs and collecting statistics on a periodic basis. The
revitalization of the program, beginning in 1989 included additions to the staff, and increased access to
information management databases. One of the first automated systems developed was the Archival
Management Information System (AMIS). AMIS is a relational database. The first version, developed in
1990, operated on DB2 software maintained at the National Institutes of Health which provided the
program’s computer support at that time.
The AMIS project manager populated AMIS by entering information from each accession dossier. Each
dossier contained the deed of gift, lists of the data files, accompanying restriction information, and varying
amounts of processing information. While this sounds straightforward and relatively simple, it was not.
Given the number of administrative reorganizations, many of the dossiers could not be located.
AMIS now operates on MS Access on NARA’s administrative support system, NARANET. All current
accessions are entered into AMIS as soon as they are received. AMIS can be used to create reports on
the status of any accession including which accessioning or preservation steps have and have not been
completed and whether all required signatures have been received. AMIS also can be used to calculate
the elapsed time between initial offer and completion. Currently, it takes more than 800 days.
Unfortunately, we still cannot tell you the exact size of the backlog(s).
GAPS DATABASE
Both Linda and I have mentioned that electronic records scheduling began almost as soon as the
program was established, with the first permanent series scheduled in 1971. The emphasis on
scheduling has continued over the past thirty years. The obvious purpose of scheduling is the ultimate
transfer of the records.
In the late 1980s NARA’s Federal Records Centers began to notice a distinct reduction in the volume of
scheduled textual records being transferred to the centers. Initial investigation determined that one of
several causes was increasing use of office automation and the migration of records from textual series to
electronic series. Following up on the findings, the Deputy Archivist asked the electronic records program
to determine what portion of the records it was accessioning resulted from records control scheduling
efforts and if any of the accessions reflected the evolution from paper to electronic media. This led to the
development of the GAPS database – for gaps in the holdings.
Two archivists examined every schedule that could be identified as scheduling electronic records for
transfer to NARA. Each GAPS record contains the schedule and item numbers, series title, description,
disposition, cutoff instructions and transfer dates. The archivists compared the entries with the
accessions listed in the AMIS database, based on the schedule and item number. Their initial survey
revealed that less than five percent of what should have been transferred actually had been. Follow up
accessioning efforts over the past decade increased that to more than thirty percent. Ongoing solicitation
will result in additional transfers. For contemporary schedules the GAPS database for electronic records
complements the Permanent Authorities database that contains similar information for all other record
forms.
DOCUMENTATION
A third accessioning processing procedure that is unique to electronic records programs is the
preservation of a separate documentation series. Documentation is crucial to allowing researchers to
understand and use the electronic file. At a minimum it contains the record layout and codes that define
the data. Ideally, documentation may include an overview of the data, the data collection methodology or
framework, a copy of the original input form, analysis of the data and its uses by the creator, definitions of
terms, policy documents on why the study was conducted, and a bibliography of research studies based
on past use of the data. Since each file is unique, no concise definition of adequate documentation
exists. Gaining physical custody of valuable data files means nothing if the creating agency has not
11
maintained or can not create adequate and proper documentation to transfer with the file. Occasionally,
otherwise valuable files can not be accessioned because no documentation existed. Throughout the past
thirty years accessioning archivists have gone to great lengths to locate and assemble appropriate
documentation.
Unfortunately, NARA still receives electronic records files which lack complete
documentation.
Federal agencies’ understanding of the need for adequate documentation varies. One outstanding
example, from my own experience, was the documentation for the Collaborative Perinatal Project. The
National Institute for Neurological Diseases expended $200 million over nearly twenty years at fourteen
different hospitals to study pregnant women and their children from birth through nine years. The series
contains more than six million records. After the project was completed, the sponsoring agency paid a
contractor more than $500,000 to prepare the documentation, including an extensive bibliography of all
research and publication based on the data, for transfer to the Archives.
As the backlog of incompletely processed computer data files grew the program responded by searching
for ways to streamline processing procedures without compromising researcher understanding of the
data. There has been greater emphasis on using standardized formats, saved as word processing
macros, to identify the verification procedures used, anomalies in the data or the coding, any restrictions
on access to the data, and incorporation of descriptive materials created for other purposes.
PRESERVATION
NARA’s approach to the preservation of electronic records continues to build upon basic tenants and
principles developed more than thirty years ago. They endure because they still apply to the data files for
which they were designed. They also endure because they are being applied successfully – with some
modification – to other forms of electronic records. This basic approach to preservation continues to be
forged through a combination of research, discussion with industry leaders, archival principles and early
practical experiences. The 1969 “A Procedure for Accepting Digital and Analog Magnetic Tape for
Archival Storage” and 1973 draft Recommended Environmental Conditions and Handling Procedures for
Magnetic Tape illustrate the early guidelines.
The first basic principle is that preservation of electronic records does not focus on the preservation of a
physical object. While the physical object is more durable than many believe it to be, it is inevitable that
the entire suite of media, software and hardware will not survive to provide access to the information at
some future date. Preservation of electronic records, therefore, focuses on maintaining the ability to
process the information on contemporary computers through repeated changes to the media, the
enabling software, and the hardware.
NARA maximizes media life by creating two copies of the data on new evaluated stable media. This
basic statement reflects unfortunate early experiences with the quality of the media copies transferred
from the creating agencies. The program quickly determined that the only way to ensure the quality of
the accessioned media was to make both copies on new certified media. Media life is enhanced further
by storage in canisters in a stable environment with appropriate temperature and humidity. Today’s
recommended temperature is 65ºF. The recommended humidity is 45%. Both are slightly lower than
those recommended for textual records.
In addition, archival electronic records are subjected to periodic sampling to monitor the continuing
readability of their information. Based on extensive media testing and consultation with computer experts
that indicated an average media life of twelve to fifteen years the staff instituted a ten year media
refreshment program to move the data to new media before the old media deteriorates. Over the past
two decades new media have been added to the list of acceptable transfer media and archival storage
media as their stability and universality has been tested and documented. Today, the program accepts
records on nine track open reel magnetic tape, class 3480 magnetic tape cartridges, and CD-ROM.
Investigation of additional magnetic media, especially class 3590E magnetic tape cartridges, is underway.
Determination of the longevity of non-magnetic media and their use as preservation storage media
appears to be some years in the future.
12
The second basic approach to preserving access to the information on electronic records was to
standardize the format of the electronic records transferred to the Archives. The staff initially surveyed
the extent and uses of computers in the Federal Government in the early 1970s to determine current best
practices. The Archives’ first regulations required that all computer data to be transferred would have to
be in one of two mainframe computer language conventions, American Standard Code Information
Interchange (ASCII) or Extended Binary Code Decimal Interchange Code (EBCDIC). This created flat
files with no embedded software or controller language. Many criticize this as a “one size fits all”
approach. This approach remains valid for the data files and databases for which it was intended. This
data still represents the overwhelming majority of all electronic records transfers to NARA. The standard
allows NARA to accept and preserve data in standardized formats that permit researchers to use the data
in any hardware platform with any software applications.
This standard format is not applicable to all electronic records. Federal agencies have developed a
myriad of software applications as computing has expanded from the computer room to the desktop.
They also have applied automation to an increasing variety of information forms including the full range of
office automation applications, satellite imaging, geospatial applications, and digital photography and
video. NARA continually seeks solutions to preserve these newer records in formats that will reflect the
full record.
A side effect of these basic preservation principles has been continuing confirmation of a decision not to
become a “Colonial Williamsburg” for information technology. NARA does not maintain any superseded
computer hardware or software to support long-term preservation. The transfer of office automation
products such as e-mail, spreadsheets, and word processing documents, especially in the records of the
Executive Office of the President and the offices of Independent Counsels is prompting NARA to
investigate ways to provide access to such software dependent information without preserving copyright
protected software. One promising solution would utilize viewers or emulators that identify then mimic the
original software.
The emphasis on information preservation and media disposal means that physical conservation – media
rehabilitation or restoration is rarely performed. When it is, the object is to rehabilitate the media to the
point that the information can be migrated onto new media. After that the original media is destroyed.
Media conservation was most widely practiced on the electronic records in the PROFS lawsuit. Physical
conservation activities included thermal reconditioning to reverse tape folding and creasing, media repair
to splice split tape and attach or move reflective tape marks, tape baking in a scientific oven, and microphotography to document irreversible physical damage.
REASONS FOR PAST SUCCESS
The reasons for the success of the oldest and largest electronic records custodial program are varied.
One of the most essential has been staffing. For thirty years NARA has had a staff devoted exclusively to
electronic records. While backlogs attest that this staff has never been large enough to both accession
and preserve all of the records transferred, the program has been able to devote some staff to develop
procedures and standards. Others have been able to study the use of emerging technologies in Federal
agencies in order to understand future accessioning, preservation and access issues.
The program achieved an early zenith in 1980 when its staff of twenty was virtually current with
information technology, and was able to “target” agencies that it worked with to secure timely transfers of
especially valuable electronic records. The backlog, relative to the overall volume of records, was also
very small.
The next decade saw a complete reversal. Temporary detailing of staff to the FBI appraisal project,
reductions-in-force, and the loss of staff to promotions in other NARA units and other Federal agencies
reduced the staff to seven in 1983. Since the revitalization which began in1989 the staffing level has
risen to more than forty five but the volume of records to be accessioned and preserved, and the backlog
13
also have risen. The range of duties has diminished as scheduling and appraisal, government-wide
standards, and information technology research have been reassigned to other NARA units. This allows
the staff to concentrate on the core functions of accessioning and preservation.
The diversity of the staff also contributes to the program’s success. While archivists, archives specialists,
and archives technicians have always comprised the majority of the staff, management analysts,
computer specialists, computer programmers, and information specialists have also been part of the staff.
Even within any occupational series the mix of education and previous experience has been diverse
ranging from history and political science to geography, library science, social science, computer science,
and education. This diverse mix balanced archival principles and concepts with information technology
needs and requirements to develop procedures for addressing the unique needs for archival electronic
records.
Maintaining a separate program for electronic records ensures that the appropriate sense of mission
exists. Whether it was developing guidelines for environmental storage conditions, appraising records to
determine which would be preserved, or drafting operating procedures, staff are embarked on a
pioneering mission, their work is breaking new ground, their efforts will make a difference for those who
follow. This sense still exists.
Closely related to the sense of mission is a strong belief in the enduring value of electronic records as a
new form of records. Although “traditional” archivists challenged the value of electronic records
throughout the first decade of the program, the staff persevered because they believed in the enduring
value of the records, of their records. Time has proven them correct.
EMERGING ISSUES
Electronic records custodial programs are at a crossroads. In many ways the current issues seem even
more challenging than those faced when archival programs first addressed this newest form of record.
But are they?
Some archivists question whether archives should establish or maintain a custodial program for electronic
records. Others are re-examining basic archival principles to determine if any modifications are required.
A relatively small number are “doing,” inventorying, scheduling, appraising, accessioning, preserving, and
providing reference services for electronic records with enduring value. This small number works with
their colleagues around the world to address custodial issue and to develop complementary strategies for
mutual benefit. I am pleased that NARA has been doing – for more than three decades.
Certainly electronic records custodians must examine and adjust their policies, procedures, guidance,
standards, and underlying concepts related to emerging issues such as increasing platform
dependencies, exploding volumes of records, increasingly diverse types of electronic records, and the
ever increasing difficulty of ensuring access to the information over time. NARA remains committed to
addressing and solving those challenges and to ensuring the long-term preservation of all Federal
electronic records with enduring value.
14
History of NARA’s Electronic Records Program2
Thomas E. Brown
National Archives and Records Administration
College Park, Maryland
What follows is a personal history. Personal for two reasons: Having worked at the National
Archives’ programs for electronic records for nearly two decades, I personally participated in many of
these events. Second, I offer some personal judgements that are my own and not necessarily those of
my colleagues at NARA or the agency itself.
Thirty years ago on April 16, 1970, archivists at the U. S. National Archives and Records Service
(NARS) accessioned their first electronic records. The genesis of this accession into a custodial unit
dedicated exclusively to electronic records had begun several years earlier when the Archivist, Robert H.
Bahmer, established on December 13, 1966, the Committee on the Disposition of Machine-Readable
Records under the chairmanship of Everett O. Alldredge.
The committee made several
recommendations, one of which was that a senior specialist would coordinate the work of the NARS units
involved in electronic records. On February 13, 1968, he issued a GSA Notice to implement the
recommendations and to detail Alldredge to his Office to implement the findings of the committee.
During the spring and summer of 1968, Alldredge proposed a single unit to be responsible for
NARS efforts at electronic records. He wanted a staff of twelve with three senior computer professionals.
This met with adamant opposition in GSA. It had established a centralized computer facility to provide
technical support to all of GSA including the fledgling machine-readable program at NARS. When the new
Archivist, James B. Rhoads, formally established the Data Archives Staff in the fall of 1968, the staff
consisted of only three people including Joseph V. Bradt as the Director. Interestingly, the staff reported
directly to Alldredge as the head of records management after his detail to the Archivist’ staff. The Data
Archives Staff had responsibility for all NARS activities regarding machine-readable records except for
appraisal. The latter function remained in the appraisal unit in NARS directed by Meyer Fishbein, a key
member of the original Committee on the Disposition of Machine-Readable Records.
The Data Archives Staff began life as a records management operation. The small staff
developed within two years three documents: a form for inventorying magnetic tape files,
recommendations for proper handling and storage of magnetic tape, and the first issuance of a General
Records Schedule for computerized records. Alldredge struggled to obtain additional resources as he
proposed three separate organizational units within the next two years ranging in size with up to 51
employees. But these plans were never realized. After Alldredge’s retirement in May 1971, the only early
organizational change occurred in 1972 when the Data Archives Staff moved to the Office of the National
Archives and became the Data Archives Branch. With the reorganization, Gerald Rosencrantz, who had
replaced in September 1970 Bradt as Director of the Data Archives Staff, became Chief of the Data
Archives Branch.
Alldredge’s last plan presaged one major development. In 1973, the Data Archives Branch broke
GSA’s monopoly on technical support and expertise. Early that year, the Data Archives Branch began to
provide reference service for 600 reels of Civil Aeronautics Board historical data banks. Most requests
came from airlines with petitions before the CAB. NARS had to use the parent agency’s central data
processing center for the reference reproductions. In the summer of 1973, an order from American
Airlines was interminably delayed at GSA’s data processing center. In anger over the delay, the VicePresident of American Airlines for Governmental Relations telephoned the Archivist of the United States
at home one evening and wanted to know where his data were. After a flurry of telephone calls between
GSA and NARS officials, a GSA Deputy Commissioner ordered a Senior Systems Analyst to copy the
tape and drive it at midnight to Baltimore-Washington airport where an American Airlines plane was
2
This paper was presented at the Annual Meeting of the Society of American Archivists in Denver, Colorado, on 2
September 2000.
15
waiting to whisk the tape to Chicago. The next day, GSA reassigned that same Systems Analyst to the
Data Archives Branch and granted the Branch unique authority within GSA to acquire data processing
services independent of GSA.
By 1974, the Data Archives Branch had slowly grown to 13 with 3 more being recruited. Four of
these were funded by other Federal agencies as a result of partnerships. To that date, the machinereadable program functioned more as a federal data center rather than an archives since the vast
majority of its holdings were primarily files with high current use and with questionable archival value. As
a corrective, NARS raised the bureaucratic stature of the program to the Machine-Readable Archives
Division and recruited Charles Dollar, a history professor, as the new division’s director, to establish a true
custodial program.
The Dollar years did indeed bring a professionalism to the Division. During his first four years,
Dollar recruited ten people for professional archival positions with seven of the ten having Ph.D. degrees.
As professionals, the staff was active in a variety of professional associations with interests in the
preservation and use of machine-readable records, such as SAA, MARAC, IASSIST, AAAS, URISA, and
APDU. For example, in fiscal year 1980, individual staff members spoke on electronic records during 31
professional conferences and attended 35 other professional meetings. The staff also devised and
implemented descriptive standards for machine-readable records both as series of archival materials and
as social science data files. The division also acquired responsibility for the appraisal of machinereadable records and then outlined appraisal criteria. Responsibility for appraisal gave the division an
entree into records management as the division established programs to train agency staff to inventory
and schedule computer data bases, apply the general records schedules, and arrange for the transfer of
permanent records to the Archives. This effort also included a “targeted agency” program through which
staff assisted the inventorying and scheduling of electronic records in those agencies with records of
permanent value. The establishment in 1977 of the Center for Machine-Readable Records provided a
mechanism for NARS to acquire physical custody of electronic records with a high reference demand but
indeterminate or dubious archival value. By the end of 1980, the Machine-Readable Archives Division
had processed 155 accessions. These included computerized information gathered by regulatory
agencies, indexing systems to permanent paper records, a primitive form of an expert system using
artificial intelligence and, probably most important, a rich collection of operational records from the
Vietnam conflict. Several archivists had acquired through on-the-job and classroom training a solid
knowledge of data processing that supplemented the technical staff.
In late 1980, however, the program began to unravel. The first blow came when the U.S. District
Court Judge Harold Green, who was threatening to send the Deputy Archivist to jail, ordered a reappraisal of Federal Bureau of Investigation’s field office records. NARS established a re-appraisal task
force that was in fact, if not on paper, under the direction of Dollar and that included two other staff
members from the division. With the Task Force relying on a quantitative analysis of a statistical sample
of case files, three additional staff spent three to six months providing technical support for Task Force.
With Dollar directing the Task Force and then reassigned, the Machine-Readable Division lacked a fulltime, on-site division director for eighteen months.
And these eighteen months coincided with the Reagan Revolution’s goal to reduce the size of the
Federal Government. GSA’s Administrator expanded the Government-wide hiring freeze into a reductionin-force or RIF. With seniority a major determinant of who lost their jobs, NARS eliminated all vacancies
and fired all employees hired in the previous three years. When the dust settled, NARS had lost 98
employees and another 100 vacant positions. As a new program with several new hires, the MachineReadable Archives Division fell to a staff of 12 during the RIF of February 1982. The following month
testimony before Congress included this lament, “The National Archives machine-readable staff is
decimated; how will data be preserved for historical purposes?”
In April 1982, NARS reorganized because of the loss of personnel agency-wide and the MachineReadable Archives Division was reduced in stature to the Machine-Readable Branch, with Trudy
Peterson as the branch chief and as part of the division under Charles Dollar. Part of the reorganization
was to centralize all data processing with an administrative unit. This transferred all staff members in any
computer-related job series from the branch and denied the remaining staff access to computer facilities.
This prompted an incredulous comment from a Canadian during a public meeting at NARS, “How can one
deal with the records of modern technology if one doesn’t have access to the technology?”
Under Peterson’s tenure through August 1983, normal attrition continued to erode the staff until it
fell to seven employees. The number of files accessioned collapsed to 25 for fiscal year 1984. With no
16
computer support, its automated location register was reduced to an annotated print out from 1982.
Appraisal reports became single sheet forms. In sum, despite Peterson’s persistence, forces outside of
the unit had turned the program into shambles when Dick Myers arrived in January 1984 to become
branch chief.
As the program was reaching its nadir, it came under attack from a group whose intent was to
help the Archives - - the NARS Committee on Preservation. In 1983, its Subcommittee C on long-range
planing concluded that the long-term solution for electronic records was to convert them to computer
output microfilm or COM. To answer a series of questions posed by the committee, the branch
conducted in August 1983 the “The 1983 Survey of Machine-Readable Records” with its report finalized
in April 1984. The report was essentially a screed against the COM proposal because of the need to
preserve the retrievability and manipulability. On February 9 and 10, 1984, the committee members met
to discuss the COM proposal and the initial results of the survey. One participant reported the meeting
had “an adversarial atmosphere between the committee and NARS staff” and concluded, “NARS’
relationship with Subcommittee C has suffered.” The NARS spokesman during the meeting commented
that the chairman “is openly critical of what he sees as the prevalent passivity of archivists” who would
not embrace COM as the technological solution to electronic records. On July 13, 1984, the Advisory
Committee on Preservation formally transmitted to NARS its recommendation that, “most future
accessions [of electronic records] be in a human comprehensible form on a certifiable archival medium,
thus ensuring that the information remains permanently useable without regard to changing memory
technologies.” On October 22, the Archivist Robert Warner responded politely and deferentially and
called the proposal “interesting and challenging” and “commended [it] for logic and good sense.” While
diplomatically leaving open the door as a possible last resort at sometime in the future, NARS position
was emphatic, “[W]e will continue our current policies and practices of preserving machine-readable
information.” The Committee chairman only sighed, “NARS hasn’t grasped the point.”
While Subcommittee C’s proposed changes were coming to naught, other changes were in store
for the NARS. In the November 1984, legislation established the National Archives and Records
Administration as an independent agency and triggered a succession of personnel changes. Warner and
his deputy resigned and thus paved the way for Frank Burke to be named Acting Archivist from April 1985
to December 1987. With the loss of most records management functions to GSA, the new NARA wanted
to raise the profile of its relations with agencies by establishing a new office devoted solely to records
issues. This new office got appraisal and other records management responsibilities for electronic
records from the custodial program. Any vestige of an effective records management program in branch
had ended with the loss of staff beginning in 1981. Thus this transfer of records management was a de
jure acknowledgement of de facto reality. Trudy Peterson, who previously was the head of the MachineReadable Branch, became responsible for all custodial programs in NARA nationwide.
To re-establish a viable electronic records custodial unit, Myers still had to confront the President
Reagan’s desire to the limit the Federal Government and so could not recruit new staff. Exploiting his long
tenure at the National Archives, he persuaded staff in other units to request reassignment to the MachineReadable Branch. In this way, he doubled the size of the staff from seven when he took over in January
1984 to fourteen a year and half later in June 1985. Yet his success was short lived as the staff, for a
variety of reasons, withered to nine a year later. Early in his tenure, Myers realized that the branch
needed direct access to technology. By the end of 1984, he had secured permission to purchase time on
the IBM computer at the National Institutes of Health and the branch assumed responsibility for
preservation and accessioning processing on January 15, 1985. The branch under Myers still had to rely
on archival staff, rather than computer staff, to apply the homegrown technical expertise they had
acquired on the job or through formal training courses to handle the Branch’s data processing chores.
In January 1986, Myers became head of the Still Pictures Branch. After six months without onsite branch chief, Edie Hedlin assumed the position in July with a staff of nine. She continued the Myers
efforts to persuade individual NARA staff members to transfer into the Machine-Readable Branch. In
addition, she changed the staffing pattern. Before her arrival, individual staff members were assigned to
a function, such as accessioning, preservation or reference. Hedlin however saw that professional staff
received training in all functional areas and the got responsibility for a set of record groups. Hedlin also
weighed into the decade-old issue about records with high research demand but dubious archival value
and secured approval to disband the Center for Machine-Readable Records. But it would take five years
of negotiations to dispose of the records previously placed in the program.
17
To establish the infrastructure for a successful electronic records program, Hedlin prepared an
option paper for consideration by top NARA management. After top management approval, Hedlin
generated the reams of administrative paperwork needed for a reorganization: a Center for Electronic
Records with two branches, an Archival Services Branch and a Technical Services Branch. On October
1, 1988, based on Hedlin’s outputs, the Center for Electronic Records was created. When Hedlin opted
for Congressional relations, Peterson hired Ken Thibodeau who became the Center’s Director in
December.
Thibodeau had the support of the new Archivist, Don Wilson, whose “priorities since becoming
Archivist has been to develop for the profession and for Federal agencies, a model Center for Electronic
Records.” To achieve this goal, in July 1989, he directed the appraisal function of electronic records
returned to the custodial program while scheduling computer materials remained part of NARA’s records
management program. Returning to precedents of the Machine-Readable Archives Division, Wilson
stated, “To continue the development of the Center for Electronic Records as a model, I also intend to
add computer experts to [the Center’s] . . . archival staff in order to achieve the proper balance between
technical and archival knowledge and practice.” As part of his Center-building efforts, Wilson directed the
Center to add five additional staff members each year until the year 2000 for an estimated total of 75
staff. In 1992, the Center’s management developed an organizational growth plan involving three
branches to take full advantage of the projected growth in personnel. With this support, the staff grew
from 17 when Thibodeau arrived in December 1988 to 48 in January 1993.
Equally important to the increase in raw numbers was the diverse backgrounds of the new staff.
Systems analysts and other computer professionals became employees of the Center. Rather than
recruiting archivists with backgrounds in history, the Center began hiring professionals with expertise in
other disciplines, such as geography, sociology, economics and library science. The Center also hired
student intermittents for routine tasks in accessioning, reference and preservation. With the addition of
data processing professionals, the Center began in 1990 to move away from relying solely on outside
service bureaus, such as the NIH computer center, and started to develop in-house systems for both
accessioning and preservation. These in-house systems allowed the Center to increase the number of
accessions processed and preserved. Through a contract with the National Academy of Public
Administration, panels of subject matter experts identified 430 statistical data bases that likely had
records with archival value. And the Center staff undertook a coordinated program, to schedule and
arrange for the ultimate transfer of those records. Thus it seemed, at the dawn of 1993, the electronic
records program was developing into an organization worthy of the National Archives of the United
States.
The bright hope quickly dimmed. In January 1993, President Clinton assumed office and ordered
a 4 per cent reduction in the Federal workforce within three years. The hope of adding five additional staff
for the Center each year until 2000 fell by the wayside. Indeed, to date, the total staff has never equaled
its high mark of 48 in January 1993. As staffing began to decline, the court litigation surrounding the
Armstrong et al v. Executive Office of the President et al or more colloquially known as the PROFS case
drained the ebbing resources. In January 1989, Scott Armstrong had filed a Freedom of Information Act
(FOIA) request seeking all the information on the Executive Office of the President’s office automation
system known as PROFS system. On January 3, 1993, U.S. District Judge Charles R. Richey rejected
the Government’s argument that e-mail were not records and ordered “the Archivist to take . . . all
necessary steps to preserve the electronic records . . ..” In the final hours of the Bush Administration,
nearly 6,000 backup tapes and hard drives from White House staff computers were transferred to NARA.
Controversy surrounding the agreement to materials led to the resignation of Wilson and the appointment
of Trudy Peterson as Acting Archivist of the United States. In that capacity, on March 25, 1993, Peterson
made the Center for Electronic Records responsible for the preservation of the White House materials.
When the plaintiffs claimed in April 1993 that the Government had not complied with the court’ order,
Judge Richey dismissed the fact that Acting Archivist Peterson had moved immediately on assuming her
position to ensure preservation of the materials and found the government in civil contempt. He ordered
the government to take “all necessary steps to preserve the tapes transferred to the Archivist,” within
thirty days or pay a fine of $50,000 a day, doubling every week thereafter. Although the contempt order
was reversed, the Center nonetheless pulled out all stops to ensure the physical preservation of the files
that were thought to be most at risk. The staff worked seven days a week sixteen hours a day, for three
weeks to accomplish this. Interestingly, the Center when embarked on this labor, it did not have access to
any technology suitable for the required work. There was no in-house processing capability at all and the
18
NIH computer center could not be used to copy backup tapes from other computer centers. Fortunately,
the Center had contracted for the development of an in-house system for routine preservation copying
and got rudimentary version of the Archival Preservation System (APS). By June 18, 1993, three days
prior to the judge’s deadline under the contempt order, the Center had successfully copied all ‘at risk’ 609
computer tapes. Over the next two years, the Center continued to preserve the White House materials
copying more than 99.98% of the media transferred. Besides the obvious impact of an increased
workload from the PROFS case, court-imposed inspections became a continuing, substantial, and
uncontrollable drain on the Center’s resources. By 1998, the Center for Electronic Records had spent
approximately $2.5 million on PROFS-related expenses; 90% of this came from the Center’s operational
budget. Indeed, the Center has spent more on the PROFS materials than on the preservation of all the
permanently valuable electronic records accessioned into the National Archives since the first electronic
records were accessioned in 1970.
The PROFS records consumed most of the Center’s staff and financial resources from 1993 to
1997. In February 1998, NARA once again reorganized its operations in the Washington DC area and
created the Electronic and Special Media Records Services Division from the erstwhile Center for
Electronic Records. Michael Carlson became director of the new division with 44 employees. In line with
the Clinton Administration’s aim to flatten the levels of Government, the branch structure was eliminated
while retaining the same supervisory structure. The only significant functional change was to transfer the
appraisal of electronic records to the unit responsible for appraising the records in other media. During
the past 8 years, the Center and its successor division has continued to develop its in-house systems for
accessioning and preservation and increased its annual capacity to copy and preserve from about 1,000
files per year to over 70,000 files today. In 1998, the automated accessioning system successfully
verified the intra-office e-mail messages from President Bush’s Office of the United States Trade
Representative and opened the door to accessioning office automation materials. The growth of
processing also included an increased in the media options for accessioning and reference purposes,
beginning with CD-ROMs in the mid-1990’s and now moving today to include File Transfer Protocol
(FTP). Resting on an initial business process improvement (BPI), the electronic records unit launched a
pilot project to merge archival and technical functions into a team responsible for all archival activities
associated with records from the Bureau of the Census. These advances came, as indicated above, in
spite of stagnant staffing levels and the preservation of PROFS draining resources.
In retrospect, the custodial program for electronic records has had a yin and yang from an
impressive start in the 1970’s to a near collapse in the 1980’s, a rebound until 1993 and followed by
inching forward into the new millennium with a languishing staff. Interestingly, Presidential policies
reverberated on this small unit within a small agency. Reagan’s policies ultimately shrunk the staff to
seven; Clinton’s policies cut short the staff increases needed to build a model program for the profession.
This history reveals the necessity of having staff from a variety of backgrounds so as to create the
needed synergy between archival and computer professionals. Since 1968, the range of records
management activities of the custodial program have ebbed and flowed. It’s full-scale responsibilities in
the 1970’s withered to nothing with the loss of resources in the 1980’s and died formally with the loss
appraisal in 1985. The Center for Electronic Records acquired the small piece of the records
management pie with appraisal in 1989 until its dissolution in 1998. Despite these impediments, the past
three decades witnessed great progress. From that first accession thirty years ago, the custodial unit
now has custody of over 200 million records. In acquiring this vast collection of historical materials, the
program pioneered records management techniques and practices for electronic records. These 200
million records were accessioned, described, preserved and made available through a range of archival
procedures that the custodial program had to create nearly from scratch. The staff has now taken these
techniques developed for data files and data bases and has applied them to the archival administration of
records from office automation systems. Thus the program now has ways to accession, verify, preserve
and provide reference for e-mail transmissions with attachments, desk-top publishing applications, and
geographical information systems.
When NARS established the Data Archives Staff, the Comptroller of the National Security Agency
wrote, “It is always reassuring to know that NARS program objectives keep pace with the rapid changes
in information handling technology. Such measures . . . do much to sustain continued user confidence in
NARS service.” While NARS and NARA have not always lived up to that optimistic assessment, we can
always hope to do so in the future. And there may be a firm foundation for hope. The current Archivist,
John Carlin, has secured significant increases in funding and has ensured that a large portion contribute
19
to administering electronic records.
The priority is seemingly the successful development and
implementation of the Electronic Records Archives (ERA). The ERA effort should provide NARA’s
custodial program for electronic records with the tools it needs to expand the successes of its past to
manage the records of the future.
20
An Historical Perspective on Appraisal of Electronic Records,
1968-19983
Linda Henry
[This paper does not necessarily represent the views of the National Archives and Records
Administration. I’m going to use the term NARA throughout and Electronic Records Program for the
various names of the unit, such as branch, division, etc.]
The appraisal function for electronic records at NARA has been in various units over the years. I will
concentrate on the period 1968-1998, when the function was most often within the electronic records
program. I will explore these appraisal themes, each chronologically:
1.
2.
3.
4.
Determining the record character of computerized records, whether they are records or non-records,
and, if they are records, whether they are valuable enough to be accessioned.
Applying traditional archival appraisal principles, such as evidential and informational values, and
incorporating other considerations unique to computerized records
Applying records management techniques
Trying new approaches
Record/ Non-Record, Not Valuable/Valuable
The issue of “recordness” arose almost from the beginning of the National Archives 65 years ago
with punch-card records, which Margaret Adams’ excellent article on the subject called the precursors of
ER. Interestingly, the early NA pioneers seemed more in agreement that punch cards were records, than
did their successors with later computerized records. Most of the pioneers argued, however, that punch
cards were “records” but not “archives,” that is, not permanently valuable. The Records Disposition Act of
1939 explicitly mentioned punch cards as records. The Records Disposition Act of 1943 substituted the
phrase “other documentary materials, regardless of physical form or characteristics.” This phrase still
pertains to federal records 57 years later. The Federal Records Act of 1950 as amended in 1976 added
the term, “machine readable material.” (44 U.S.C. 3301), which also still pertains.
NARA established a program for ER in 1968. Despite the legal definition of record, the
arguments continued about whether computerized records were records or non-records and particularly
whether they were permanently valuable records. For example, in one 1976 appraisal, other archivists in
NARA argued that “these tapes are similar to reference and study materials which were disposed of as
non-record material.” and also that the records did not have evidential or informational value i.e., they
were not valuable. This did not always happen, and other appraisals encountered no opposition.
Federal agencies presented another obstacle. The first Data Archives Staff found that “virtually
all agencies in the Federal Government considered the information on magnetic tapes as ‘non-record.’ ”
In the next decade, a 1975 survey found that 60 percent of federal agency records officers thought that
computerized records were record material. (Dollar, Ann Arbor, p. 80.) This improvement may be
attributed to several years of NARA efforts to educate federal agencies.
Sometimes the doubts of both the NARA archivists’ and federal agencies about the record nature
of ER converged in an appraisal dossier. For example, a 1977 dossier had a records schedule with
dispositions for computer printouts or “computer runs,” but no item or disposition for the computerized
records from which the printouts came. (NC1-151-77-001)
3
An earlier version of this paper was presented at the Society of American Archivists Annual
Meeting, Session #47, Sept. 2, 2000
21
At the end of the 1980s, the records test arose again. In 1989 Armstrong v Executive Office of
the President, known as the PROFS e-mail case, raised almost every question about the record nature of
e-mail. Were e-mail messages records or non-records? Were e-mail messages in electronic and paper
form both records? Were both forms valuable? How much metadata should be captured in an electronic
message that was printed? The PROFS case settled the issue that transmission and receipt information
was part of the record. Issues about destruction of e-mail were ultimately resolved in the GRS 20 lawsuit,
which I’ll discuss later.
Also in the early 1990s, after 20 years of trying to educate others that computerized records were
indeed records, NARA’s ER program faced still another assault about the record character of
computerized records—from some members of the archival profession. A group of archivists proclaimed
that the very records NARA had been appraising and accessioning for 20 years were not records, but
“merely” data. I have responded to those supporters of a “new paradigm” and their narrow definition of a
record. I mention it here because it seems that disagreements about the record nature of computerized
records will never go away.
Applying Traditional Archival Theory of Appraisal
When NARA began appraising electronic records in 1969, appraisal of permanent records
usually occurred when agencies offered records to NARA. This meant that ER appraisers usually had
custody of the records, and archivists verified and tested the readability of the tapes prior to appraisal. By
the mid-1980s, NARA no longer accepted direct offers of federal records. For the last 15 years, then,
appraisal has occurred when the ER are still in the agency. By federal regulation, agencies must
schedule ER systems within one year of a system’s creation, although this doesn’t always happen. (36
CFR 1228.26) In addition, today transfer, verification and copying of records most often takes place after
appraisal.
ER appraisers considered the traditional principles of evidential, informational, administrative and
legal values. Charles Dollar’s writings from the 1970s, however, gave the impression that ER archivists
analyzed only informational value. While that value characterized most of the appraisals then and now, at
least some ER appraisals from the 1970s clearly identified evidential value. For example, the Presidential
Clemency Board includes the Consistency Audit Data file appraised for its evidential value in 1976.
Similarly, records appraisals in the 1970s for Department of Defense records about the conduct of the
Vietnam War and from regulatory agencies, such as the Securities and Exchange Commission, show that
records were being appraised for documenting core mission programs, not just for, or in addition to,
informational value. Thomas Brown has published examples of such records and concluded that 59
percent of NARA ER accessions before 1980 were “programmatic records or records derived from
program operations.”
In the 1980s, ER appraisers gained more experience with appraisals for evidential and
informational value, and for legal value as well. For example, INS records were appraised in part for their
value to immigrants who could use them for legal purposes. In addition, ER appraisers gained some
experience with appraisal of text systems. One early example is the Watergate Special Prosecution
Force records, which included a text system appraised for evidential value. In the 1990s appraisers
gained more experience with electronic text files, such as those from the Executive Office of the
President, and some geographic information systems.
While ER appraisers considered the same values as those applied to paper records, they also
had to consider characteristics unique to computerized records, such as manipulability, volume, linkage,
duplication and evaluating micro-level data. While these attributes do not in themselves justify permanent
appraisal, each can greatly enhance the value of the records. For example, an automated index is greatly
superior to a manual one because of the characteristic of manipulation. Being able to save micro-level
data, as opposed to summary or aggregated data, can be preferable because of reanalysis, which can
serve as a check on the way the agency originally used the data, or accountability—a much touted
concept of the 1990s. Saving computerized micro-level data also solves the volume problem of paper
records.
Linkage permits comparison of data with common attributes such as geographic location,
22
occupation or age.
Other NARA archivists sometimes misunderstood these characteristics. For
example, one complained that an ER appraiser was equating “permanent” with “manipulative.” Another
NARA archivist reluctantly agreed to accessioning “a mere cubic foot in volume.” seemingly
misunderstanding the volume issue, since that cubic foot consisted of approximately 11 reels of magnetic
tape containing thousands of records.
ER appraisers also had to learn to evaluate technical issues, also uncharted territory for
appraisal. The first two technical issues concern readability—whether the records can be used—and
documentation, whether there is sufficient information to use or appraise the records, to process them
and, most importantly, enable researchers to use them. If the records are not readable and the
documentation is insufficient, the records, however potentially valuable, cannot be appraised for
permanent retention. Other technical considerations concern the hardware and software environment. In
the early years NARA sometimes reformatted information into a independent format, but by 1976 NARA
required that agencies transfer permanent records in hardware and software independent format.
Records appraised after this regulation could sometimes be appraised as temporary because of their
dependent format, even if the records were otherwise valuable. NARA’s current Electronic Records
Archives (ERA) initiative, headed by Ken Thibodeau, offers hope for some solutions to the software
dependency problem and other issues.
In general, however, appraisers first applied the traditional tests of evidential and informational
value before considering technical considerations, except readability and documentation. After all,
manipulating junk, cutting down on the volume of junk or having junk in a software independent format
really doesn’t matter if the records are junk.
Applying Traditional Records Management Techniques
Almost from the beginning of the ER program in 1968, ER archivists faced the same problems as
other archivists with the proliferation of temporary records. NARA began issuing general records
schedules (GRS) covering disposable paper records in 1945. One of the first actions of the newly created
Data Archives Staff was issuing a general records schedule for machine readable records in 1972, GRS
20. NARA revised GRS 20 in 1977, 1982 and 1988, all covering data processing operations. In 1988 a
separate GRS 23 covered word-processing files, administrative databases, and electronic spreadsheets.
NARA revised GRS 20 again in 1995, combined in GRS 23, and explicitly included e-mail. In December
1996, the lawsuit , Public Citizen v. Carlin, challenged the legality of GRS 20 and, among other issues,
continued the arguments begun in the PROFS case about the record nature of e-mail. The lawsuit also
added the issue of what constituted proper recordkeeping. The controversy often seemed to overlook
the requirement that e-mail could not be destroyed unless it had been placed in a recordkeeping system
with records management functionality and had transmission and receipt information. While the
recordkeeping system could be a paper system, agencies could not merely “print out e-mail” as some
opponents misleadingly argued and some professional organizations, such as SAA, misunderstood
(Archival Outlook, July/August 1997, pp. 4-5). After numerous actions, the GRS 20 lawsuit ended on
March 6, 2000, when the Supreme Court refused to reverse a US Court of Appeals Decision of August
1999 that upheld GRS 20.
Innovation, trying new approaches
Almost from the beginning, the ER program tried innovative archival and records management
approaches. One early example in the 1970s was the “targeted agency” program whereby the ER
program worked with selected federal agencies known to create valuable records but which needed
NARA assistance in inventorying, scheduling and transferring records. Two such targeted agencies were
the Bureau of the Census and the Public Health Service. One result was to save the 1960 Census
records. In 1999 and 2000, NARA is once again seeing the value of this approach and is funding
“targeted assistance” positions throughout its nationwide system.
Another innovation, in 1978, was creating a partial records center function for ER which had a
high current research use but an indeterminate permanent value. NARA management initially approved a
23
procedure to accession such records and reappraise them after 10 years, but other units in the Archives
objected. The compromise was creating a partial records center function within the MRR Division.
NARA assumed physical custody, but the agencies retained legal custody. At the end of a time period,
usually 5 years, NARA and the federal agency would review the agreement and extend it, or schedule the
records for destruction or accessioning. For those records that were ultimately deemed permanently
valuable, having them in the MRR Division gave NARA a “bird in hand” advantage, since it was much
easier to then accession them rather than trying to get agencies to transfer them. One example of
valuable accessioned records emerging from this experiment is the Civil Aeronautics Board’s Origin and
Destination data. Placed in the records center for ER in 1978, the data became a valuable source in
documenting airline deregulation a decade later.
Still another example of archival innovation is the ER program’s use of a study by the National
Academy of Public Administration (NAPA) in 1990 and 1991 of major federal data bases. While NARA’s
ER program knew about numerous federal data bases and had scheduled and accessioned a great
number of them, the problem was identifying the universe. NARA asked NAPA to prepare an inventory of
major data bases, and, to focus on NARA’s interest, to identify those that had potentially permanent
value. The preliminary inventory included approximately 9,000 data bases. NAPA panels culled that
number to 1,789 and recommended that NARA should accession 448 data bases, almost a “Fortune
500.” ER appraisers made some adjustments in the NAPA recommendations, “demoting” some of the
“should transfer” data bases to temporary retention and elevating some of the “un-rated” data bases to
permanent retention. ER appraisers were then able to contact agencies and use the clout of NAPA to get
agencies to schedule and transfer records. The result was schedules for 295 databases, mostly
permanent but some temporary, which the ER program did not have to rely on federal agencies to
schedule. A great number of these have been transferred and accessioned to date.
In this presentation I have largely confined my remarks to the period from 1968 through 1998,
when the appraisal function was again transferred from the ER program to the appraisal unit in NARA.
However, I would like to note that in 1999 Archivist John Carlin began “a major project to review and, if
necessary, reinvent the policies and processes for the scheduling and appraisal of federal records in all
media.” The expected conclusion of the project is Sept. 30, 2001.
This quick run through 30 years of appraisal of ER at NARA, and my own 7-year experience in
appraisal of ER, leads me to some personal observations. The first is the consistency over the years in
the importance of applying traditional archival theory and practice to records of a new media. The ER
program did develop new considerations in appraisal, such as manipulability, volume, linkage, duplication
and evaluating micro-level data. And the staff did have to learn to apply a technical analysis. But the
bottom line remained focused on the content of records and applying the traditional archival appraisal
principles of evidential, informational, legal and administrative value.
My other personal observation has to do with the importance of databases and my reservations
about e-mail. In recent years, records of office automation, such as word processing and particularly email, have concerned archivists more than other types of ER. Certainly office automation records
account for the vast bulk of records that federal agencies are producing. Does our hand-wringing about
e-mail, however, sometimes divert us from the important databases that agencies are creating?
Statistical databases, for example, will always be important government records. Governments at all
levels count things. Almost everything. Federal agencies count the population, crop production, wage
earnings, accidents, and college and university enrollments, among a very long list of things. Federal,
state, and local governments are going to go on counting and creating valuable and permanent records
that reflect agency mission and, usually, impact. At the federal level, such records also give us the only
national information we have on a numerous subjects, such as disease and educational achievement.
All those important databases need to be scheduled, transferred and accessioned, so they can be used
by researchers.
The volume of e-mail is probably too enormous to be counted. One 1999 estimate is 36.5 billion
messages per year in the federal government. Before I had a computer for my work, I created few
records. Today, federal workers at some 2 million work stations are creating far more records than they
24
created before they had pc’s. The pie of records is thus vastly larger. How much of the record material
needs to be retained for more than a brief period? For years, NARA has estimated that less than 2% of
federal records are permanently valuable. This estimate seems much too high for e-mail.
More distressing, however, is the lack of organization of the e-mail. We can’t appraise all that
unorganized material on 2 million pc’s. There’s too much, most of it is transitory and it requires
implementation of rigorous records management, still largely non-existent. For several years NARA has
been emphasizing the importance of establishing record-keeping systems, preferably electronic. NARA’s
current web page for federal agencies still stresses this (nara.gov/records/fasttrak/ftprod.html). NARA has
also endorsed the Department of Defense’s Standard 5015.2 for records management software
applications for ER, which offers hope for agencies organizing their e-mail.
Thinking about e-mail reminds me of NARA’s earliest history. The pioneers at NARA in the
1930s and 1940s faced mountains of government records accumulating at approximately a million cu. ft.
annually. As one report noted then, “Caring for these records has been likened to keeping an elephant
for a pet: ‘ its bulk cannot be ignored, its upkeep is terrific, and, although it can be utilized, uncontrolled it
is potentially a menace.’” Doesn’t this describe the e-mail mess?
The ER problems NARA faced 30 years ago were daunting. They still are. But today we have an
advantage. We have 30 years of experience in confronting problems with ER We can apply that
experience to the challenges we now face.
25
Knowledge-based Persistent Archives
Rea ga n W . Moore Sa n Die go Su pe rc omp uter Cente r mo ore @s ds c .edu
Abstract
The preservation of digital information for long periods of time is becoming feasible through the
integration of archival storage technology from supercomputer centers, information models from the
digital library community, and preservation models from the archivist’s community. The supercomputer
centers provide the technology needed to store the immense amounts of digital data that are being
created, while the digital library community provides the mechanisms to define the context needed to
interpret the data. The coordination of these technologies with preservation and management policies
defines the infrastructure for a collection based persistent archive [1]. This report discusses the use of
knowledge representations to augment collection-based persistent archives.
1. Introduction
Supercomputer centers, digital libraries, and archival storage communities have common persistent
archival storage requirements. Each of these communities is building software infrastructure to organize
and store large collections of data. An emerging common requirement is the ability to maintain data
collections for long periods of time. The challenge is to maintain the ability to discover, access, and
display digital objects that are stored within the archive, while the technology used to manage the archive
evolves.
We originally implemented a collection-based persistent archive [1] in which a description of the collection
is stored along with the data. The approach focused on the development of infrastructure independent
representations for the information content of the collection, interoperability mechanisms to support
migration of the collection onto new software and hardware systems, and use of a standard tagging
language to annotate the information content. The process used to ingest a collection, transform it into
an infrastructure independent form, and recreate the collection on new technology is shown schematically
in Figure 1.
Figure 1. Persistent Collection Process
Two phases are emphasized, the archiving of the collection, and the retrieval or instantiation of the
collection onto new technology. The diagram shows the multiple steps that are necessary to preserve
digital objects through time. The steps form a cycle that can be used for migrating data collections onto
26
new infrastructure as technology evolves. The technology changes can occur at the system-level where
archive, file, compute and database software evolves, or at the information model level where formats,
programming languages and practices change. The ultimate goal is to maintain not only the bits
associated with the original data, but also the context that permits the data to be interpreted.
We rely on the use of collections to define the context to associate with digital data. Each digital object is
maintained as a tagged structure that includes the original bytes of data, as well as attributes that have
been defined as relevant for the data collection. A collection-based persistent archive is therefore one in
which the organization of the collection is archived simultaneously with the digital objects that comprise
the collection.
A persistent collection requires the ability to dynamically recreate the collection on new technology.
Scalable archival storage systems are used to ensure that sufficient resources are available for continual
migration of digital objects to new media. The software systems that interpret the infrastructure
independent representation for the collections are based upon generic digital library systems, and are
migrated explicitly to new platforms. In this system, the original representation of the digital objects and
of the collections does not change. The maintenance of the persistent archive is then achieved through
application of archivist policies that govern the rate of migration of the objects and the collection
instantiation software.
2. Knowledge-based Archives
The preservation of the context to associate with digital objects is the dominant issue for knowledgebased persistent archives. The context is traditionally defined through specification of attributes that are
associated with each digital object. The context is also defined through the implied relationships that exist
between the attributes, and the preferred organization of the attributes in user interfaces for viewing the
data collection.
Management of the collection context is made difficult by the rapid change of technology. Software
systems used to manage collections are changing on five to ten-year time scale. Of greater concern is
that the information tagging languages used to annotate digital objects is also changing. The persistent
archiving of a collection must also handle the evolution of the information mark-up language.
We have characterized persistent archives in prior publications [1,2] as collection-based repositories. We
now recognize the need to broaden the archive characterization to knowledge-based repositories. Not
only the information content, but also the processing steps used to accession the collection must be
preserved. Conceptually, one can view the accessioning process as the equivalent of the process needed
to instantiate the collection on new technology. If the accessioning process can be captured in an
infrastructure independent representation, the same process can be used to manage the migration of the
collection to new markup languages, archival data repositories, information repositories, and knowledge
repositories.
The archival description of a collection then must include not only contextual information about the digital
objects, but also knowledge about the relationships used to derive the contextual information.
The architecture that is needed to implement a knowledge-based persistent archive is shown in figure 2.
27
Ingest
Knowledge
Relationships
between
Concepts
Manage
Knowledge
Repository for
Rules
Information
Attributes
Semantics
Information
Repository
Data
Fields
Containers
Folders
Storage
(Replicas,
Persistent IDs)
Process
Access
Knowledge or
Topic-Based
Query
Attribute- based
Query
Infrastructure
Feature-based
Query
Process
Figure 2. Knowledge-based Persistent Archive
The three columns represent the technologies needed to manage the ingestion process, manage the
persistent archive, and manage the access environment. The three rows represent the infrastructure
needed to manage knowledge, information and data.
Knowledge is represented as relationships between domain concepts. Information is represented as
attributes about digital objects within the collection. The digital objects are “images” of the reality
described by the domain concepts. Ingestion corresponds to the steps of knowledge mining/tagging,
information mining/tagging, and digital object organization/storage. Persistent archive management
requires infrastructure to store the digital objects (archives), information repositories to hold the metadata
(databases), and knowledge repositories to organize the relationships (logic systems). The access
environment provides mechanisms to query the collection at the data level through feature extraction, at
the information level through database queries, and at the knowledge-level through domain concepts.
Just as the data management infrastructure is intended to provide access without having to know data
object names, the knowledge access infrastructure is intended to provide access without having to know
the explicit metadata attribute names used to organize the collection database.
The knowledge-based persistent archive requires software infrastructure to support interoperability
between different implementations of ingestion, management, and access infrastructure components.
This is shown in Figure 3. Between “Ingest platforms” and “Management repositories”, standards are
needed to define consistent tagging mechanisms for knowledge (XML Topic Map DTD or XTM DTD), for
information (XML DTD), and for data organization (logical folders and physical containers). Between
“Management repositories” and “Access platforms”, standard query languages are needed for knowledgebased access (Knowledge query language or rule manipulation language), attribute-based access
(EMCAT SGL generator or MIX mediator), and feature-based access (application of procedures within a
computational grid).
Between the “knowledge” and “information “ environments, a standard
28
representation is needed to map from concepts to attributes, such as topic maps or model-based access
systems. Between “information” and “data storage” environments, a data handling system is needed to
map from attributes to storage locations, such as the SDSC Storage Resource Broker.
Ingest
Knowledge
Manage
Relationships
Between
Concepts
X
T
M
D
T
D
Knowledge
Repository for
Rules
Access
Ru
les
K
Q
L
Knowledge or
Topic-Based
Query
(Topic Maps / Model-based Access)
Information
X
M
L
D
T
D
Attributes
Semantics
Information
Repository
E
M
C
A
T/
M
IX
Attribute- based
Query
(Data Handling System - Storage Resource Broker)
Figure 3.
Persistent Archive Interfaces
M
Gr
Data
Fields
Containers
Folders
C
A
T/
H
D
F
Storage
(Replicas,
Persistent IDs)
ids
Feature-based
Query
Persistence is achieved through the infrastructure middleware (shown in Figure 3 as the blue grid) that
links accession platforms, management repositories, and access platforms. The same middleware is
needed to support grid environments (such as computation on distributed data collections) and digital
library environments (such as curricula support in the National Science, Math, Engineering Technology
Education Digital Library - NDSL). This architecture has been proposed to both the Grid Forum and the
NSDL, and may be the architecture that integrates knowledge management activities from these
communities with the persistent archive community.
2.1 Archive Accessioning Process:
Of interest is the emerging need for knowledge management as well as information management and
data management when ingesting collections. When we look at collections, we see multiple interfaces
where knowledge is required to be able to adequately describe relationships inherent within the
collection. We have been looking at the preservation of relationships that are needed to describe:
- implied knowledge (interpretation of fields)
- structural knowledge (topology associated with digital line graphs)
- domain knowledge (relationships between domain concepts)
- procedural knowledge (workflow creation steps for digital objects)
- presentation knowledge (support for knowledge-based queries).
One way to accomplish the goal of knowledge-based access is to use the ISO 13250 Topic Maps
standard to maintain mappings between domain concepts and the attribute names used in the collection
schema. It is very interesting to note that relationships are implicit between each of the nine infrastructure
29
components defined in Figure 2. The relationships either define rules that can be applied to the
collection, or quantify associations that can be made between collection elements. Examples are:
• Relationships that quantify rules:
• Rules for defining collection attributes
• Rules for organizing attributes into a schema
• Rules for feature extraction
• Rules governing data set creation
• Relationships that quantify associations:
• Organization of concepts into topic maps
• Ontology mapping between concept maps
• Mapping of concepts to collection attributes
• Mapping of concepts to feature extraction rules
• Mapping between attributes and data fields (semantics)
• Semantic mapping between collections
• Mapping between attributes and storage
• Mapping between attributes and features
• Clustering of data into containers
The relationships can be separated into four broad classes:
− Semantic/logical relationships. Relationships can be defined to map from the concepts used to
describe the collection to the attribute tags used to annotate the collection. Semantic relationships
can also be defined between the domain specific concepts as knowledge bases or semantic maps.
− Procedural/temporal relationships. The transformations that are applied to the collection to create the
archival form constitute a workflow that represents the ingestion process. The temporal order and
explicit transformations can be represented as a set of states through which the collection is
processed.
− Structural/spatial relationships. The internal organization of digital objects within the collection can be
represented as a structural ordering of the tagged elements. The representation of the structure can
be expressed using the same types of characterization as needed for spatially tagged data.
− Functional relationships. For scientific applications, analysis algorithms are needed to identify
features that might be associated with a digital object. The expression of the relationship between
the named feature and its presence within a digital object will require the ability to archive
mathematical expressions.
In the ingestion process, a major challenge has been the need to be able to differentiate between artifacts
and implied knowledge. Essentially, the steps of refining the description of a collection by including more
attributes, must be integrated with the identification of anomalies. To make progress, we apply the
concepts of occurrence tagging and closure to the archived collections. Occurrence tagging is the explicit
annotation of the location of each tagged attribute along with the associated value. This provides a
representation that captures all of the information content, without imposing constraints on permissible
attribute values. Closure is the analysis of the occurrences to identify both completeness and
consistency. Completeness is evaluated by verifying that all attributes are populated, and that the
information content is fully annotated. Consistency checks that all attribute values fall within defined
ranges. Consistency can be checked by construction of inverse indexes that point to all occurrences of
each attribute value.
It is necessary to iterate between knowledge extraction and attribute mining. We illustrate this through
application of the ingestion process shown in Figure 4.
• Define a representation of the concepts inherent within the collection.
• Build a concept map that identifies all of the possible attributes to associate with each concept
• Tag the collection to identify attributes for each of the possible fields.
• Restructure the concept map to eliminate unused fields, specialize classes, rearrange class
attributes, etc.
30
•
Mine the collection to identify differences between bill versions, identify missing attributes, identify
implicit attributes, and identify invalid data (such as duplicated pages).
•
Accession
Template
Closure
Concept/Attribute
Information
Generation
Knowledge
Generation
Attribute
Selection
Attribute
Inverse Indexing
Attribute
Tagging
Occurrence
Tagging
View
Management
Data
Organization
Collection
Figure 4. Ingestion Process
At one time, the hope was to be able to ingest a collection in a single pass. Based upon the above steps,
at least three analyses are needed to mine knowledge, information, and organize data. Depending upon
the number of iterations used to refine the concept space, additional passes through the data may be
necessary. It is still an area of debate for whether it will be possible to differentiate in general between
concept map refinement and error analysis. These steps will have to be done jointly for most collections.
Note that once the data has been wrapped into XML, all integrity checking, knowledge mining, derivation
of a "consolidated version", etc., can be seen as (albeit very elaborate) queries against an XML collection.
The interesting research issue is to find out how well XML query languages (including the UCSD/SDSC
XMAS system) are able to express the analysis queries. Especially for integrity checking, logic-based
XML query languages seem to be a good choice for an ingestion environment.
2.2 Archival Representation of Collections:
One of the results of the analysis of the collections provided by NARA was the realization that multiple
views of a collection may need to be archived. Typical views include:
• Original form as submitted
• XML tagged form
• Occurrence representation (occurrence, attribute, value)
• Knowledge-based representation (recreation of the original form from the occurrence representation).
This view can be thought of as the noise-free representation of the original collection based upon the
knowledge and information content that was created during the accessioning process. This view can
be designed to include white space and all anomalies if desired.
31
•
Consolidated representation (elimination of all duplicated information)
By archiving descriptions of the processing steps needed to go between each of these views, one can
guarantee that the same processing steps could be applied in the future to re-instantiate the collection on
new technology, including new information and knowledge representations.
3. Relationships between NARA and other Agency projects:
There is a strong synergy between the development of persistent archive infrastructure for NARA, digital
library development for NSF, and data grid development for DOE, NASA, and NLM. All of these research
areas require the ability to manage knowledge, information, and data objects. What has become
apparent, is that even though the requirements driving the infrastructure development for each agency
are different, a uniform architecture is emerging that meets all agency requirements. The architecture
shown in Figure 3 provides:
• Validation mechanism for the common data management architecture
• Validation mechanism for the differentiation between knowledge, information, and data and the
choice of representation standards
• Integration vehicle for tying together persistent archives with grid environments
• Integration vehicle for tying together grid environments with digital libraries
• Integration vehicle for tying together digital libraries with persistent archives
It is interesting to note the multiple projects that are building upon the architecture that is being developed
in the NARA collaboration:
• NSF – Digital Library Initiative, Phase 2.
• NSF – National SMET Education Digital Library
• NSF – NPACI data grid for neuroscience brain image federation
• NASA – Information Power Grid distributed data processing
• DOE – ASCI Data Visualization Corridor remote data processing
• DOE – Particle Physics Data Grid object replication
• NLM – Digital Embryo Project data grid for image processing and storage
• NARA – Persistent Archive
It is also interesting to note the iterative technology development cycle that links all of the projects. An
original DARPA project developed the data handling capabilities as part of the Distributed Object
Computation Testbed. The NASA IPG integrated the data handling technology with computational grid
technology (common security environments).
The NSF NPACI project integrated information
management with data handling to support digital libraries. The ASCI PPDG then applied the technology
to support replica management across heterogeneous systems. And the NARA project applied the
technology to manage migration of collections across evolving infrastructure technology.
Acknowledgements:
This work was supported by the National Archives and Records Association and the Defense Advanced
Research Projects Agency/ITO. The research topics have been investigated by the following members of
the Data Intensive Computing Environment Group at the San Diego Supercomputer Center: Richard
Marciano, Bertram Ludaescher, Ilya Zaslavsky, Amarnath Gupta, and Chaitan Baru.
32
References:
[1]
Moore, R., C. Baru, A. Rajasekar, B. Ludascher, R. Marciano, M. Wan, W. Schroeder, and A.
Gupta, “Collection-Based Persistent Digital Archives - Part 1”, D-Lib Magazine, March 2000,
http://www.dlib.org/
[2]
Moore, R., C. Baru, A. Rajasekar, B. Ludascher, R. Marciano, M. Wan, W. Schroeder, and A.
Gupta, “Collection-Based Persistent Digital Archives - Part 2”, D-Lib Magazine, April 2000,
http://www.dlib.org/
33
August 2000
NARA and Electronic Records: A Chronology
Organizational Names:
1934-1949:
1949-1985:
1985-present:
National Archives (NA), independent agency
National Archives & Records Service (NARS), part of General Services Adm.
National Archives & Records Administration (NARA), independent agency
1968-1972:
1972-1974:
1974-1982:
1982-1988:
1988-1998:
1998-present:
Data Archives Staff
Data Archives Branch
Machine-Readable Archives Division
Machine-Readable Branch
Center for Electronic Records
Electronic and Special Media Records Services Division
Dates:
1939 The Records Disposition Act includes punch cards in its definition of records
1943 Records Disposal Act defines records as including “other documentary materials,
regardless of physical form or characteristics,” a phrase included in subsequent federal
records acts
1965 NARS assists Bureau of the Budget in producing an inventory of federal punch cards and
computer tapes
1966 Archivist Bahmer establishes Committee on Disposition of Machine-Readable Records
(Dec)
1967 NARS issues federal records management regulations for ADP records (Feb)
1968 Committee on Disposition of Machine-Readable Records finalizes its report (Jan)
Ev Alldredge begins detail to the Office of the Archivist to implement report (Feb)
NARS hosts conference, “The National Archives and Statistical Research” (May)
Report of Joint Committee on the Status of the National Archives (AHA, OAH, SAA)
calls for an “archives of machine-readable records” (Jul)
Archivist Rhoads establishes Data Archives Staff in Office of Records Management
Joseph V. Bradt becomes Director of the Staff
1969 SAA forms Ad-Hoc Committee on Machine Readable Records and Data Archives with
Alldredge as Chair (Oct)
1970 NARS staff accessions first electronic records (Apr)
Gerald Rosenkrantz becomes Director of the Data Archives Staff (Sept)
1971 Archivist Rhoads signs first records schedule with permanent electronic records
1972 NARS issues General Records Schedule 20 for ADP records (Apr)
Data Archives Staff becomes Data Archives Branch
1973 Branch drafts Recommended Environmental Conditions & Handling Procedures for
Magnetic Tape (Jun)
1974 Branch compiles Directory of Computerized Data Files and Related Software for
publication by the National Technical Information Service (Mar)
NARS upgrades Branch to Machine-Readable Archives Division
Charles Dollar becomes Director of the Division (Aug)
Division assumes responsibility for appraisal of electronic records (Aug)
1975 Division appraises and accessions first operational data from Vietnam War
Division issues Catalog of Machine-Readable Records in the National Archives of the
United States, 1975, and 2nd. Ed., 1977
1976 The amended Records Disposal Act of 1950 specifies “machine readable material” as
records
1977 Division launches "targeted agencies" project
34
1978 Division establishes Center for Machine-Readable Records for records of high current use
but undetermined long term value (May)
Division issues first Accessioning Procedures Handbook
1980 Division provides computer support and staff for FBI appraisal project, AFSC v Webster
1982 NARS reduction-in-force cuts staff (Feb)
NARS downgrades Division to Machine-Readable Branch
Trudy Peterson becomes Chief of the Branch (Apr)
1984 Branch battles NARS Preservation Advisory Committee about computer-output microfilm
as the solution to electronic records
Richard Myers becomes Chief, Machine-Readable Branch (Jan)
1985 NARA transfers all appraisal to new records management program (Dec)
1986 Edie Hedlin becomes Chief, Machine-Readable Branch (Jul)
1987 Branch closes its Center for Machine-Readable Records (for records of high current use) 1988
NARA upgrades Branch to Center for Electronic Records (Oct)
Ken Thibodeau becomes Director of the Center (Dec)
1989 Plaintiffs file Armstrong v Executive Office of the President (Jan)
NARA transfers appraisal of electronic records to the Center (Oct )
1990 Center begins GAPS project for past-due transfers
1991 Center offers reference service via e-mail (Mar)
The National Academy of Public Administration publishes The Archives of the Future:
Archival Strategies for the Treatment of Electronic Databases (Dec)
1993 Acting Archivist Peterson assigns preservation of Armstrong v EOP media to Center
(Apr)
Center acquires and installs the Archival Preservation System (APS) (May)
Archival Electronic Records Inspection and Control (AERIC) system becomes operational
1994 Center moves to Archives II (Jan)
NARA designates CD-ROM as acceptable transfer media (Jul)
1995 NARA revises GRS 20 to include e-mail (Aug)
1996 Plaintiffs file Public Citizen v Carlin (GRS 20 lawsuit) (Dec)
1997 Center staff takes initial steps toward the Electronic Records Archives (ERA) initiative
Center expands reference media to include CD-ROM and diskettes
Center uses AERIC to comply with E-FOIA amendments
1998 Center becomes Electronic & Special Media Records Services Division (Feb)
Michael Carlson becomes Director of the Division (Apr)
NARA transfers appraisal of electronic records to the Life Cycle Management Division
NARA endorses Dept. of Defense 5015.2 standard for records management applications
for electronic records (May)
1999 Archivist Carlin begins appraisal reinvention project (Apr)
U.S. Supreme Court refuses to reverse Appeals Court decision upholding GRS 20 (Aug)
2000 Division's electronic records holdings exceed 200,000,000 records in 167,000 data files
35
The potential of markup languages to support descriptive
access to electronic records: The EAD standard
Anne J. Gilliland-Swetland
Abstract:
This paper will review the potential of Encoded Archival Description (EAD), rcently adopted as an
American descriptive standard, to provide online descriptive access to electronic records. The paper will
begin by reviewing the current state of electronic records description and the complex relationships
between metadata that are part of the record and metadata that are about the record. It will then describe
the status and scope of EAD, how it relates to other descriptive initiatives that are applying markup
languages, and the potential of EAD to serve as a metadata infrastructure for online archival information
systems. The paper will conclude with a discussion of the extent to which EAD can currently
accommodate, or could be extended to accommodate, description and online delivery of electronic
records.
Introduction
There has been a considerable amount of political and professional rhetoric, stemming from
unprecedented developments over the past decade in technologies supporting the World Wide Web,
about developing online access to unpublished information resources—including archival holdings. The
rhetoric has resulted in the establishment of research and development agendas by major government
4
funding agencies, private foundations, industry, and professional institutions and associations . A number
of major initiatives have resulted from the availability of this funding. As they relate to archival concerns,
these initiatives can be grouped into three primary domains of activity:
•
the development of archival standards that support online access to archival descriptions (Encoded
Archival Description being the most prominent recent example);
•
the development of archival information systems such as American Memory at the Library of
Congress and the Online Archive of California (to cite two American examples) that provide not only
online descriptions but also digitized copies of selected archival holdings; and
•
research projects addressing the archival management of records that are “born digital,” that is, of
electronic records (for example, the Recordkeeping Functional Requirements Project at the University
of Pittsburgh and the International Project on Permanent Authentic Records in Electronic Systems
5
(InterPARES)) .
While there has been considerable dialog and overlap between archivists involved with the first two of
these areas, until recently archivists grappling with the challenges of creating and preserving electronic
records have not been integrally engaged in broader initiatives to standardize and enhance description for
online access, nor to provide online access to electronic records through archival information systems or
digital libraries. The major exception to this has been the Recordkeeping Metadata Schema (RKMS)
4
For example, the National Science Foundation, the National Endowment for the Humanities, and the
National Historical Publications and Records Commission in the United States, and the Fifth Framework
and the Joint Information Systems Committee in Europe; national archives and libraries in many
countries; and descriptive standards groups within professional associations.
5
Gilliland-Swetland, Anne J. and Philip Eppard. “Preserving the Authenticity of Contingent Digital
Objects: The InterPARES Project” D-Lib Magazine 6 no.7 (2000).
Available at:
http://www.dlib.org/dlib/july00/eppard/07eppard.html (16 October, 2000);
InterPARES Website available at http://www.interpares.org (16 October, 2000).
36
developed in Australia. The focus for RKMS is the record, regardless of its format, and how it can be
reconstructed and retain its meaning across time and user domains. RKMS provides:
•
A standardized set of structured or semi-structured recordkeeping metadata elements
•
A framework for developing recordkeeping metadata sets in different contexts
•
A framework for mapping recordkeeping metadata sets to establish equivalences and
6
correspondences that can provide the basis for semi-automated translation between metadata sets
In North America, electronic records management evolved to some extent as an area apart from the
mainstream of the archival profession. Its immediate concerns have been on creating, identifying, and
accessioning electronic records. In the 1970s and 1980s, electronic records, or “machine-readable
records” as they were initially termed, tended to be managed as software-independent datafiles. More
recently, as electronic records have taken on more complex functionality, there has been an increased
awareness of the need to preserve their value as legal and organizational evidence. As a result, archivists
are now engaged with researchers from computer science, digital library development, and preservation
in several projects to identify how to preserve authentic electronic records with their functionality intact.
One of the most prominent of such projects is that of the National Archives and Records Administration
and the San Diego Supercomputer Center to employ XML in the development of persistent archives. This
concern for evidence requires a more detailed understanding of what are the characteristics of an
authentic record in and over time, as well as close analysis of the intellectual rationales behind archival
description in terms of how it contributes to ensuring and demonstrating the authenticity of preserved
records.
Indeed, there is a growing convergence of different areas within the archival profession, as well as of
other professional and disciplinary domains relating to description. This convergence arises largely out of
the development of new metadata schema and standards and technological capabilities that provide
7
structures and crosswalks for formalizing and bridging diverse data types (such as image or geospatial
8
data), metadata semantics, and professional practices .
Archives play a key and often overlooked role in establishing and demonstrating the authenticity of any
record, regardless of its form, through archival description. In contrast to the key purposes of bibliographic
description which are to manage a physical information object as well as to facilitate its intellectual
retrieval and use, archival description must address that object not only as information, but as evidence.
As a result, archival description must not only describe the content of a fonds or record group, it must also
describe the circumstances of its creation, its chain of custody, its relationships to other records
generated by the same activity, and the impact upon the aggregation of records of any processing or
preservation activity in ways that are and remain meaningful to different kinds of users over time.
Archival description, therefore, has three primary roles. Firstly, it serves as a tool that meets the needs of
the archival materials being described by authenticating and documenting them. Secondly, it is a
collections management tool for use by the archivists. Thirdly, it is an information discovery and retrieval
tool for making the evidence and information contained in archival collections available and
comprehensible by archivists and users alike.
6
See McKemmish, Sue, Glenda Acland, Nigel Ward, and Barbara Reed. “Describing Records in Context
in the Continuum: The Australian Recordkeeping Metadata Schema.” Archivaria 48 (1999): 3-42.
7
A crosswalk is a chart or table that represents the mapping of fields or data elements in one metadata
standard to fields or data elements in other standards that have the same function or meaning.
Crosswalks support the ability to search transparently heterogeneous databases as a single database
(semantic interoperability) and to convert data from one metadata standard to another.
8
See Gilliland-Swetland, Anne J. Enduring Paradigm, New Opportunities: The Value of the Archival
Perspective in the Digital Environment (Washington, D.C.: Council on Library and Information Resources,
2000).
37
Describing Electronic Records
Ironically, in a world of increasing online access to primary information resources, many of which first
require digitization, electronic records are proving to be among the most intransigent in terms of providing
even basic descriptive access. This intransigence reflects inherent technical problems with the diverse
formats in which electronic records are created and may need to be maintained. Equally, it reflects how
the enormous volume of electronic records requiring processing by a comparatively small staff together
with data archiving practices originally adopted from the social sciences data archives community have
led to idiosyncratic summary archival descriptions and an over-dependence upon the metadata generated
by the creator of the records. Description of electronic records often consists of high level summaries of
data, reports on quality and accuracy of data, scanned or PDF versions of codebooks and data
dictionaries, and customized subject indexes and data extracts.
While the current state of description for electronic records is certainly understandable, it is, nevertheless,
deficient in several respects:
•
There has been insufficient analysis of what is the actual nature of electronic records. In particular,
there needs to be more examination of the relationship between data content and the metadata that
provide and document its context and structure, and of the various ways in which aspects of data and
metadata in complex systems such as databases might come together to form the intellectual
construct that is a record. Often one of the most difficult aspects of working with electronic records is
to be able to identify and then describe, in the absence of a tangible document, the parameters of that
intellectual construct.
•
Metadata generated by records creators has been viewed as sufficient substitute for archival
description. For example, in 1993, Margaret Hedstrom proposed that management of metadata
provide an alternative strategy to current descriptive practices in order to support the “need to identify,
gain access, understand the meaning, interpret the content, determine authenticity, and manage
9
electronic records to ensure continuing access .” Subsequently, several projects have resulted in
metadata specifications for electronic records, most notably the Pittsburgh Project and related
implementation projects such as the Indiana University Electronic Records Project. With the
exception of the Australian RKMS project, there has been almost no discussion of the value-added
role that archival description should play in terms of ensuring and documenting authenticity, and
10
making the records meaningful to users across time and domains.
•
There has been little emphasis on establishing the documentary relationships between electronic
records and paper records created by the same activity. Lack of standardization and use of nonarchival of descriptive practices has made it difficult to integrate descriptions of electronic records with
standardized descriptive metadata created by archivists and other information, industry, and research
communities. For example, in the mid-1980s, when archivists looked to the use of MARC formats to
the MARC Machine-Readable Data Format (MRDF) rather than the MARC Archives and Manuscripts
Control Format (AMC) that was developed for the collective description of archival and manuscript
9
Hedstrom, Margaret. “Descriptive Practices for Electronic Records: Deciding What is Essential and
Imagining What is Possible,” Archivaria 36 (Autumn 1996): 53.
10
Bearman, D. and Sochats, K. Metadata Requirements For Evidence. 1996, Available:
http://www.lis.pitt.edu/~nhprc/BACartic.html (October 17, 2000); Bantin, Philip C. “Developing a
Strategy for Managing Electronic Records: The Findings of the Indiana University Electronic
Records Project,” American Archivist 61 (1998): 328-64. Bantin, Philip C. “The Indiana
University Electronic Records Project Revisited,” American Archivist 62 (1999)153-163; and
McKemmish, Sue, Glenda Acland, and Barbara Reed. “Towards a Framework for Standardising
Recordkeeping Metadata: The Australian Recordkeeping Metadata Schema,” Records
Management Journal 9 (1999): 177-202.
38
materials. In effect, such an approach treated electronic records as a special format with distinct
descriptive needs, rather than as components of wider archival aggregations.
•
Because management of electronic records has generally been viewed by the rest of the archival
profession as an area that requires distinct technical expertise, developments in archival description
such as EAD have progressed without being strongly informed by the descriptive needs of electronic
records.
It is useful at this point to define more closely what is meant by metadata, since the term is understood
differently by different communities. Metadata refers to a range of structured or semi-structured data
about data that are critical to the development of effective, authoritative, interoperable, scaleable, and
preservable information and record-keeping systems. Until the mid-1990s, metadata was a term most
prevalently used by communities involved with the management and interoperability of geospatial data,
and with data management and systems design and maintenance in general. For these communities,
metadata referred to a suite of industry or disciplinary standards as well as additional internal and
external documentation and other data necessary for the identification, representation, inter-operability,
technical management, performance, and use of data contained in an information system. For archivists,
metadata refers to the value-added information, such as EAD, that they create to identify, authenticate,
arrange, describe, preserve and otherwise enhance access to their holdings.
In contemplating the role of metadata in the description of electronic records, several questions come to
mind:
•
Which metadata are part of the record, which are about the record, and which are neither but are
required to preserve or reconstruct the technological context of the record? And of all these types of
metadata, which must be captured as part of archival description?
•
How can the trustworthiness of these metadata be determined in terms of quality and completeness
in and over time?
•
Are their descriptive needs of electronic records that might be different from those of other types of
records? If so, what are they and how should they best be addressed?
•
Can the metadata generated by the creator of the electronic record somehow be automatically
translated or mapped into a standardized description for archival records?
•
Can the structure and documentary contexts of electronic records be automatically analyzed to
generate specific components of a standardized description for electronic records?
•
Which kinds of contextual documentation do electronic records require in order to be understood and
can a metadata infrastructure facilitate links to that documentation online?
•
How can the links between records and metadata retain their referential integrity over time and in the
face of systems obsolescence, data migration, and evolution of metadata schema?
•
What do users need in order to be able to identify relevant electronic records online? What do users
need to be able to use electronic records disseminated online?
Encoded Archival Description
In the face of such questions, therefore, how might Encoded Archival Description and other markup
initiatives enhance current electronic records description? Simply defined, EAD is a Document Type
Definition (DTD) developed using Standard Generalized Markup Language (SGML) that makes it possible
to develop predictably structured archival description that can be disseminated on the World Wide Web.
39
That description is most commonly an archival finding aid, but the DTD is flexible enough to
accommodate various other types of archival descriptive tools.
However, the power of EAD is that it can be much more than a structure through which to create a digital
representation of a two-dimensional finding aid. The hierarchical nature of EAD, its explicit delineation of
each data element, and its adherence to standardized metadata conventions and protocols provide it with
the potential to function as a multi-dimensional metadata infrastructure that can interface with other
metadata schema, but that can provide maximum flexibility in describing a diversity of record types. With
such an infrastructure, archivists and software developers have the capabilities and incentives to design a
range of archival information systems that fundamentally re-conceptualize how access to archival
holdings is provided. These archival information systems would not only contain the kinds of archival
description found today in finding aids, but also digitized versions of archival materials, full-text of ancillary
materials, extensive linkages to other online archival and bibliographic information systems, and actual
11
electronic records and the necessary technical documentation to use them .”
In such information systems, however, EAD would not be the only metadata schema invoked, and one of
the powerful aspects of EAD is its ability to interface or interoperate with other metadata schema and
SGML-based implementations. EAD is fully XML-compliant, meaning not only that EAD-encoded
descriptions can be searched and manipulated over the Web as the Web increasingly supports XML, but
also that electronic records technical documentation, such as database models, workflow rules, and
technical drawings can integrated with the archival descriptions in ways not previously possible in a more
manual environment. Similarly, EAD can interface with descriptive metadata created in MARC because
of metadata mapping between the two standards. With the recent release of XMLMARC software, this
mapping will become only easier. EAD also shares header data elements with the Text Encoding and
Interchange (TEI) DTD. TEI is a DTD that facilitates the development of digital versions of scholarly texts.
Using EAD to Describe Electronic Records
EAD is currently in its first full release (Version 1.0). It is fully expected that the DTD will be dynamic and
will continue to be extended to accommodate new technological capabilities and metadata schema, as
well refined based one evaluative feedback from archivists and users. In its current form, what then does
EAD have to offer electronic records description, given that the needs of electronic records have yet to be
integrally addressed by the DTD?
EAD, while it is a data structure and not a data content standard, works to standardize idiosyncratic
descriptive practices. Electronic records descriptive practices are some of the most idiosyncratic in the
field because there is such diveristy of types of electronic records, and because electronic records
description is rarely taught in archival education programs, and is primarily learned as institution-specific
practices “on the job.” Descriptive records tend to comprise examples of descriptions of datafiles, rather
12
than complete descriptions, together with user guides of documentation packages . Using EAD would
also integrate electronic records management into the mainstream of archival activities, treating the
records as records, rather than as instances of special formats. Moreover, through collective description,
as well as elements such as <separatedmaterial> and <relatedmaterial>, all records created by the same
activity will be treated as an intellectual whole, regardless of whether they are paper, electronic, or some
other medium.
11
Gilliland-Swetland, Anne J. “Popularizing the Finding Aid: Exploiting EAD to Enhance Online Browsing
and Retrieval in Archival Information Systems by Diverse User Groups” Journal of Internet Cataloging 4
nos. 1/2 (2000) (in press); Gilliland-Swetland, Anne J. "Health Sciences Documentation and Networked
Hypermedia: An Integrative Approach," Archivaria 41 (1995): 41-56.
12
Dryden, Jean E. “Archival Description of Electronic Records: An Examination of Current Practices,”
Archivaria 40 (1995): 99-108.
40
Electronic records descriptions can be quite flat, consisting mostly of summary information, with
arrangement of the contents of a datafile often being incidental. However, users may wish to have
access at the level of individual records or even data elements. The hierarchy built into EAD has the
potential to support this kind of granularity of access, although commercial software that is currently
available has yet to address much of this potential. Technical documentation accompanying the electronic
records can also be linked in electronic form to the EAD description through elements such as <archref>,
<odd> (other descriptive data) and <add> (adjunct descriptive data). If this documentation is marked up
using SGML, XML, or some other markup language, the possibility exists of additional reconciliation if the
different metadata schema. The well-defined EAD structure also makes possible the use of cross-walks
to interface with other common metadata schema that might be relevant to the records (for example,
geospatial metadata).
All this is not to say that EAD is ideal as it stands for describing electronic records. Several limitations
need to be addressed in the next version of EAD if it is truly to accommodate electronic records.
1) EAD is strongest with regard to the description of the records once they are held in the archives. It is
weak in how it supports records management, appraisal, and accessioning processes. More explicit
attention needs to be paid to how records retention schedules, appraisal reports, accessioning
procedures, and data quality reports are captured and tracked, as well as the various agents
associated with those processes.
2) There needs to be a more closely delineated data elements with which electronic record metadata
can be described, rather than consigning such materials to “bucket” elements such as <odd> and
<add>. These elements and their values should be based upon lists of common types of
documentation that accompany electronic records when they are accessioned. The data elements
also should have attributed that indicate the extent to which the accuracy of each piece of
documentation has been verified.
3) Custodial history is integral to establishing the authenticity of records, and for electronic records it can
be quite complex, especially if the archives takes over intellectual and not physical control of inactive
records. The <custodhist> element needs to be expanded to address this issue, in particular, noncustodial arrangements for electronic records.
4) Preservation and meticulous documentation of preservation processes are integral not only for
providing continued access to electronic records, but also for establishing and demonstrating the
continued authenticity of those records (or of authentic copies of the records). Currently preservation
information is bundled into a single data element <processinfo> (processing information), and as with
<custodhist> this element needs to be expanded and further delineated to track preservation
processes such as migration and emulation and any effects these might have upon the record.
5) Even with traditional records, many archivists find it difficult to make the necessary distinction
between intellectual and physical levels of arrangement. Many electronic records can be arranged in
multiple ways and, therefore, the concept of levels of arrangement may not be as relevant as possible
arrangement schema. It needs to be possible through the <arrangement> element for users to
identify the range of potential arrangements and data extracts in order to be able to specify the one
which they would like to use when accessing electronic records online or ordering copies of them.
This is a compelling reason to do more user-based research so that any extensions to EAD are more
user-driven.
6) As with museum objects, additional aspects of physical description may need to be incorporated into
the <physdesc> element to allow for highly technical description. Some of these elements might
correspond to those that were included in MARC MRDF.
7) For EAD in general, there is a need for a companion content standard and a structure for developing
authority files. Work on both of these aspects is currently underway. There is also a need to analyze
the extent to which EAD should be extended to accommodate a range of archival descriptive
41
traditions and technical requirements for records in specific media, or whether a better approach
would be to concentrate on mapping different types of metadata through processes such as metadata
crosswalks and automatic reconciliation of diverse XML structures.
Conclusion
There is obviously much work to be done in the area of electronic records description, and EAD provides
one important vehicle to do so. However, given the volume of electronic records already created and
anticipated in future years, there must surely also be an increased emphasis on automating as many
aspects of archival description as possible. This is where research and development such as that
underway at the San Diego Supercomputer Center in partnership with the US National Archives and
Records Administration is likely to make such a strong contribution. One final caveat, however—almost all
developments in archival description to date, even that of EAD, have occurred without systematic
analysis of user needs and capabilities. As archival description, and even the complete archival record
becomes increasingly available online to the general public without any archival reference mediation, it is
going to be critical that we spend time examining the usefulness and usability of the materials we are
providing to our users. Otherwise we may find that we have created a web of metadata and records that
is so complex that it will have become impenetrable to most users.
42
Preservation and migration of electronic records: the state of
the issue13
Kenneth Thibodeau
14
The problem of preserving electronic records
The two-edged sword of continuing progress and rapid obsolescence of information technology is
the most often cited, but perhaps not the most significant challenge archives face in the endeavor to
preserve electronic records. Organizations rely more and more on digital technology to produce, process,
store, communicate, and use information in their activities. Thus, the quantity of records being created in
electronic form increases. In the experience of the National Archives and Records Administration of the
United States, it increases exponentially. The technological challenge is compounded by the continuing
extension of information technology in terms of the types of information objects it produces, and again in
terms of its applicability to different spheres of activity and different types of actions within those spheres.
The resultant records are increasingly diverse and complex. The impact is not only on individual records,
but on the archival fonds as a structured whole.
Approaches to the problem of preserving electronic records
The field of information technology has, by and large, ignored the problems of long term
preservation. If anything, one could say that the market has tended to exacerbate the problem of
preserving electronic records. The pressures of competition have led the industry to obey Moore=s law,
replacing both hardware and software on a frequency of two years or less.
In one area, however, there has been some improvement in recent years: that of digital storage
media. From the 1980s there was a trend towards storage media that were more fragile and less stable
over time. In recent years, this trend, if not reversed, has been offset somewhat by the introduction of
15
more stable and reliable media.
Current research and development efforts offer the prospects of
improved options for long term storage of digital information, notably in the areas of ion-milling and
holographic media. But archival concern with digital media should not be limited to their durability. The
ICA Guide to Managing Electronic Records sets out seven criteria for media used for preserving
electronic records:
$
open standards for digital recording on the medium,
$
robust methods for preventing, detecting and reporting errors,
$
sufficient market penetration,
$
known longevity,
$
known susceptibility to degradation or deterioration,
$
a favorable cost/benefit ratio, and
16
$
availability of methods for recovering from loss.
Whatever relief archives may find in the area of digital storage is more than offset by the
increasing diversity, complexity and spread of electronic records. In recent years, increasing attention
has been devoted to problems of digital preservation in a variety of spheres and professions. Several
13
This paper was presented at the XIVth International Congress on Archives, Seville, Spain, September 22, 2000.
The views expressed are the author’s and not necessarily those of the National Archives and Records
Administration.
14
The author is Director of the Electronic Records Archives Program, National Archives and Records
Administration, U.S.
15
Charles M. Dollar. Authentic electronic records: strategies for long-term access. Chicago: Cohasset.
1999. Pp. 58-60.
16
International Council on Archives. Committee on Electronic Records. Guide for managing electronic
records from an archival perspective. Paris. 1997.
43
different approaches has been proposed. A few have been tried in test mode, fewer in actual practice. In
practice, the experience of archives is largely limited to relatively simple technical formats, such as flat
files. Some institutions have developed computer applications for preserving potentially complex
databases. These include CONSTANCE at the National Archives of France, AERIC at NARA, ERICSON
at the National Archives of Canada, and similar systems in Sweden, the United Kingdom and elsewhere.
Significant preservation projects addressing the actual preservation of digital formats, at various stages of
17
research or development, include the bundles proposal of the British Standards Institute, the CEDARs
18
19
project at the University of Leeds, England, the Victoria Electronic Records System in Australia, the
20
emulation experiment at the Royal Library in The Netherlands, the Universal Preservation Format
21
sponsored by the WGBH Educational Foundation in Boston, and the Highly Integrated Information
22
Processing and Storage technology being developed at Carnegie-Mellon University in the U.S.
Current initiatives are pursuing quite a variety of approaches. The proposed solutions can be
categorized into five broad categories:
$
preserving the original technology used to create or store the records;
$
emulating the original technology on new platforms;
$
migrating the software necessary to retrieve, deliver, and use the records;
$
migrating the records to up-to-date formats; and
$
converting records to standard forms.
These approaches define a spectrum ranging, in broad terms, from no change in the records or the
technological context in which they exist to one in which the original hardware and software have
disappeared and the digital format of the records has changed. Each of these methods has pros and
cons. None of them is entirely satisfactory. On the one hand, in general, one can say that the closer one
stays to the original technology and original digital format of the records, the less the problem of
authenticity; however, it is also obvious that the closer one stays to original technology, the more complex
and more impractical the approach becomes over time. More complex because, as records continue to
accumulate over time, there will be more and more varieties of technology that the archives would have to
maintain. More impractical because, first, support for obsolete technologies will eventually disappear
and, second, the distance and difference between the preserved technology or technical artifacts B
including the records B and the best available technology for preserving, managing, retrieving and
delivering the records will increase continuously. On the other hand, while moving ahead as technology
progresses can eliminate such practical problems, it can entail loss or corruption of records.
The need for an archival approach to preserving electronic records
All of these approaches to preserving electronic records have in common the objective of solving
technological problems related to the passage of time. None of them actually focus on the objective of
preserving records.
This technological orientation is misdirected because success in solving
17
British Standards Institution. Bundles for the perpetual preservation of electronic documents and
associated objects. Public Draft for Comment - IDT/1/4: 99/621800DC.London. 1999.
18
David Holdsworth and Derek M. Sergeant. A blueprint for representation information in the OAIS mode.
In: Eighth Goddard Conference on Mass Storage Systems and Technologies, B. Kobler and P.C. Hariharan, editors.
Maryland, Goddard Space Flight Center, 2000. Pp. 413-28.
19
Public Record Office Victoria. Victorian Electronic Records Strategy. Final Report. 2000.
<http://www.prov.vic.gov.au/vers/final/finaltoc.htm>
20
Jeff Rothenberg. An experiment in using emulation to preserve digital publications. Den Haag.
Koninklijke Bibliotheek. 2000.
21
Dave MacCarn, Toward a universal data format for the preservation of media. SMPTE
Journal, July 1997 v106 n7 p477-479. See also <Http://info.wgbh.org/upf/>
22
http://www.ece.cmu.edu/research/chips/
44
technological problems does not necessarily imply any success, or even relevance, in addressing archival
requirements for the preservation of records.
Logically, archival principles and objectives should dictate the requirements that technical
solutions must satisfy. Archival requirements for preservation must be based on the conception of
electronic records, not as the products of computer applications, but as the instruments and by-products
of the practical activity of a records creator. The ultimate criterion for success in the preservation of
electronic records is not whether they remain true to some given technological materialization, but
whether they continue to provide authentic evidence of the activities in which they were created.
An architecture for archival preservation
Clearly, the archival profession needs to determine specific requirements for the preservation of
different types of records, and also to guarantee respect for provenance and the integrity of archival fonds
over time. The InterPARES project, directed by Professor Duranti, brings together archivists from
universities and archival institutions, along with computer and information scientists and engineers, from
around the world in a concerted effort to delineate specific archival requirements for preserving authentic
electronic records. InterPARES is working to define the archival requirements for authenticity on the
23
basis of archival science and diplomatics.
Simultaneously, the InterPARES Preservation Task Force is examining technical issues related to
digital preservation and developing a formal model of the preservation function as viewed from the
perspective of the juridical or physical person responsible for preserving electronic records. While this
work is still in progress, there are several ideas which have been proposed that are worth citing at this
time. One key idea is that, strictly speaking, it is not possible to preserve electronic records; it is only
possible to maintain the ability to reproduce electronic records. It is always necessary to retrieve from
storage the binary digits that make up the record and process them through some software for delivery or
presentation. B Analogously, a musical score does not actually store music. It stores a symbolic notation
which, when processed by a musician on a suitable instrument, can produce music. B Presuming the
process is the right process and it is executed correctly, it is the output of such processing that is the
24
record, not the stored bits that are subject to processing.
This concept has important consequences. It
shifts priority in preservation of electronic records from their storage over time, to the integral processes of
putting the records into archival storage, getting them out of storage, and delivering them to future
researchers. The recognition that electronic records must inevitably be reproduced accentuates the
importance of being able to demonstrate the integrity and authenticity of the records. This entails
extending the traditional concept of an unbroken chain of custody into one of an unbroken process of
preservation. As defined in the ICA Guide, AAn electronic record is preserved if and only if it continues to
exist in a form that allows it to be retrieved, and, once retrieved, provides reliable and authentic evidence
25
of the activity which produced the record.@
Demonstrating the authenticity of electronic records
depends on verifying that:
1. the right data was put into storage properly;
2. either nothing happened in storage to change this data or alternatively any changes in the data over
time are insignificant;
3. all the right data and only the right data was retrieved from storage;
4. the retrieved data was subjected to an appropriate process, and
5. the processing was executed correctly to output an authentic reproduction of the record.
Parallel to the InterPARES project, the National Archives and Records Administration is sponsoring
research into the development of an information management architecture designed to address archival
23
Anne J. Gilliland-Swetland and Philip B. Eppard. Preserving the authenticity of contingent digital
objects. The InterPARES project. D-Lib magazine. July-August 2000.
24
Preliminary report from the chair of the Preservation Task Force to the Director of the InterPARES
project, March 30, 2000.
25
ICA. Guide. P. 35.
45
requirements for the preservation of electronic records. This architecture implements the proposed ISO
26
standard for an Open Archival Information System (OAIS).
The architecture extends that general
reference model by articulating archival requirements. To address the basic problem of continuing
change in technology over time, the architecture postulates that archival information systems should
independent of the particular technology used to implement them at any time. That is, an archival
information system should be built in such a way that it is possible to replace any component of hardware
or software used in the system with minimal impact on the rest of the system and with no impact on the
27
preserved collections of records.
Collection-based persistent object preservation
The information management architecture is being developed in the U.S. National Partnership for
Advanced Computational Infrastructure. The Partnership is a collaboration of 46 institutions nationwide,
and 6 foreign affiliates, with the San Diego Supercomputer Center serving as the leading edge technical
resource. The research is addressing archival requirements for preservation of records, including respect
for provenance. Rather than focus on technological problems, the method focuses on the objects that are
to be preserved. In this case, the objects are records and also collections of records, as organized within
archival fonds at all levels of hierarchy.
The method of collection-based persistent object preservation consists of identifying the properties of
the objects to be preserved; expressing those properties in explicit, abstract models; and applying those
models to transform the objects into an independent technological format suitable for long-term
preservation. In the archival domain the development of this method started with the conception of the
essential properties of records expressed in the ICA Guide on electronic records; that is, AA record is
recorded information produced or received in ... an institutional or individual activity and that comprises
content, context and structure sufficient to provide evidence of the activity regardless of the form or
28
medium.@ The essential structure of a record is its documentary form. This form may be expressed in
the digital format in which the record is stored, but it is not necessarily identical to the digital format.
Therefore, a transformation of the record which replaces one digital method with another one that is more
suitable to long-term retention, preserves the record so long as it maintains the essential documentary
form of the record. The immediate context of a record is its archival bond: the position of a record with
respect to other records in the archival fonds. In our research, we have extended the list of essential
properties of records beyond content, structure and context to include the appearance of the record. We
are also addressing a special type of content that is unique to electronic records: hyperlinks.
Persistent Object Preservation expresses the structure of records using eXtensible Markup Language
(XML) Document Type Definitions. The method encapsulates records using the metadata defined in
these models, transforming records into a format that is independent of any specific technology. The
research has demonstrated that this method can be applied to collections of records as well as to
individual records. That is, one can construct a Document Type Definition to capture and preserve the
structure of any archival collection, of arbitrary complexity, from individual files through series and classes
to entire archival fonds.
The research is exploring different ways of preserving the appearance of records. One way is to use
a technology known as Multi-Valent Documents to capture and retain a bitmapped image of the
document. MVD enables the image to be retained not as a version of the document, but as a layer of the
26
Consultative Committee on Space Data Systems. Reference Model for an Open Archival Information
System (OAIS). Draft Recommendation for Space Data System Standards, CCSDS 650.0-R-1. Red Book. Issue
1. May 1999. <http://ccsds.org/RP9905/RP9905.html>
27
Kenneth Thibodeau, Reagan Moore, and Chaitanya Baru. Persistent object preservation: advanced
computing infrastructure for Digital Preservation. European commission. Proceedings of the DLM-Forum on
electronic records. European citizens and electronic information: the memory of the information society. Brussels,
18-19 October 1999. Luxembourg. Office for Official Publications of the European Commission. 2000. Pp. 113118.
28
ICA. Guide. P. 22.
46
29
document object modeled as an acyclic directed tree.
Another possible means of preserving
appearance is through the eXtensible Style Sheet Language (XSSL) available in the XML standard.
Using style sheets to capture the attributes of appearance is especially advantageous for types of
applications, such as databases and geographic information systems, where stored data elements may
participate in many different records. In such systems the records are likely to be expressed as views,
forms, or reports which extract specific subsets of the data and present them in predefined formats. A
different style sheet can be defined for each of these formats
The method extends beyond the preservation of archival collections of records over time. It also
addresses the key archival functions; notably, the accessioning of records into the archival repository, the
establishment of intellectual control over the records, and the delivery or dissemination of the records to
researchers. This extension of the persistent object approach is consistent with the basic premise of
object oriented methodology which starts with the recognition that an object has behaviors or methods, as
well as attributes. One of the essential behaviors of a record is that it occupies a specific position in
relation to other records in the archival fonds. This behavior expresses the immediate context of the
record and is the basis for arriving at its significant context; that is, the activity of which the record
30
provides evidence.
The transformation of records into a persistent object format not only enables the
records to be preserved indefinitely into the future, it also makes it possible to benefit from advanced
technologies, which have not even been invented yet, to search, access and deliver the records in the
future. This is made possible though the separation of context, structure and appearance in explicit
schemas expressed in simple textual form. Over time, it will not be neccessary to migrate the materials
stored in persistent object form to new technologies, but only to interpret the schema metadata so that it
can be used in future technologies.
Viability of the persistent object preservation method
The initiative to develop the collection-based persistent object method for preserving electronic
records is still in the stage of research and development, and will remain in this stage for some time.
Nonetheless, there are substantial reasons, in both the technical and the archival domains, to assume
that it will be successful. In the domain of technology, two facts should be highlighted. First, the research
is not developing any special technologies to suit archival needs. Rather, it is building archival solutions
on the basis of technologies which are seen as essential to the next generation Internet and information
infrastructure and as keys to electronic commerce and electronic government. Archives should benefit,
therefore, from widespread market support for the enabling technologies. Second, while the research
addresses archival requirements specifically, the method has broad application in other areas, such as
digital libraries, museums and collections of scientific data. Thus, archival institutions can collaborate with
organizations in these other domains to develop from the enabling technologies solutions for long-term
preservation and access. In the archival domain, the promise of the persistent object preservation
method has been demonstrated in several empirical tests, applying the method to a variety of collections
across a broad quantitative scale. These demonstrations involved bringing the collections into the
archival information system from external sources; examining the documents, databases, images,
geographic information systems and other digital objects tested in order to generate XML models;
transforming the records and capturing collection organization according to these models; storing the
transformed collections and related meta-data; and retrieving and presenting the preserved records using
technologies completely different that those which had originally been used to create and store the
records.
Conclusion
The persistent object preservation method offers several advantages to archives. It provides a
coherent and comprehensive framework that can be specifically tailored to archival requirements.
29
Thomas A Phelps and Robert Walinsky. Multivalent documents: anytime, anywhere, any type, every way
user-improvable digital document system. <Http://elib.cs.berkeley.edu/>
30
Reagan Moore, Chaitanya Baru, et al. Collection-based persistent digital archives. D-Lib Magazine.
March and April 2000, vol. 6, nos. 3-4. <http://www.dlib.org/dlib/march00/moore/03moore-pt1.html.> and
<http://www.dlib.org/dlib/april00/moore/04moore-pt2.html.>
47
Through abstraction of the context, structure and appearance of the contents of digital objects, it provides
a single, but highly adaptable method that serves at once the need for preserving authentic electronic
records over time, for adhering to archival principles, such as provenance, and for performing core
archival functions. Moreover, the persistent object framework permits the simultaneous adoption of other
techniques if the need arises. Clearly a substantial amount of research, analysis, testing and evaluation
needs to be completed before this method reaches its full potential. Nonetheless, the positioning of this
method in the center of major developments in computer science and information technology offers great
potential for making of electronic records not so much a problem for preservation, but an opportunity for
archives to achieve their objectives to a greater extent and at a higher level than has been possible
before now.
48
Responding to the Challenges and Opportunities of ICT: The New
Records Manager
Seamus Ross, Director Humanities Computing and Information Management, University of Glasgow
I. Introduction:
The business activities of public and private sector organisations depend upon increasing quantities of
knowledge, information, and data in digital form. Computers, software, and data pervades all aspects of
our lives from routines embedded in microchips which keep our cars and aeroplanes running, to the
application programmes used to analyse data to establish our credit worthiness (or riskiness) when we
seek mortgages or other loans, to applications which manage environmental systems in large buildings or
control manufacturing equipment. In many of these cases the data collected are used to further refine the
applications which analyse the data themselves in an incremental manner; this is particularly true of
applications in the financial sector. Nearly every organisation is in the data, information, and knowledge
business, a point stressed by Thomas Stewart, a pioneer in the field of intellectual capital. In Intellectual
Capital: The New Wealth of Organisations he argued that:
Every organisation houses valuable intellectual materials in the form of assets and resources, tacit and
explicit perspectives and capabilities, data, information, knowledge, and maybe wisdom. However, you
can't manage intellectual capital--you can't even find the soft forms of it--unless you can locate it in places
in a company that are strategically important and where management can make a difference. (Stewart
1997, 75) Unfortunately Stewart does not appear aware of professional records managers, or if he does,
he does not give them a place in his vision of the new organisation. Yet it is records managers who have
the skills and the experience to manage this intellectual capital. It is most likely that Stewart did not
include records managers in his vision because, like most non-records professionals, he views them as
keepers of information resources which are no longer central to the running of the organisation itself-information at the end of the business life-cycle, basically corporate memory. The root cause of this
problem rests firmly at the door of records managers; a concern voiced by many managers themselves.
For example, in 'At the end of the life cycle: electronic records retention' David Stephens, Director of the
Records Management Consulting Division of Zasio Enterprises, lamented the failure of the records
management community to develop and implement suitable electronic records management strategies
(1997, 108). Records Managers both curate the records that ensure regulatory compliance, have
evidential value in the event of litigation and provide competitive advantage through their recurring value,
and they manage the storage of records in ways that could help to alleviate the uncontrolled explosion of
records common in most commercial and public sector organisations. Surprisingly only 32 per cent of the
200 UK companies 1 UKLOOK Tampere Programme 1998-9, Programme sponsored by the British
Council and the University of Tampere.
49
XML per la conservazione dei sistemi documentari informatici
Maria Guercio - Università degli Studi di Urbino
Perché XML per gli archivi?
Da alcuni anni c'è un grande interesse per i linguaggi di marcatura da parte degli amministratori e
dei conservatori di beni culturali, in particolare degli archivisti e dei bibliotecari, che sembrano aver
finalmente trovato uno standard adeguato alle complesse esigenze informative, di comunicazione e
tenuta delle memorie documentali digitali.
Nel caso specifico degli archivi elettronici, nonostante gli sforzi compiuti da numerosi centri di ricerca
31
istituzionali e accademici , non si sono ancora raggiunti risultati convincenti sia nella definizione di
strategie sia nella elaborazione di procedure e soluzioni informatiche, soprattutto per quanto riguarda i
problemi della conservazione nel tempo. La funzione conservativa in ambiente digitale costituisce, del
resto, uno dei compiti più difficili e impegnativi per una serie di ragioni, di cui la principale riguarda la
natura contrastante e apparentemente inconciliabile dell'obiettivo medesimo: mantenere l'integrità certa
32
dei documenti informatici e, al contempo, assicurare un'accessibilità che, a causa dell'obsolescenza
tecnologica, implica continui interventi di copiatura, conversione, migrazione e, quindi, continue modifiche
nella struttura dei bit che costituiscono il documento. Si tratta per gli archivisti di una vera e propria sfida
che richiede in primo luogo la definizione di una base concettuale e metodologica e un quadro di
riferimento teorico solidi, ma anche un grande sforzo organizzativo e tecnologico, risorse finanziarie
elevate, tecnici di altissima professionalità e lo sviluppo di prodotti informatici in grado di garantire la
produzione e il trattamento dei documenti con modalità di routine compatibili con l'esigenza della loro
conservazione permanente.
Per quanto riguarda in particolare il quadro teorico in materia di archivi digitali, dopo una fase di
dibattito vivace che ha coinvolto scuole diverse e ha stimolato la riflessione di molti, soprattutto nel
mondo anglosassone, il panorama degli studi condotti negli ultimi anni si è da un lato semplificato per
quanto riguarda iniziative di ricerca di peso internazionale, dall'altro mostra uno sconfortante livello di
33
frammentazione e di ridondanza per la molteplicità di iniziative di dimensione locale che non sembrano
per ora contribuire in modo significativo né alla riflessione teorica né alla predisposizione di strumenti
operativi esportabili. D'altra parte la questione, come si è detto, è molto impegnativa e non può trovare
risposte esclusivamente in un lavoro di indagine che coinvolga settori disciplinari diversi e risorse
adeguate. Non è, perciò, un caso che le uniche due iniziative di ricerca oggi attive in questo ambito
(reciprocamente collegate e cooperative) siano quelle che hanno trovato il sostegno delle istituzioni
31
Negli ultimi anni sono stati condotti numerosi studi sui documenti elettronici, che tuttavia si sono concentrati con esiti interessanti - soprattutto sul problema della formazione di sistemi documentari informatici. Le indagini di
maggior rilievo internazionale sono state, in particolare, quelle condotte dalla Università di Pittsburgh e dalla
Università del British Columbia (Vancouver, Canada), entrambe concluse nel 1997. La ricerca canadese è stata
realizzata d'intesa con il Dipartimento della difesa degli Stati Uniti che, a conclusione del lavoro svolto, ha elaborato
le regole per la certificazione dei programmi di automazione della gestione documentaria destinati
all'amministrazione federale degli Stati Uniti. Per maggiori informazioni si vedano i materiali disponibili ai seguenti
indirizzi: http://www.lis.pitt.edu/~nhprc/ per la ricerca dell'Università di Pittsburgh;
http://www.slais.ubc.ca/users/duranti/ per la ricerca della Università del British Columbia e
http://jitc.fhu.disa.mil/recmgt/ per lo standard definito dal Dipartimento della difesa degli Stati Uniti "Standard
5015.2 - Design Criteria Standard For Electronic Records Management Software Applications”.
32
Ken Thidobeau, Reagan Moore, Chaitanya Baru, Persistent object preservation: Advanced computing
infrastructure for digital preservation, in Proceedings of the DLM-Forum on electronic records. European citizens
and electronic information: the memory of the Information Society. Brussels, 18-19 October 1999, Luxembourg,
Office for official publication of the European Communities, 2000, pp. 113-120.
33
E' significativo l'esito della ricognizione condotta dall'Unione europea sull'esistenza di linee guida per la
conservazione digitale in tutto il settore dei beni culturali: dopo un anno di lavoro e numerose interviste, ricognizioni
e analisi, il gruppo di lavoro ha stabilito che non esistevano al 1998 linee guida in grado di affrontare il problema
complessivo della conservazione digitale nel settore culturale e che "long-term perspectives on preserving access to
digital archives still require fundamental work". Cfr. Marc Fresko, Kenneth Tombs, Digital preservation guidelines:
the state of the art in libraries, museums and archives, Brussels, European Commission, DG XIII/E, 1998.
50
34
scientifiche nazionali nordamericane, il progetto internazionale InterPARES , condotto dalla scuola di
archivistica dell'Università del British Columbia e il progetto NPACI-NARA, sostenuto dal National
35
Archives di Washington e dall'Università della California .
Per quanto riguarda gli aspetti teorici e metodologici, il progetto canadese, cui l'Italia partecipa con un
proprio team di ricercatori e di istituzioni, costituisce senz'altro l'iniziativa di ricerca più significativa. Il
progetto nasce dalla convinzione che una soluzione definitiva e complessiva al problema della
conservazione documentale necessiti di un impegno globale della comunità internazionale e di una seria
ed efficace cooperazione tra discipline e ambiti professionali diversi al fine di pervenire alla comune
definizione di:
strategie di valutazione e selezione dei documenti informatici, che individuino tempi e modi del
trasferimento di responsabilità per la conservazione permanente,
standard e norme per i supporti,
principi e procedure di autenticazione dei documenti elettronici nelle attività di conversione, copiatura,
migrazione,
criteri di descrizione coerenti con la natura archivistica del materiale trattato, con le esigenze della
ricerca scientifica e con i bisogni di accesso di un'utenza non specialistica,
standard e procedure per la tutela della privacy e in materia di copyright.
La ricerca è, come si è detto, in pieno svolgimento, ma ha già consentito di elaborare un primo
schema - in fase di validazione mediante attività di ricognizione condotte su sistemi elettronici diversi 36
delle componenti logiche che formano la struttura dei documenti informatici .
Per quanto riguarda, invece, la scelta di metodi sperimentati per organizzare e gestire concretamente
la funzione conservativa, l'incertezza è notevole. Le soluzioni suggerite dagli esperti sono tutt'altro che
consolidate, generalmente molto costose e, per ora, prive di verifiche sul campo. Si orientano verso la
conservazione delle tecnologie hardware e software oppure sostengono l'opportunità di sviluppare
programmi di emulazione delle piattaforme tecnologiche originali. In entrambi i casi si tratta di interventi
che richiedono risorse elevate e non eliminano le rischiose e impegnative attività di migrazione né
riducono le difficoltà dell'utenza costretta a misurarsi con strumenti obsoleti anche dal punto di vista della
presentazione e delle modalità di ricerca. La maggioranza degli esperti considera perciò tali ipotesi
37
insufficienti e ribadisce l'urgenza di elaborare alternative fattibili ed efficaci . Tra le proposte che hanno
finora ottenuto i consensi maggiori e promettono sviluppi interessanti e utilizzabili in contesti operativi
diversificati anche di piccole dimensioni, la conservazione in formati indipendenti dalle tecnologie - basati
sull'uso di linguaggi di marcatura (SGML/XML) - della rappresentazione originaria dei documenti e dei
metadati di contesto e di relazione sembra destinata - nel medio periodo - a significativi sviluppi.
38
Anche in questo caso la scarsa letteratura disponibile in materia
non offre per il momento
indicazioni univoche e convincenti sulla strada da seguire, limitandosi a individuare vantaggi e svantaggi
di ognuna delle ipotesi formulate. Inoltre, la scarsità di risorse e l'insufficienza di esperienze e
conoscenze finora accumulate dalle istituzioni competenti hanno costituito fino ad oggi ostacoli quasi
insormontabili alla rapida individuazione di soluzione condivisibili. Non è, quindi, fuori di luogo l'allarme
34
L'indagine costituisce la prosecuzione del lavoro svolto nel corso del precedente e già ricordato progetto sulla
formazione e gestione dei documenti attivi e affronta il problema specifico della conservazione a lungo termine
dell'integrità e autenticità dei documenti elettronici. Alla ricerca partecipano undici Paesi (Australia, Canada, Cina,
Francia, Irlanda, Italia, Olanda, Portogallo, Stati Uniti, Svezia, UK) e un team di industrie farmaceutiche. I materiali
del progetto (che si concluderà nel febbraio 2002) sono disponibili al sito http://www.interpares.org. Alcuni
documenti sono stati recentemente pubblicati sulla rivista "Archivi per la storia", 1999, n. 2.
35
Materiali della ricerca sono disponibili al seguente indirizzo:
http://www.sdsc.edu/NARA/Publications/collections.html.
36
Si vedano in particolare i materiali prodotti dalla Authenticity Task Force del progetto: Research methodology
statement, Template for analysis e Case Study protocol and questionnaire, in "Archivi per la storica", 1999, n. 1-2,
pp. 263-337.
37
Cfr il citato rapporto di Marc Fresko, Kenneth Tombs, Digital preservation guidelines e il documento proposto
come standard ISO dal Consultative Committee for Space Data Systems, Reference Model for an Open Archival
Information System (OAIS).
38
Una bibliografia accurata si trova in Marc Fresko, Kenneth Tombs, Digital preservation guidelines…cit. Si
vedano anche le indicazioni contenute nella Nota bibliografica sul documento elettronico. 1986-1998, pubblicata nel
citato numero di "Archivi per la storia" 1999, n. 1-2, pp.347-375.
51
degli archivisti per il futuro delle memorie digitali, anche se l'ultima delle soluzioni proposte, su cui questo
contributo si sofferma in modo particolare, sembra rispondere a molte delle esigenze della conservazione
permanente e offrire qualche speranza alle preoccupazioni più diffuse.
Per comprendere meglio la natura del problema e valutare le aspettative suscitate dal nuovo
standard presso la comunità archivistica nazionale e internazionale, è tuttavia necessario identificare, sia
pure per grandi linee, i nodi teorici e pratici legati alle attività di gestione, uso e conservazione
39
"archivistica" dei sistemi documentari informatici . Gli oggetti che devono essere identificati e mantenuti
perché si possa parlare di "conservazione archivistica" sono molteplici e strutturalmente articolati. Non
basta, infatti, salvare il flusso di bit che definisce un documento, ma è indispensabile anche conservare le
40
informazioni che rendono esplicita la sua rappresentazione e i suoi legami nel sistema documentario .
Sono inoltre essenziali le modalità di rappresentazione e comunicazione che implicano l'adozione di
parametri uniformi di descrizione e di accesso, tutt'altro che semplici da determinare in un settore che,
non a caso, è arrivato molto tardi e con molte resistenze - rispetto ad altri analoghi campi disciplinari all'accettazione di pratiche normalizzate. Assai differenziata è, peraltro, anche l'utenza dei sistemi
archivistici per il grado di conoscenza del sistema documentario, per le modalità di interrogazione, per la
natura stessa delle ricerche. In relazione ad altre aree di applicazione dell'informatica, il mondo degli
41
archivi si caratterizza per la complessa e stratificata articolazione della produzione documentaria , la cui
peculiare natura originaria deve essere rigidamente salvaguardata per garantire la possibilità stessa della
ricerca futura.
A chi progetta interventi di automazione in questo ambito la specificità di tale materiale costituisce
allo stesso tempo un vincolo e un'opportunità: le potenzialità di continua trasformazione offerte dalle
tecnologie dell'informazione sono tali per gli archivisti solo se riferite alla ricchezza informativa e alla
facilità di recupero di contenuti strutturati in modo logico e relazioni significative. Diventano invece gravi
rischi da eliminare o, quantomeno, controllare e limitare se l'obiettivo sia quello di garantire l'autenticità e
l'integrità del sistema nel tempo. Per dare solo un'idea dell'articolazione del sistema documentario e della
natura strutturata dei suoi contenuti e relazioni, si ricorda che:
• l'archivio non è un semplice insieme di documenti, ma un insieme complesso di entità a loro volta
costituite di sottopartizioni di diversa tipologia (subfondo, serie, sottoserie, fascicolo, sottofascicolo,
unità documentaria),
§ ciascuna sottopartizione è identificata e descritta mediante informazioni di natura generale
42
condivisibili anche in un contesto internazionale (segnatura archivistica, denominazione, estremi
cronologici, consistenza, ecc.), integrata da eventuali ulteriori dati significativi,
§ le stesse unità documentarie non sono riducibili a semplice informazione testuale, ma si strutturano in
43
una serie di componenti riconosciute all'interno di uno schema generale e facilmente identificabili
(autore, destinatario, data, oggetto, testo, indicazione di allegati, ecc.),
39
E' indispensabile tenere sempre distinte le attività di memorizzazione da quelle di conservazione, che implicano la
decisione di mantenere il patrimonio nel lungo periodo e la stabilità dei documenti e delle relazioni sia archivistiche
che amministrative, un sistema di accesso e di ricerca elaborato in modo adeguato alle esigenze di ricerca di una
comunità ampia.
40
Tutti gli autori sottolineano la difficoltà di entrambi questi obiettivi: da un lato la necessità della migrazione, che
si impone inevitabilmente e ripetutamente nell'attività di conservazione, può introdurre cambiamenti anche
significativi nel flusso di bit, dall'altro la sempre più diffusa rappresentazione ad oggetti tende a rendere trasparenti
agli utenti le informazioni "di contesto", che invece devono essere identificate e trattate in modo esplicito. Si veda in
proposito Consultative Committee for Space Data Systems, Reference Model for an Open Archival Information
System (OAIS), cit., sezioni 3.3 e 3.4.
41
Per un'analisi dei sistemi documentari informatici e dei requisiti funzionali che garantiscono una corretta
formazione e gestione, cfr. Maria Guercio, La formazione dei documenti in ambiente digitale, in Gli archivi
del futuro. Il futuro degli archivi. Cagliari, 29-31 ottobre 1998 (numero monografico di "Archivi per la
storia", 1999, n. 1-2), pp. 21-58 .
42
Si fa qui riferimento allo standard per la descrizione archivistica, International Standard for Archival Description
approvati dal Consiglio internazionale degli archivi nel 1996 e recentemente aggiornato dal Committee for archival
description. Il testo è disponibile al seguente indirizzo: http://www.anai.org.
43
La disciplina che studia il documento e le sue componenti fisiche e logiche è la diplomatica, che negli ultimi anni,
per merito di alcuni archivisti italiani, ha allargato il suo campo di indagine tradizionale, limitato in passato all'età
52
§
tali unità, a loro volta, possono essere organizzate per tipologie specifiche che condividono un
maggior numero di dati significativi rispetto allo schema generale comune,
§ le informazioni relative al contesto giuridico, amministrativo e organizzativo di un sistema archivistico
sono rilevanti ai fini della intelligibilità stessa delle testimonianze documentarie. Tali informazioni di
contesto (ufficio di assegnazione, ufficio di provenienza, tipo di procedimento/processo
amministrativo, responsabile del procedimento) devono essere, perciò. catturate dal programma
informatico e mantenute in via definitiva sia a fini giuridici che per la comprensione amministrativa e
storica del materiale prodotto.
Uno dei nodi centrali per la tenuta e l'accesso nel tempo ai documenti d'archivio informatici è,
quindi, quello di assicurare il mantenimento in modo stabile e in ambiente sicuro della molteplicità
delle informazioni strutturate - di attributi e relazioni di metadati -, che abbiano reso funzionale il
sistema documentario nella fase attiva e che, per tale ragione, saranno conservate e messe a
disposizione degli utenti per le future attività di ricerca storica, scientifica, amministrativa.
Non è questa la sede per affrontare in modo approfondito la questione - assai dibattuta in sede
teorica - del ruolo e del trattamento dei metadati archivistici. Il tema - su cui scuole di archivistica
diverse hanno discusso e discutono senza trovare ancora un punto di vista comune - riguarda le
modalità di trattamento di queste meta-informazioni, sia nella fase attiva del sistema documentario
che nel corso del trasferimento dei documenti destinati alla conservazione permanente nelle
44
istituzioni archivistiche competenti . Nessun autore nega il loro valore strategico in ambito
archivistico, sia per la complessità di strutturazione dei sistemi documentari, sia per l'univocità degli
elementi che li costituiscono. A differenza di altri materiali, ad esempio i beni librari, i documenti
d'archivio e, ancor più, i riferimenti di natura contestuale variano in ogni struttura organizzata e per
ciascuno specifico ordinamento. Anche nel caso in cui si conservino più esemplari del medesimo
documento in diverse aree della stessa organizzazione, è quasi sempre necessario ridefinire e
mantenere i dati di contesto che cambiano in base alle specifiche procedure di classificazione e
ordinamento. Le singole componenti (i documenti), ma soprattutto le reciproche relazioni (il vincolo
archivistico) hanno all'interno di un archivio valori propri che si traducono in un'autonoma, ben definita
gestione di metadati.
La strada più semplice - ma non per questo necessariamente più adeguata - è quella sostenuta da
alcuni ricercatori nordamericani (in particolare da David Bearman) e ripresa successivamente dalla scuola
archivistica australiana, che ritiene possibile incapsulare tutti i metadati all'interno della singola entità
45
documentaria . In realtà si tratta di una proposta che semplifica senza risolvere e, perciò, impoverisce il
problema della conservazione dei documenti digitali, poiché appiattisce su due soli livelli la struttura
informativa da mantenere: i documenti da un lato, i metadati identificativi del singolo documento (ad
esempio, i dati del profilo documentario) dall'altro. In realtà le componenti informative che si creano nella
vita di un archivio e che sono indispensabili per la sua sopravvivenza come bene culturale sono ben più
ricche di articolazioni e richiedono procedure di gestione più attente alla stratificazione storica delle
attività e dei dati, a prescindere dal metodo e dagli strumenti impiegati per il loro mantenimento.
La questione che deve essere affrontata è, quindi, quella di identificare le strutture e gli schemi logici
di metadati corrispondenti agli oggetti informativi che si intendono salvaguardare a fini storici e alle attività
e funzioni di sistema di cui è necessario tenere traccia storica nel lungo periodo (organigrammi del
soggetto produttore di documenti, piani di classificazione e repertori dei fascicoli, sistemi di registrazione
medievale e moderna, fino a comprendere non solo i documenti contemporanei, ma anche i documenti elettronici.
Cfr. Paola Carucci, Il documento contemporaneo. Diplomatica e criteri di edizione, Roma, Nuova Italia Scientifica,
1987 e, più recentemente, Luciana Duranti, Diplomatics: new uses for an old science, Washington, SAA, 1999.
44
L'intervento sulle meta-informazioni deve essere "precoce", riguardare cioè il sistema attivo, poiché la maggior
parte delle informazioni che garantiscono la continuità dell'accesso sono disponibili esclusivamente nell'archivio
corrente (ad esempio i dati relativi alla struttura amministrativa che produce i documenti, lo schema logico di un
database, la documentazione di un programma applicativo). Interventi tardivi di recupero sono, talvolta, possibili,
ma sono di gran lunga più costosi e impegnativi.
45
David Bearman.Ken Sochats, Metadata Specifications Derived from Functional Requirements: A Reference
Model for Business Acceptable Communications, documento disponibile al seguente indirizzo
www.lis.pitt.edu/~nhprc.papers/model.html. Sulle conclusioni raggiunte dagli archivisti australiani in materia, cfr.
Sue McKemmish, Australian Research and Development Initiatives, in "Archivi per la storia", 1999, n. 1-2, pp. 197206.
53
e autenticazione, elenchi dei procedimenti/processi amministrativi, interventi di copiatura, conversione,
migrazione, ecc.).
In conclusione, le informazioni di riferimento ai documenti destinate ad essere oggetto di
trattamento conservativo sono nel caso degli archivi così numerose che richiedono di essere organizzate
per componenti funzionali (metadati di contesto amministrativo, quali ad esempio quelli relativi alla
struttura organizzativa del soggetto produttore del sistema documentario, metadati di contesto
documentario, ad esempio le informazioni che identificano il sistema di classificazione in uso
storicamente o i dati di registrazione/identificazione dei documenti, ecc.) in modo da assicurare l'integrità
46
delle singole unità documentarie e archivistiche e delle relazioni di contesto , ma anche il mantenimento
nel lungo periodo in forme stabili delle modalità originarie di reperimento dei documenti e della loro
accessibilità, cioè della capacità di comprensione e di elaborazione degli oggetti informatici da parte delle
macchine e degli esseri umani. I requisiti funzionali e tecnologici da implementare per la realizzazione dei
sistemi documentali includono, inoltre, il rispetto dei principi di conformità alle norme che a livello
nazionale stabiliscono i requisiti di validità giuridica dei documenti in forma elettronica e le modalità di
47
autenticazione . Il nodo principale è, quindi, quello della possibilità di contemperare la garanzia
48
dell'integrità e l'esigenza di accessibilità e consentire un "riuso flessibile e illimitato" dei documenti .
Un elemento vincolante è, infine, quello del contenimento dei costi e della scalabilità delle
soluzioni, tenuto conto della esiguità delle risorse finanziarie che sono normalmente a disposizione delle
istituzioni archivistiche cui è affidato il compito della conservazione permanente delle memorie
documentarie, incluse quelle digitali che le amministrazioni pubbliche e il settore privato hanno già
cominciato a produrre in quantità rilevante. E' evidente che le possibilità di riuso sono legate a uno
sviluppo significativo di standard che determinano, inoltre, un effettivo contenimento dei costi e dei rischi
di perdite (in particolare per quanto riguarda la conversione/migrazione delle applicazioni e la
duplicazione delle informazioni).
XML sembra offrire un metodo diffuso, a basso costo e scalabile per affrontare la diversificazione
e la frammentazione della produzione documentaria e delle sue articolazioni, la sua ricchezza informativa
e il peso, finora insostenibile per i bilanci limitati degli enti culturali, delle innovazioni tecnologiche.
Le potenzialità specifiche di XML in questo ambito riguardano, in particolare, la gestione di
documenti e di meta-informazioni indipendenti dal software sia a fini di scambio e ricerca che a fini di
conservazione. Come si è visto, uno dei nodi è costituito dalla rappresentazione standardizzata dei
49
documenti indipendentemente dalle piattaforme di lavoro utilizzate e, quindi in grado di affrontare, il più
46
La definizione dei metadati significativi per assicurare l'integrità a lungo termine dei documenti
informatici e la loro accessibilità è una questione cruciale per gli archivisti e ha una valenza strettamente
teorica, anche se non può prescindere da una attenta valutazione e da un uso adeguato delle tecnologie.
Sul tema della definizione dei requisiti per la gestione elettronica del documento (Model requirements for
the management of electronic records) è al lavoro, finanziato dall'Unione europea nell'ambito del
progetto IDA, un gruppo di ricerca che fa capo alla società londinese Cornwell Affiliates e si avvale di
esperti internazionali. Anche il citato progetto InterPARES avrà tra i suoi risultati - come si è già ricordato
- la identificazione della serie di informazioni necessarie a garantire l'autenticità e l'integrità dei documenti
elettronici.
47
Nel caso delle amministrazioni pubbliche italiane, l'automazione dei sistemi documentari deve tenere conto di
norme europee e di una serie di provvedimenti nazionali molto complessi ancora in via di definizione: dpr 513/97,
dpr 428/98, dpcm 8 febbraio 1999, regole tecniche 24/98. Sono in corso di approvazione le regole tecniche
applicative del dpr 498/98, mentre non è un caso che siano in fase di prima elaborazione (del tutto insoddisfacente)
le disposizioni sulla conservazione dei documenti informatici.
48
Enrico Seta, Digitalizzazione e linguaggi di marcatura, in "Bollettino AIB", 1999, p. 72.
49
Si osservi che nella bozza di regole tecniche predisposte dall'Aipa e dal Dipartimento della funzione pubblica in
applicazione del dpr 428/98 sulla gestione informatica dei documenti e in corso di approvazione (cfr
http://www.aipa.it ) l'articolo 16 (leggibilità dei documenti) stabilisce che "ciascuna amministrazione garantisce la
leggibilità nel tempo di tutti i documenti trasmessi o ricevuti adottando i formati previsti all'articolo 6, comma 1,
lettera b) della delibera Aipa 24/98 ovvero altri formati non proprietari". Il citato articolo 6 fa riferimento in modo
specifico ai formati PDF e SGML.
54
50
a lungo possibile , i rischi e gli oneri che derivano dalla obsolescenza tecnologica. In questi ultimi anni si
è, infatti, assistito a una crescita esponenziale di ambienti applicativi diversi e alla proliferazione dei
formati per la creazione di documenti elettronici che a fini conservativi devono essere necessariamente
convertiti in prodotti standard in grado di garantire connettività. Tra questi sta ottenendo, non a caso,
notevole successo la soluzione fornita dai linguaggi di marcatura, che identificano e mantengono con
strumenti indipendenti dall'hardware e dal software metadati strutturati, predefiniti ma allo stesso tempo
flessibili, condivisibili e, insieme, suscettibili di un trattamento dettagliato. E' naturalmente indispensabile
sviluppare schemi concettuali e grammatiche specifiche per la formazione, gestione e tenuta dei
documenti che identifichino le informazioni necessarie al mantenimento e all'uso dei documenti, dai dati
di contesto organizzativo a quelli relativi all'ordinamento dei documenti, dalla loro organizzazione in serie
e fascicoli al tracciamento degli interventi conservativi, ecc.
L'uso di XML, tuttavia, apre ulteriori e molto significative possibilità per lo sviluppo di sistemi
documentari informatici, soprattutto perché consente, oltre alla gestione dei riferimenti esterni al
documento e alle sue partizioni, anche il trattamento della struttura logica e semantica dei contenuti.
Questi sviluppi si possono tradurre nella decisione di:
§ promuovere, all'interno di un'organizzazione, interventi di razionalizzazione e semplificazione
delle tipologie documentarie mediante la definizione di rappresentazioni specifiche con lo
scopo di ottimizzare l'elaborazione automatica dei documenti, garantire coerenza, qualità e
51
uniformità dei materiali ,
§ sviluppare strumenti di recupero e riutilizzo di documenti (o di componenti interne) ai fini di
una distribuzione/condivisione di contenuti destinati a durare nel tempo,
§ gestire formati multipli,
§ utilizzare i sistemi di validazione XML anche a fini di sicurezza e di integrità,
§ controllare e ottimizzare i cicli di gestione dei documenti.
E', tuttavia, importante sottolineare che XML può svolgere una funzione rilevante nei processi
di automazione del settore documentario anche in termini di contenimento dei costi ed efficienza
dei risultati se è accompagnato da un uso diffuso di DTD. Le Document Type Definition - è stato
52
recentemente ricordato da Charles Goldfarb e Paul Prescod
- migliorano, infatti "la
permanenza, la longevità e l'ampio riutilizzo dei propri dati, insieme alla prevedibilità e
all'affidabilità della loro elaborazione". Tuttavia lo sviluppo di DTD è innanzi tutto una questione
che rimette al centro della progettazione i problemi di struttura logica e concettuale che,
naturalmente, richiedono un approccio seriamente interdisciplinare e soprattutto presuppone
un'effettiva volontà di cooperazione per la definizione di regole comuni, se non di veri e propri
standard di settore. Non è un caso, infatti, che gli ambiti di sviluppo più promettenti si concentrino
proprio nella definizione, per ambiti settoriali, di standard internazionali o di procedure
normalizzate a livello nazionale. E', ad esempio, il caso dell'Encoded Archival Description
53
promosso dalla Library of Congress oppure della circolare in corso di elaborazione da parte
dell'Autorità per l'informatica per la definizione di una DTD per lo scambio di documenti in rete tra
54
pubbliche amministrazioni .
50
E' stato osservato che anche gli standard sono destinati a subire processi di evoluzione e, quindi, di obsolescenza e
che per il loro successo e la loro diffusione non è sufficiente l'approvazione da parte degli organismi internazionali,
ma servono notevoli investimenti in campo applicativo, tutt'altro che garantiti da un provvedimento ufficiale di
riconoscimento.
51
L'utilizzo di XML accentua l'importanza della struttura semantica dei documenti anche perché lo standard
consente di distinguere in modo chiaro tra elementi e attributi, cioè - in termini di analisi del documento archivistico
- tra i dati che costituiscono la struttura costitutiva generale per tipi di documento e gli attributi intesi come
informazioni di secondo livello (proprietà degli oggetti e non sue parti). Cfr in proposito Charles F. Goldfarb e Paul
Prescod, XML, Milano, McGraw-Hill Italia, 1999, p. 400.
52
Ibidem
53
L'Encoded Archival Description (EAD) è il risultato di un progetto di ricerca avviato nel 1993 dall'Università di
Berkeley sull'uso dei linguaggi di marcatura (allora SGML) per la pubblicazione di strumenti di ricerca in ambiente
digitale.
54
Il provvedimento concluderà la lunga serie di disposizioni concernenti la gestione informatica dei documenti
amministrativi, avviata con il dpr 428/1998 sulla tenuta del protocollo informatico. Le regole tecniche, applicative
del dpr citato, in corso di approvazione da parte della Presidenza del Consiglio dei ministri, dedica una intera
55
Il progetto di partnership Italia-USA sulla metodologia XML per la conservazione e l'accesso ai documenti
55
elettronici
Le amministrazioni archivistiche dei Paesi tecnologicamente all'avanguardia esprimono da tempo le
loro crescenti preoccupazioni in relazione alla capacità di affrontare adeguatamente il futuro delle
memorie digitali. Nel 1998 il responsabile dell'amministrazione archivistica statunitense, John Carlin,
sottolineava che la crescita esponenziale dei documenti elettronici prodotti dal governo federale (milioni di
file in pochi anni) era ed è incompatibile con le risorse e con gli strumenti disponibili e che il rischio di
perdita definitiva in larga parte del patrimonio documentario contemporaneo richiede uno sforzo
eccezionale, non solo in termini di investimento per le attrezzature tecnologiche ma anche per la ricerca
di soluzioni per la sperimentazione e verifica su larga scala di tecnologie avanzate per la conservazione
di documenti elettronici. Da questa preoccupazione era, quindi, nata la decisione del National Archives di
Washington di partecipare a un impegnativo programma di ricerca avviato dall'Università della California,
il Distributed Object Computation Testbed (DOCT), per valutare soluzioni informatiche avanzate in grado
56
di gestire grandi quantità di documenti digitali . Uno dei punti di forza del progetto NARADOCT/Electronic Records Management Project è stato proprio lo sviluppo di strumenti basati sullo
standard XML per la migrazione di documenti informatici e dei metadati necessari a garantirne
l'accessibilità e a provarne l'integrità.
La prima fase della ricerca che è stata condotta a partire dal 1° ottobre 1998 e che, per quanto
riguarda il quadro concettuale di riferimento, è strettamente correlata al progetto InterPARES, ha già dato
alcuni primi risultati significativi:
§ la definizione di un'architettura scalabile per gestire la migrazione dei supporti,
§ l'elaborazione di un modello informativo per trattare la migrazione dei dati di contesto.
La sperimentazione si era, tuttavia, concentrata sul trattamento ai fini della conservazione
permanente (nella ricerca si parla di un arco temporale di 400 anni) di un fondo archivistico costituito da
oltre un milione di messaggi di posta elettronica conservati presso il National Archives.
La fase successiva che si è aperta alcuni mesi fa grazie a un nuovo finanziamento di 300.000
dollari del National Historical Publications and Records Commission è destinata ad allargare il campo di
indagine ad almeno tre grandi classi di documenti elettronici (documenti testuali, documenti composti,
documenti GIS ) il cui accesso richieda l'uso di strumenti software. Il nodo centrale della ricerca, che
corrisponde alla questione di fondo della conservazione delle memorie digitali, è quello di:
§ definire un meccanismo per la creazione parzialmente automatica della rappresentazione
digitale dei documenti in forme indipendenti dal software e sostitutive di originali che non
possono essere conservati a lungo termine per ragioni di obsolescenza,
§ predisporre un prototipo di strumento software indipendente dalle piattaforme,
sufficientemente robusto, flessibile e scalabile (Archivists' Workbench Software Package),
basato sull'utilizzo di XML in quanto standard emergente (e promettente) per la
rappresentazione e lo scambio informatico sul web e fondato sui risultati ottenuti nel corso
delle precedenti indagini condotte dalla Università della California relative a sistemi di
sezione alle modalità di trasmissione e registrazione dei documenti informatici e introduce l'obbligo per lo scambio
dei dati relativi alla segnatura di protocollo dell'utilizzo dello standard XML e delle DTD elaborate dal Centro
tecnico per la rete unitaria della p.a. Nella bozza delle regole - che come si è ricordato sono a disposizione sul sito
dell'Autorità (www.aipa.it) - l'articolo 19 stabilisce le informazioni da includere nella segnatura: oggetto, mittente e
destinatario costituiscono le informazioni obbligatorie, cui si possono aggiungere i dati relativi alla persona o
all'ufficio all'interno della struttura destinataria cui si presume sia affidato il trattamento del documento, l'indice di
classificazione, l'identificazione degli allegati, il procedimento e il suo trattamento e tutte le informazioni che le
amministrazioni specifiche vorranno concordare nell'ambito di rapporti reciproci.
55
Il progetto statunitense - NARA-NPACI, "Methodologies for Preservation and Access of Software-dependent
Electronic Records" - è stato promosso nel 1998. Una seconda fase di durata triennale del programma di ricerca che
ha ottenuto nuovi consistenti finanziamenti dal National Historical Publications and Records Commission è stata
approvata nella primavera del 2000 con l'obiettivo specifico di affrontare i problemi di scalabilità delle soluzioni
individuate e della loro utilità per ambienti di (http://www.sdsc.edu/NHPRC).
56
Il progetto è finanziato dall'US Patent and Trademark Office e dalla Defense Advanced Research Projects
Agency. Per maggiori informazioni sul progetto cfr http://www.sdsc.edu/DOCT.
56
wrapper-mediator (cioè componenti software che operano come traduttori tra i formati nativi
di una fonte informativa e un protocollo comune) anch'essi basati su XML. La scalabilità dei
prodotti riguarda la capacità di rispondere anche alle esigenze di depositi archivistici di
medie e piccole dimensioni. Un ulteriore sviluppo del progetto riguarda l'integrazione di
software esistenti con le funzionalità realizzate con il prototipo.
All'origine di questa scelta c'è la convinzione che i documenti elettronici possano essere
considerati come fonti distribuite di informazione semi-strutturata, costituite da uno schema definito di
componenti informative interne ed esterne al documento e da una serie di elementi passibili di variazione
(il supporto, il contesto tecnologico, ecc.).
Il progetto americano si basa su una serie di presupposti e di pre-condizioni:
§ la considerazione che la codifica ASCII o Unicode per le informazioni testuali e la codifica bitmap
per le immagini siano indipendenti dalle infrastrutture tecnologiche,
§ l'assunto per cui la rappresentazione di informazione strutturata mediante linguaggi di marcatura
(XML) è indipendente e di facile accesso e consente l'auto-descrizione dei documenti,
§ la definizione di una metodologia per la creazione di fonti informative sostitutive degli originali
basata sullo sviluppo di "contenitori" (wrapper) di prodotti software strutturati in modo che:
§ tutti i metadati che descrivono i contesti documentari abbiano la forma di documenti XML
forniti di specifiche DTD,
§ tutte le informazioni testuali siano convertite in documenti XML,
§ tutte le immagini siano convertite in bitmap,
§ tutti i riferimenti a immagini e ad altri documenti all'interno di un documento archivistico siano
convertiti in collegamenti permanenti a loro volta rappresentati in un formato XML
compatibile.
Un aspetto del progetto che merita una specifica riflessione riguarda la necessità di prevedere
modifiche - prodotte anche con procedure automatiche - delle DTD in seguito ad interventi di
conversione, migrazione o copiatura dei materiali digitali da parte delle istituzioni archivistiche cui sono
affidate.
Alcuni risultati sono già stati raggiunti e riguardano, come si è ricordato, la struttura del modello
57
informativo per la conservazione permanente di materiali archivistici . In particolare, nel progetto si
identificano almeno tre nuclei di elementi che devono essere mantenuti nel sistema (simultaneamente
alle singole entità documentarie):
§ lo schema logico che organizza gli attributi essenziali, cioè
§ i metadati relativi ai documenti singoli (digital object representation) che ne definiscono la
struttura, il contesto fisico e la provenienza,
§ i metadati che si riferiscono alla organizzazione dell'archivio e includono le diverse informazioni di
contesto (data collection representation), a loro volta organizzati in sotto-insiemi,
§ i metadati di presentazione (presentation representation), che consentono la conservazione di
diverse interfaccia utente, in particolare dell'interfaccia originaria,
§ la descrizione fisica degli attributi all'interno del database del deposito archivistico,
§ un dizionario dei dati per le definizioni semantiche degli attributi.
Come emerge anche da questa breve presentazione, i ricercatori sono consapevoli della grande
complessità della struttura informativa dell'archivio e delle meta-informazioni che devono essere
identificate, mantenute e gestite nel tempo per assolvere il compito della conservazione. Le attività più
delicate non riguardano tanto le soluzioni tecnologiche, ma i problemi semantici, ovvero l'individuazione e
l'utilizzo delle componenti logiche e la definizione e articolazione dei sotto-sistemi. Perché si ottengano
risultati di qualità su questo terreno di ricerca sono necessarie una padronanza dei principi e dei metodi
archivistici e una solida esperienza maturata in ambienti, tradizioni e giurisdizioni diverse. E' per questa
ragione che gli studiosi statunitensi hanno accolto positivamente la proposta di collaborazione con quelle
istituzioni italiane che da anni condividono le medesime preoccupazioni sulla conservazione dei
57
Reagan Moore, Chaitan Baru, Arcot Rajasekar, Bertram Ludaescher, Richard Marciano, Michael Wan, Wayne
Schroeder e Amarnath Gupta, Collection-Based Persistent Dital Archives. Part I, in "D-Lib Magazine", 6 (2000), n.
3, disponibile al seguente indirizzo: http://www.dlib.org/march00/moore.
57
documenti informatici e il medesimo interesse per le potenzialità dei linguaggi di marcatura e che hanno,
58
comunque, avviato un analogo programma di lavoro .
Il progetto italiano, promosso dall'Istituto di studi per la tutela dei beni archivistici e librari
dell'Università di Urbino e sostenuto dall'Ufficio centrale per i beni archivistici, dall'Associazione nazionale
archivistica italiana e dal Consorzio Roma Ricerche, riguarda in particolare la verifica in ambito europeo
dei requisiti funzionali per la conservazione di archivi digitali e la definizione di una metodologia basata
sul trattamento dei metadati mediante l'utilizzo di XML e lo sviluppo di DTD in stretta connessione con la
59
ricerca InterPARES e la complessa indagine NARA-NPACI. L'impegno più significativo - per il quale è
prevista la diretta collaborazione con il gruppo di lavoro statunitense - riguarda l'individuazione degli
attributi necessari a garantire l'autenticità, l'integrità e l'accessibilità a lungo termine dei documenti
elettronici e la loro strutturazione. La collaborazione si basa sull'analisi dei materiali di indagine, sulla
comune valutazione del metodo sviluppato e sulla organizzazione congiunta di seminari e workshop.
E' presto per valutare gli esiti di un rapporto appena avviato, anche se sin d'ora si può, comunque,
ritenere che l'iniziativa consentirà un confronto molto concreto e operativo su aspetti vitali della ricerca
archivistica. Non si può, tuttavia, tacere una considerazione per quanto riguarda le difficoltà in cui si
svolge oggi la ricerca in Italia soprattutto in settori che offrono una limitata visibilità: a fronte di consistenti
e continui investimenti finanziari da parte delle istituzioni nordamericane, nel nostro Paese il lavoro di
indagine è caratterizzato da iniziative quasi individuali sostenute dalle modestissime risorse delle
istituzioni culturali (in questo caso l'amministrazione archivistica e l'associazione degli archivisti italiani).
Eppure la salvaguardia della memoria documentaria del futuro e i grandi rischi che la minacciano
costituiscono un tema vitale per ogni comunità civile che abbia il senso della propria dimensione storica
e, almeno, quello della sua continuità. E' vero, purtroppo, che gli orizzonti temporali degli individui e delle
amministrazioni si accorciano sempre più e che solo gli specialisti di settore sembrano avere ancora a
cuore i problemi - costosi, impegnativi e assai poco remunerativi - della memoria. Per fortuna le
tecnologie hanno già più volte dimostrato di saper trovare le risposte anche ai problemi di cui sono esse
stesse responsabili. XML è, appunto, uno strumento che apre prospettive incoraggianti per risolvere i
rischi di perdita e corruzione dell'informazione digitale.
58
In particolare, l'amministrazione archivistica e il Consorzio Roma Ricerche conducono da tempo uno studio e
hanno già prodotto alcune realizzazioni sull'uso di SGML/XML per il recupero retrospettivo di strumenti di ricerca
archivistici. Si è, ad esempio, recentemente affrontata la digitalizzazione della Guida generale degli Archivi di Stato
italiani utilizzando il formato XML Cfr http://www.maas.ccr.it/cgi-win/h3.exe/aguida/findex.it, con particolare
riferimento alle parti intitolate: "La storia della Guida" e "Il progetto informatico".
59
Chi scrive, oltre a svolgere le funzioni di direttore dell'Istituto di Urbino, è anche il coordinatore del team
italiano che collabora nell'ambito della ricerca InterPARES. Cfr. M. Guercio, La ricerca InterPARES. Lo
stato del progetto, in "Il mondo degli archivi", 1999, 1, pp. 10-14; Id., Il futuro per le memorie digitali, in
"Autorità per l'informatica nella pubblica amministrazione, Notiziario", 2000, 1, pp. 50-55; Id., Qualche
informazione sullo stato di avanzamento del progetto Inter-PARES, in “Il mondo degli archivi”, 2000, pp.
47-48.
58