La conservazione dei documenti informatici
Transcript
La conservazione dei documenti informatici
Autorità per l’informatica nella Pubblica Amministrazione La conservazione dei documenti informatici - Aspetti organizzativi e tecnici Seminario di studio Roma 30 ottobre 2000 1 Programma I SESSIONE : LA FUNZIONE CONSERVATIVA. PROBLEMI, STRATEGIE , ESPERIENZE , 9:00 Apertura del seminario Salvatore Italia Direttore Ufficio Centrale per i beni archivistici Guido Mario Rey –Presidente dell’Autorità 9:15 Relazione introduttiva Carlo Batini (Autorità per l'Informatica nella Pubblica Amministrazione) 9:35 La gestione e la conservazione dei documenti informatici nelle pubbliche amministrazioni italiane (The management and the preservation of electronic records in the Italian public administrations) Antonio Massari (Autorità per l'Informatica), 10:10 Information Management Architecture for Persistent Object Preservation Reagan W. Moore (University of California) 10:40 Intervallo 11:00 The strategies for the preservation. Projects and perspectives in the US Federal Government Ken Thibodeau (National Archives USA) 11:30 The Swedish programs for the electronic records preservation and access Kristiansson Gorlan (Riksarkivet, Sweden) 12:00 Responding to the challenges and opportunities of ITC: the new record manager Seamus Ross (University of Glasgow) 12:30 Discussione 13:00 Intervallo per il pranzo II SESSIONE . I METODI PER LA CONSERVAZIONE , IL RUOLO DEI METADATI, POTENZIALITÀ DI 14:00 LE XML Relazione introduttiva Maria Guercio (Università di Urbino) 14:30 A logical model for the electronic records authentication Bill Underwood (Georgia Tech Institute) 15:00 Metadata: Archival Concept or IT Domain? Peter Horsman (IT Committee, International Council on Archives, president) 15:30 The potential of markup languages to support descriptive access to electronic records: the EAD standard 2 Anne Swetland Gilliland (University of Los Angeles) 16:00 XML standards for business-to-business and business to government communication Zachary Coffin (eXtensible Business Reporting Language) 16:30 XML, uno standard per gli archivi informatici (XML, a standard for the electronic records) Daniele Tatti (Autorità per l'informatica) 3 Indice Indice 4 Premessa 6 THE EVOLUTION OF PROCESSING PROCEDURES FOR ELECTRONIC RECORDS 7 ARCHIVAL PRINCIPLES IMBUED IN ELECTRONIC RECORDS PROCESSING 7 ACCESSIONING 8 ARCHIVAL PRESERVATION SYSTEM 9 ARCHIVAL ELECTRONIC RECORDS INSPECTION AND CONTROL SYSTEM 10 ARCHIVAL MANAGEMENT INFORMATION SYSTEM 10 GAPS DATABASE 11 DOCUMENTATION 11 PRESERVATION 12 REASONS FOR PAST SUCCESS 13 EMERGING ISSUES 14 History of NARA’s Electronic Records Program 15 An Historical Perspective on Appraisal of Electronic Records, 1968-1998 21 Record/ Non-Record, Not Valuable/Valuable 21 Applying Traditional Archival Theory of Appraisal 22 Applying Traditional Records Management Techniques 23 Innovation, trying new approaches 23 Knowledge-based Persistent Archives 26 Abstract 26 1. Introduction 26 2. Knowledge-based Archives 27 2.1 Archive Accessioning Process: 29 2.2 Archival Representation of Collections: 31 3. Relationships between NARA and other Agency projects: 32 Acknowledgements: 32 NARA and Electronic Records: A Chronology 34 4 The potential of markup languages to support descriptive access to electronic records: The EAD standard 36 Abstract: 36 Introduction 36 Describing Electronic Records 38 Encoded Archival Description 39 Using EAD to Describe Electronic Records 40 Conclusion 42 Preservation and migration of electronic records: the state of the issue 43 The problem of preserving electronic records 43 Approaches to the problem of preserving electronic records 43 The need for an archival approach to preserving electronic records 44 An architecture for archival preservation 45 1. the right data was put into storage properly; 45 2. either nothing happened in storage to change this data or alternatively any changes in the data over time are insignificant; 45 3. all the right data and only the right data was retrieved from storage; 45 4. the retrieved data was subjected to an appropriate process, and 45 5. the processing was executed correctly to output an authentic reproduction of the record. 45 Collection-based persistent object preservation 46 Viability of the persistent object preservation method 47 Conclusion 47 Responding to the Challenges and Opportunities of ICT: The New Records Manager 49 XML per la conservazione dei sistemi documentari informatici 50 5 Premessa Il seminario, organizzato dall'Autorità e dall'Ufficio Centrale per i Beni Archivistici del Ministero per i Beni e le Attività Culturali, ha la finalità di affrontare la questione cruciale della conservazione a lungo termine dei documenti informatici, da un lato sensibilizzando le amministrazioni pubbliche che hanno già avviato o programmato processi di automazione dei sistemi documentari, sia di avviare - anche attraverso un confronto internazionale l'analisi e la verifica delle modalità e degli strumenti che consentono il mantenimento nel tempo dell'integrità e dell'autenticità delle memorie digitali nonché la loro accessibilità. Il tema ha acquistato attualità e rilevanza crescenti in relazione alla larga diffusione di tecnologie dell'informazione e della comunicazione per la gestione dei flussi amministrativi e documentari. Richiede , tuttavia, la soluzione di numerosi problemi di natura organizzativa e tecnica che gruppi internazionali di ricerca e istituzioni nazionali di tutto il mondo hanno cominciato a studiare da qualche anno con impegno crescente soprattutto allo scopo di individuare soluzioni scalabili e di basso costo che rendano possibile anche alle organizzazioni di piccole e medie dimensioni avviare interventi generali di automazione dei processi amministrativi e delle attività di trasmissione e tenuta dei documenti. In particolare, nella prima sessione, il seminario si propone di offrire alle amministrazioni pubbliche italiane - soprattutto ai responsabili dei sistemi informativi e ai responsabili dei sistemi documentari - un panorama della specifica situazione nazionale e dei progetti e delle esperienze operative più interessanti condotte in altri Paesi (Stati Uniti e Svezia) in stretta collaborazione con più ampi gruppi di ricerca internazionali, insieme a un'analisi delle esigenze di formazione e riqualificazione dei profili professionali destinati ad assumere il nuovo difficile ruolo di conservatore delle memorie digitali. La seconda sessione è destinata ad esaminare le questioni più strettamente tecniche della conservazione, in particolare i diversi metodi oggi disponibili per affrontare l'obsolescenza tecnologica e, in special modo, le potenzialità dei linguaggi di marcatura (SGML, XML) per il trattamento e la gestione dei metadati che garantiscono la tenuta a lungo termine del patrimonio documentario e lo sviluppo di standard specifici (Encoded Archival Description). 6 THE EVOLUTION OF PROCESSING PROCEDURES FOR ELECTRONIC RECORDS 1 Bruce Ambacher Electronic and Special Media Records Services Division National Archives and Records Administration In 1966 Archivist of the United States Robert Bahmer established the Committee on Disposition of Machine-Readable Records. In January 1968 the Committee presented its report recommending the National Archives address the records management and archival aspects of Federal machine-readable records. Bahmer then assigned senior National Archives staff to make the recommendations a reality. He tasked experienced records managers and archivists such as Ev Alldredge and Meyer Fishbein, with developing an integrated program which was formally established as the Data Archives Staff late in 1968 by Bahmer’s successor, James Burt Rhoads. It was natural, therefore, that the newest archival records program would be imbued with as many aspects of traditional records management and archival theory and practice as possible and that the practices and procedures developed for and by the new program would conform to National Archives standards. ARCHIVAL PRINCIPLES IMBUED IN ELECTRONIC RECORDS PROCESSING The establishment of a separate Data Archives Staff in the fall of 1968 continued several National Archives traditions. As early as 1938 the Archives had recognized the practical need for separate programs for distinct types of records which required special staff, storage, preservation, or reference services. In 1968 separate programs already existed for photographs, motion pictures, sound recordings, and cartographic materials as well as separate custodial divisions for civil and military textual records. Thus, establishing a separate Data Archives Staff in 1968 to perform all archival functions (except appraisal) for computer generated data reflected National Archives practice. This staff immediately began adapting and applying traditional records management and archival functions to electronic data. Establishing the Data Archives Staff also confirmed the National Archives continuing support for the principle of archival custody. Both the experienced Archives staff responsible for establishing the program and the archivists hired to implement the program, regardless of professional training and previous experience, recognized the importance of accessioning and gaining physical and intellectual control over machine-readable records identified as having enduring value. In the first year the new staff began to use effective textual records management and textual archival procedures for machine-readable records. The first step was a records management survey of extant machine-readable records to help determine which Federal records might be accessioned. The staff also developed the first of an ongoing series of procedures manuals. This one, “A Procedure for Accepting Digital and Analog Magnetic Tape for Archival Storage,” dealt with both accessioning and preservation guidelines. It established basic technical accessioning and preservation criteria such as tape format, initial readability of the data, storage conditions including temperature and humidity, and routine preservation activities such as tape rotation and controlled tension rewinding. In June 1973 the more extensive draft, Recommended Environmental Conditions and Handling Procedures for Magnetic Tape, superseded the preservation sections of the earlier Procedure. Today, lengthy specifications in the Code of Federal Regulations regulate agency custody and storage of two categories of electronic records – those that are either unscheduled or those that have enduring value and are scheduled for transfer to the National Archives. Once electronic records are transferred to NARA they are controlled by the Code of Federal Regulations, by the division’s internal Preservation Procedures Handbook, and by NARA Preservation guidelines. These are reinforced by more generalized environmental care and handling guidelines from sources such as the National Institute for Standards and Technology and the National Media Laboratory, as well as guidance published by hardware and media manufacturers. 1 An earlier version of this paper was presented at the Annual Conference of the Society of American Archivists in Denver, Colorado, August, 2000. 7 The initial inventorying and scheduling activities confirmed another simple, but most important archival principle – the record nature of computer data. As Federal agencies recognized machine-readable items as Federal records, records managers began to include them in records control schedules. NARA processed the first schedule that contained a permanent machine-readable series was processed in 1971. A closely related significant early step was the development and issuance of General Records Schedule 20 covering machine-readable records in April 1972. Linda Henry has reviewed the history of GRS 20. I wish to emphasize that issuing GRS 20, confirmed another basic National Archives concept – responsibility for the entire life-cycle of records - identifying, appraising, accessioning, preserving, and providing reference service for records with enduring value created or received by the Federal Government, regardless of physical format. ACCESSIONING In April 1970 the Data Archives Staff accepted its first electronic records data file transfer, NASA’s Tektite I, and began the accessioning process. Building on the approach in “A Procedure for Accepting Digital and Analog Magnetic Tape for Archival Storage” the staff made a master and a backup copy, deposited the copies in separate storage areas, negotiated the restriction on access statement with the creating agency, prepared a basic documentation package, and developed a catalog entry. These still are the basic steps taken at NARA and every repository that has a custodial program for electronic records. The relatively rapid growth in the volume of accessioned machine-readable holdings and in the number of staff responsible for them highlighted the need to further develop written standardized procedures. By the mid-1970s the staff had grown from the initial three to more than fifteen, several with minimal archival experience. Starting with the first accession in April 1970, the collection had grown to more than 122 series in 18 record groups on 1000 reels of magnetic tape four years later. In 1973 the fledging technical services staff issued the draft Recommended Environmental Conditions and Handling Procedures for Magnetic Tape, to address some aspects of preservation. More formal written preservation procedures focusing on in-house procedures came with in-house computing capabilities. In 1976 the Supervisory Archivist in charge of accessioning assigned the task of developing a comprehensive Appraisal-Description-Accessioning Procedures Handbook to a new staff member as a training project. The first draft was available in 1978. A polished comprehensive loose leaf notebook format including elaborate flowcharts of all procedures; checklists of required elements for each task; and samples of appraisals, descriptions, and documentation packages was completed in 1980. The division has continued maintaining this, and other, procedures manuals. The most recent version, completed in June by Linda Henry, is online on the division’s local area network and contains all of the current forms as well as up-to-date samples of current reports and descriptions. The major purpose of the accessioning and preservation procedures, regardless of the version or the year created, or of the computer equipment used, is to ensure all aspects of accessioning and initial preservation are completed and that each accession completes all steps consistently, in conformance with current Archives standards. Obviously, there has been some variation, in the required forms, the preservation techniques used, the extent of the documentation, and the type of description over the past three decades. In 1968 the National Archives and Records Service was a Service within the General Services Administration (GSA). In the interest of economy and efficiency, GSA had consolidated all computer services in the GSA computer staff in Region 3. GSA directed the Data Archives Staff to use that staff for computer services for accessioning, preservation, and reference. This created problems. GSA’s technical staff did not appreciate the meaning of terms such as “archival” or “permanent.” They also did not high priorities to processing older files “inherited” by NARS from other agencies when their own managers placed greater importance on current projects. This inevitably led to problems. In 1973, as a result of events that Thomas Brown describes elsewhere, GSA reassigned a computer programmer to the Data 8 Archives Staff giving the staff direct responsibility for technical services associated with accessioning, preservation, and reference. The program, however, continued to be dependent upon outside vendors to provide computer time for all accessioning, preservation, and reference services for two more decades. Technical Services staff wrote punch card job decks that contained processing instructions. A courier service took the jobs and the computer tapes to a vendor computer center for batch processing. This processing generally was performed overnight in non-prime time when computer time was least expensive. Any minor error in keypunching or job command language would abort the job. The staff would make the minor adjustment or correction and resubmit the job. A first major step toward acquiring in house computing capability was a side benefit of American Friends Service Committee et al vs. William Webster et al, the 1978 lawsuit against the National Archives and the Federal Bureau of Investigation. The legal aspects and the archival implications of the FBI appraisal project have been assessed elsewhere. Its impact on the electronic records program was significant. The Machine-Readable Division, in addition to detailing three staff full-time to the appraisal project, provided the Archives team with computer-generated samples, computer analyses, and statistics for team evaluation and court scrutiny. The need to reduce turn-around time led to the division acquiring a “dumb” terminal to process computer jobs. It also was a factor in GSA’s decision in 1982 to acquire a computer for the division. At the same time that the FBI Project was diverting resources, President Ronald Reagan, who had campaigned on an anti-big government theme, imposed a budget and hiring freeze on Federal agencies. GSA went even further and initiated a six percent budget reduction on its Services. Within the National Archives this translated into Reductions-in-Force (RIFs) which led to dismissing all employees hired within the preveious three years. The Machine-Readable program, with its overwhelming newer employees, took a disproportionate share of the staff cuts. RIFs reduced the staff from twenty in 1981 to just twelve in 1982. Custody of the Archives’ first computer was diverted from the greatly reduced and demoted branch to an administrative program. Some branch staff devoted to technical services also were transferred to that division, reducing the staff to just seven. The computer was used primarily for administrative and agencywide needs. Electronic records preservation and reference copying was a low priority. All aspects of processing electronic records languished throughout the 1980s. ARCHIVAL PRESERVATION SYSTEM The most significant time savings in processing data files comes from introducing in-house automation into the process. Developing that in-house capability was a key component of the revitalization of the electronic records program that began in 1989. In addition to automating the accessioning and preservation processes, automating management control and tracking was also a high priority. The revised accessioning and preservation strategy recognized that the singular approach of the previous two decades increasingly had the potential to modify the evidential character of the electronic records being preserved, to move away from preserving records as received. A more diverse, modular approach was required to ensure capturing and preserving the essential record characteristics of each file. This strategy was a major theme in the development of the Archival Preservation System (APS). NARA designed APS to utilize standard personal computers and magnetic tape drivers with unique software programming to perform four basic functions. First, APS copies electronic records from a variety of media to the medium chosen for archival preservation. Second, it imprints the archival copies of physical files with standard specifications for recording and labeling. Third, APS automatically tracks and captures information on the files copied, the media volumes created and the processes used to assist both preservation and access. Finally, APS facilitates future migration of files to new media. The key to APS efficiencies is standardizing the preserved data files. APS received a real baptism under fire. It arrived in mid-1993, in time to be diverted from archival preservation processing to duplication of backup tapes transferred from the Executive Office of the 9 President to NARA under temporary restraining orders imposed by the U.S. District Court as part of the Armstrong vs. EOP lawsuit. By 1995 APS had succeeded in copying more than 99.98% of all the EOP records from a variety of physical media. In the past seven years the electronic records program has used the APS to successfully preserve tens of thousands of electronic records data files on thousands of physical volumes. Today fifteen APS workstations in various local area network and stand alone configurations process virtually all of NARA’s electronic records. To date NARA has spent $1,845,000 on APS. This includes $500,000 for initial proof of concept, programming, and subsequent reprogramming of APS code. Hardware costs, including Y2K replacements, exceed $900,000. Daily operations and maintenance have cost nearly $400,000. In the next few years there are plans to expand both APS processing capabilities to include additional types of computer files such as images and office automation and to enhance its reporting capabilities. Currently APS is viewed as an essential component of, or delivery system to, the Electronic Records Archive. ARCHIVAL ELECTRONIC RECORDS INSPECTION AND CONTROL SYSTEM APS was designed to evolve the way the National Archives accessions and preserves electronic records as physical files. It did not, however, address the need to verify the contents of a data file automatically or capture and preserve the logical and conceptual characteristics of the records and the records systems or databases in which the records were created and used. These goals have been embodied in a second automated system, the Archival Electronic records Inspection and Control (AERIC) system. As every processing archivist and trainee who has had to hand validate a sample dump of computer records will testify, the process is time-consuming, labor intensive, boring, and prone to error. Further, anyone validating records could only examine an insignificant number of records, typically ten to fifty. In 1990 the Center for Electronic Records funded a proof of concept study to determine whether and how computer technology could be applied to automate the data verification process focusing on the logical and conceptual structure of data. NARA received the AERIC prototype in 1991. Over the past decade the AERIC system has evolved from a single workstation to a local area network available simultaneously to every accessioning archivist. An archivist or technician addresses the question of whether the file relates to what the creating agency claims it is by entering the information about what the structure and content are supposed to be - the metadata, record structures and code definitions associated with specific variables. The system then reads the data against the variables and codes and reports on each nonconforming record. AERIC achieves its greatest efficiencies when multiple files, including periodic accretions, have the same or similar structures, permitting verification of multiple files based on the same input. To date NARA has invested about $1,300,000 on AERIC. This includes $750,000 for initial proof of concept, programming, and subsequent reprogramming and significant expansion of AERIC code. Hardware costs, including Y2K replacements and three new stand-alone workstations, exceed $400,000. Daily operations and maintenance have cost nearly $150,000. The most recent modifications allow staff to verify the content of structured text files including e-mail and diplomatic cables. Planned upgrades to AERIC include expanding the types of electronic records that it can process to include data from other structured files such as GIS and images and to enhance its storage capacity so it can be used to facilitate researcher access to archival databases. A stand-alone AERIC has just been placed into service to verify the classified electronic records transferred from the National Security Council and the offices of the Independent Counsels. ARCHIVAL MANAGEMENT INFORMATION SYSTEM For the past quarter century NARA’s electronic records program, like other archival programs, has faced several recurring and related questions. “What is the status of a particular accession?” “How many 10 accessions have been completed this fiscal year?” “How big are the backlog(s)?” “What is the best way to manage the backlog?” Throughout the 1970s and 1980s answers to these questions depended upon supervisors maintaining a variety of manual logs and collecting statistics on a periodic basis. The revitalization of the program, beginning in 1989 included additions to the staff, and increased access to information management databases. One of the first automated systems developed was the Archival Management Information System (AMIS). AMIS is a relational database. The first version, developed in 1990, operated on DB2 software maintained at the National Institutes of Health which provided the program’s computer support at that time. The AMIS project manager populated AMIS by entering information from each accession dossier. Each dossier contained the deed of gift, lists of the data files, accompanying restriction information, and varying amounts of processing information. While this sounds straightforward and relatively simple, it was not. Given the number of administrative reorganizations, many of the dossiers could not be located. AMIS now operates on MS Access on NARA’s administrative support system, NARANET. All current accessions are entered into AMIS as soon as they are received. AMIS can be used to create reports on the status of any accession including which accessioning or preservation steps have and have not been completed and whether all required signatures have been received. AMIS also can be used to calculate the elapsed time between initial offer and completion. Currently, it takes more than 800 days. Unfortunately, we still cannot tell you the exact size of the backlog(s). GAPS DATABASE Both Linda and I have mentioned that electronic records scheduling began almost as soon as the program was established, with the first permanent series scheduled in 1971. The emphasis on scheduling has continued over the past thirty years. The obvious purpose of scheduling is the ultimate transfer of the records. In the late 1980s NARA’s Federal Records Centers began to notice a distinct reduction in the volume of scheduled textual records being transferred to the centers. Initial investigation determined that one of several causes was increasing use of office automation and the migration of records from textual series to electronic series. Following up on the findings, the Deputy Archivist asked the electronic records program to determine what portion of the records it was accessioning resulted from records control scheduling efforts and if any of the accessions reflected the evolution from paper to electronic media. This led to the development of the GAPS database – for gaps in the holdings. Two archivists examined every schedule that could be identified as scheduling electronic records for transfer to NARA. Each GAPS record contains the schedule and item numbers, series title, description, disposition, cutoff instructions and transfer dates. The archivists compared the entries with the accessions listed in the AMIS database, based on the schedule and item number. Their initial survey revealed that less than five percent of what should have been transferred actually had been. Follow up accessioning efforts over the past decade increased that to more than thirty percent. Ongoing solicitation will result in additional transfers. For contemporary schedules the GAPS database for electronic records complements the Permanent Authorities database that contains similar information for all other record forms. DOCUMENTATION A third accessioning processing procedure that is unique to electronic records programs is the preservation of a separate documentation series. Documentation is crucial to allowing researchers to understand and use the electronic file. At a minimum it contains the record layout and codes that define the data. Ideally, documentation may include an overview of the data, the data collection methodology or framework, a copy of the original input form, analysis of the data and its uses by the creator, definitions of terms, policy documents on why the study was conducted, and a bibliography of research studies based on past use of the data. Since each file is unique, no concise definition of adequate documentation exists. Gaining physical custody of valuable data files means nothing if the creating agency has not 11 maintained or can not create adequate and proper documentation to transfer with the file. Occasionally, otherwise valuable files can not be accessioned because no documentation existed. Throughout the past thirty years accessioning archivists have gone to great lengths to locate and assemble appropriate documentation. Unfortunately, NARA still receives electronic records files which lack complete documentation. Federal agencies’ understanding of the need for adequate documentation varies. One outstanding example, from my own experience, was the documentation for the Collaborative Perinatal Project. The National Institute for Neurological Diseases expended $200 million over nearly twenty years at fourteen different hospitals to study pregnant women and their children from birth through nine years. The series contains more than six million records. After the project was completed, the sponsoring agency paid a contractor more than $500,000 to prepare the documentation, including an extensive bibliography of all research and publication based on the data, for transfer to the Archives. As the backlog of incompletely processed computer data files grew the program responded by searching for ways to streamline processing procedures without compromising researcher understanding of the data. There has been greater emphasis on using standardized formats, saved as word processing macros, to identify the verification procedures used, anomalies in the data or the coding, any restrictions on access to the data, and incorporation of descriptive materials created for other purposes. PRESERVATION NARA’s approach to the preservation of electronic records continues to build upon basic tenants and principles developed more than thirty years ago. They endure because they still apply to the data files for which they were designed. They also endure because they are being applied successfully – with some modification – to other forms of electronic records. This basic approach to preservation continues to be forged through a combination of research, discussion with industry leaders, archival principles and early practical experiences. The 1969 “A Procedure for Accepting Digital and Analog Magnetic Tape for Archival Storage” and 1973 draft Recommended Environmental Conditions and Handling Procedures for Magnetic Tape illustrate the early guidelines. The first basic principle is that preservation of electronic records does not focus on the preservation of a physical object. While the physical object is more durable than many believe it to be, it is inevitable that the entire suite of media, software and hardware will not survive to provide access to the information at some future date. Preservation of electronic records, therefore, focuses on maintaining the ability to process the information on contemporary computers through repeated changes to the media, the enabling software, and the hardware. NARA maximizes media life by creating two copies of the data on new evaluated stable media. This basic statement reflects unfortunate early experiences with the quality of the media copies transferred from the creating agencies. The program quickly determined that the only way to ensure the quality of the accessioned media was to make both copies on new certified media. Media life is enhanced further by storage in canisters in a stable environment with appropriate temperature and humidity. Today’s recommended temperature is 65ºF. The recommended humidity is 45%. Both are slightly lower than those recommended for textual records. In addition, archival electronic records are subjected to periodic sampling to monitor the continuing readability of their information. Based on extensive media testing and consultation with computer experts that indicated an average media life of twelve to fifteen years the staff instituted a ten year media refreshment program to move the data to new media before the old media deteriorates. Over the past two decades new media have been added to the list of acceptable transfer media and archival storage media as their stability and universality has been tested and documented. Today, the program accepts records on nine track open reel magnetic tape, class 3480 magnetic tape cartridges, and CD-ROM. Investigation of additional magnetic media, especially class 3590E magnetic tape cartridges, is underway. Determination of the longevity of non-magnetic media and their use as preservation storage media appears to be some years in the future. 12 The second basic approach to preserving access to the information on electronic records was to standardize the format of the electronic records transferred to the Archives. The staff initially surveyed the extent and uses of computers in the Federal Government in the early 1970s to determine current best practices. The Archives’ first regulations required that all computer data to be transferred would have to be in one of two mainframe computer language conventions, American Standard Code Information Interchange (ASCII) or Extended Binary Code Decimal Interchange Code (EBCDIC). This created flat files with no embedded software or controller language. Many criticize this as a “one size fits all” approach. This approach remains valid for the data files and databases for which it was intended. This data still represents the overwhelming majority of all electronic records transfers to NARA. The standard allows NARA to accept and preserve data in standardized formats that permit researchers to use the data in any hardware platform with any software applications. This standard format is not applicable to all electronic records. Federal agencies have developed a myriad of software applications as computing has expanded from the computer room to the desktop. They also have applied automation to an increasing variety of information forms including the full range of office automation applications, satellite imaging, geospatial applications, and digital photography and video. NARA continually seeks solutions to preserve these newer records in formats that will reflect the full record. A side effect of these basic preservation principles has been continuing confirmation of a decision not to become a “Colonial Williamsburg” for information technology. NARA does not maintain any superseded computer hardware or software to support long-term preservation. The transfer of office automation products such as e-mail, spreadsheets, and word processing documents, especially in the records of the Executive Office of the President and the offices of Independent Counsels is prompting NARA to investigate ways to provide access to such software dependent information without preserving copyright protected software. One promising solution would utilize viewers or emulators that identify then mimic the original software. The emphasis on information preservation and media disposal means that physical conservation – media rehabilitation or restoration is rarely performed. When it is, the object is to rehabilitate the media to the point that the information can be migrated onto new media. After that the original media is destroyed. Media conservation was most widely practiced on the electronic records in the PROFS lawsuit. Physical conservation activities included thermal reconditioning to reverse tape folding and creasing, media repair to splice split tape and attach or move reflective tape marks, tape baking in a scientific oven, and microphotography to document irreversible physical damage. REASONS FOR PAST SUCCESS The reasons for the success of the oldest and largest electronic records custodial program are varied. One of the most essential has been staffing. For thirty years NARA has had a staff devoted exclusively to electronic records. While backlogs attest that this staff has never been large enough to both accession and preserve all of the records transferred, the program has been able to devote some staff to develop procedures and standards. Others have been able to study the use of emerging technologies in Federal agencies in order to understand future accessioning, preservation and access issues. The program achieved an early zenith in 1980 when its staff of twenty was virtually current with information technology, and was able to “target” agencies that it worked with to secure timely transfers of especially valuable electronic records. The backlog, relative to the overall volume of records, was also very small. The next decade saw a complete reversal. Temporary detailing of staff to the FBI appraisal project, reductions-in-force, and the loss of staff to promotions in other NARA units and other Federal agencies reduced the staff to seven in 1983. Since the revitalization which began in1989 the staffing level has risen to more than forty five but the volume of records to be accessioned and preserved, and the backlog 13 also have risen. The range of duties has diminished as scheduling and appraisal, government-wide standards, and information technology research have been reassigned to other NARA units. This allows the staff to concentrate on the core functions of accessioning and preservation. The diversity of the staff also contributes to the program’s success. While archivists, archives specialists, and archives technicians have always comprised the majority of the staff, management analysts, computer specialists, computer programmers, and information specialists have also been part of the staff. Even within any occupational series the mix of education and previous experience has been diverse ranging from history and political science to geography, library science, social science, computer science, and education. This diverse mix balanced archival principles and concepts with information technology needs and requirements to develop procedures for addressing the unique needs for archival electronic records. Maintaining a separate program for electronic records ensures that the appropriate sense of mission exists. Whether it was developing guidelines for environmental storage conditions, appraising records to determine which would be preserved, or drafting operating procedures, staff are embarked on a pioneering mission, their work is breaking new ground, their efforts will make a difference for those who follow. This sense still exists. Closely related to the sense of mission is a strong belief in the enduring value of electronic records as a new form of records. Although “traditional” archivists challenged the value of electronic records throughout the first decade of the program, the staff persevered because they believed in the enduring value of the records, of their records. Time has proven them correct. EMERGING ISSUES Electronic records custodial programs are at a crossroads. In many ways the current issues seem even more challenging than those faced when archival programs first addressed this newest form of record. But are they? Some archivists question whether archives should establish or maintain a custodial program for electronic records. Others are re-examining basic archival principles to determine if any modifications are required. A relatively small number are “doing,” inventorying, scheduling, appraising, accessioning, preserving, and providing reference services for electronic records with enduring value. This small number works with their colleagues around the world to address custodial issue and to develop complementary strategies for mutual benefit. I am pleased that NARA has been doing – for more than three decades. Certainly electronic records custodians must examine and adjust their policies, procedures, guidance, standards, and underlying concepts related to emerging issues such as increasing platform dependencies, exploding volumes of records, increasingly diverse types of electronic records, and the ever increasing difficulty of ensuring access to the information over time. NARA remains committed to addressing and solving those challenges and to ensuring the long-term preservation of all Federal electronic records with enduring value. 14 History of NARA’s Electronic Records Program2 Thomas E. Brown National Archives and Records Administration College Park, Maryland What follows is a personal history. Personal for two reasons: Having worked at the National Archives’ programs for electronic records for nearly two decades, I personally participated in many of these events. Second, I offer some personal judgements that are my own and not necessarily those of my colleagues at NARA or the agency itself. Thirty years ago on April 16, 1970, archivists at the U. S. National Archives and Records Service (NARS) accessioned their first electronic records. The genesis of this accession into a custodial unit dedicated exclusively to electronic records had begun several years earlier when the Archivist, Robert H. Bahmer, established on December 13, 1966, the Committee on the Disposition of Machine-Readable Records under the chairmanship of Everett O. Alldredge. The committee made several recommendations, one of which was that a senior specialist would coordinate the work of the NARS units involved in electronic records. On February 13, 1968, he issued a GSA Notice to implement the recommendations and to detail Alldredge to his Office to implement the findings of the committee. During the spring and summer of 1968, Alldredge proposed a single unit to be responsible for NARS efforts at electronic records. He wanted a staff of twelve with three senior computer professionals. This met with adamant opposition in GSA. It had established a centralized computer facility to provide technical support to all of GSA including the fledgling machine-readable program at NARS. When the new Archivist, James B. Rhoads, formally established the Data Archives Staff in the fall of 1968, the staff consisted of only three people including Joseph V. Bradt as the Director. Interestingly, the staff reported directly to Alldredge as the head of records management after his detail to the Archivist’ staff. The Data Archives Staff had responsibility for all NARS activities regarding machine-readable records except for appraisal. The latter function remained in the appraisal unit in NARS directed by Meyer Fishbein, a key member of the original Committee on the Disposition of Machine-Readable Records. The Data Archives Staff began life as a records management operation. The small staff developed within two years three documents: a form for inventorying magnetic tape files, recommendations for proper handling and storage of magnetic tape, and the first issuance of a General Records Schedule for computerized records. Alldredge struggled to obtain additional resources as he proposed three separate organizational units within the next two years ranging in size with up to 51 employees. But these plans were never realized. After Alldredge’s retirement in May 1971, the only early organizational change occurred in 1972 when the Data Archives Staff moved to the Office of the National Archives and became the Data Archives Branch. With the reorganization, Gerald Rosencrantz, who had replaced in September 1970 Bradt as Director of the Data Archives Staff, became Chief of the Data Archives Branch. Alldredge’s last plan presaged one major development. In 1973, the Data Archives Branch broke GSA’s monopoly on technical support and expertise. Early that year, the Data Archives Branch began to provide reference service for 600 reels of Civil Aeronautics Board historical data banks. Most requests came from airlines with petitions before the CAB. NARS had to use the parent agency’s central data processing center for the reference reproductions. In the summer of 1973, an order from American Airlines was interminably delayed at GSA’s data processing center. In anger over the delay, the VicePresident of American Airlines for Governmental Relations telephoned the Archivist of the United States at home one evening and wanted to know where his data were. After a flurry of telephone calls between GSA and NARS officials, a GSA Deputy Commissioner ordered a Senior Systems Analyst to copy the tape and drive it at midnight to Baltimore-Washington airport where an American Airlines plane was 2 This paper was presented at the Annual Meeting of the Society of American Archivists in Denver, Colorado, on 2 September 2000. 15 waiting to whisk the tape to Chicago. The next day, GSA reassigned that same Systems Analyst to the Data Archives Branch and granted the Branch unique authority within GSA to acquire data processing services independent of GSA. By 1974, the Data Archives Branch had slowly grown to 13 with 3 more being recruited. Four of these were funded by other Federal agencies as a result of partnerships. To that date, the machinereadable program functioned more as a federal data center rather than an archives since the vast majority of its holdings were primarily files with high current use and with questionable archival value. As a corrective, NARS raised the bureaucratic stature of the program to the Machine-Readable Archives Division and recruited Charles Dollar, a history professor, as the new division’s director, to establish a true custodial program. The Dollar years did indeed bring a professionalism to the Division. During his first four years, Dollar recruited ten people for professional archival positions with seven of the ten having Ph.D. degrees. As professionals, the staff was active in a variety of professional associations with interests in the preservation and use of machine-readable records, such as SAA, MARAC, IASSIST, AAAS, URISA, and APDU. For example, in fiscal year 1980, individual staff members spoke on electronic records during 31 professional conferences and attended 35 other professional meetings. The staff also devised and implemented descriptive standards for machine-readable records both as series of archival materials and as social science data files. The division also acquired responsibility for the appraisal of machinereadable records and then outlined appraisal criteria. Responsibility for appraisal gave the division an entree into records management as the division established programs to train agency staff to inventory and schedule computer data bases, apply the general records schedules, and arrange for the transfer of permanent records to the Archives. This effort also included a “targeted agency” program through which staff assisted the inventorying and scheduling of electronic records in those agencies with records of permanent value. The establishment in 1977 of the Center for Machine-Readable Records provided a mechanism for NARS to acquire physical custody of electronic records with a high reference demand but indeterminate or dubious archival value. By the end of 1980, the Machine-Readable Archives Division had processed 155 accessions. These included computerized information gathered by regulatory agencies, indexing systems to permanent paper records, a primitive form of an expert system using artificial intelligence and, probably most important, a rich collection of operational records from the Vietnam conflict. Several archivists had acquired through on-the-job and classroom training a solid knowledge of data processing that supplemented the technical staff. In late 1980, however, the program began to unravel. The first blow came when the U.S. District Court Judge Harold Green, who was threatening to send the Deputy Archivist to jail, ordered a reappraisal of Federal Bureau of Investigation’s field office records. NARS established a re-appraisal task force that was in fact, if not on paper, under the direction of Dollar and that included two other staff members from the division. With the Task Force relying on a quantitative analysis of a statistical sample of case files, three additional staff spent three to six months providing technical support for Task Force. With Dollar directing the Task Force and then reassigned, the Machine-Readable Division lacked a fulltime, on-site division director for eighteen months. And these eighteen months coincided with the Reagan Revolution’s goal to reduce the size of the Federal Government. GSA’s Administrator expanded the Government-wide hiring freeze into a reductionin-force or RIF. With seniority a major determinant of who lost their jobs, NARS eliminated all vacancies and fired all employees hired in the previous three years. When the dust settled, NARS had lost 98 employees and another 100 vacant positions. As a new program with several new hires, the MachineReadable Archives Division fell to a staff of 12 during the RIF of February 1982. The following month testimony before Congress included this lament, “The National Archives machine-readable staff is decimated; how will data be preserved for historical purposes?” In April 1982, NARS reorganized because of the loss of personnel agency-wide and the MachineReadable Archives Division was reduced in stature to the Machine-Readable Branch, with Trudy Peterson as the branch chief and as part of the division under Charles Dollar. Part of the reorganization was to centralize all data processing with an administrative unit. This transferred all staff members in any computer-related job series from the branch and denied the remaining staff access to computer facilities. This prompted an incredulous comment from a Canadian during a public meeting at NARS, “How can one deal with the records of modern technology if one doesn’t have access to the technology?” Under Peterson’s tenure through August 1983, normal attrition continued to erode the staff until it fell to seven employees. The number of files accessioned collapsed to 25 for fiscal year 1984. With no 16 computer support, its automated location register was reduced to an annotated print out from 1982. Appraisal reports became single sheet forms. In sum, despite Peterson’s persistence, forces outside of the unit had turned the program into shambles when Dick Myers arrived in January 1984 to become branch chief. As the program was reaching its nadir, it came under attack from a group whose intent was to help the Archives - - the NARS Committee on Preservation. In 1983, its Subcommittee C on long-range planing concluded that the long-term solution for electronic records was to convert them to computer output microfilm or COM. To answer a series of questions posed by the committee, the branch conducted in August 1983 the “The 1983 Survey of Machine-Readable Records” with its report finalized in April 1984. The report was essentially a screed against the COM proposal because of the need to preserve the retrievability and manipulability. On February 9 and 10, 1984, the committee members met to discuss the COM proposal and the initial results of the survey. One participant reported the meeting had “an adversarial atmosphere between the committee and NARS staff” and concluded, “NARS’ relationship with Subcommittee C has suffered.” The NARS spokesman during the meeting commented that the chairman “is openly critical of what he sees as the prevalent passivity of archivists” who would not embrace COM as the technological solution to electronic records. On July 13, 1984, the Advisory Committee on Preservation formally transmitted to NARS its recommendation that, “most future accessions [of electronic records] be in a human comprehensible form on a certifiable archival medium, thus ensuring that the information remains permanently useable without regard to changing memory technologies.” On October 22, the Archivist Robert Warner responded politely and deferentially and called the proposal “interesting and challenging” and “commended [it] for logic and good sense.” While diplomatically leaving open the door as a possible last resort at sometime in the future, NARS position was emphatic, “[W]e will continue our current policies and practices of preserving machine-readable information.” The Committee chairman only sighed, “NARS hasn’t grasped the point.” While Subcommittee C’s proposed changes were coming to naught, other changes were in store for the NARS. In the November 1984, legislation established the National Archives and Records Administration as an independent agency and triggered a succession of personnel changes. Warner and his deputy resigned and thus paved the way for Frank Burke to be named Acting Archivist from April 1985 to December 1987. With the loss of most records management functions to GSA, the new NARA wanted to raise the profile of its relations with agencies by establishing a new office devoted solely to records issues. This new office got appraisal and other records management responsibilities for electronic records from the custodial program. Any vestige of an effective records management program in branch had ended with the loss of staff beginning in 1981. Thus this transfer of records management was a de jure acknowledgement of de facto reality. Trudy Peterson, who previously was the head of the MachineReadable Branch, became responsible for all custodial programs in NARA nationwide. To re-establish a viable electronic records custodial unit, Myers still had to confront the President Reagan’s desire to the limit the Federal Government and so could not recruit new staff. Exploiting his long tenure at the National Archives, he persuaded staff in other units to request reassignment to the MachineReadable Branch. In this way, he doubled the size of the staff from seven when he took over in January 1984 to fourteen a year and half later in June 1985. Yet his success was short lived as the staff, for a variety of reasons, withered to nine a year later. Early in his tenure, Myers realized that the branch needed direct access to technology. By the end of 1984, he had secured permission to purchase time on the IBM computer at the National Institutes of Health and the branch assumed responsibility for preservation and accessioning processing on January 15, 1985. The branch under Myers still had to rely on archival staff, rather than computer staff, to apply the homegrown technical expertise they had acquired on the job or through formal training courses to handle the Branch’s data processing chores. In January 1986, Myers became head of the Still Pictures Branch. After six months without onsite branch chief, Edie Hedlin assumed the position in July with a staff of nine. She continued the Myers efforts to persuade individual NARA staff members to transfer into the Machine-Readable Branch. In addition, she changed the staffing pattern. Before her arrival, individual staff members were assigned to a function, such as accessioning, preservation or reference. Hedlin however saw that professional staff received training in all functional areas and the got responsibility for a set of record groups. Hedlin also weighed into the decade-old issue about records with high research demand but dubious archival value and secured approval to disband the Center for Machine-Readable Records. But it would take five years of negotiations to dispose of the records previously placed in the program. 17 To establish the infrastructure for a successful electronic records program, Hedlin prepared an option paper for consideration by top NARA management. After top management approval, Hedlin generated the reams of administrative paperwork needed for a reorganization: a Center for Electronic Records with two branches, an Archival Services Branch and a Technical Services Branch. On October 1, 1988, based on Hedlin’s outputs, the Center for Electronic Records was created. When Hedlin opted for Congressional relations, Peterson hired Ken Thibodeau who became the Center’s Director in December. Thibodeau had the support of the new Archivist, Don Wilson, whose “priorities since becoming Archivist has been to develop for the profession and for Federal agencies, a model Center for Electronic Records.” To achieve this goal, in July 1989, he directed the appraisal function of electronic records returned to the custodial program while scheduling computer materials remained part of NARA’s records management program. Returning to precedents of the Machine-Readable Archives Division, Wilson stated, “To continue the development of the Center for Electronic Records as a model, I also intend to add computer experts to [the Center’s] . . . archival staff in order to achieve the proper balance between technical and archival knowledge and practice.” As part of his Center-building efforts, Wilson directed the Center to add five additional staff members each year until the year 2000 for an estimated total of 75 staff. In 1992, the Center’s management developed an organizational growth plan involving three branches to take full advantage of the projected growth in personnel. With this support, the staff grew from 17 when Thibodeau arrived in December 1988 to 48 in January 1993. Equally important to the increase in raw numbers was the diverse backgrounds of the new staff. Systems analysts and other computer professionals became employees of the Center. Rather than recruiting archivists with backgrounds in history, the Center began hiring professionals with expertise in other disciplines, such as geography, sociology, economics and library science. The Center also hired student intermittents for routine tasks in accessioning, reference and preservation. With the addition of data processing professionals, the Center began in 1990 to move away from relying solely on outside service bureaus, such as the NIH computer center, and started to develop in-house systems for both accessioning and preservation. These in-house systems allowed the Center to increase the number of accessions processed and preserved. Through a contract with the National Academy of Public Administration, panels of subject matter experts identified 430 statistical data bases that likely had records with archival value. And the Center staff undertook a coordinated program, to schedule and arrange for the ultimate transfer of those records. Thus it seemed, at the dawn of 1993, the electronic records program was developing into an organization worthy of the National Archives of the United States. The bright hope quickly dimmed. In January 1993, President Clinton assumed office and ordered a 4 per cent reduction in the Federal workforce within three years. The hope of adding five additional staff for the Center each year until 2000 fell by the wayside. Indeed, to date, the total staff has never equaled its high mark of 48 in January 1993. As staffing began to decline, the court litigation surrounding the Armstrong et al v. Executive Office of the President et al or more colloquially known as the PROFS case drained the ebbing resources. In January 1989, Scott Armstrong had filed a Freedom of Information Act (FOIA) request seeking all the information on the Executive Office of the President’s office automation system known as PROFS system. On January 3, 1993, U.S. District Judge Charles R. Richey rejected the Government’s argument that e-mail were not records and ordered “the Archivist to take . . . all necessary steps to preserve the electronic records . . ..” In the final hours of the Bush Administration, nearly 6,000 backup tapes and hard drives from White House staff computers were transferred to NARA. Controversy surrounding the agreement to materials led to the resignation of Wilson and the appointment of Trudy Peterson as Acting Archivist of the United States. In that capacity, on March 25, 1993, Peterson made the Center for Electronic Records responsible for the preservation of the White House materials. When the plaintiffs claimed in April 1993 that the Government had not complied with the court’ order, Judge Richey dismissed the fact that Acting Archivist Peterson had moved immediately on assuming her position to ensure preservation of the materials and found the government in civil contempt. He ordered the government to take “all necessary steps to preserve the tapes transferred to the Archivist,” within thirty days or pay a fine of $50,000 a day, doubling every week thereafter. Although the contempt order was reversed, the Center nonetheless pulled out all stops to ensure the physical preservation of the files that were thought to be most at risk. The staff worked seven days a week sixteen hours a day, for three weeks to accomplish this. Interestingly, the Center when embarked on this labor, it did not have access to any technology suitable for the required work. There was no in-house processing capability at all and the 18 NIH computer center could not be used to copy backup tapes from other computer centers. Fortunately, the Center had contracted for the development of an in-house system for routine preservation copying and got rudimentary version of the Archival Preservation System (APS). By June 18, 1993, three days prior to the judge’s deadline under the contempt order, the Center had successfully copied all ‘at risk’ 609 computer tapes. Over the next two years, the Center continued to preserve the White House materials copying more than 99.98% of the media transferred. Besides the obvious impact of an increased workload from the PROFS case, court-imposed inspections became a continuing, substantial, and uncontrollable drain on the Center’s resources. By 1998, the Center for Electronic Records had spent approximately $2.5 million on PROFS-related expenses; 90% of this came from the Center’s operational budget. Indeed, the Center has spent more on the PROFS materials than on the preservation of all the permanently valuable electronic records accessioned into the National Archives since the first electronic records were accessioned in 1970. The PROFS records consumed most of the Center’s staff and financial resources from 1993 to 1997. In February 1998, NARA once again reorganized its operations in the Washington DC area and created the Electronic and Special Media Records Services Division from the erstwhile Center for Electronic Records. Michael Carlson became director of the new division with 44 employees. In line with the Clinton Administration’s aim to flatten the levels of Government, the branch structure was eliminated while retaining the same supervisory structure. The only significant functional change was to transfer the appraisal of electronic records to the unit responsible for appraising the records in other media. During the past 8 years, the Center and its successor division has continued to develop its in-house systems for accessioning and preservation and increased its annual capacity to copy and preserve from about 1,000 files per year to over 70,000 files today. In 1998, the automated accessioning system successfully verified the intra-office e-mail messages from President Bush’s Office of the United States Trade Representative and opened the door to accessioning office automation materials. The growth of processing also included an increased in the media options for accessioning and reference purposes, beginning with CD-ROMs in the mid-1990’s and now moving today to include File Transfer Protocol (FTP). Resting on an initial business process improvement (BPI), the electronic records unit launched a pilot project to merge archival and technical functions into a team responsible for all archival activities associated with records from the Bureau of the Census. These advances came, as indicated above, in spite of stagnant staffing levels and the preservation of PROFS draining resources. In retrospect, the custodial program for electronic records has had a yin and yang from an impressive start in the 1970’s to a near collapse in the 1980’s, a rebound until 1993 and followed by inching forward into the new millennium with a languishing staff. Interestingly, Presidential policies reverberated on this small unit within a small agency. Reagan’s policies ultimately shrunk the staff to seven; Clinton’s policies cut short the staff increases needed to build a model program for the profession. This history reveals the necessity of having staff from a variety of backgrounds so as to create the needed synergy between archival and computer professionals. Since 1968, the range of records management activities of the custodial program have ebbed and flowed. It’s full-scale responsibilities in the 1970’s withered to nothing with the loss of resources in the 1980’s and died formally with the loss appraisal in 1985. The Center for Electronic Records acquired the small piece of the records management pie with appraisal in 1989 until its dissolution in 1998. Despite these impediments, the past three decades witnessed great progress. From that first accession thirty years ago, the custodial unit now has custody of over 200 million records. In acquiring this vast collection of historical materials, the program pioneered records management techniques and practices for electronic records. These 200 million records were accessioned, described, preserved and made available through a range of archival procedures that the custodial program had to create nearly from scratch. The staff has now taken these techniques developed for data files and data bases and has applied them to the archival administration of records from office automation systems. Thus the program now has ways to accession, verify, preserve and provide reference for e-mail transmissions with attachments, desk-top publishing applications, and geographical information systems. When NARS established the Data Archives Staff, the Comptroller of the National Security Agency wrote, “It is always reassuring to know that NARS program objectives keep pace with the rapid changes in information handling technology. Such measures . . . do much to sustain continued user confidence in NARS service.” While NARS and NARA have not always lived up to that optimistic assessment, we can always hope to do so in the future. And there may be a firm foundation for hope. The current Archivist, John Carlin, has secured significant increases in funding and has ensured that a large portion contribute 19 to administering electronic records. The priority is seemingly the successful development and implementation of the Electronic Records Archives (ERA). The ERA effort should provide NARA’s custodial program for electronic records with the tools it needs to expand the successes of its past to manage the records of the future. 20 An Historical Perspective on Appraisal of Electronic Records, 1968-19983 Linda Henry [This paper does not necessarily represent the views of the National Archives and Records Administration. I’m going to use the term NARA throughout and Electronic Records Program for the various names of the unit, such as branch, division, etc.] The appraisal function for electronic records at NARA has been in various units over the years. I will concentrate on the period 1968-1998, when the function was most often within the electronic records program. I will explore these appraisal themes, each chronologically: 1. 2. 3. 4. Determining the record character of computerized records, whether they are records or non-records, and, if they are records, whether they are valuable enough to be accessioned. Applying traditional archival appraisal principles, such as evidential and informational values, and incorporating other considerations unique to computerized records Applying records management techniques Trying new approaches Record/ Non-Record, Not Valuable/Valuable The issue of “recordness” arose almost from the beginning of the National Archives 65 years ago with punch-card records, which Margaret Adams’ excellent article on the subject called the precursors of ER. Interestingly, the early NA pioneers seemed more in agreement that punch cards were records, than did their successors with later computerized records. Most of the pioneers argued, however, that punch cards were “records” but not “archives,” that is, not permanently valuable. The Records Disposition Act of 1939 explicitly mentioned punch cards as records. The Records Disposition Act of 1943 substituted the phrase “other documentary materials, regardless of physical form or characteristics.” This phrase still pertains to federal records 57 years later. The Federal Records Act of 1950 as amended in 1976 added the term, “machine readable material.” (44 U.S.C. 3301), which also still pertains. NARA established a program for ER in 1968. Despite the legal definition of record, the arguments continued about whether computerized records were records or non-records and particularly whether they were permanently valuable records. For example, in one 1976 appraisal, other archivists in NARA argued that “these tapes are similar to reference and study materials which were disposed of as non-record material.” and also that the records did not have evidential or informational value i.e., they were not valuable. This did not always happen, and other appraisals encountered no opposition. Federal agencies presented another obstacle. The first Data Archives Staff found that “virtually all agencies in the Federal Government considered the information on magnetic tapes as ‘non-record.’ ” In the next decade, a 1975 survey found that 60 percent of federal agency records officers thought that computerized records were record material. (Dollar, Ann Arbor, p. 80.) This improvement may be attributed to several years of NARA efforts to educate federal agencies. Sometimes the doubts of both the NARA archivists’ and federal agencies about the record nature of ER converged in an appraisal dossier. For example, a 1977 dossier had a records schedule with dispositions for computer printouts or “computer runs,” but no item or disposition for the computerized records from which the printouts came. (NC1-151-77-001) 3 An earlier version of this paper was presented at the Society of American Archivists Annual Meeting, Session #47, Sept. 2, 2000 21 At the end of the 1980s, the records test arose again. In 1989 Armstrong v Executive Office of the President, known as the PROFS e-mail case, raised almost every question about the record nature of e-mail. Were e-mail messages records or non-records? Were e-mail messages in electronic and paper form both records? Were both forms valuable? How much metadata should be captured in an electronic message that was printed? The PROFS case settled the issue that transmission and receipt information was part of the record. Issues about destruction of e-mail were ultimately resolved in the GRS 20 lawsuit, which I’ll discuss later. Also in the early 1990s, after 20 years of trying to educate others that computerized records were indeed records, NARA’s ER program faced still another assault about the record character of computerized records—from some members of the archival profession. A group of archivists proclaimed that the very records NARA had been appraising and accessioning for 20 years were not records, but “merely” data. I have responded to those supporters of a “new paradigm” and their narrow definition of a record. I mention it here because it seems that disagreements about the record nature of computerized records will never go away. Applying Traditional Archival Theory of Appraisal When NARA began appraising electronic records in 1969, appraisal of permanent records usually occurred when agencies offered records to NARA. This meant that ER appraisers usually had custody of the records, and archivists verified and tested the readability of the tapes prior to appraisal. By the mid-1980s, NARA no longer accepted direct offers of federal records. For the last 15 years, then, appraisal has occurred when the ER are still in the agency. By federal regulation, agencies must schedule ER systems within one year of a system’s creation, although this doesn’t always happen. (36 CFR 1228.26) In addition, today transfer, verification and copying of records most often takes place after appraisal. ER appraisers considered the traditional principles of evidential, informational, administrative and legal values. Charles Dollar’s writings from the 1970s, however, gave the impression that ER archivists analyzed only informational value. While that value characterized most of the appraisals then and now, at least some ER appraisals from the 1970s clearly identified evidential value. For example, the Presidential Clemency Board includes the Consistency Audit Data file appraised for its evidential value in 1976. Similarly, records appraisals in the 1970s for Department of Defense records about the conduct of the Vietnam War and from regulatory agencies, such as the Securities and Exchange Commission, show that records were being appraised for documenting core mission programs, not just for, or in addition to, informational value. Thomas Brown has published examples of such records and concluded that 59 percent of NARA ER accessions before 1980 were “programmatic records or records derived from program operations.” In the 1980s, ER appraisers gained more experience with appraisals for evidential and informational value, and for legal value as well. For example, INS records were appraised in part for their value to immigrants who could use them for legal purposes. In addition, ER appraisers gained some experience with appraisal of text systems. One early example is the Watergate Special Prosecution Force records, which included a text system appraised for evidential value. In the 1990s appraisers gained more experience with electronic text files, such as those from the Executive Office of the President, and some geographic information systems. While ER appraisers considered the same values as those applied to paper records, they also had to consider characteristics unique to computerized records, such as manipulability, volume, linkage, duplication and evaluating micro-level data. While these attributes do not in themselves justify permanent appraisal, each can greatly enhance the value of the records. For example, an automated index is greatly superior to a manual one because of the characteristic of manipulation. Being able to save micro-level data, as opposed to summary or aggregated data, can be preferable because of reanalysis, which can serve as a check on the way the agency originally used the data, or accountability—a much touted concept of the 1990s. Saving computerized micro-level data also solves the volume problem of paper records. Linkage permits comparison of data with common attributes such as geographic location, 22 occupation or age. Other NARA archivists sometimes misunderstood these characteristics. For example, one complained that an ER appraiser was equating “permanent” with “manipulative.” Another NARA archivist reluctantly agreed to accessioning “a mere cubic foot in volume.” seemingly misunderstanding the volume issue, since that cubic foot consisted of approximately 11 reels of magnetic tape containing thousands of records. ER appraisers also had to learn to evaluate technical issues, also uncharted territory for appraisal. The first two technical issues concern readability—whether the records can be used—and documentation, whether there is sufficient information to use or appraise the records, to process them and, most importantly, enable researchers to use them. If the records are not readable and the documentation is insufficient, the records, however potentially valuable, cannot be appraised for permanent retention. Other technical considerations concern the hardware and software environment. In the early years NARA sometimes reformatted information into a independent format, but by 1976 NARA required that agencies transfer permanent records in hardware and software independent format. Records appraised after this regulation could sometimes be appraised as temporary because of their dependent format, even if the records were otherwise valuable. NARA’s current Electronic Records Archives (ERA) initiative, headed by Ken Thibodeau, offers hope for some solutions to the software dependency problem and other issues. In general, however, appraisers first applied the traditional tests of evidential and informational value before considering technical considerations, except readability and documentation. After all, manipulating junk, cutting down on the volume of junk or having junk in a software independent format really doesn’t matter if the records are junk. Applying Traditional Records Management Techniques Almost from the beginning of the ER program in 1968, ER archivists faced the same problems as other archivists with the proliferation of temporary records. NARA began issuing general records schedules (GRS) covering disposable paper records in 1945. One of the first actions of the newly created Data Archives Staff was issuing a general records schedule for machine readable records in 1972, GRS 20. NARA revised GRS 20 in 1977, 1982 and 1988, all covering data processing operations. In 1988 a separate GRS 23 covered word-processing files, administrative databases, and electronic spreadsheets. NARA revised GRS 20 again in 1995, combined in GRS 23, and explicitly included e-mail. In December 1996, the lawsuit , Public Citizen v. Carlin, challenged the legality of GRS 20 and, among other issues, continued the arguments begun in the PROFS case about the record nature of e-mail. The lawsuit also added the issue of what constituted proper recordkeeping. The controversy often seemed to overlook the requirement that e-mail could not be destroyed unless it had been placed in a recordkeeping system with records management functionality and had transmission and receipt information. While the recordkeeping system could be a paper system, agencies could not merely “print out e-mail” as some opponents misleadingly argued and some professional organizations, such as SAA, misunderstood (Archival Outlook, July/August 1997, pp. 4-5). After numerous actions, the GRS 20 lawsuit ended on March 6, 2000, when the Supreme Court refused to reverse a US Court of Appeals Decision of August 1999 that upheld GRS 20. Innovation, trying new approaches Almost from the beginning, the ER program tried innovative archival and records management approaches. One early example in the 1970s was the “targeted agency” program whereby the ER program worked with selected federal agencies known to create valuable records but which needed NARA assistance in inventorying, scheduling and transferring records. Two such targeted agencies were the Bureau of the Census and the Public Health Service. One result was to save the 1960 Census records. In 1999 and 2000, NARA is once again seeing the value of this approach and is funding “targeted assistance” positions throughout its nationwide system. Another innovation, in 1978, was creating a partial records center function for ER which had a high current research use but an indeterminate permanent value. NARA management initially approved a 23 procedure to accession such records and reappraise them after 10 years, but other units in the Archives objected. The compromise was creating a partial records center function within the MRR Division. NARA assumed physical custody, but the agencies retained legal custody. At the end of a time period, usually 5 years, NARA and the federal agency would review the agreement and extend it, or schedule the records for destruction or accessioning. For those records that were ultimately deemed permanently valuable, having them in the MRR Division gave NARA a “bird in hand” advantage, since it was much easier to then accession them rather than trying to get agencies to transfer them. One example of valuable accessioned records emerging from this experiment is the Civil Aeronautics Board’s Origin and Destination data. Placed in the records center for ER in 1978, the data became a valuable source in documenting airline deregulation a decade later. Still another example of archival innovation is the ER program’s use of a study by the National Academy of Public Administration (NAPA) in 1990 and 1991 of major federal data bases. While NARA’s ER program knew about numerous federal data bases and had scheduled and accessioned a great number of them, the problem was identifying the universe. NARA asked NAPA to prepare an inventory of major data bases, and, to focus on NARA’s interest, to identify those that had potentially permanent value. The preliminary inventory included approximately 9,000 data bases. NAPA panels culled that number to 1,789 and recommended that NARA should accession 448 data bases, almost a “Fortune 500.” ER appraisers made some adjustments in the NAPA recommendations, “demoting” some of the “should transfer” data bases to temporary retention and elevating some of the “un-rated” data bases to permanent retention. ER appraisers were then able to contact agencies and use the clout of NAPA to get agencies to schedule and transfer records. The result was schedules for 295 databases, mostly permanent but some temporary, which the ER program did not have to rely on federal agencies to schedule. A great number of these have been transferred and accessioned to date. In this presentation I have largely confined my remarks to the period from 1968 through 1998, when the appraisal function was again transferred from the ER program to the appraisal unit in NARA. However, I would like to note that in 1999 Archivist John Carlin began “a major project to review and, if necessary, reinvent the policies and processes for the scheduling and appraisal of federal records in all media.” The expected conclusion of the project is Sept. 30, 2001. This quick run through 30 years of appraisal of ER at NARA, and my own 7-year experience in appraisal of ER, leads me to some personal observations. The first is the consistency over the years in the importance of applying traditional archival theory and practice to records of a new media. The ER program did develop new considerations in appraisal, such as manipulability, volume, linkage, duplication and evaluating micro-level data. And the staff did have to learn to apply a technical analysis. But the bottom line remained focused on the content of records and applying the traditional archival appraisal principles of evidential, informational, legal and administrative value. My other personal observation has to do with the importance of databases and my reservations about e-mail. In recent years, records of office automation, such as word processing and particularly email, have concerned archivists more than other types of ER. Certainly office automation records account for the vast bulk of records that federal agencies are producing. Does our hand-wringing about e-mail, however, sometimes divert us from the important databases that agencies are creating? Statistical databases, for example, will always be important government records. Governments at all levels count things. Almost everything. Federal agencies count the population, crop production, wage earnings, accidents, and college and university enrollments, among a very long list of things. Federal, state, and local governments are going to go on counting and creating valuable and permanent records that reflect agency mission and, usually, impact. At the federal level, such records also give us the only national information we have on a numerous subjects, such as disease and educational achievement. All those important databases need to be scheduled, transferred and accessioned, so they can be used by researchers. The volume of e-mail is probably too enormous to be counted. One 1999 estimate is 36.5 billion messages per year in the federal government. Before I had a computer for my work, I created few records. Today, federal workers at some 2 million work stations are creating far more records than they 24 created before they had pc’s. The pie of records is thus vastly larger. How much of the record material needs to be retained for more than a brief period? For years, NARA has estimated that less than 2% of federal records are permanently valuable. This estimate seems much too high for e-mail. More distressing, however, is the lack of organization of the e-mail. We can’t appraise all that unorganized material on 2 million pc’s. There’s too much, most of it is transitory and it requires implementation of rigorous records management, still largely non-existent. For several years NARA has been emphasizing the importance of establishing record-keeping systems, preferably electronic. NARA’s current web page for federal agencies still stresses this (nara.gov/records/fasttrak/ftprod.html). NARA has also endorsed the Department of Defense’s Standard 5015.2 for records management software applications for ER, which offers hope for agencies organizing their e-mail. Thinking about e-mail reminds me of NARA’s earliest history. The pioneers at NARA in the 1930s and 1940s faced mountains of government records accumulating at approximately a million cu. ft. annually. As one report noted then, “Caring for these records has been likened to keeping an elephant for a pet: ‘ its bulk cannot be ignored, its upkeep is terrific, and, although it can be utilized, uncontrolled it is potentially a menace.’” Doesn’t this describe the e-mail mess? The ER problems NARA faced 30 years ago were daunting. They still are. But today we have an advantage. We have 30 years of experience in confronting problems with ER We can apply that experience to the challenges we now face. 25 Knowledge-based Persistent Archives Rea ga n W . Moore Sa n Die go Su pe rc omp uter Cente r mo ore @s ds c .edu Abstract The preservation of digital information for long periods of time is becoming feasible through the integration of archival storage technology from supercomputer centers, information models from the digital library community, and preservation models from the archivist’s community. The supercomputer centers provide the technology needed to store the immense amounts of digital data that are being created, while the digital library community provides the mechanisms to define the context needed to interpret the data. The coordination of these technologies with preservation and management policies defines the infrastructure for a collection based persistent archive [1]. This report discusses the use of knowledge representations to augment collection-based persistent archives. 1. Introduction Supercomputer centers, digital libraries, and archival storage communities have common persistent archival storage requirements. Each of these communities is building software infrastructure to organize and store large collections of data. An emerging common requirement is the ability to maintain data collections for long periods of time. The challenge is to maintain the ability to discover, access, and display digital objects that are stored within the archive, while the technology used to manage the archive evolves. We originally implemented a collection-based persistent archive [1] in which a description of the collection is stored along with the data. The approach focused on the development of infrastructure independent representations for the information content of the collection, interoperability mechanisms to support migration of the collection onto new software and hardware systems, and use of a standard tagging language to annotate the information content. The process used to ingest a collection, transform it into an infrastructure independent form, and recreate the collection on new technology is shown schematically in Figure 1. Figure 1. Persistent Collection Process Two phases are emphasized, the archiving of the collection, and the retrieval or instantiation of the collection onto new technology. The diagram shows the multiple steps that are necessary to preserve digital objects through time. The steps form a cycle that can be used for migrating data collections onto 26 new infrastructure as technology evolves. The technology changes can occur at the system-level where archive, file, compute and database software evolves, or at the information model level where formats, programming languages and practices change. The ultimate goal is to maintain not only the bits associated with the original data, but also the context that permits the data to be interpreted. We rely on the use of collections to define the context to associate with digital data. Each digital object is maintained as a tagged structure that includes the original bytes of data, as well as attributes that have been defined as relevant for the data collection. A collection-based persistent archive is therefore one in which the organization of the collection is archived simultaneously with the digital objects that comprise the collection. A persistent collection requires the ability to dynamically recreate the collection on new technology. Scalable archival storage systems are used to ensure that sufficient resources are available for continual migration of digital objects to new media. The software systems that interpret the infrastructure independent representation for the collections are based upon generic digital library systems, and are migrated explicitly to new platforms. In this system, the original representation of the digital objects and of the collections does not change. The maintenance of the persistent archive is then achieved through application of archivist policies that govern the rate of migration of the objects and the collection instantiation software. 2. Knowledge-based Archives The preservation of the context to associate with digital objects is the dominant issue for knowledgebased persistent archives. The context is traditionally defined through specification of attributes that are associated with each digital object. The context is also defined through the implied relationships that exist between the attributes, and the preferred organization of the attributes in user interfaces for viewing the data collection. Management of the collection context is made difficult by the rapid change of technology. Software systems used to manage collections are changing on five to ten-year time scale. Of greater concern is that the information tagging languages used to annotate digital objects is also changing. The persistent archiving of a collection must also handle the evolution of the information mark-up language. We have characterized persistent archives in prior publications [1,2] as collection-based repositories. We now recognize the need to broaden the archive characterization to knowledge-based repositories. Not only the information content, but also the processing steps used to accession the collection must be preserved. Conceptually, one can view the accessioning process as the equivalent of the process needed to instantiate the collection on new technology. If the accessioning process can be captured in an infrastructure independent representation, the same process can be used to manage the migration of the collection to new markup languages, archival data repositories, information repositories, and knowledge repositories. The archival description of a collection then must include not only contextual information about the digital objects, but also knowledge about the relationships used to derive the contextual information. The architecture that is needed to implement a knowledge-based persistent archive is shown in figure 2. 27 Ingest Knowledge Relationships between Concepts Manage Knowledge Repository for Rules Information Attributes Semantics Information Repository Data Fields Containers Folders Storage (Replicas, Persistent IDs) Process Access Knowledge or Topic-Based Query Attribute- based Query Infrastructure Feature-based Query Process Figure 2. Knowledge-based Persistent Archive The three columns represent the technologies needed to manage the ingestion process, manage the persistent archive, and manage the access environment. The three rows represent the infrastructure needed to manage knowledge, information and data. Knowledge is represented as relationships between domain concepts. Information is represented as attributes about digital objects within the collection. The digital objects are “images” of the reality described by the domain concepts. Ingestion corresponds to the steps of knowledge mining/tagging, information mining/tagging, and digital object organization/storage. Persistent archive management requires infrastructure to store the digital objects (archives), information repositories to hold the metadata (databases), and knowledge repositories to organize the relationships (logic systems). The access environment provides mechanisms to query the collection at the data level through feature extraction, at the information level through database queries, and at the knowledge-level through domain concepts. Just as the data management infrastructure is intended to provide access without having to know data object names, the knowledge access infrastructure is intended to provide access without having to know the explicit metadata attribute names used to organize the collection database. The knowledge-based persistent archive requires software infrastructure to support interoperability between different implementations of ingestion, management, and access infrastructure components. This is shown in Figure 3. Between “Ingest platforms” and “Management repositories”, standards are needed to define consistent tagging mechanisms for knowledge (XML Topic Map DTD or XTM DTD), for information (XML DTD), and for data organization (logical folders and physical containers). Between “Management repositories” and “Access platforms”, standard query languages are needed for knowledgebased access (Knowledge query language or rule manipulation language), attribute-based access (EMCAT SGL generator or MIX mediator), and feature-based access (application of procedures within a computational grid). Between the “knowledge” and “information “ environments, a standard 28 representation is needed to map from concepts to attributes, such as topic maps or model-based access systems. Between “information” and “data storage” environments, a data handling system is needed to map from attributes to storage locations, such as the SDSC Storage Resource Broker. Ingest Knowledge Manage Relationships Between Concepts X T M D T D Knowledge Repository for Rules Access Ru les K Q L Knowledge or Topic-Based Query (Topic Maps / Model-based Access) Information X M L D T D Attributes Semantics Information Repository E M C A T/ M IX Attribute- based Query (Data Handling System - Storage Resource Broker) Figure 3. Persistent Archive Interfaces M Gr Data Fields Containers Folders C A T/ H D F Storage (Replicas, Persistent IDs) ids Feature-based Query Persistence is achieved through the infrastructure middleware (shown in Figure 3 as the blue grid) that links accession platforms, management repositories, and access platforms. The same middleware is needed to support grid environments (such as computation on distributed data collections) and digital library environments (such as curricula support in the National Science, Math, Engineering Technology Education Digital Library - NDSL). This architecture has been proposed to both the Grid Forum and the NSDL, and may be the architecture that integrates knowledge management activities from these communities with the persistent archive community. 2.1 Archive Accessioning Process: Of interest is the emerging need for knowledge management as well as information management and data management when ingesting collections. When we look at collections, we see multiple interfaces where knowledge is required to be able to adequately describe relationships inherent within the collection. We have been looking at the preservation of relationships that are needed to describe: - implied knowledge (interpretation of fields) - structural knowledge (topology associated with digital line graphs) - domain knowledge (relationships between domain concepts) - procedural knowledge (workflow creation steps for digital objects) - presentation knowledge (support for knowledge-based queries). One way to accomplish the goal of knowledge-based access is to use the ISO 13250 Topic Maps standard to maintain mappings between domain concepts and the attribute names used in the collection schema. It is very interesting to note that relationships are implicit between each of the nine infrastructure 29 components defined in Figure 2. The relationships either define rules that can be applied to the collection, or quantify associations that can be made between collection elements. Examples are: • Relationships that quantify rules: • Rules for defining collection attributes • Rules for organizing attributes into a schema • Rules for feature extraction • Rules governing data set creation • Relationships that quantify associations: • Organization of concepts into topic maps • Ontology mapping between concept maps • Mapping of concepts to collection attributes • Mapping of concepts to feature extraction rules • Mapping between attributes and data fields (semantics) • Semantic mapping between collections • Mapping between attributes and storage • Mapping between attributes and features • Clustering of data into containers The relationships can be separated into four broad classes: − Semantic/logical relationships. Relationships can be defined to map from the concepts used to describe the collection to the attribute tags used to annotate the collection. Semantic relationships can also be defined between the domain specific concepts as knowledge bases or semantic maps. − Procedural/temporal relationships. The transformations that are applied to the collection to create the archival form constitute a workflow that represents the ingestion process. The temporal order and explicit transformations can be represented as a set of states through which the collection is processed. − Structural/spatial relationships. The internal organization of digital objects within the collection can be represented as a structural ordering of the tagged elements. The representation of the structure can be expressed using the same types of characterization as needed for spatially tagged data. − Functional relationships. For scientific applications, analysis algorithms are needed to identify features that might be associated with a digital object. The expression of the relationship between the named feature and its presence within a digital object will require the ability to archive mathematical expressions. In the ingestion process, a major challenge has been the need to be able to differentiate between artifacts and implied knowledge. Essentially, the steps of refining the description of a collection by including more attributes, must be integrated with the identification of anomalies. To make progress, we apply the concepts of occurrence tagging and closure to the archived collections. Occurrence tagging is the explicit annotation of the location of each tagged attribute along with the associated value. This provides a representation that captures all of the information content, without imposing constraints on permissible attribute values. Closure is the analysis of the occurrences to identify both completeness and consistency. Completeness is evaluated by verifying that all attributes are populated, and that the information content is fully annotated. Consistency checks that all attribute values fall within defined ranges. Consistency can be checked by construction of inverse indexes that point to all occurrences of each attribute value. It is necessary to iterate between knowledge extraction and attribute mining. We illustrate this through application of the ingestion process shown in Figure 4. • Define a representation of the concepts inherent within the collection. • Build a concept map that identifies all of the possible attributes to associate with each concept • Tag the collection to identify attributes for each of the possible fields. • Restructure the concept map to eliminate unused fields, specialize classes, rearrange class attributes, etc. 30 • Mine the collection to identify differences between bill versions, identify missing attributes, identify implicit attributes, and identify invalid data (such as duplicated pages). • Accession Template Closure Concept/Attribute Information Generation Knowledge Generation Attribute Selection Attribute Inverse Indexing Attribute Tagging Occurrence Tagging View Management Data Organization Collection Figure 4. Ingestion Process At one time, the hope was to be able to ingest a collection in a single pass. Based upon the above steps, at least three analyses are needed to mine knowledge, information, and organize data. Depending upon the number of iterations used to refine the concept space, additional passes through the data may be necessary. It is still an area of debate for whether it will be possible to differentiate in general between concept map refinement and error analysis. These steps will have to be done jointly for most collections. Note that once the data has been wrapped into XML, all integrity checking, knowledge mining, derivation of a "consolidated version", etc., can be seen as (albeit very elaborate) queries against an XML collection. The interesting research issue is to find out how well XML query languages (including the UCSD/SDSC XMAS system) are able to express the analysis queries. Especially for integrity checking, logic-based XML query languages seem to be a good choice for an ingestion environment. 2.2 Archival Representation of Collections: One of the results of the analysis of the collections provided by NARA was the realization that multiple views of a collection may need to be archived. Typical views include: • Original form as submitted • XML tagged form • Occurrence representation (occurrence, attribute, value) • Knowledge-based representation (recreation of the original form from the occurrence representation). This view can be thought of as the noise-free representation of the original collection based upon the knowledge and information content that was created during the accessioning process. This view can be designed to include white space and all anomalies if desired. 31 • Consolidated representation (elimination of all duplicated information) By archiving descriptions of the processing steps needed to go between each of these views, one can guarantee that the same processing steps could be applied in the future to re-instantiate the collection on new technology, including new information and knowledge representations. 3. Relationships between NARA and other Agency projects: There is a strong synergy between the development of persistent archive infrastructure for NARA, digital library development for NSF, and data grid development for DOE, NASA, and NLM. All of these research areas require the ability to manage knowledge, information, and data objects. What has become apparent, is that even though the requirements driving the infrastructure development for each agency are different, a uniform architecture is emerging that meets all agency requirements. The architecture shown in Figure 3 provides: • Validation mechanism for the common data management architecture • Validation mechanism for the differentiation between knowledge, information, and data and the choice of representation standards • Integration vehicle for tying together persistent archives with grid environments • Integration vehicle for tying together grid environments with digital libraries • Integration vehicle for tying together digital libraries with persistent archives It is interesting to note the multiple projects that are building upon the architecture that is being developed in the NARA collaboration: • NSF – Digital Library Initiative, Phase 2. • NSF – National SMET Education Digital Library • NSF – NPACI data grid for neuroscience brain image federation • NASA – Information Power Grid distributed data processing • DOE – ASCI Data Visualization Corridor remote data processing • DOE – Particle Physics Data Grid object replication • NLM – Digital Embryo Project data grid for image processing and storage • NARA – Persistent Archive It is also interesting to note the iterative technology development cycle that links all of the projects. An original DARPA project developed the data handling capabilities as part of the Distributed Object Computation Testbed. The NASA IPG integrated the data handling technology with computational grid technology (common security environments). The NSF NPACI project integrated information management with data handling to support digital libraries. The ASCI PPDG then applied the technology to support replica management across heterogeneous systems. And the NARA project applied the technology to manage migration of collections across evolving infrastructure technology. Acknowledgements: This work was supported by the National Archives and Records Association and the Defense Advanced Research Projects Agency/ITO. The research topics have been investigated by the following members of the Data Intensive Computing Environment Group at the San Diego Supercomputer Center: Richard Marciano, Bertram Ludaescher, Ilya Zaslavsky, Amarnath Gupta, and Chaitan Baru. 32 References: [1] Moore, R., C. Baru, A. Rajasekar, B. Ludascher, R. Marciano, M. Wan, W. Schroeder, and A. Gupta, “Collection-Based Persistent Digital Archives - Part 1”, D-Lib Magazine, March 2000, http://www.dlib.org/ [2] Moore, R., C. Baru, A. Rajasekar, B. Ludascher, R. Marciano, M. Wan, W. Schroeder, and A. Gupta, “Collection-Based Persistent Digital Archives - Part 2”, D-Lib Magazine, April 2000, http://www.dlib.org/ 33 August 2000 NARA and Electronic Records: A Chronology Organizational Names: 1934-1949: 1949-1985: 1985-present: National Archives (NA), independent agency National Archives & Records Service (NARS), part of General Services Adm. National Archives & Records Administration (NARA), independent agency 1968-1972: 1972-1974: 1974-1982: 1982-1988: 1988-1998: 1998-present: Data Archives Staff Data Archives Branch Machine-Readable Archives Division Machine-Readable Branch Center for Electronic Records Electronic and Special Media Records Services Division Dates: 1939 The Records Disposition Act includes punch cards in its definition of records 1943 Records Disposal Act defines records as including “other documentary materials, regardless of physical form or characteristics,” a phrase included in subsequent federal records acts 1965 NARS assists Bureau of the Budget in producing an inventory of federal punch cards and computer tapes 1966 Archivist Bahmer establishes Committee on Disposition of Machine-Readable Records (Dec) 1967 NARS issues federal records management regulations for ADP records (Feb) 1968 Committee on Disposition of Machine-Readable Records finalizes its report (Jan) Ev Alldredge begins detail to the Office of the Archivist to implement report (Feb) NARS hosts conference, “The National Archives and Statistical Research” (May) Report of Joint Committee on the Status of the National Archives (AHA, OAH, SAA) calls for an “archives of machine-readable records” (Jul) Archivist Rhoads establishes Data Archives Staff in Office of Records Management Joseph V. Bradt becomes Director of the Staff 1969 SAA forms Ad-Hoc Committee on Machine Readable Records and Data Archives with Alldredge as Chair (Oct) 1970 NARS staff accessions first electronic records (Apr) Gerald Rosenkrantz becomes Director of the Data Archives Staff (Sept) 1971 Archivist Rhoads signs first records schedule with permanent electronic records 1972 NARS issues General Records Schedule 20 for ADP records (Apr) Data Archives Staff becomes Data Archives Branch 1973 Branch drafts Recommended Environmental Conditions & Handling Procedures for Magnetic Tape (Jun) 1974 Branch compiles Directory of Computerized Data Files and Related Software for publication by the National Technical Information Service (Mar) NARS upgrades Branch to Machine-Readable Archives Division Charles Dollar becomes Director of the Division (Aug) Division assumes responsibility for appraisal of electronic records (Aug) 1975 Division appraises and accessions first operational data from Vietnam War Division issues Catalog of Machine-Readable Records in the National Archives of the United States, 1975, and 2nd. Ed., 1977 1976 The amended Records Disposal Act of 1950 specifies “machine readable material” as records 1977 Division launches "targeted agencies" project 34 1978 Division establishes Center for Machine-Readable Records for records of high current use but undetermined long term value (May) Division issues first Accessioning Procedures Handbook 1980 Division provides computer support and staff for FBI appraisal project, AFSC v Webster 1982 NARS reduction-in-force cuts staff (Feb) NARS downgrades Division to Machine-Readable Branch Trudy Peterson becomes Chief of the Branch (Apr) 1984 Branch battles NARS Preservation Advisory Committee about computer-output microfilm as the solution to electronic records Richard Myers becomes Chief, Machine-Readable Branch (Jan) 1985 NARA transfers all appraisal to new records management program (Dec) 1986 Edie Hedlin becomes Chief, Machine-Readable Branch (Jul) 1987 Branch closes its Center for Machine-Readable Records (for records of high current use) 1988 NARA upgrades Branch to Center for Electronic Records (Oct) Ken Thibodeau becomes Director of the Center (Dec) 1989 Plaintiffs file Armstrong v Executive Office of the President (Jan) NARA transfers appraisal of electronic records to the Center (Oct ) 1990 Center begins GAPS project for past-due transfers 1991 Center offers reference service via e-mail (Mar) The National Academy of Public Administration publishes The Archives of the Future: Archival Strategies for the Treatment of Electronic Databases (Dec) 1993 Acting Archivist Peterson assigns preservation of Armstrong v EOP media to Center (Apr) Center acquires and installs the Archival Preservation System (APS) (May) Archival Electronic Records Inspection and Control (AERIC) system becomes operational 1994 Center moves to Archives II (Jan) NARA designates CD-ROM as acceptable transfer media (Jul) 1995 NARA revises GRS 20 to include e-mail (Aug) 1996 Plaintiffs file Public Citizen v Carlin (GRS 20 lawsuit) (Dec) 1997 Center staff takes initial steps toward the Electronic Records Archives (ERA) initiative Center expands reference media to include CD-ROM and diskettes Center uses AERIC to comply with E-FOIA amendments 1998 Center becomes Electronic & Special Media Records Services Division (Feb) Michael Carlson becomes Director of the Division (Apr) NARA transfers appraisal of electronic records to the Life Cycle Management Division NARA endorses Dept. of Defense 5015.2 standard for records management applications for electronic records (May) 1999 Archivist Carlin begins appraisal reinvention project (Apr) U.S. Supreme Court refuses to reverse Appeals Court decision upholding GRS 20 (Aug) 2000 Division's electronic records holdings exceed 200,000,000 records in 167,000 data files 35 The potential of markup languages to support descriptive access to electronic records: The EAD standard Anne J. Gilliland-Swetland Abstract: This paper will review the potential of Encoded Archival Description (EAD), rcently adopted as an American descriptive standard, to provide online descriptive access to electronic records. The paper will begin by reviewing the current state of electronic records description and the complex relationships between metadata that are part of the record and metadata that are about the record. It will then describe the status and scope of EAD, how it relates to other descriptive initiatives that are applying markup languages, and the potential of EAD to serve as a metadata infrastructure for online archival information systems. The paper will conclude with a discussion of the extent to which EAD can currently accommodate, or could be extended to accommodate, description and online delivery of electronic records. Introduction There has been a considerable amount of political and professional rhetoric, stemming from unprecedented developments over the past decade in technologies supporting the World Wide Web, about developing online access to unpublished information resources—including archival holdings. The rhetoric has resulted in the establishment of research and development agendas by major government 4 funding agencies, private foundations, industry, and professional institutions and associations . A number of major initiatives have resulted from the availability of this funding. As they relate to archival concerns, these initiatives can be grouped into three primary domains of activity: • the development of archival standards that support online access to archival descriptions (Encoded Archival Description being the most prominent recent example); • the development of archival information systems such as American Memory at the Library of Congress and the Online Archive of California (to cite two American examples) that provide not only online descriptions but also digitized copies of selected archival holdings; and • research projects addressing the archival management of records that are “born digital,” that is, of electronic records (for example, the Recordkeeping Functional Requirements Project at the University of Pittsburgh and the International Project on Permanent Authentic Records in Electronic Systems 5 (InterPARES)) . While there has been considerable dialog and overlap between archivists involved with the first two of these areas, until recently archivists grappling with the challenges of creating and preserving electronic records have not been integrally engaged in broader initiatives to standardize and enhance description for online access, nor to provide online access to electronic records through archival information systems or digital libraries. The major exception to this has been the Recordkeeping Metadata Schema (RKMS) 4 For example, the National Science Foundation, the National Endowment for the Humanities, and the National Historical Publications and Records Commission in the United States, and the Fifth Framework and the Joint Information Systems Committee in Europe; national archives and libraries in many countries; and descriptive standards groups within professional associations. 5 Gilliland-Swetland, Anne J. and Philip Eppard. “Preserving the Authenticity of Contingent Digital Objects: The InterPARES Project” D-Lib Magazine 6 no.7 (2000). Available at: http://www.dlib.org/dlib/july00/eppard/07eppard.html (16 October, 2000); InterPARES Website available at http://www.interpares.org (16 October, 2000). 36 developed in Australia. The focus for RKMS is the record, regardless of its format, and how it can be reconstructed and retain its meaning across time and user domains. RKMS provides: • A standardized set of structured or semi-structured recordkeeping metadata elements • A framework for developing recordkeeping metadata sets in different contexts • A framework for mapping recordkeeping metadata sets to establish equivalences and 6 correspondences that can provide the basis for semi-automated translation between metadata sets In North America, electronic records management evolved to some extent as an area apart from the mainstream of the archival profession. Its immediate concerns have been on creating, identifying, and accessioning electronic records. In the 1970s and 1980s, electronic records, or “machine-readable records” as they were initially termed, tended to be managed as software-independent datafiles. More recently, as electronic records have taken on more complex functionality, there has been an increased awareness of the need to preserve their value as legal and organizational evidence. As a result, archivists are now engaged with researchers from computer science, digital library development, and preservation in several projects to identify how to preserve authentic electronic records with their functionality intact. One of the most prominent of such projects is that of the National Archives and Records Administration and the San Diego Supercomputer Center to employ XML in the development of persistent archives. This concern for evidence requires a more detailed understanding of what are the characteristics of an authentic record in and over time, as well as close analysis of the intellectual rationales behind archival description in terms of how it contributes to ensuring and demonstrating the authenticity of preserved records. Indeed, there is a growing convergence of different areas within the archival profession, as well as of other professional and disciplinary domains relating to description. This convergence arises largely out of the development of new metadata schema and standards and technological capabilities that provide 7 structures and crosswalks for formalizing and bridging diverse data types (such as image or geospatial 8 data), metadata semantics, and professional practices . Archives play a key and often overlooked role in establishing and demonstrating the authenticity of any record, regardless of its form, through archival description. In contrast to the key purposes of bibliographic description which are to manage a physical information object as well as to facilitate its intellectual retrieval and use, archival description must address that object not only as information, but as evidence. As a result, archival description must not only describe the content of a fonds or record group, it must also describe the circumstances of its creation, its chain of custody, its relationships to other records generated by the same activity, and the impact upon the aggregation of records of any processing or preservation activity in ways that are and remain meaningful to different kinds of users over time. Archival description, therefore, has three primary roles. Firstly, it serves as a tool that meets the needs of the archival materials being described by authenticating and documenting them. Secondly, it is a collections management tool for use by the archivists. Thirdly, it is an information discovery and retrieval tool for making the evidence and information contained in archival collections available and comprehensible by archivists and users alike. 6 See McKemmish, Sue, Glenda Acland, Nigel Ward, and Barbara Reed. “Describing Records in Context in the Continuum: The Australian Recordkeeping Metadata Schema.” Archivaria 48 (1999): 3-42. 7 A crosswalk is a chart or table that represents the mapping of fields or data elements in one metadata standard to fields or data elements in other standards that have the same function or meaning. Crosswalks support the ability to search transparently heterogeneous databases as a single database (semantic interoperability) and to convert data from one metadata standard to another. 8 See Gilliland-Swetland, Anne J. Enduring Paradigm, New Opportunities: The Value of the Archival Perspective in the Digital Environment (Washington, D.C.: Council on Library and Information Resources, 2000). 37 Describing Electronic Records Ironically, in a world of increasing online access to primary information resources, many of which first require digitization, electronic records are proving to be among the most intransigent in terms of providing even basic descriptive access. This intransigence reflects inherent technical problems with the diverse formats in which electronic records are created and may need to be maintained. Equally, it reflects how the enormous volume of electronic records requiring processing by a comparatively small staff together with data archiving practices originally adopted from the social sciences data archives community have led to idiosyncratic summary archival descriptions and an over-dependence upon the metadata generated by the creator of the records. Description of electronic records often consists of high level summaries of data, reports on quality and accuracy of data, scanned or PDF versions of codebooks and data dictionaries, and customized subject indexes and data extracts. While the current state of description for electronic records is certainly understandable, it is, nevertheless, deficient in several respects: • There has been insufficient analysis of what is the actual nature of electronic records. In particular, there needs to be more examination of the relationship between data content and the metadata that provide and document its context and structure, and of the various ways in which aspects of data and metadata in complex systems such as databases might come together to form the intellectual construct that is a record. Often one of the most difficult aspects of working with electronic records is to be able to identify and then describe, in the absence of a tangible document, the parameters of that intellectual construct. • Metadata generated by records creators has been viewed as sufficient substitute for archival description. For example, in 1993, Margaret Hedstrom proposed that management of metadata provide an alternative strategy to current descriptive practices in order to support the “need to identify, gain access, understand the meaning, interpret the content, determine authenticity, and manage 9 electronic records to ensure continuing access .” Subsequently, several projects have resulted in metadata specifications for electronic records, most notably the Pittsburgh Project and related implementation projects such as the Indiana University Electronic Records Project. With the exception of the Australian RKMS project, there has been almost no discussion of the value-added role that archival description should play in terms of ensuring and documenting authenticity, and 10 making the records meaningful to users across time and domains. • There has been little emphasis on establishing the documentary relationships between electronic records and paper records created by the same activity. Lack of standardization and use of nonarchival of descriptive practices has made it difficult to integrate descriptions of electronic records with standardized descriptive metadata created by archivists and other information, industry, and research communities. For example, in the mid-1980s, when archivists looked to the use of MARC formats to the MARC Machine-Readable Data Format (MRDF) rather than the MARC Archives and Manuscripts Control Format (AMC) that was developed for the collective description of archival and manuscript 9 Hedstrom, Margaret. “Descriptive Practices for Electronic Records: Deciding What is Essential and Imagining What is Possible,” Archivaria 36 (Autumn 1996): 53. 10 Bearman, D. and Sochats, K. Metadata Requirements For Evidence. 1996, Available: http://www.lis.pitt.edu/~nhprc/BACartic.html (October 17, 2000); Bantin, Philip C. “Developing a Strategy for Managing Electronic Records: The Findings of the Indiana University Electronic Records Project,” American Archivist 61 (1998): 328-64. Bantin, Philip C. “The Indiana University Electronic Records Project Revisited,” American Archivist 62 (1999)153-163; and McKemmish, Sue, Glenda Acland, and Barbara Reed. “Towards a Framework for Standardising Recordkeeping Metadata: The Australian Recordkeeping Metadata Schema,” Records Management Journal 9 (1999): 177-202. 38 materials. In effect, such an approach treated electronic records as a special format with distinct descriptive needs, rather than as components of wider archival aggregations. • Because management of electronic records has generally been viewed by the rest of the archival profession as an area that requires distinct technical expertise, developments in archival description such as EAD have progressed without being strongly informed by the descriptive needs of electronic records. It is useful at this point to define more closely what is meant by metadata, since the term is understood differently by different communities. Metadata refers to a range of structured or semi-structured data about data that are critical to the development of effective, authoritative, interoperable, scaleable, and preservable information and record-keeping systems. Until the mid-1990s, metadata was a term most prevalently used by communities involved with the management and interoperability of geospatial data, and with data management and systems design and maintenance in general. For these communities, metadata referred to a suite of industry or disciplinary standards as well as additional internal and external documentation and other data necessary for the identification, representation, inter-operability, technical management, performance, and use of data contained in an information system. For archivists, metadata refers to the value-added information, such as EAD, that they create to identify, authenticate, arrange, describe, preserve and otherwise enhance access to their holdings. In contemplating the role of metadata in the description of electronic records, several questions come to mind: • Which metadata are part of the record, which are about the record, and which are neither but are required to preserve or reconstruct the technological context of the record? And of all these types of metadata, which must be captured as part of archival description? • How can the trustworthiness of these metadata be determined in terms of quality and completeness in and over time? • Are their descriptive needs of electronic records that might be different from those of other types of records? If so, what are they and how should they best be addressed? • Can the metadata generated by the creator of the electronic record somehow be automatically translated or mapped into a standardized description for archival records? • Can the structure and documentary contexts of electronic records be automatically analyzed to generate specific components of a standardized description for electronic records? • Which kinds of contextual documentation do electronic records require in order to be understood and can a metadata infrastructure facilitate links to that documentation online? • How can the links between records and metadata retain their referential integrity over time and in the face of systems obsolescence, data migration, and evolution of metadata schema? • What do users need in order to be able to identify relevant electronic records online? What do users need to be able to use electronic records disseminated online? Encoded Archival Description In the face of such questions, therefore, how might Encoded Archival Description and other markup initiatives enhance current electronic records description? Simply defined, EAD is a Document Type Definition (DTD) developed using Standard Generalized Markup Language (SGML) that makes it possible to develop predictably structured archival description that can be disseminated on the World Wide Web. 39 That description is most commonly an archival finding aid, but the DTD is flexible enough to accommodate various other types of archival descriptive tools. However, the power of EAD is that it can be much more than a structure through which to create a digital representation of a two-dimensional finding aid. The hierarchical nature of EAD, its explicit delineation of each data element, and its adherence to standardized metadata conventions and protocols provide it with the potential to function as a multi-dimensional metadata infrastructure that can interface with other metadata schema, but that can provide maximum flexibility in describing a diversity of record types. With such an infrastructure, archivists and software developers have the capabilities and incentives to design a range of archival information systems that fundamentally re-conceptualize how access to archival holdings is provided. These archival information systems would not only contain the kinds of archival description found today in finding aids, but also digitized versions of archival materials, full-text of ancillary materials, extensive linkages to other online archival and bibliographic information systems, and actual 11 electronic records and the necessary technical documentation to use them .” In such information systems, however, EAD would not be the only metadata schema invoked, and one of the powerful aspects of EAD is its ability to interface or interoperate with other metadata schema and SGML-based implementations. EAD is fully XML-compliant, meaning not only that EAD-encoded descriptions can be searched and manipulated over the Web as the Web increasingly supports XML, but also that electronic records technical documentation, such as database models, workflow rules, and technical drawings can integrated with the archival descriptions in ways not previously possible in a more manual environment. Similarly, EAD can interface with descriptive metadata created in MARC because of metadata mapping between the two standards. With the recent release of XMLMARC software, this mapping will become only easier. EAD also shares header data elements with the Text Encoding and Interchange (TEI) DTD. TEI is a DTD that facilitates the development of digital versions of scholarly texts. Using EAD to Describe Electronic Records EAD is currently in its first full release (Version 1.0). It is fully expected that the DTD will be dynamic and will continue to be extended to accommodate new technological capabilities and metadata schema, as well refined based one evaluative feedback from archivists and users. In its current form, what then does EAD have to offer electronic records description, given that the needs of electronic records have yet to be integrally addressed by the DTD? EAD, while it is a data structure and not a data content standard, works to standardize idiosyncratic descriptive practices. Electronic records descriptive practices are some of the most idiosyncratic in the field because there is such diveristy of types of electronic records, and because electronic records description is rarely taught in archival education programs, and is primarily learned as institution-specific practices “on the job.” Descriptive records tend to comprise examples of descriptions of datafiles, rather 12 than complete descriptions, together with user guides of documentation packages . Using EAD would also integrate electronic records management into the mainstream of archival activities, treating the records as records, rather than as instances of special formats. Moreover, through collective description, as well as elements such as <separatedmaterial> and <relatedmaterial>, all records created by the same activity will be treated as an intellectual whole, regardless of whether they are paper, electronic, or some other medium. 11 Gilliland-Swetland, Anne J. “Popularizing the Finding Aid: Exploiting EAD to Enhance Online Browsing and Retrieval in Archival Information Systems by Diverse User Groups” Journal of Internet Cataloging 4 nos. 1/2 (2000) (in press); Gilliland-Swetland, Anne J. "Health Sciences Documentation and Networked Hypermedia: An Integrative Approach," Archivaria 41 (1995): 41-56. 12 Dryden, Jean E. “Archival Description of Electronic Records: An Examination of Current Practices,” Archivaria 40 (1995): 99-108. 40 Electronic records descriptions can be quite flat, consisting mostly of summary information, with arrangement of the contents of a datafile often being incidental. However, users may wish to have access at the level of individual records or even data elements. The hierarchy built into EAD has the potential to support this kind of granularity of access, although commercial software that is currently available has yet to address much of this potential. Technical documentation accompanying the electronic records can also be linked in electronic form to the EAD description through elements such as <archref>, <odd> (other descriptive data) and <add> (adjunct descriptive data). If this documentation is marked up using SGML, XML, or some other markup language, the possibility exists of additional reconciliation if the different metadata schema. The well-defined EAD structure also makes possible the use of cross-walks to interface with other common metadata schema that might be relevant to the records (for example, geospatial metadata). All this is not to say that EAD is ideal as it stands for describing electronic records. Several limitations need to be addressed in the next version of EAD if it is truly to accommodate electronic records. 1) EAD is strongest with regard to the description of the records once they are held in the archives. It is weak in how it supports records management, appraisal, and accessioning processes. More explicit attention needs to be paid to how records retention schedules, appraisal reports, accessioning procedures, and data quality reports are captured and tracked, as well as the various agents associated with those processes. 2) There needs to be a more closely delineated data elements with which electronic record metadata can be described, rather than consigning such materials to “bucket” elements such as <odd> and <add>. These elements and their values should be based upon lists of common types of documentation that accompany electronic records when they are accessioned. The data elements also should have attributed that indicate the extent to which the accuracy of each piece of documentation has been verified. 3) Custodial history is integral to establishing the authenticity of records, and for electronic records it can be quite complex, especially if the archives takes over intellectual and not physical control of inactive records. The <custodhist> element needs to be expanded to address this issue, in particular, noncustodial arrangements for electronic records. 4) Preservation and meticulous documentation of preservation processes are integral not only for providing continued access to electronic records, but also for establishing and demonstrating the continued authenticity of those records (or of authentic copies of the records). Currently preservation information is bundled into a single data element <processinfo> (processing information), and as with <custodhist> this element needs to be expanded and further delineated to track preservation processes such as migration and emulation and any effects these might have upon the record. 5) Even with traditional records, many archivists find it difficult to make the necessary distinction between intellectual and physical levels of arrangement. Many electronic records can be arranged in multiple ways and, therefore, the concept of levels of arrangement may not be as relevant as possible arrangement schema. It needs to be possible through the <arrangement> element for users to identify the range of potential arrangements and data extracts in order to be able to specify the one which they would like to use when accessing electronic records online or ordering copies of them. This is a compelling reason to do more user-based research so that any extensions to EAD are more user-driven. 6) As with museum objects, additional aspects of physical description may need to be incorporated into the <physdesc> element to allow for highly technical description. Some of these elements might correspond to those that were included in MARC MRDF. 7) For EAD in general, there is a need for a companion content standard and a structure for developing authority files. Work on both of these aspects is currently underway. There is also a need to analyze the extent to which EAD should be extended to accommodate a range of archival descriptive 41 traditions and technical requirements for records in specific media, or whether a better approach would be to concentrate on mapping different types of metadata through processes such as metadata crosswalks and automatic reconciliation of diverse XML structures. Conclusion There is obviously much work to be done in the area of electronic records description, and EAD provides one important vehicle to do so. However, given the volume of electronic records already created and anticipated in future years, there must surely also be an increased emphasis on automating as many aspects of archival description as possible. This is where research and development such as that underway at the San Diego Supercomputer Center in partnership with the US National Archives and Records Administration is likely to make such a strong contribution. One final caveat, however—almost all developments in archival description to date, even that of EAD, have occurred without systematic analysis of user needs and capabilities. As archival description, and even the complete archival record becomes increasingly available online to the general public without any archival reference mediation, it is going to be critical that we spend time examining the usefulness and usability of the materials we are providing to our users. Otherwise we may find that we have created a web of metadata and records that is so complex that it will have become impenetrable to most users. 42 Preservation and migration of electronic records: the state of the issue13 Kenneth Thibodeau 14 The problem of preserving electronic records The two-edged sword of continuing progress and rapid obsolescence of information technology is the most often cited, but perhaps not the most significant challenge archives face in the endeavor to preserve electronic records. Organizations rely more and more on digital technology to produce, process, store, communicate, and use information in their activities. Thus, the quantity of records being created in electronic form increases. In the experience of the National Archives and Records Administration of the United States, it increases exponentially. The technological challenge is compounded by the continuing extension of information technology in terms of the types of information objects it produces, and again in terms of its applicability to different spheres of activity and different types of actions within those spheres. The resultant records are increasingly diverse and complex. The impact is not only on individual records, but on the archival fonds as a structured whole. Approaches to the problem of preserving electronic records The field of information technology has, by and large, ignored the problems of long term preservation. If anything, one could say that the market has tended to exacerbate the problem of preserving electronic records. The pressures of competition have led the industry to obey Moore=s law, replacing both hardware and software on a frequency of two years or less. In one area, however, there has been some improvement in recent years: that of digital storage media. From the 1980s there was a trend towards storage media that were more fragile and less stable over time. In recent years, this trend, if not reversed, has been offset somewhat by the introduction of 15 more stable and reliable media. Current research and development efforts offer the prospects of improved options for long term storage of digital information, notably in the areas of ion-milling and holographic media. But archival concern with digital media should not be limited to their durability. The ICA Guide to Managing Electronic Records sets out seven criteria for media used for preserving electronic records: $ open standards for digital recording on the medium, $ robust methods for preventing, detecting and reporting errors, $ sufficient market penetration, $ known longevity, $ known susceptibility to degradation or deterioration, $ a favorable cost/benefit ratio, and 16 $ availability of methods for recovering from loss. Whatever relief archives may find in the area of digital storage is more than offset by the increasing diversity, complexity and spread of electronic records. In recent years, increasing attention has been devoted to problems of digital preservation in a variety of spheres and professions. Several 13 This paper was presented at the XIVth International Congress on Archives, Seville, Spain, September 22, 2000. The views expressed are the author’s and not necessarily those of the National Archives and Records Administration. 14 The author is Director of the Electronic Records Archives Program, National Archives and Records Administration, U.S. 15 Charles M. Dollar. Authentic electronic records: strategies for long-term access. Chicago: Cohasset. 1999. Pp. 58-60. 16 International Council on Archives. Committee on Electronic Records. Guide for managing electronic records from an archival perspective. Paris. 1997. 43 different approaches has been proposed. A few have been tried in test mode, fewer in actual practice. In practice, the experience of archives is largely limited to relatively simple technical formats, such as flat files. Some institutions have developed computer applications for preserving potentially complex databases. These include CONSTANCE at the National Archives of France, AERIC at NARA, ERICSON at the National Archives of Canada, and similar systems in Sweden, the United Kingdom and elsewhere. Significant preservation projects addressing the actual preservation of digital formats, at various stages of 17 research or development, include the bundles proposal of the British Standards Institute, the CEDARs 18 19 project at the University of Leeds, England, the Victoria Electronic Records System in Australia, the 20 emulation experiment at the Royal Library in The Netherlands, the Universal Preservation Format 21 sponsored by the WGBH Educational Foundation in Boston, and the Highly Integrated Information 22 Processing and Storage technology being developed at Carnegie-Mellon University in the U.S. Current initiatives are pursuing quite a variety of approaches. The proposed solutions can be categorized into five broad categories: $ preserving the original technology used to create or store the records; $ emulating the original technology on new platforms; $ migrating the software necessary to retrieve, deliver, and use the records; $ migrating the records to up-to-date formats; and $ converting records to standard forms. These approaches define a spectrum ranging, in broad terms, from no change in the records or the technological context in which they exist to one in which the original hardware and software have disappeared and the digital format of the records has changed. Each of these methods has pros and cons. None of them is entirely satisfactory. On the one hand, in general, one can say that the closer one stays to the original technology and original digital format of the records, the less the problem of authenticity; however, it is also obvious that the closer one stays to original technology, the more complex and more impractical the approach becomes over time. More complex because, as records continue to accumulate over time, there will be more and more varieties of technology that the archives would have to maintain. More impractical because, first, support for obsolete technologies will eventually disappear and, second, the distance and difference between the preserved technology or technical artifacts B including the records B and the best available technology for preserving, managing, retrieving and delivering the records will increase continuously. On the other hand, while moving ahead as technology progresses can eliminate such practical problems, it can entail loss or corruption of records. The need for an archival approach to preserving electronic records All of these approaches to preserving electronic records have in common the objective of solving technological problems related to the passage of time. None of them actually focus on the objective of preserving records. This technological orientation is misdirected because success in solving 17 British Standards Institution. Bundles for the perpetual preservation of electronic documents and associated objects. Public Draft for Comment - IDT/1/4: 99/621800DC.London. 1999. 18 David Holdsworth and Derek M. Sergeant. A blueprint for representation information in the OAIS mode. In: Eighth Goddard Conference on Mass Storage Systems and Technologies, B. Kobler and P.C. Hariharan, editors. Maryland, Goddard Space Flight Center, 2000. Pp. 413-28. 19 Public Record Office Victoria. Victorian Electronic Records Strategy. Final Report. 2000. <http://www.prov.vic.gov.au/vers/final/finaltoc.htm> 20 Jeff Rothenberg. An experiment in using emulation to preserve digital publications. Den Haag. Koninklijke Bibliotheek. 2000. 21 Dave MacCarn, Toward a universal data format for the preservation of media. SMPTE Journal, July 1997 v106 n7 p477-479. See also <Http://info.wgbh.org/upf/> 22 http://www.ece.cmu.edu/research/chips/ 44 technological problems does not necessarily imply any success, or even relevance, in addressing archival requirements for the preservation of records. Logically, archival principles and objectives should dictate the requirements that technical solutions must satisfy. Archival requirements for preservation must be based on the conception of electronic records, not as the products of computer applications, but as the instruments and by-products of the practical activity of a records creator. The ultimate criterion for success in the preservation of electronic records is not whether they remain true to some given technological materialization, but whether they continue to provide authentic evidence of the activities in which they were created. An architecture for archival preservation Clearly, the archival profession needs to determine specific requirements for the preservation of different types of records, and also to guarantee respect for provenance and the integrity of archival fonds over time. The InterPARES project, directed by Professor Duranti, brings together archivists from universities and archival institutions, along with computer and information scientists and engineers, from around the world in a concerted effort to delineate specific archival requirements for preserving authentic electronic records. InterPARES is working to define the archival requirements for authenticity on the 23 basis of archival science and diplomatics. Simultaneously, the InterPARES Preservation Task Force is examining technical issues related to digital preservation and developing a formal model of the preservation function as viewed from the perspective of the juridical or physical person responsible for preserving electronic records. While this work is still in progress, there are several ideas which have been proposed that are worth citing at this time. One key idea is that, strictly speaking, it is not possible to preserve electronic records; it is only possible to maintain the ability to reproduce electronic records. It is always necessary to retrieve from storage the binary digits that make up the record and process them through some software for delivery or presentation. B Analogously, a musical score does not actually store music. It stores a symbolic notation which, when processed by a musician on a suitable instrument, can produce music. B Presuming the process is the right process and it is executed correctly, it is the output of such processing that is the 24 record, not the stored bits that are subject to processing. This concept has important consequences. It shifts priority in preservation of electronic records from their storage over time, to the integral processes of putting the records into archival storage, getting them out of storage, and delivering them to future researchers. The recognition that electronic records must inevitably be reproduced accentuates the importance of being able to demonstrate the integrity and authenticity of the records. This entails extending the traditional concept of an unbroken chain of custody into one of an unbroken process of preservation. As defined in the ICA Guide, AAn electronic record is preserved if and only if it continues to exist in a form that allows it to be retrieved, and, once retrieved, provides reliable and authentic evidence 25 of the activity which produced the record.@ Demonstrating the authenticity of electronic records depends on verifying that: 1. the right data was put into storage properly; 2. either nothing happened in storage to change this data or alternatively any changes in the data over time are insignificant; 3. all the right data and only the right data was retrieved from storage; 4. the retrieved data was subjected to an appropriate process, and 5. the processing was executed correctly to output an authentic reproduction of the record. Parallel to the InterPARES project, the National Archives and Records Administration is sponsoring research into the development of an information management architecture designed to address archival 23 Anne J. Gilliland-Swetland and Philip B. Eppard. Preserving the authenticity of contingent digital objects. The InterPARES project. D-Lib magazine. July-August 2000. 24 Preliminary report from the chair of the Preservation Task Force to the Director of the InterPARES project, March 30, 2000. 25 ICA. Guide. P. 35. 45 requirements for the preservation of electronic records. This architecture implements the proposed ISO 26 standard for an Open Archival Information System (OAIS). The architecture extends that general reference model by articulating archival requirements. To address the basic problem of continuing change in technology over time, the architecture postulates that archival information systems should independent of the particular technology used to implement them at any time. That is, an archival information system should be built in such a way that it is possible to replace any component of hardware or software used in the system with minimal impact on the rest of the system and with no impact on the 27 preserved collections of records. Collection-based persistent object preservation The information management architecture is being developed in the U.S. National Partnership for Advanced Computational Infrastructure. The Partnership is a collaboration of 46 institutions nationwide, and 6 foreign affiliates, with the San Diego Supercomputer Center serving as the leading edge technical resource. The research is addressing archival requirements for preservation of records, including respect for provenance. Rather than focus on technological problems, the method focuses on the objects that are to be preserved. In this case, the objects are records and also collections of records, as organized within archival fonds at all levels of hierarchy. The method of collection-based persistent object preservation consists of identifying the properties of the objects to be preserved; expressing those properties in explicit, abstract models; and applying those models to transform the objects into an independent technological format suitable for long-term preservation. In the archival domain the development of this method started with the conception of the essential properties of records expressed in the ICA Guide on electronic records; that is, AA record is recorded information produced or received in ... an institutional or individual activity and that comprises content, context and structure sufficient to provide evidence of the activity regardless of the form or 28 medium.@ The essential structure of a record is its documentary form. This form may be expressed in the digital format in which the record is stored, but it is not necessarily identical to the digital format. Therefore, a transformation of the record which replaces one digital method with another one that is more suitable to long-term retention, preserves the record so long as it maintains the essential documentary form of the record. The immediate context of a record is its archival bond: the position of a record with respect to other records in the archival fonds. In our research, we have extended the list of essential properties of records beyond content, structure and context to include the appearance of the record. We are also addressing a special type of content that is unique to electronic records: hyperlinks. Persistent Object Preservation expresses the structure of records using eXtensible Markup Language (XML) Document Type Definitions. The method encapsulates records using the metadata defined in these models, transforming records into a format that is independent of any specific technology. The research has demonstrated that this method can be applied to collections of records as well as to individual records. That is, one can construct a Document Type Definition to capture and preserve the structure of any archival collection, of arbitrary complexity, from individual files through series and classes to entire archival fonds. The research is exploring different ways of preserving the appearance of records. One way is to use a technology known as Multi-Valent Documents to capture and retain a bitmapped image of the document. MVD enables the image to be retained not as a version of the document, but as a layer of the 26 Consultative Committee on Space Data Systems. Reference Model for an Open Archival Information System (OAIS). Draft Recommendation for Space Data System Standards, CCSDS 650.0-R-1. Red Book. Issue 1. May 1999. <http://ccsds.org/RP9905/RP9905.html> 27 Kenneth Thibodeau, Reagan Moore, and Chaitanya Baru. Persistent object preservation: advanced computing infrastructure for Digital Preservation. European commission. Proceedings of the DLM-Forum on electronic records. European citizens and electronic information: the memory of the information society. Brussels, 18-19 October 1999. Luxembourg. Office for Official Publications of the European Commission. 2000. Pp. 113118. 28 ICA. Guide. P. 22. 46 29 document object modeled as an acyclic directed tree. Another possible means of preserving appearance is through the eXtensible Style Sheet Language (XSSL) available in the XML standard. Using style sheets to capture the attributes of appearance is especially advantageous for types of applications, such as databases and geographic information systems, where stored data elements may participate in many different records. In such systems the records are likely to be expressed as views, forms, or reports which extract specific subsets of the data and present them in predefined formats. A different style sheet can be defined for each of these formats The method extends beyond the preservation of archival collections of records over time. It also addresses the key archival functions; notably, the accessioning of records into the archival repository, the establishment of intellectual control over the records, and the delivery or dissemination of the records to researchers. This extension of the persistent object approach is consistent with the basic premise of object oriented methodology which starts with the recognition that an object has behaviors or methods, as well as attributes. One of the essential behaviors of a record is that it occupies a specific position in relation to other records in the archival fonds. This behavior expresses the immediate context of the record and is the basis for arriving at its significant context; that is, the activity of which the record 30 provides evidence. The transformation of records into a persistent object format not only enables the records to be preserved indefinitely into the future, it also makes it possible to benefit from advanced technologies, which have not even been invented yet, to search, access and deliver the records in the future. This is made possible though the separation of context, structure and appearance in explicit schemas expressed in simple textual form. Over time, it will not be neccessary to migrate the materials stored in persistent object form to new technologies, but only to interpret the schema metadata so that it can be used in future technologies. Viability of the persistent object preservation method The initiative to develop the collection-based persistent object method for preserving electronic records is still in the stage of research and development, and will remain in this stage for some time. Nonetheless, there are substantial reasons, in both the technical and the archival domains, to assume that it will be successful. In the domain of technology, two facts should be highlighted. First, the research is not developing any special technologies to suit archival needs. Rather, it is building archival solutions on the basis of technologies which are seen as essential to the next generation Internet and information infrastructure and as keys to electronic commerce and electronic government. Archives should benefit, therefore, from widespread market support for the enabling technologies. Second, while the research addresses archival requirements specifically, the method has broad application in other areas, such as digital libraries, museums and collections of scientific data. Thus, archival institutions can collaborate with organizations in these other domains to develop from the enabling technologies solutions for long-term preservation and access. In the archival domain, the promise of the persistent object preservation method has been demonstrated in several empirical tests, applying the method to a variety of collections across a broad quantitative scale. These demonstrations involved bringing the collections into the archival information system from external sources; examining the documents, databases, images, geographic information systems and other digital objects tested in order to generate XML models; transforming the records and capturing collection organization according to these models; storing the transformed collections and related meta-data; and retrieving and presenting the preserved records using technologies completely different that those which had originally been used to create and store the records. Conclusion The persistent object preservation method offers several advantages to archives. It provides a coherent and comprehensive framework that can be specifically tailored to archival requirements. 29 Thomas A Phelps and Robert Walinsky. Multivalent documents: anytime, anywhere, any type, every way user-improvable digital document system. <Http://elib.cs.berkeley.edu/> 30 Reagan Moore, Chaitanya Baru, et al. Collection-based persistent digital archives. D-Lib Magazine. March and April 2000, vol. 6, nos. 3-4. <http://www.dlib.org/dlib/march00/moore/03moore-pt1.html.> and <http://www.dlib.org/dlib/april00/moore/04moore-pt2.html.> 47 Through abstraction of the context, structure and appearance of the contents of digital objects, it provides a single, but highly adaptable method that serves at once the need for preserving authentic electronic records over time, for adhering to archival principles, such as provenance, and for performing core archival functions. Moreover, the persistent object framework permits the simultaneous adoption of other techniques if the need arises. Clearly a substantial amount of research, analysis, testing and evaluation needs to be completed before this method reaches its full potential. Nonetheless, the positioning of this method in the center of major developments in computer science and information technology offers great potential for making of electronic records not so much a problem for preservation, but an opportunity for archives to achieve their objectives to a greater extent and at a higher level than has been possible before now. 48 Responding to the Challenges and Opportunities of ICT: The New Records Manager Seamus Ross, Director Humanities Computing and Information Management, University of Glasgow I. Introduction: The business activities of public and private sector organisations depend upon increasing quantities of knowledge, information, and data in digital form. Computers, software, and data pervades all aspects of our lives from routines embedded in microchips which keep our cars and aeroplanes running, to the application programmes used to analyse data to establish our credit worthiness (or riskiness) when we seek mortgages or other loans, to applications which manage environmental systems in large buildings or control manufacturing equipment. In many of these cases the data collected are used to further refine the applications which analyse the data themselves in an incremental manner; this is particularly true of applications in the financial sector. Nearly every organisation is in the data, information, and knowledge business, a point stressed by Thomas Stewart, a pioneer in the field of intellectual capital. In Intellectual Capital: The New Wealth of Organisations he argued that: Every organisation houses valuable intellectual materials in the form of assets and resources, tacit and explicit perspectives and capabilities, data, information, knowledge, and maybe wisdom. However, you can't manage intellectual capital--you can't even find the soft forms of it--unless you can locate it in places in a company that are strategically important and where management can make a difference. (Stewart 1997, 75) Unfortunately Stewart does not appear aware of professional records managers, or if he does, he does not give them a place in his vision of the new organisation. Yet it is records managers who have the skills and the experience to manage this intellectual capital. It is most likely that Stewart did not include records managers in his vision because, like most non-records professionals, he views them as keepers of information resources which are no longer central to the running of the organisation itself-information at the end of the business life-cycle, basically corporate memory. The root cause of this problem rests firmly at the door of records managers; a concern voiced by many managers themselves. For example, in 'At the end of the life cycle: electronic records retention' David Stephens, Director of the Records Management Consulting Division of Zasio Enterprises, lamented the failure of the records management community to develop and implement suitable electronic records management strategies (1997, 108). Records Managers both curate the records that ensure regulatory compliance, have evidential value in the event of litigation and provide competitive advantage through their recurring value, and they manage the storage of records in ways that could help to alleviate the uncontrolled explosion of records common in most commercial and public sector organisations. Surprisingly only 32 per cent of the 200 UK companies 1 UKLOOK Tampere Programme 1998-9, Programme sponsored by the British Council and the University of Tampere. 49 XML per la conservazione dei sistemi documentari informatici Maria Guercio - Università degli Studi di Urbino Perché XML per gli archivi? Da alcuni anni c'è un grande interesse per i linguaggi di marcatura da parte degli amministratori e dei conservatori di beni culturali, in particolare degli archivisti e dei bibliotecari, che sembrano aver finalmente trovato uno standard adeguato alle complesse esigenze informative, di comunicazione e tenuta delle memorie documentali digitali. Nel caso specifico degli archivi elettronici, nonostante gli sforzi compiuti da numerosi centri di ricerca 31 istituzionali e accademici , non si sono ancora raggiunti risultati convincenti sia nella definizione di strategie sia nella elaborazione di procedure e soluzioni informatiche, soprattutto per quanto riguarda i problemi della conservazione nel tempo. La funzione conservativa in ambiente digitale costituisce, del resto, uno dei compiti più difficili e impegnativi per una serie di ragioni, di cui la principale riguarda la natura contrastante e apparentemente inconciliabile dell'obiettivo medesimo: mantenere l'integrità certa 32 dei documenti informatici e, al contempo, assicurare un'accessibilità che, a causa dell'obsolescenza tecnologica, implica continui interventi di copiatura, conversione, migrazione e, quindi, continue modifiche nella struttura dei bit che costituiscono il documento. Si tratta per gli archivisti di una vera e propria sfida che richiede in primo luogo la definizione di una base concettuale e metodologica e un quadro di riferimento teorico solidi, ma anche un grande sforzo organizzativo e tecnologico, risorse finanziarie elevate, tecnici di altissima professionalità e lo sviluppo di prodotti informatici in grado di garantire la produzione e il trattamento dei documenti con modalità di routine compatibili con l'esigenza della loro conservazione permanente. Per quanto riguarda in particolare il quadro teorico in materia di archivi digitali, dopo una fase di dibattito vivace che ha coinvolto scuole diverse e ha stimolato la riflessione di molti, soprattutto nel mondo anglosassone, il panorama degli studi condotti negli ultimi anni si è da un lato semplificato per quanto riguarda iniziative di ricerca di peso internazionale, dall'altro mostra uno sconfortante livello di 33 frammentazione e di ridondanza per la molteplicità di iniziative di dimensione locale che non sembrano per ora contribuire in modo significativo né alla riflessione teorica né alla predisposizione di strumenti operativi esportabili. D'altra parte la questione, come si è detto, è molto impegnativa e non può trovare risposte esclusivamente in un lavoro di indagine che coinvolga settori disciplinari diversi e risorse adeguate. Non è, perciò, un caso che le uniche due iniziative di ricerca oggi attive in questo ambito (reciprocamente collegate e cooperative) siano quelle che hanno trovato il sostegno delle istituzioni 31 Negli ultimi anni sono stati condotti numerosi studi sui documenti elettronici, che tuttavia si sono concentrati con esiti interessanti - soprattutto sul problema della formazione di sistemi documentari informatici. Le indagini di maggior rilievo internazionale sono state, in particolare, quelle condotte dalla Università di Pittsburgh e dalla Università del British Columbia (Vancouver, Canada), entrambe concluse nel 1997. La ricerca canadese è stata realizzata d'intesa con il Dipartimento della difesa degli Stati Uniti che, a conclusione del lavoro svolto, ha elaborato le regole per la certificazione dei programmi di automazione della gestione documentaria destinati all'amministrazione federale degli Stati Uniti. Per maggiori informazioni si vedano i materiali disponibili ai seguenti indirizzi: http://www.lis.pitt.edu/~nhprc/ per la ricerca dell'Università di Pittsburgh; http://www.slais.ubc.ca/users/duranti/ per la ricerca della Università del British Columbia e http://jitc.fhu.disa.mil/recmgt/ per lo standard definito dal Dipartimento della difesa degli Stati Uniti "Standard 5015.2 - Design Criteria Standard For Electronic Records Management Software Applications”. 32 Ken Thidobeau, Reagan Moore, Chaitanya Baru, Persistent object preservation: Advanced computing infrastructure for digital preservation, in Proceedings of the DLM-Forum on electronic records. European citizens and electronic information: the memory of the Information Society. Brussels, 18-19 October 1999, Luxembourg, Office for official publication of the European Communities, 2000, pp. 113-120. 33 E' significativo l'esito della ricognizione condotta dall'Unione europea sull'esistenza di linee guida per la conservazione digitale in tutto il settore dei beni culturali: dopo un anno di lavoro e numerose interviste, ricognizioni e analisi, il gruppo di lavoro ha stabilito che non esistevano al 1998 linee guida in grado di affrontare il problema complessivo della conservazione digitale nel settore culturale e che "long-term perspectives on preserving access to digital archives still require fundamental work". Cfr. Marc Fresko, Kenneth Tombs, Digital preservation guidelines: the state of the art in libraries, museums and archives, Brussels, European Commission, DG XIII/E, 1998. 50 34 scientifiche nazionali nordamericane, il progetto internazionale InterPARES , condotto dalla scuola di archivistica dell'Università del British Columbia e il progetto NPACI-NARA, sostenuto dal National 35 Archives di Washington e dall'Università della California . Per quanto riguarda gli aspetti teorici e metodologici, il progetto canadese, cui l'Italia partecipa con un proprio team di ricercatori e di istituzioni, costituisce senz'altro l'iniziativa di ricerca più significativa. Il progetto nasce dalla convinzione che una soluzione definitiva e complessiva al problema della conservazione documentale necessiti di un impegno globale della comunità internazionale e di una seria ed efficace cooperazione tra discipline e ambiti professionali diversi al fine di pervenire alla comune definizione di: strategie di valutazione e selezione dei documenti informatici, che individuino tempi e modi del trasferimento di responsabilità per la conservazione permanente, standard e norme per i supporti, principi e procedure di autenticazione dei documenti elettronici nelle attività di conversione, copiatura, migrazione, criteri di descrizione coerenti con la natura archivistica del materiale trattato, con le esigenze della ricerca scientifica e con i bisogni di accesso di un'utenza non specialistica, standard e procedure per la tutela della privacy e in materia di copyright. La ricerca è, come si è detto, in pieno svolgimento, ma ha già consentito di elaborare un primo schema - in fase di validazione mediante attività di ricognizione condotte su sistemi elettronici diversi 36 delle componenti logiche che formano la struttura dei documenti informatici . Per quanto riguarda, invece, la scelta di metodi sperimentati per organizzare e gestire concretamente la funzione conservativa, l'incertezza è notevole. Le soluzioni suggerite dagli esperti sono tutt'altro che consolidate, generalmente molto costose e, per ora, prive di verifiche sul campo. Si orientano verso la conservazione delle tecnologie hardware e software oppure sostengono l'opportunità di sviluppare programmi di emulazione delle piattaforme tecnologiche originali. In entrambi i casi si tratta di interventi che richiedono risorse elevate e non eliminano le rischiose e impegnative attività di migrazione né riducono le difficoltà dell'utenza costretta a misurarsi con strumenti obsoleti anche dal punto di vista della presentazione e delle modalità di ricerca. La maggioranza degli esperti considera perciò tali ipotesi 37 insufficienti e ribadisce l'urgenza di elaborare alternative fattibili ed efficaci . Tra le proposte che hanno finora ottenuto i consensi maggiori e promettono sviluppi interessanti e utilizzabili in contesti operativi diversificati anche di piccole dimensioni, la conservazione in formati indipendenti dalle tecnologie - basati sull'uso di linguaggi di marcatura (SGML/XML) - della rappresentazione originaria dei documenti e dei metadati di contesto e di relazione sembra destinata - nel medio periodo - a significativi sviluppi. 38 Anche in questo caso la scarsa letteratura disponibile in materia non offre per il momento indicazioni univoche e convincenti sulla strada da seguire, limitandosi a individuare vantaggi e svantaggi di ognuna delle ipotesi formulate. Inoltre, la scarsità di risorse e l'insufficienza di esperienze e conoscenze finora accumulate dalle istituzioni competenti hanno costituito fino ad oggi ostacoli quasi insormontabili alla rapida individuazione di soluzione condivisibili. Non è, quindi, fuori di luogo l'allarme 34 L'indagine costituisce la prosecuzione del lavoro svolto nel corso del precedente e già ricordato progetto sulla formazione e gestione dei documenti attivi e affronta il problema specifico della conservazione a lungo termine dell'integrità e autenticità dei documenti elettronici. Alla ricerca partecipano undici Paesi (Australia, Canada, Cina, Francia, Irlanda, Italia, Olanda, Portogallo, Stati Uniti, Svezia, UK) e un team di industrie farmaceutiche. I materiali del progetto (che si concluderà nel febbraio 2002) sono disponibili al sito http://www.interpares.org. Alcuni documenti sono stati recentemente pubblicati sulla rivista "Archivi per la storia", 1999, n. 2. 35 Materiali della ricerca sono disponibili al seguente indirizzo: http://www.sdsc.edu/NARA/Publications/collections.html. 36 Si vedano in particolare i materiali prodotti dalla Authenticity Task Force del progetto: Research methodology statement, Template for analysis e Case Study protocol and questionnaire, in "Archivi per la storica", 1999, n. 1-2, pp. 263-337. 37 Cfr il citato rapporto di Marc Fresko, Kenneth Tombs, Digital preservation guidelines e il documento proposto come standard ISO dal Consultative Committee for Space Data Systems, Reference Model for an Open Archival Information System (OAIS). 38 Una bibliografia accurata si trova in Marc Fresko, Kenneth Tombs, Digital preservation guidelines…cit. Si vedano anche le indicazioni contenute nella Nota bibliografica sul documento elettronico. 1986-1998, pubblicata nel citato numero di "Archivi per la storia" 1999, n. 1-2, pp.347-375. 51 degli archivisti per il futuro delle memorie digitali, anche se l'ultima delle soluzioni proposte, su cui questo contributo si sofferma in modo particolare, sembra rispondere a molte delle esigenze della conservazione permanente e offrire qualche speranza alle preoccupazioni più diffuse. Per comprendere meglio la natura del problema e valutare le aspettative suscitate dal nuovo standard presso la comunità archivistica nazionale e internazionale, è tuttavia necessario identificare, sia pure per grandi linee, i nodi teorici e pratici legati alle attività di gestione, uso e conservazione 39 "archivistica" dei sistemi documentari informatici . Gli oggetti che devono essere identificati e mantenuti perché si possa parlare di "conservazione archivistica" sono molteplici e strutturalmente articolati. Non basta, infatti, salvare il flusso di bit che definisce un documento, ma è indispensabile anche conservare le 40 informazioni che rendono esplicita la sua rappresentazione e i suoi legami nel sistema documentario . Sono inoltre essenziali le modalità di rappresentazione e comunicazione che implicano l'adozione di parametri uniformi di descrizione e di accesso, tutt'altro che semplici da determinare in un settore che, non a caso, è arrivato molto tardi e con molte resistenze - rispetto ad altri analoghi campi disciplinari all'accettazione di pratiche normalizzate. Assai differenziata è, peraltro, anche l'utenza dei sistemi archivistici per il grado di conoscenza del sistema documentario, per le modalità di interrogazione, per la natura stessa delle ricerche. In relazione ad altre aree di applicazione dell'informatica, il mondo degli 41 archivi si caratterizza per la complessa e stratificata articolazione della produzione documentaria , la cui peculiare natura originaria deve essere rigidamente salvaguardata per garantire la possibilità stessa della ricerca futura. A chi progetta interventi di automazione in questo ambito la specificità di tale materiale costituisce allo stesso tempo un vincolo e un'opportunità: le potenzialità di continua trasformazione offerte dalle tecnologie dell'informazione sono tali per gli archivisti solo se riferite alla ricchezza informativa e alla facilità di recupero di contenuti strutturati in modo logico e relazioni significative. Diventano invece gravi rischi da eliminare o, quantomeno, controllare e limitare se l'obiettivo sia quello di garantire l'autenticità e l'integrità del sistema nel tempo. Per dare solo un'idea dell'articolazione del sistema documentario e della natura strutturata dei suoi contenuti e relazioni, si ricorda che: • l'archivio non è un semplice insieme di documenti, ma un insieme complesso di entità a loro volta costituite di sottopartizioni di diversa tipologia (subfondo, serie, sottoserie, fascicolo, sottofascicolo, unità documentaria), § ciascuna sottopartizione è identificata e descritta mediante informazioni di natura generale 42 condivisibili anche in un contesto internazionale (segnatura archivistica, denominazione, estremi cronologici, consistenza, ecc.), integrata da eventuali ulteriori dati significativi, § le stesse unità documentarie non sono riducibili a semplice informazione testuale, ma si strutturano in 43 una serie di componenti riconosciute all'interno di uno schema generale e facilmente identificabili (autore, destinatario, data, oggetto, testo, indicazione di allegati, ecc.), 39 E' indispensabile tenere sempre distinte le attività di memorizzazione da quelle di conservazione, che implicano la decisione di mantenere il patrimonio nel lungo periodo e la stabilità dei documenti e delle relazioni sia archivistiche che amministrative, un sistema di accesso e di ricerca elaborato in modo adeguato alle esigenze di ricerca di una comunità ampia. 40 Tutti gli autori sottolineano la difficoltà di entrambi questi obiettivi: da un lato la necessità della migrazione, che si impone inevitabilmente e ripetutamente nell'attività di conservazione, può introdurre cambiamenti anche significativi nel flusso di bit, dall'altro la sempre più diffusa rappresentazione ad oggetti tende a rendere trasparenti agli utenti le informazioni "di contesto", che invece devono essere identificate e trattate in modo esplicito. Si veda in proposito Consultative Committee for Space Data Systems, Reference Model for an Open Archival Information System (OAIS), cit., sezioni 3.3 e 3.4. 41 Per un'analisi dei sistemi documentari informatici e dei requisiti funzionali che garantiscono una corretta formazione e gestione, cfr. Maria Guercio, La formazione dei documenti in ambiente digitale, in Gli archivi del futuro. Il futuro degli archivi. Cagliari, 29-31 ottobre 1998 (numero monografico di "Archivi per la storia", 1999, n. 1-2), pp. 21-58 . 42 Si fa qui riferimento allo standard per la descrizione archivistica, International Standard for Archival Description approvati dal Consiglio internazionale degli archivi nel 1996 e recentemente aggiornato dal Committee for archival description. Il testo è disponibile al seguente indirizzo: http://www.anai.org. 43 La disciplina che studia il documento e le sue componenti fisiche e logiche è la diplomatica, che negli ultimi anni, per merito di alcuni archivisti italiani, ha allargato il suo campo di indagine tradizionale, limitato in passato all'età 52 § tali unità, a loro volta, possono essere organizzate per tipologie specifiche che condividono un maggior numero di dati significativi rispetto allo schema generale comune, § le informazioni relative al contesto giuridico, amministrativo e organizzativo di un sistema archivistico sono rilevanti ai fini della intelligibilità stessa delle testimonianze documentarie. Tali informazioni di contesto (ufficio di assegnazione, ufficio di provenienza, tipo di procedimento/processo amministrativo, responsabile del procedimento) devono essere, perciò. catturate dal programma informatico e mantenute in via definitiva sia a fini giuridici che per la comprensione amministrativa e storica del materiale prodotto. Uno dei nodi centrali per la tenuta e l'accesso nel tempo ai documenti d'archivio informatici è, quindi, quello di assicurare il mantenimento in modo stabile e in ambiente sicuro della molteplicità delle informazioni strutturate - di attributi e relazioni di metadati -, che abbiano reso funzionale il sistema documentario nella fase attiva e che, per tale ragione, saranno conservate e messe a disposizione degli utenti per le future attività di ricerca storica, scientifica, amministrativa. Non è questa la sede per affrontare in modo approfondito la questione - assai dibattuta in sede teorica - del ruolo e del trattamento dei metadati archivistici. Il tema - su cui scuole di archivistica diverse hanno discusso e discutono senza trovare ancora un punto di vista comune - riguarda le modalità di trattamento di queste meta-informazioni, sia nella fase attiva del sistema documentario che nel corso del trasferimento dei documenti destinati alla conservazione permanente nelle 44 istituzioni archivistiche competenti . Nessun autore nega il loro valore strategico in ambito archivistico, sia per la complessità di strutturazione dei sistemi documentari, sia per l'univocità degli elementi che li costituiscono. A differenza di altri materiali, ad esempio i beni librari, i documenti d'archivio e, ancor più, i riferimenti di natura contestuale variano in ogni struttura organizzata e per ciascuno specifico ordinamento. Anche nel caso in cui si conservino più esemplari del medesimo documento in diverse aree della stessa organizzazione, è quasi sempre necessario ridefinire e mantenere i dati di contesto che cambiano in base alle specifiche procedure di classificazione e ordinamento. Le singole componenti (i documenti), ma soprattutto le reciproche relazioni (il vincolo archivistico) hanno all'interno di un archivio valori propri che si traducono in un'autonoma, ben definita gestione di metadati. La strada più semplice - ma non per questo necessariamente più adeguata - è quella sostenuta da alcuni ricercatori nordamericani (in particolare da David Bearman) e ripresa successivamente dalla scuola archivistica australiana, che ritiene possibile incapsulare tutti i metadati all'interno della singola entità 45 documentaria . In realtà si tratta di una proposta che semplifica senza risolvere e, perciò, impoverisce il problema della conservazione dei documenti digitali, poiché appiattisce su due soli livelli la struttura informativa da mantenere: i documenti da un lato, i metadati identificativi del singolo documento (ad esempio, i dati del profilo documentario) dall'altro. In realtà le componenti informative che si creano nella vita di un archivio e che sono indispensabili per la sua sopravvivenza come bene culturale sono ben più ricche di articolazioni e richiedono procedure di gestione più attente alla stratificazione storica delle attività e dei dati, a prescindere dal metodo e dagli strumenti impiegati per il loro mantenimento. La questione che deve essere affrontata è, quindi, quella di identificare le strutture e gli schemi logici di metadati corrispondenti agli oggetti informativi che si intendono salvaguardare a fini storici e alle attività e funzioni di sistema di cui è necessario tenere traccia storica nel lungo periodo (organigrammi del soggetto produttore di documenti, piani di classificazione e repertori dei fascicoli, sistemi di registrazione medievale e moderna, fino a comprendere non solo i documenti contemporanei, ma anche i documenti elettronici. Cfr. Paola Carucci, Il documento contemporaneo. Diplomatica e criteri di edizione, Roma, Nuova Italia Scientifica, 1987 e, più recentemente, Luciana Duranti, Diplomatics: new uses for an old science, Washington, SAA, 1999. 44 L'intervento sulle meta-informazioni deve essere "precoce", riguardare cioè il sistema attivo, poiché la maggior parte delle informazioni che garantiscono la continuità dell'accesso sono disponibili esclusivamente nell'archivio corrente (ad esempio i dati relativi alla struttura amministrativa che produce i documenti, lo schema logico di un database, la documentazione di un programma applicativo). Interventi tardivi di recupero sono, talvolta, possibili, ma sono di gran lunga più costosi e impegnativi. 45 David Bearman.Ken Sochats, Metadata Specifications Derived from Functional Requirements: A Reference Model for Business Acceptable Communications, documento disponibile al seguente indirizzo www.lis.pitt.edu/~nhprc.papers/model.html. Sulle conclusioni raggiunte dagli archivisti australiani in materia, cfr. Sue McKemmish, Australian Research and Development Initiatives, in "Archivi per la storia", 1999, n. 1-2, pp. 197206. 53 e autenticazione, elenchi dei procedimenti/processi amministrativi, interventi di copiatura, conversione, migrazione, ecc.). In conclusione, le informazioni di riferimento ai documenti destinate ad essere oggetto di trattamento conservativo sono nel caso degli archivi così numerose che richiedono di essere organizzate per componenti funzionali (metadati di contesto amministrativo, quali ad esempio quelli relativi alla struttura organizzativa del soggetto produttore del sistema documentario, metadati di contesto documentario, ad esempio le informazioni che identificano il sistema di classificazione in uso storicamente o i dati di registrazione/identificazione dei documenti, ecc.) in modo da assicurare l'integrità 46 delle singole unità documentarie e archivistiche e delle relazioni di contesto , ma anche il mantenimento nel lungo periodo in forme stabili delle modalità originarie di reperimento dei documenti e della loro accessibilità, cioè della capacità di comprensione e di elaborazione degli oggetti informatici da parte delle macchine e degli esseri umani. I requisiti funzionali e tecnologici da implementare per la realizzazione dei sistemi documentali includono, inoltre, il rispetto dei principi di conformità alle norme che a livello nazionale stabiliscono i requisiti di validità giuridica dei documenti in forma elettronica e le modalità di 47 autenticazione . Il nodo principale è, quindi, quello della possibilità di contemperare la garanzia 48 dell'integrità e l'esigenza di accessibilità e consentire un "riuso flessibile e illimitato" dei documenti . Un elemento vincolante è, infine, quello del contenimento dei costi e della scalabilità delle soluzioni, tenuto conto della esiguità delle risorse finanziarie che sono normalmente a disposizione delle istituzioni archivistiche cui è affidato il compito della conservazione permanente delle memorie documentarie, incluse quelle digitali che le amministrazioni pubbliche e il settore privato hanno già cominciato a produrre in quantità rilevante. E' evidente che le possibilità di riuso sono legate a uno sviluppo significativo di standard che determinano, inoltre, un effettivo contenimento dei costi e dei rischi di perdite (in particolare per quanto riguarda la conversione/migrazione delle applicazioni e la duplicazione delle informazioni). XML sembra offrire un metodo diffuso, a basso costo e scalabile per affrontare la diversificazione e la frammentazione della produzione documentaria e delle sue articolazioni, la sua ricchezza informativa e il peso, finora insostenibile per i bilanci limitati degli enti culturali, delle innovazioni tecnologiche. Le potenzialità specifiche di XML in questo ambito riguardano, in particolare, la gestione di documenti e di meta-informazioni indipendenti dal software sia a fini di scambio e ricerca che a fini di conservazione. Come si è visto, uno dei nodi è costituito dalla rappresentazione standardizzata dei 49 documenti indipendentemente dalle piattaforme di lavoro utilizzate e, quindi in grado di affrontare, il più 46 La definizione dei metadati significativi per assicurare l'integrità a lungo termine dei documenti informatici e la loro accessibilità è una questione cruciale per gli archivisti e ha una valenza strettamente teorica, anche se non può prescindere da una attenta valutazione e da un uso adeguato delle tecnologie. Sul tema della definizione dei requisiti per la gestione elettronica del documento (Model requirements for the management of electronic records) è al lavoro, finanziato dall'Unione europea nell'ambito del progetto IDA, un gruppo di ricerca che fa capo alla società londinese Cornwell Affiliates e si avvale di esperti internazionali. Anche il citato progetto InterPARES avrà tra i suoi risultati - come si è già ricordato - la identificazione della serie di informazioni necessarie a garantire l'autenticità e l'integrità dei documenti elettronici. 47 Nel caso delle amministrazioni pubbliche italiane, l'automazione dei sistemi documentari deve tenere conto di norme europee e di una serie di provvedimenti nazionali molto complessi ancora in via di definizione: dpr 513/97, dpr 428/98, dpcm 8 febbraio 1999, regole tecniche 24/98. Sono in corso di approvazione le regole tecniche applicative del dpr 498/98, mentre non è un caso che siano in fase di prima elaborazione (del tutto insoddisfacente) le disposizioni sulla conservazione dei documenti informatici. 48 Enrico Seta, Digitalizzazione e linguaggi di marcatura, in "Bollettino AIB", 1999, p. 72. 49 Si osservi che nella bozza di regole tecniche predisposte dall'Aipa e dal Dipartimento della funzione pubblica in applicazione del dpr 428/98 sulla gestione informatica dei documenti e in corso di approvazione (cfr http://www.aipa.it ) l'articolo 16 (leggibilità dei documenti) stabilisce che "ciascuna amministrazione garantisce la leggibilità nel tempo di tutti i documenti trasmessi o ricevuti adottando i formati previsti all'articolo 6, comma 1, lettera b) della delibera Aipa 24/98 ovvero altri formati non proprietari". Il citato articolo 6 fa riferimento in modo specifico ai formati PDF e SGML. 54 50 a lungo possibile , i rischi e gli oneri che derivano dalla obsolescenza tecnologica. In questi ultimi anni si è, infatti, assistito a una crescita esponenziale di ambienti applicativi diversi e alla proliferazione dei formati per la creazione di documenti elettronici che a fini conservativi devono essere necessariamente convertiti in prodotti standard in grado di garantire connettività. Tra questi sta ottenendo, non a caso, notevole successo la soluzione fornita dai linguaggi di marcatura, che identificano e mantengono con strumenti indipendenti dall'hardware e dal software metadati strutturati, predefiniti ma allo stesso tempo flessibili, condivisibili e, insieme, suscettibili di un trattamento dettagliato. E' naturalmente indispensabile sviluppare schemi concettuali e grammatiche specifiche per la formazione, gestione e tenuta dei documenti che identifichino le informazioni necessarie al mantenimento e all'uso dei documenti, dai dati di contesto organizzativo a quelli relativi all'ordinamento dei documenti, dalla loro organizzazione in serie e fascicoli al tracciamento degli interventi conservativi, ecc. L'uso di XML, tuttavia, apre ulteriori e molto significative possibilità per lo sviluppo di sistemi documentari informatici, soprattutto perché consente, oltre alla gestione dei riferimenti esterni al documento e alle sue partizioni, anche il trattamento della struttura logica e semantica dei contenuti. Questi sviluppi si possono tradurre nella decisione di: § promuovere, all'interno di un'organizzazione, interventi di razionalizzazione e semplificazione delle tipologie documentarie mediante la definizione di rappresentazioni specifiche con lo scopo di ottimizzare l'elaborazione automatica dei documenti, garantire coerenza, qualità e 51 uniformità dei materiali , § sviluppare strumenti di recupero e riutilizzo di documenti (o di componenti interne) ai fini di una distribuzione/condivisione di contenuti destinati a durare nel tempo, § gestire formati multipli, § utilizzare i sistemi di validazione XML anche a fini di sicurezza e di integrità, § controllare e ottimizzare i cicli di gestione dei documenti. E', tuttavia, importante sottolineare che XML può svolgere una funzione rilevante nei processi di automazione del settore documentario anche in termini di contenimento dei costi ed efficienza dei risultati se è accompagnato da un uso diffuso di DTD. Le Document Type Definition - è stato 52 recentemente ricordato da Charles Goldfarb e Paul Prescod - migliorano, infatti "la permanenza, la longevità e l'ampio riutilizzo dei propri dati, insieme alla prevedibilità e all'affidabilità della loro elaborazione". Tuttavia lo sviluppo di DTD è innanzi tutto una questione che rimette al centro della progettazione i problemi di struttura logica e concettuale che, naturalmente, richiedono un approccio seriamente interdisciplinare e soprattutto presuppone un'effettiva volontà di cooperazione per la definizione di regole comuni, se non di veri e propri standard di settore. Non è un caso, infatti, che gli ambiti di sviluppo più promettenti si concentrino proprio nella definizione, per ambiti settoriali, di standard internazionali o di procedure normalizzate a livello nazionale. E', ad esempio, il caso dell'Encoded Archival Description 53 promosso dalla Library of Congress oppure della circolare in corso di elaborazione da parte dell'Autorità per l'informatica per la definizione di una DTD per lo scambio di documenti in rete tra 54 pubbliche amministrazioni . 50 E' stato osservato che anche gli standard sono destinati a subire processi di evoluzione e, quindi, di obsolescenza e che per il loro successo e la loro diffusione non è sufficiente l'approvazione da parte degli organismi internazionali, ma servono notevoli investimenti in campo applicativo, tutt'altro che garantiti da un provvedimento ufficiale di riconoscimento. 51 L'utilizzo di XML accentua l'importanza della struttura semantica dei documenti anche perché lo standard consente di distinguere in modo chiaro tra elementi e attributi, cioè - in termini di analisi del documento archivistico - tra i dati che costituiscono la struttura costitutiva generale per tipi di documento e gli attributi intesi come informazioni di secondo livello (proprietà degli oggetti e non sue parti). Cfr in proposito Charles F. Goldfarb e Paul Prescod, XML, Milano, McGraw-Hill Italia, 1999, p. 400. 52 Ibidem 53 L'Encoded Archival Description (EAD) è il risultato di un progetto di ricerca avviato nel 1993 dall'Università di Berkeley sull'uso dei linguaggi di marcatura (allora SGML) per la pubblicazione di strumenti di ricerca in ambiente digitale. 54 Il provvedimento concluderà la lunga serie di disposizioni concernenti la gestione informatica dei documenti amministrativi, avviata con il dpr 428/1998 sulla tenuta del protocollo informatico. Le regole tecniche, applicative del dpr citato, in corso di approvazione da parte della Presidenza del Consiglio dei ministri, dedica una intera 55 Il progetto di partnership Italia-USA sulla metodologia XML per la conservazione e l'accesso ai documenti 55 elettronici Le amministrazioni archivistiche dei Paesi tecnologicamente all'avanguardia esprimono da tempo le loro crescenti preoccupazioni in relazione alla capacità di affrontare adeguatamente il futuro delle memorie digitali. Nel 1998 il responsabile dell'amministrazione archivistica statunitense, John Carlin, sottolineava che la crescita esponenziale dei documenti elettronici prodotti dal governo federale (milioni di file in pochi anni) era ed è incompatibile con le risorse e con gli strumenti disponibili e che il rischio di perdita definitiva in larga parte del patrimonio documentario contemporaneo richiede uno sforzo eccezionale, non solo in termini di investimento per le attrezzature tecnologiche ma anche per la ricerca di soluzioni per la sperimentazione e verifica su larga scala di tecnologie avanzate per la conservazione di documenti elettronici. Da questa preoccupazione era, quindi, nata la decisione del National Archives di Washington di partecipare a un impegnativo programma di ricerca avviato dall'Università della California, il Distributed Object Computation Testbed (DOCT), per valutare soluzioni informatiche avanzate in grado 56 di gestire grandi quantità di documenti digitali . Uno dei punti di forza del progetto NARADOCT/Electronic Records Management Project è stato proprio lo sviluppo di strumenti basati sullo standard XML per la migrazione di documenti informatici e dei metadati necessari a garantirne l'accessibilità e a provarne l'integrità. La prima fase della ricerca che è stata condotta a partire dal 1° ottobre 1998 e che, per quanto riguarda il quadro concettuale di riferimento, è strettamente correlata al progetto InterPARES, ha già dato alcuni primi risultati significativi: § la definizione di un'architettura scalabile per gestire la migrazione dei supporti, § l'elaborazione di un modello informativo per trattare la migrazione dei dati di contesto. La sperimentazione si era, tuttavia, concentrata sul trattamento ai fini della conservazione permanente (nella ricerca si parla di un arco temporale di 400 anni) di un fondo archivistico costituito da oltre un milione di messaggi di posta elettronica conservati presso il National Archives. La fase successiva che si è aperta alcuni mesi fa grazie a un nuovo finanziamento di 300.000 dollari del National Historical Publications and Records Commission è destinata ad allargare il campo di indagine ad almeno tre grandi classi di documenti elettronici (documenti testuali, documenti composti, documenti GIS ) il cui accesso richieda l'uso di strumenti software. Il nodo centrale della ricerca, che corrisponde alla questione di fondo della conservazione delle memorie digitali, è quello di: § definire un meccanismo per la creazione parzialmente automatica della rappresentazione digitale dei documenti in forme indipendenti dal software e sostitutive di originali che non possono essere conservati a lungo termine per ragioni di obsolescenza, § predisporre un prototipo di strumento software indipendente dalle piattaforme, sufficientemente robusto, flessibile e scalabile (Archivists' Workbench Software Package), basato sull'utilizzo di XML in quanto standard emergente (e promettente) per la rappresentazione e lo scambio informatico sul web e fondato sui risultati ottenuti nel corso delle precedenti indagini condotte dalla Università della California relative a sistemi di sezione alle modalità di trasmissione e registrazione dei documenti informatici e introduce l'obbligo per lo scambio dei dati relativi alla segnatura di protocollo dell'utilizzo dello standard XML e delle DTD elaborate dal Centro tecnico per la rete unitaria della p.a. Nella bozza delle regole - che come si è ricordato sono a disposizione sul sito dell'Autorità (www.aipa.it) - l'articolo 19 stabilisce le informazioni da includere nella segnatura: oggetto, mittente e destinatario costituiscono le informazioni obbligatorie, cui si possono aggiungere i dati relativi alla persona o all'ufficio all'interno della struttura destinataria cui si presume sia affidato il trattamento del documento, l'indice di classificazione, l'identificazione degli allegati, il procedimento e il suo trattamento e tutte le informazioni che le amministrazioni specifiche vorranno concordare nell'ambito di rapporti reciproci. 55 Il progetto statunitense - NARA-NPACI, "Methodologies for Preservation and Access of Software-dependent Electronic Records" - è stato promosso nel 1998. Una seconda fase di durata triennale del programma di ricerca che ha ottenuto nuovi consistenti finanziamenti dal National Historical Publications and Records Commission è stata approvata nella primavera del 2000 con l'obiettivo specifico di affrontare i problemi di scalabilità delle soluzioni individuate e della loro utilità per ambienti di (http://www.sdsc.edu/NHPRC). 56 Il progetto è finanziato dall'US Patent and Trademark Office e dalla Defense Advanced Research Projects Agency. Per maggiori informazioni sul progetto cfr http://www.sdsc.edu/DOCT. 56 wrapper-mediator (cioè componenti software che operano come traduttori tra i formati nativi di una fonte informativa e un protocollo comune) anch'essi basati su XML. La scalabilità dei prodotti riguarda la capacità di rispondere anche alle esigenze di depositi archivistici di medie e piccole dimensioni. Un ulteriore sviluppo del progetto riguarda l'integrazione di software esistenti con le funzionalità realizzate con il prototipo. All'origine di questa scelta c'è la convinzione che i documenti elettronici possano essere considerati come fonti distribuite di informazione semi-strutturata, costituite da uno schema definito di componenti informative interne ed esterne al documento e da una serie di elementi passibili di variazione (il supporto, il contesto tecnologico, ecc.). Il progetto americano si basa su una serie di presupposti e di pre-condizioni: § la considerazione che la codifica ASCII o Unicode per le informazioni testuali e la codifica bitmap per le immagini siano indipendenti dalle infrastrutture tecnologiche, § l'assunto per cui la rappresentazione di informazione strutturata mediante linguaggi di marcatura (XML) è indipendente e di facile accesso e consente l'auto-descrizione dei documenti, § la definizione di una metodologia per la creazione di fonti informative sostitutive degli originali basata sullo sviluppo di "contenitori" (wrapper) di prodotti software strutturati in modo che: § tutti i metadati che descrivono i contesti documentari abbiano la forma di documenti XML forniti di specifiche DTD, § tutte le informazioni testuali siano convertite in documenti XML, § tutte le immagini siano convertite in bitmap, § tutti i riferimenti a immagini e ad altri documenti all'interno di un documento archivistico siano convertiti in collegamenti permanenti a loro volta rappresentati in un formato XML compatibile. Un aspetto del progetto che merita una specifica riflessione riguarda la necessità di prevedere modifiche - prodotte anche con procedure automatiche - delle DTD in seguito ad interventi di conversione, migrazione o copiatura dei materiali digitali da parte delle istituzioni archivistiche cui sono affidate. Alcuni risultati sono già stati raggiunti e riguardano, come si è ricordato, la struttura del modello 57 informativo per la conservazione permanente di materiali archivistici . In particolare, nel progetto si identificano almeno tre nuclei di elementi che devono essere mantenuti nel sistema (simultaneamente alle singole entità documentarie): § lo schema logico che organizza gli attributi essenziali, cioè § i metadati relativi ai documenti singoli (digital object representation) che ne definiscono la struttura, il contesto fisico e la provenienza, § i metadati che si riferiscono alla organizzazione dell'archivio e includono le diverse informazioni di contesto (data collection representation), a loro volta organizzati in sotto-insiemi, § i metadati di presentazione (presentation representation), che consentono la conservazione di diverse interfaccia utente, in particolare dell'interfaccia originaria, § la descrizione fisica degli attributi all'interno del database del deposito archivistico, § un dizionario dei dati per le definizioni semantiche degli attributi. Come emerge anche da questa breve presentazione, i ricercatori sono consapevoli della grande complessità della struttura informativa dell'archivio e delle meta-informazioni che devono essere identificate, mantenute e gestite nel tempo per assolvere il compito della conservazione. Le attività più delicate non riguardano tanto le soluzioni tecnologiche, ma i problemi semantici, ovvero l'individuazione e l'utilizzo delle componenti logiche e la definizione e articolazione dei sotto-sistemi. Perché si ottengano risultati di qualità su questo terreno di ricerca sono necessarie una padronanza dei principi e dei metodi archivistici e una solida esperienza maturata in ambienti, tradizioni e giurisdizioni diverse. E' per questa ragione che gli studiosi statunitensi hanno accolto positivamente la proposta di collaborazione con quelle istituzioni italiane che da anni condividono le medesime preoccupazioni sulla conservazione dei 57 Reagan Moore, Chaitan Baru, Arcot Rajasekar, Bertram Ludaescher, Richard Marciano, Michael Wan, Wayne Schroeder e Amarnath Gupta, Collection-Based Persistent Dital Archives. Part I, in "D-Lib Magazine", 6 (2000), n. 3, disponibile al seguente indirizzo: http://www.dlib.org/march00/moore. 57 documenti informatici e il medesimo interesse per le potenzialità dei linguaggi di marcatura e che hanno, 58 comunque, avviato un analogo programma di lavoro . Il progetto italiano, promosso dall'Istituto di studi per la tutela dei beni archivistici e librari dell'Università di Urbino e sostenuto dall'Ufficio centrale per i beni archivistici, dall'Associazione nazionale archivistica italiana e dal Consorzio Roma Ricerche, riguarda in particolare la verifica in ambito europeo dei requisiti funzionali per la conservazione di archivi digitali e la definizione di una metodologia basata sul trattamento dei metadati mediante l'utilizzo di XML e lo sviluppo di DTD in stretta connessione con la 59 ricerca InterPARES e la complessa indagine NARA-NPACI. L'impegno più significativo - per il quale è prevista la diretta collaborazione con il gruppo di lavoro statunitense - riguarda l'individuazione degli attributi necessari a garantire l'autenticità, l'integrità e l'accessibilità a lungo termine dei documenti elettronici e la loro strutturazione. La collaborazione si basa sull'analisi dei materiali di indagine, sulla comune valutazione del metodo sviluppato e sulla organizzazione congiunta di seminari e workshop. E' presto per valutare gli esiti di un rapporto appena avviato, anche se sin d'ora si può, comunque, ritenere che l'iniziativa consentirà un confronto molto concreto e operativo su aspetti vitali della ricerca archivistica. Non si può, tuttavia, tacere una considerazione per quanto riguarda le difficoltà in cui si svolge oggi la ricerca in Italia soprattutto in settori che offrono una limitata visibilità: a fronte di consistenti e continui investimenti finanziari da parte delle istituzioni nordamericane, nel nostro Paese il lavoro di indagine è caratterizzato da iniziative quasi individuali sostenute dalle modestissime risorse delle istituzioni culturali (in questo caso l'amministrazione archivistica e l'associazione degli archivisti italiani). Eppure la salvaguardia della memoria documentaria del futuro e i grandi rischi che la minacciano costituiscono un tema vitale per ogni comunità civile che abbia il senso della propria dimensione storica e, almeno, quello della sua continuità. E' vero, purtroppo, che gli orizzonti temporali degli individui e delle amministrazioni si accorciano sempre più e che solo gli specialisti di settore sembrano avere ancora a cuore i problemi - costosi, impegnativi e assai poco remunerativi - della memoria. Per fortuna le tecnologie hanno già più volte dimostrato di saper trovare le risposte anche ai problemi di cui sono esse stesse responsabili. XML è, appunto, uno strumento che apre prospettive incoraggianti per risolvere i rischi di perdita e corruzione dell'informazione digitale. 58 In particolare, l'amministrazione archivistica e il Consorzio Roma Ricerche conducono da tempo uno studio e hanno già prodotto alcune realizzazioni sull'uso di SGML/XML per il recupero retrospettivo di strumenti di ricerca archivistici. Si è, ad esempio, recentemente affrontata la digitalizzazione della Guida generale degli Archivi di Stato italiani utilizzando il formato XML Cfr http://www.maas.ccr.it/cgi-win/h3.exe/aguida/findex.it, con particolare riferimento alle parti intitolate: "La storia della Guida" e "Il progetto informatico". 59 Chi scrive, oltre a svolgere le funzioni di direttore dell'Istituto di Urbino, è anche il coordinatore del team italiano che collabora nell'ambito della ricerca InterPARES. Cfr. M. Guercio, La ricerca InterPARES. Lo stato del progetto, in "Il mondo degli archivi", 1999, 1, pp. 10-14; Id., Il futuro per le memorie digitali, in "Autorità per l'informatica nella pubblica amministrazione, Notiziario", 2000, 1, pp. 50-55; Id., Qualche informazione sullo stato di avanzamento del progetto Inter-PARES, in “Il mondo degli archivi”, 2000, pp. 47-48. 58