What is encoding?

In the context of an electronic document, the word encoding refers to the addition of markup to the document in order to store and transmit information about its structure, content or appearance. The term markup originated within the printing and publishing trades, where it referred to instructions added to copies of texts on paper in order to indicate the layout and typography required for a printed text. More recently, and especially when dealing with electronic documents, a distinction has been made between presentational markup of texts, which is concerned with their visual appearance, and descriptive or logical markup, which is concerned with their logical structure. Encoding a document with presentational markup is undertaken in order to tell a computer what the document should look like (for example, what font should be used in a particular part of the document, or where the margins should be set). Encoding a document with descriptive or logical markup is intended in the first instance to identify the meaning of particular parts of the document (for example, to indicate which words constitute a heading or title).

Back to top

What is the difference between proprietary and non-proprietary encoding?

Proprietary software products, such as mainstream word-processing applications, use encoding to allow the creators of documents to give instructions on the formatting of each part of the document. Typically such instructions are issued by clicking a button or selecting from a drop-down menu. When such instructions have been given the visual presentation of the document can be observed on screen but the underlying encoding is normally invisible to the user, although many products allow users the option of viewing at least some of the codes if they wish to do so. Each supplier of proprietary software uses their own set of codes, and in many cases these are unique to a particular software application. When documents are created using proprietary software, proprietary codes are embedded in the documents and it is almost certain that some will be embedded at a deeper level than users can view.

Proprietary encoding causes no difficulties while documents are maintained where they can be viewed or edited using the software that was employed to create them, but if it becomes necessary to transfer them to a different computing environment there is no guarantee that that the encoding will be recognised in the new environment. Dependence on proprietary encoding gives rise to significant problems when documents need to be shared between users working in different computing environments, or when documents need to be preserved beyond the life of the software product used to create them. Only too often, attempts to use a document with different software will fail, or the user will be presented with a defective rendering of the document that fails to capture the full characteristics of the original.

One possible solution is to use plain text formats. Conversion to plain text strips out the proprietary codes, but as a result document formatting will be rudimentary or non-existent unless non-proprietary codes are substituted.

Non-proprietary encoding allows plain text documents to be encoded in a way that is platform-independent and can thus be recognised across a range of computing environments. The best known system of non-proprietary encoding is HyperText Markup Language or HTML, which is used for publishing text on the World Wide Web. Documents encoded using HTML can be viewed in almost any web browser, regardless of the computing environment in which the user is working.

All the encoding mechanisms discussed so far are primarily concerned with presentational markup. HTML, like the encoding mechanisms in proprietary word-processors, is not concerned with the logical structure of documents. In applying such encoding, the only consideration is how the document should appear when viewed on a computer screen or on a printed page.

However, non-proprietary encoding can also be used for descriptive markup based on the meaning of the document content. Standard Generalized Markup Language or SGML was developed in the 1980s as a software and hardware independent metalanguage to support encoding of the meaning of document components as well as their appearance. It was adopted by the International Standards Organization as ISO 8879. Extensible Markup Language or XML was developed in the late 1990s as a simplified subset of SGML and is increasingly widely used.

These markup languages can be employed at the time of document creation or can be applied retrospectively to existing documents.

Back to top


What is/are plain text format/s?

A plain text file is one which has no invisible control characters or formatting commands, but only printable text characters such as letters, numbers or symbols. Plain text files can be created using a plain text editor such as Microsoft Notepad. Plain text is portable (it can be transferred from one computing environment to another) because it can be read by almost any software application. However, when imported into an application such as a word processor it normally appears unformatted.

The text characters in a plain text file are often taken from the ASCII character set.

ASCII is an acronym for the American Standard Code for Information Interchange, which was developed by the American National Standards Institute. It represents each character as a binary number (a string of seven digits, each of which is either a 0 or a 1). For example, the ASCII code for the ‘percent’ symbol (%) is 0100101. ASCII defines 128 characters in this way.

Because the ASCII character set has been widely used, plain text files are sometimes referred to as ‘ASCII files’.

There are several character sets which offer a larger number of characters than the 128 provided by ASCII. In particular there is a newer standard, called Unicode, which is an extension of ASCII that provides extra characters to represent non-English letters, accented characters, mathematical symbols and the like. Unicode is used by Microsoft Windows NT, 2000 and XP, and in future plain text files may increasingly employ Unicode rather than ASCII.

Back to top


What is SGML?

Standard Generalized Markup Language or SGML was recognised as an international standard (ISO 8879) in 1986. Strictly speaking (and despite its name), SGML is not a markup language. It is a metalanguage which can be used to define specific markup languages for encoding particular types of electronic text. It provides non-proprietary encoding which is independent of any individual software or hardware platform.

SGML has been used extensively in the publishing industry and has also been adopted by many business and government agencies for the control of their documentation. It has been used in the academic community as the basis for the Text Encoding Initiative (TEI) and in the archival community as the basis for the initial development of Encoded Archival Description (EAD).

There are many web sites and print publications giving further information about SGML. Some of these are listed on the references page on this web-site.

Back to top


What is HTML?

HyperText Markup Language or HTML is the language chiefly used for the production and publication of web pages. Technically it is a subset of SGML. It is simple to learn and use but is less rigorous than SGML; its focus is on the visual presentation of documents rather than their logical structure. Its lack of extensibility has limited its use for purposes other than web authoring.

Back to top


What is XML?

Extensible Markup Language or XML provides a set of rules for non-proprietary encoding of documents and data. Because it is platform independent, XML encoding can be read and processed by many types of software.

In the 1990s it was recognised that Standard Generalized Markup Language (SGML) is often unnecessarily complex and that modifications to the original SGML standard would be beneficial to many potential users. XML was developed from 1996 onwards as a restricted subset of SGML. Originally XML was perceived as a future successor to HyperText Markup Language (HTML) for publishing web pages but it is now recognised as a highly functional, powerful and effective means of structuring, storing and sharing almost any kind of information in textual form. Like SGML, XML supports the encoding of textual content with reference to its meaning: particular components of text can be identified as headings, titles, names, dates, prices, quantities and so on.

Control of the visual appearance of text is secondary to the capture of meaning. Using XML it is possible, though by no means essential, to apply rules for visual presentation that are based on the meaning of particular elements of encoded text (for example: if it is a heading then display it in bold type). It is also possible to have different ways of displaying the same text (for example: display headings in bold for user group A and in italics for user group B).

Because of its focus on encoding the meaning of textual content, XML is suitable for use with information in the form of data as well as documents. Traditionally, a distinction has been drawn between documents, which have been created and manipulated using word-processors and similar applications, and data which have generally been maintained in database environments. The recognition that documents contain elements with specific meanings has led to an understanding that the distinction between documents and data is not absolute. Data elements such as names and addresses may be used in a documentary context or in a database, but their intrinsic meaning is no different wherever they appear. XML is helping to bridge the gap between the management of data and documents in the digital world.

XML documents and data are stored in plain text format and can be displayed by any software which can handle plain text. When displayed on screen, or printed to paper, XML encoding can be read by anyone who knows the Latin alphabet. The structure of a document encoded in XML can also be understood without undue difficulty by a human reader. However XML documents can also be read and processed by a wide variety of software applications, which can enable their contents to be searched, analysed, manipulated, transmitted or re-presented as required.

XML is maintained by the World Wide Web Consortium, a 500-member organisation founded in 1994 to develop protocols which promote the evolution of the World Wide Web. However its use is not restricted to the World Wide Web. XML is increasingly used by both public and private sector organisations for the management of their internal information as well as for external publishing and electronic commerce. It has been adopted by many vendors of business software as a tool for the manipulation, transmission and exchange of commercial data. In the scholarly world XML is widely used for the dissemination of electronic resources, particularly in the humanities. In recent years it has formed the basis of much digital library development. It also offers considerable potential for the preservation of digital archives and for electronic records management.

XML is now the basis for Encoded Archival Description (EAD), which was originally based on SGML. An XML compliant version of the Text Encoding Initiative (TEI) has also been developed. The tools that are being produced by the LEADERS project will use XML.

There are many web sites and print publications giving further information about XML. Some of these are listed on the references page on this web-site.

Back to top


How are XML documents created?

XML can be used when documents are created or can be applied retrospectively to existing documents, although retrospective conversion may not be as easy as encoding at the time of creation.

A plain text editor such as Microsoft Notepad can be used to create XML documents, but it is generally easier to use a dedicated XML authoring tool. Work on the LEADERS project has used SoftQuad XMetaL 2, but many other XML authoring tools are available.

In future, XML capability may increasingly be found within proprietary word-processing software; but at present the use of conventional word-processors to create XML documents cannot be recommended because of the risk that proprietary encoding will be added to the XML markup.

In order to create XML documents it is necessary to have some knowledge of XML's rules for document structure. A basic understanding of document type definitions is also likely to be required. When an XML authoring tool is used, the application of XML is greatly simplified and documents can be created by someone with only a very limited knowledge of how XML works (although a fuller understanding of its rules may still be beneficial). If a plain text editor is used, detailed knowledge of XML syntax is essential.

Back to top


What is the structure of an XML document?

Briefly, an XML document is likely to contain a number of different elements. To take a very simple example, the following document

Mary: please call John Smith at the office in London.

can be seen as comprising three elements: a heading ('Message'), an addressee ('Mary') and the body of the text ('please call...').

In XML, each element is composed of tags and content. The tags are surrounded by angle brackets < > and give the name of the element. For example:
<body>please call John Smith at the office in London</body>

Each element has two tags. The first is known as the start tag. The second (the end tag) is distinguished by the presence of an oblique stroke, as in </body>. The content is placed between the start tag and the end tag.

When required, one element can be nested inside another. For example:
<body>please call <person>John Smith</person> at the office in London</body>
<body>please call <person>John Smith</person> at the office in <place>London</place></body>

Attributes can be used to modify, or provide additional information about, an element. Typically they provide an answer to the question: what sort of <element_name> is it? Attributes are placed inside the start tag. For example:
<place type="city">London</place>
<place type="country">Japan</place>

In these examples, XML uses an attribute called type to indicate what sort of place London is and what sort of place Japan is.

Putting all this together, the XML document might look like this:
<body>please call <person>John Smith</person> at the office in <place type="city">London</place></body>

Tags are also needed to indicate the start and the end of the document as a whole:
<body>please call <person>John Smith</person> at the office in <place type="city">London</place></body>

In this example the whole document is enclosed within the pair of tags <memo> ... </memo>.

XML also requires a declaration at the head of each document, indicating which version of XML is used (in fact, as yet there is only one version) and often providing other information about the document encoding:
<?xml version= "1.0"?>
<body>please call <person>John Smith</person> at the office in <place type="city">London</place></body>

With the addition of the declaration <?xml version= "1.0"?> the XML document in this example is now complete.

In practice, most XML documents are likely to be longer and to have a more complex structure than this simple memorandum. However, the same basic rules apply to all XML documents.

The structure of a completed XML document looks very similar to the structure of a document that has been marked up in HyperText Markup Language (HTML) for publication as a web page. However, there are some key differences between XML and HTML. In particular:

  • HTML tags are frequently used to indicate how the text should look when viewed in a web browser, but XML tags normally indicate the meaning of the text.
  • HTML specifies what tags and attributes are available and what each of them should be used for, but XML tags and attributes need not be pre-defined. XML allows creators to define their own tags and attributes.

XML tags and attributes are often defined using a Document Type Definition or DTD.

The way in which a completed XML document appears on screen, and the way in which the tags are interpreted by a software application, will depend on the application that is used:

  • In a plain text editor such as Microsoft Notepad, the tags are visible and are not interpreted by the software in any way.
  • An application with full XML functionality can interpret the tags. The tags may be visible or invisible, depending on the application.

More detailed information on XML document structures can be found in R. Eckstein, XML Pocket Reference (O'Reilly & Associates Inc, Sebastopol, Canada, 1999), and in other print publications and web-sites listed on the references page.

Back to top


What are Document Type Definitions?

A Document Type Definition or DTD provides one means of defining the elements and attributes that are used in the structure of an XML document.

XML does not specify what elements and attributes are available for use. Instead, it allows creators to define their own elements and attributes. In this way, creators can develop their own encoding schemes that are tailored for particular circumstances. If desired, it would be possible to create an encoding scheme that is unique to a particular document.

More commonly, however, encoding schemes are created for particular types of document. For example, in an accounting environment there would be little point in creating a separate encoding scheme for each invoice that is issued. Instead, a single encoding scheme could be created for all invoices.

An encoding scheme for a particular type of document is usually set out in a Document Type Definition (DTD). Among other things, a DTD defines:

  • which elements may appear in a document
  • which elements are compulsory and which are optional
  • which elements may appear more than once in the same document
  • which attributes may be used with each element
  • which attributes are compulsory and which are optional
  • where in the document each element may appear.

A DTD can also define the rules for nesting one element inside another. For example, it may be appropriate to specify that <chapter>s are always found inside <book>s, and not the other way round.

Each type of document can have its own DTD. Alternatively, a single DTD can be created for a number of related document types. For example, in an accounting system:

  • one DTD could be created for invoices and another for receipts
  • a single DTD could encompass both invoices and receipts.

The advantages of DTDs are that they:

  • provide tighter control over XML documents and help to eliminate errors
  • allow creators to encode documents to an agreed standard, rather than simply using a personal markup system
  • ensure consistency in markup between one document and another document of a similar type.

The use of DTDs also allows encoded documents to be validated, by using a program called a parser. The parser reads both the encoded document and the DTD and checks that the structure of the document conforms fully to the encoding scheme set out in the DTD.

DTDs are often used by specialist communities who share common interests beyond the boundaries of particular organisations or localities. One example of this is the use of Encoded Archival Description as a DTD for the compilation of archival finding-aids. Individual creators of archival finding-aids do not have to create their own DTDs because they can use one that already exists and is recognised by other members of the archival community. The use of a common DTD by a group of like-minded people is optional but widely accepted because it provides a basis for the sharing of information.

Within the structure of an XML document, the use of a DTD is declared using a document type declaration. In the case of a DTD which is external to the document, the document type declaration specifies the location at which the DTD can be found.

Thus an archival finding-aid using Encoded Archival Description (EAD) will contain a declaration such as:
<!DOCTYPE ead SYSTEM "www.loc.gov/ead/ead.dtd">
which indicates that the EAD DTD is being used.

In this way, each document carries with it a description of its own logical format.

As an alternative to the use of a DTD, elements and attributes can be defined using an XML Schema. Schemas are potentially more powerful, but as yet they are less widely used than DTDs. The LEADERS project employs DTDs but does not makes use of schemas at present.

Back to top


What other techniques and technologies are related to XML?

XML supports the use of a number of related technologies. Those most relevant to LEADERS are:

  • XSL (Extensible Style Sheet Language) and XSLT (XSL Transformation), used to transform XML documents into different output formats
  • XLink (XML Linking Language) and XPointer (XML Pointer Language), used to create links between XML documents.

Further information on these and other techniques and technologies in the 'XML family' can be found in R. Eckstein, XML Pocket Reference (O'Reilly & Associates Inc, Sebastopol, Canada, 1999), and in other print publications and web-sites listed on the references page.

Back to top


What are archives?

The word archives is used by different groups of people to mean different things. To information technologists it usually means computer files that have been removed to offline storage. To members of the digital library community it means a collection of back issues of a published journal or other electronic resource. To office managers and administrators it often means old files or papers which are rarely used for current business, or the storage area where such papers are kept. To historians, archives are seen as unpublished documents preserved for research purposes, or an institution where such documents are held.

The LEADERS project uses the word archives in a sense which is more precise than any of these. It follows the terminology and assumptions of the archives and records community in seeing archives as a sub-set of records. Records are created by organisational employees in the course of their business activities, and by individuals in the course of their personal and professional lives, and form evidence of the activities which gave rise to them. They may be created in any medium (including electronic media as well as paper). Archives are those records which have been formally identified as having long-term value to their creator, to an organisation or to the wider society.

Many archives are preserved in publicly-funded archival institutions at national, regional or local level, in universities or in the archive departments of libraries or museums. Others are maintained by the organisations or families which were responsible for their creation. In most countries, archives of public-sector bodies are available for use by members of the public (though exceptions may be made for archives created in the recent past, or those which are exceptionally sensitive or confidential). Archival institutions such as the UK Public Record Office or the National Archives of the United States receive many thousands of enquiries each year. Archives of business and other private-sector organisations, and archives of individuals and families, are not always open to the public, although many of them are publicly accessible.

Records and archives can be used for a number of purposes:

  • They are used for business purposes when they are used to support administration, public service, professional activity, marketing, trade or other economic activity; or to support dealings between individuals and organisations or other individuals in the course of personal or professional life.
  • They are used to support accountability when there is a need to discover whether organisations or individuals have complied with legal or regulatory requirements or recognised best practice.
  • They are used for cultural purposes when they are used as a means of gaining or augmenting an understanding of an organisation, family or individual, or of aspects of society or the wider world. Academic research is one example of a cultural purpose.

In the case of older archives (particularly those created by previous generations) the most prominent of these is cultural use. Archives provide a means of understanding human history, a basis for corporate and public memory and a source of community and personal identity. They also offer pathways to learning and social inclusion for people of all ages and backgrounds (UK National Council on Archives, Changing the Future of our Past, 2002).

Users of archives employ them for three broad values which archives can provide:

  • They may be used because they form evidence of the activity in which they were created. They are used in this way when proof is required that a particular activity took place or that it took place in a particular manner.
  • They may be used because they are sources of information. They are used in this way when the user seeks facts or knowledge about the operations or working methods of their creator or about other subjects, persons or places.
  • They may be used because they are physical artefacts or objects. They are used in this way when the user is interested in their aesthetic qualities, their associations, their tangibility, or their physical form.

The LEADERS project seeks to recognise these different values of archives and to provide tools which respect the needs of different users.

Back to top


What are archival finding-aids? Why are they needed?

An archival finding-aid is a tool for users or potential users of archives, and also for their owners or custodians. An effective finding-aid is one which provides:

  • a summary description of particular archival materials
  • a statement of the context in which they were created (or a pointer to a statement of this kind)
  • other information to support their management, discovery, use or interpretation.

Thus archival finding-aids serve a number of purposes. They indicate what archives exist and how they were created, how archives are organised and how they can be accessed. Besides describing the nature, content and origin of archival materials, a typical finding-aid provides information about the date of their creation or accumulation, their extent, quantity or size, and the conditions under which they may be used.

Archives are rarely self-explanatory. Because they are created in the course of organisational business, or in the course of personal or professional activity, they can only be fully understood by those who have adequate knowledge of the context of their creation. The people who have created archival materials are likely to have been only dimly aware of the possibility that they might eventually be used by others, perhaps many years after their creation or for purposes unconnected with the activity which gave rise to them. In this respect archives differ from reference books or other published materials which are consciously created to disseminate information or ideas on a particular subject. To support the effective use of archives, an archival finding-aid needs to carry considerably more contextual information than (for example) a library catalogue.

However, archival finding-aids are similar to library catalogues insofar as they also provide a means of discovering or locating the resources which users need. By browsing or searching a finding-aid, users can seek to ascertain what archival materials are available and to identify particular materials that may be relevant to their needs. Finding-aids are (or should be) structured and presented in ways that assist users to do this. They may also contain elements such as index terms which are specifically intended to facilitate searching.

Finding-aids also provide information about the structure of the archives which they describe. Structural information is necessary because archives can be viewed at a number of levels. At item level, there are usually numerous individual documents (e.g. letters, memoranda, invoices). Each item has an internal structure, while related items are often aggregated in files or folders. These in turn usually aggregate to form series of archives. A series is composed of files and/or items which have something in common, usually because they were created in the course of interrelated activities. When the various items, files and series created or accumulated by a single organisation, family or individual are viewed as a whole, the resulting aggregation of archives is known to archivists as a fonds.

Thus archival materials, and the finding-aids which describe and represent them, have a hierarchical structure. While other models are possible (see for example A. Cunningham, Dynamic descriptions: Australian strategies for the intellectual control of records and recordkeeping systems, 1998), the hierarchy is usually seen as a kind of pyramid. A single fonds is seen as containing many series, while each series may contain many files or items. A hierarchical model of this kind is used for the majority of archival finding-aids and is enshrined in ISAD(G), the General International Standard [for] Archival Description published by the International Council on Archives.

Because XML is designed to support hierarchically structured information, it can easily represent the kind of hierarchical structure employed in most archival finding-aids. When finding-aids are encoded using XML, they normally make use of Encoded Archival Description (EAD) to provide a common standard. The finding-aids used in the LEADERS model system are encoded in EAD, and the LEADERS toolset will be designed to work with any EAD-encoded finding-aid.

Back to top


What is Encoded Archival Description (EAD)?

EAD is a standard for the encoding of archival finding-aids. It was originally devised between 1993 and 1995 at the University of California at Berkeley, but is now maintained by the US Library of Congress in conjunction with the Society of American Archivists. EAD was originally based on the Standard Generalized Markup Language (SGML), but with the subsequent development of Extensible Markup Language (XML) it has been made fully XML compliant. It is widely used both in the United States and in the United Kingdom.

The EAD standard is represented as an XML document type definition (or DTD). It can be obtained free of charge from the EAD official web site , which also provides background information on EAD, an overview of its structure and guidelines for its implementation. Further web-sites and print publications giving information about EAD are listed on the LEADERS references page.

Back to top


What is the Text Encoding Initiative (TEI)?

The TEI is a set of standards for the encoding of electronic texts. It was launched in 1987 with the support of the Association for Computing and the Humanities, the Association for Literary and Linguistic Computing and the Association for Computational Linguistics. A new non-profit TEI Consortium was established in December 2000 to maintain and develop the TEI.

The TEI has become the de facto international standard for scholarly work with electronic texts. Its use enables electronic documents to be searched, sorted and presented to users in a variety of formats, and allows elements to be extracted from them for analysis. The TEI can handle many types of texts, but the focus is on primary sources of interest to scholars working in the humanities and social sciences. It can also be used for the textual material which must surround digital images in order to make them searchable.

The TEI guidelines were originally based on the Standard Generalized Markup Language (SGML). A new, updated version of the guidelines published in 2002 is fully compliant with Extensible Markup Language (XML).

The TEI guidelines are expressed as a modular document type definition (or DTD) which covers the encoding of most kinds of electronic texts. The modules making up the TEI DTD can be configured as either an SGML or an XML DTD. They can be obtained free of charge from the TEI web-site, which also provides background information on the TEI and links to projects worldwide which have used the TEI to provide resources for literary, linguistic or historical studies. Further web-sites and print publications giving information about the TEI are listed on the LEADERS references page.

Back to top

What is Encoded Archival Context (EAC)?

Encoded Archival Context (EAC) is a developing standard for structuring and exchanging information about creators of archival materials. In 2001 a group of archivists met in Toronto, Canada to develop a model for holding this information and to develop a strategy to test it. This model was called Encoded Archival Context (EAC) to emphasize its relationship with Encoded Archival Description (EAD). There is currently an early draft version of an EAC DTD available, which is being used by LEADERS. The final published version may take the form of a DTD or a Schema, but no time frame has been set for its release.

The records that can be created using EAC work in a similar way to library authority files but with an emphasis on the provision of biographical or administrative histories for the person or organisation concerned.

Back to top

Author: Geoffrey Yeo, July 2002
This page was last modified on 24 June 2003 by Anna Sexton
Copyright © UCL 2002