What is encoding?
In the context of an electronic document, the word encoding
refers to the addition of markup to the document in order to store
and transmit information about its structure, content or appearance.
The term markup originated within the printing and publishing trades,
where it referred to instructions added to copies of texts on paper
in order to indicate the layout and typography required for a printed
text. More recently, and especially when dealing with electronic
documents, a distinction has been made between presentational
markup of texts, which is concerned with their visual appearance,
and descriptive or logical markup, which is concerned with
their logical structure. Encoding a document with presentational
markup is undertaken in order to tell a computer what the document
should look like (for example, what font should be used in a particular
part of the document, or where the margins should be set). Encoding
a document with descriptive or logical markup is intended in the
first instance to identify the meaning of particular parts of the
document (for example, to indicate which words constitute a heading
or title).
Back to top
What is the difference between proprietary and
non-proprietary encoding?
Proprietary software products, such as mainstream word-processing
applications, use encoding to allow the
creators of documents to give instructions on the formatting of
each part of the document. Typically such instructions are issued
by clicking a button or selecting from a drop-down menu. When such
instructions have been given the visual presentation of the document
can be observed on screen but the underlying encoding is normally
invisible to the user, although many products allow users the option
of viewing at least some of the codes if they wish to do so. Each
supplier of proprietary software uses their own set of codes, and
in many cases these are unique to a particular software application.
When documents are created using proprietary software, proprietary
codes are embedded in the documents and it is almost certain that
some will be embedded at a deeper level than users can view.
Proprietary encoding causes no difficulties while documents are
maintained where they can be viewed or edited using the software
that was employed to create them, but if it becomes necessary to
transfer them to a different computing environment there is no guarantee
that that the encoding will be recognised in the new environment.
Dependence on proprietary encoding gives rise to significant problems
when documents need to be shared between users working in different
computing environments, or when documents need to be preserved beyond
the life of the software product used to create them. Only too often,
attempts to use a document with different software will fail, or
the user will be presented with a defective rendering of the document
that fails to capture the full characteristics of the original.
One possible solution is to use plain text formats.
Conversion to plain text strips out the proprietary codes, but as
a result document formatting will be rudimentary or non-existent
unless non-proprietary codes are substituted.
Non-proprietary encoding allows plain text documents to
be encoded in a way that is platform-independent and can thus be
recognised across a range of computing environments. The best known
system of non-proprietary encoding is HyperText Markup Language
or HTML, which is used for publishing text
on the World Wide Web. Documents encoded using HTML can be viewed
in almost any web browser, regardless of the computing environment
in which the user is working.
All the encoding mechanisms discussed so far are primarily concerned
with presentational markup. HTML, like the encoding mechanisms
in proprietary word-processors, is not concerned with the logical
structure of documents. In applying such encoding, the only consideration
is how the document should appear when viewed on a computer screen
or on a printed page.
However, non-proprietary encoding can also be used for descriptive
markup based on the meaning of the document content. Standard
Generalized Markup Language or SGML was
developed in the 1980s as a software and hardware independent metalanguage
to support encoding of the meaning of document components as well
as their appearance. It was adopted by the International Standards
Organization as ISO 8879. Extensible Markup Language or XML
was developed in the late 1990s as a simplified subset of SGML and
is increasingly widely used.
These markup languages can be employed at the time of document
creation or can be applied retrospectively to existing documents.
Back to top
What is/are plain text format/s?
A plain text file is one which has no invisible control characters
or formatting commands, but only printable text characters such
as letters, numbers or symbols. Plain text files can be created
using a plain text editor such as Microsoft Notepad. Plain text
is portable (it can be transferred from one computing environment
to another) because it can be read by almost any software application.
However, when imported into an application such as a word processor
it normally appears unformatted.
The text characters in a plain text file are often taken from the
ASCII character set.
ASCII is an acronym for the American Standard Code for Information
Interchange, which was developed by the American National Standards
Institute. It represents each character as a binary number (a string
of seven digits, each of which is either a 0 or a 1). For example,
the ASCII code for the percent symbol (%) is 0100101.
ASCII defines 128 characters in this way.
Because the ASCII character set has been widely used, plain text
files are sometimes referred to as ASCII files.
There are several character sets which offer a larger number of
characters than the 128 provided by ASCII. In particular there is
a newer standard, called Unicode,
which is an extension of ASCII that provides extra characters to
represent non-English letters, accented characters, mathematical
symbols and the like. Unicode is used by Microsoft Windows NT, 2000
and XP, and in future plain text files may increasingly employ Unicode
rather than ASCII.
Back to top
What is SGML?
Standard Generalized Markup Language or SGML was recognised
as an international standard (ISO 8879) in 1986. Strictly speaking
(and despite its name), SGML is not a markup language. It is a metalanguage
which can be used to define specific markup languages for encoding
particular types of electronic text. It provides non-proprietary
encoding which is independent of any individual software or
hardware platform.
SGML has been used extensively in the publishing industry and has
also been adopted by many business and government agencies for the
control of their documentation. It has been used in the academic
community as the basis for the Text Encoding Initiative
(TEI) and in the archival community as the basis for the initial
development of Encoded Archival Description (EAD).
There are many web sites and print publications giving further
information about SGML. Some of these are listed on the references
page on this web-site.
Back to top
What is HTML?
HyperText Markup Language or HTML is the language chiefly used
for the production and publication of web pages. Technically it
is a subset of SGML. It is simple to learn and use but is less rigorous
than SGML; its focus is on the visual presentation of documents
rather than their logical structure. Its lack of extensibility has
limited its use for purposes other than web authoring.
Back to top
What is XML?
Extensible Markup Language or XML provides a set of rules
for non-proprietary encoding of documents and
data. Because it is platform independent, XML encoding can be read
and processed by many types of software.
In the 1990s it was recognised that Standard Generalized Markup
Language (SGML) is often unnecessarily complex
and that modifications to the original SGML standard would be beneficial
to many potential users. XML was developed from 1996 onwards as
a restricted subset of SGML. Originally XML was perceived as a future
successor to HyperText Markup Language (HTML)
for publishing web pages but it is now recognised as a highly functional,
powerful and effective means of structuring, storing and sharing
almost any kind of information in textual form. Like SGML, XML supports
the encoding of textual content with reference to its meaning: particular
components of text can be identified as headings, titles, names,
dates, prices, quantities and so on.
Control of the visual appearance of text is secondary to the capture
of meaning. Using XML it is possible, though by no means essential,
to apply rules for visual presentation that are based on the meaning
of particular elements of encoded text (for example: if it is a
heading then display it in bold type). It is also possible to have
different ways of displaying the same text (for example: display
headings in bold for user group A and in italics for user group
B).
Because of its focus on encoding the meaning of textual content,
XML is suitable for use with information in the form of data as
well as documents. Traditionally, a distinction has been drawn between
documents, which have been created and manipulated using word-processors
and similar applications, and data which have generally been maintained
in database environments. The recognition that documents contain
elements with specific meanings has led to an understanding that
the distinction between documents and data is not absolute. Data
elements such as names and addresses may be used in a documentary
context or in a database, but their intrinsic meaning is no different
wherever they appear. XML is helping to bridge the gap between the
management of data and documents in the digital world.
XML documents and data are stored in plain text
format and can be displayed by any software which can handle
plain text. When displayed on screen, or printed to paper, XML encoding
can be read by anyone who knows the Latin alphabet. The structure
of a document encoded in XML can also be understood without
undue difficulty by a human reader. However XML documents can also
be read and processed by a wide variety of software applications,
which can enable their contents to be searched, analysed, manipulated,
transmitted or re-presented as required.
XML is maintained by the World Wide
Web Consortium, a 500-member organisation founded in 1994 to
develop protocols which promote the evolution of the World Wide
Web. However its use is not restricted to the World Wide Web. XML
is increasingly used by both public and private sector organisations
for the management of their internal information as well as for
external publishing and electronic commerce. It has been adopted
by many vendors of business software as a tool for the manipulation,
transmission and exchange of commercial data. In the scholarly world
XML is widely used for the dissemination of electronic resources,
particularly in the humanities. In recent years it has formed the
basis of much digital library development. It also offers considerable
potential for the preservation of digital archives and for electronic
records management.
XML is now the basis for Encoded Archival Description
(EAD), which was originally based on SGML. An XML compliant
version of the Text Encoding Initiative (TEI)
has also been developed. The tools that are being produced by the
LEADERS project will use XML.
There are many web sites and print publications giving further
information about XML. Some of these are listed on the references
page on this web-site.
Back to top
How are XML documents created?
XML can be used when documents are created or
can be applied retrospectively to existing documents, although retrospective
conversion may not be as easy as encoding at the time of creation.
A plain text editor such as Microsoft Notepad can be used to create
XML documents, but it is generally easier to use a dedicated XML
authoring tool. Work on the LEADERS project has used SoftQuad XMetaL
2, but many other XML authoring tools are available.
In future, XML capability may increasingly be found within proprietary
word-processing software; but at present the use of conventional
word-processors to create XML documents cannot be recommended because
of the risk that proprietary encoding will be
added to the XML markup.
In order to create XML documents it is necessary to have some knowledge
of XML's rules for document structure.
A basic understanding of document type definitions
is also likely to be required. When an XML authoring tool is used,
the application of XML is greatly simplified and documents can be
created by someone with only a very limited knowledge of how XML
works (although a fuller understanding of its rules may still be
beneficial). If a plain text editor is used, detailed knowledge
of XML syntax is essential.
Back to top
What is the structure of an XML document?
Briefly, an XML document is likely to contain
a number of different elements. To take a very simple example, the
following document
Message
Mary: please call John Smith at the office in London.
|
can be seen as comprising three elements: a heading ('Message'),
an addressee ('Mary') and the body of the text ('please call...').
In XML, each element is composed of tags and content. The tags
are surrounded by angle brackets < > and give the name of
the element. For example:
<heading>Message</heading>
<addressee>Mary:</addressee>
<body>please call John Smith at the office in London</body>
Each element has two tags. The first is known as the start tag.
The second (the end tag) is distinguished by the presence
of an oblique stroke, as in </body>. The content
is placed between the start tag and the end tag.
When required, one element can be nested inside another. For example:
<body>please call <person>John Smith</person>
at the office in London</body>
Or:
<body>please call <person>John Smith</person>
at the office in <place>London</place></body>
Attributes can be used to modify, or provide additional
information about, an element. Typically they provide an answer
to the question: what sort of <element_name> is it?
Attributes are placed inside the start tag. For example:
<place type="city">London</place>
Or:
<place type="country">Japan</place>
In these examples, XML uses an attribute called type to
indicate what sort of place London is and what sort of place Japan
is.
Putting all this together, the XML document might look like this:
<heading>Message</heading>
<addressee>Mary:</addressee>
<body>please call <person>John Smith</person>
at the office in <place type="city">London</place></body>
Tags are also needed to indicate the start and the end of the document
as a whole:
<memo>
<heading>Message</heading>
<addressee>Mary:</addressee>
<body>please call <person>John Smith</person>
at the office in <place type="city">London</place></body>
</memo>
In this example the whole document is enclosed within the pair
of tags <memo> ... </memo>.
XML also requires a declaration at the head of each document, indicating
which version of XML is used (in fact, as yet there is only one
version) and often providing other information about the document
encoding:
<?xml version= "1.0"?>
<memo>
<heading>Message</heading>
<addressee>Mary:</addressee>
<body>please call <person>John Smith</person>
at the office in <place type="city">London</place></body>
</memo>
With the addition of the declaration <?xml version= "1.0"?>
the XML document in this example is now complete.
In practice, most XML documents are likely to be longer and to
have a more complex structure than this simple memorandum. However,
the same basic rules apply to all XML documents.
The structure of a completed XML document looks very similar to
the structure of a document that has been marked up in HyperText
Markup Language (HTML) for publication as a
web page. However, there are some key differences between XML and
HTML. In particular:
- HTML tags are frequently used to indicate how the text should
look when viewed in a web browser, but XML tags normally indicate
the meaning of the text.
- HTML specifies what tags and attributes are available and what
each of them should be used for, but XML tags and attributes need
not be pre-defined. XML allows creators to define their own tags
and attributes.
XML tags and attributes are often defined using a Document
Type Definition or DTD.
The way in which a completed XML document appears on screen, and
the way in which the tags are interpreted by a software application,
will depend on the application that is used:
- In a plain text editor such as Microsoft Notepad, the tags are
visible and are not interpreted by the software in any way.
- An application with full XML functionality can interpret the
tags. The tags may be visible or invisible, depending on the application.
More detailed information on XML document structures can be found
in R. Eckstein, XML Pocket Reference (O'Reilly & Associates
Inc, Sebastopol, Canada, 1999), and in other print publications
and web-sites listed on the references
page.
Back to top
What are Document Type Definitions?
A Document Type Definition or DTD provides one means of
defining the elements and attributes that are used in the structure
of an XML document.
XML does not specify what elements and attributes are available
for use. Instead, it allows creators to define their own elements
and attributes. In this way, creators can develop their own encoding
schemes that are tailored for particular circumstances. If desired,
it would be possible to create an encoding scheme that is unique
to a particular document.
More commonly, however, encoding schemes are created for particular
types of document. For example, in an accounting environment there
would be little point in creating a separate encoding scheme for
each invoice that is issued. Instead, a single encoding scheme could
be created for all invoices.
An encoding scheme for a particular type of document is usually
set out in a Document Type Definition (DTD). Among other
things, a DTD defines:
- which elements may appear in a document
- which elements are compulsory and which are optional
- which elements may appear more than once in the same document
- which attributes may be used with each element
- which attributes are compulsory and which are optional
- where in the document each element may appear.
A DTD can also define the rules for nesting one element inside
another. For example, it may be appropriate to specify that <chapter>s
are always found inside <book>s, and not the other
way round.
Each type of document can have its own DTD. Alternatively, a single
DTD can be created for a number of related document types. For example,
in an accounting system:
- one DTD could be created for invoices and another for receipts
- a single DTD could encompass both invoices and receipts.
The advantages of DTDs are that they:
- provide tighter control over XML documents and help to eliminate
errors
- allow creators to encode documents to an agreed standard, rather
than simply using a personal markup system
- ensure consistency in markup between one document and another
document of a similar type.
The use of DTDs also allows encoded documents to be validated,
by using a program called a parser. The parser reads both the encoded
document and the DTD and checks that the structure of the document
conforms fully to the encoding scheme set out in the DTD.
DTDs are often used by specialist communities who share common
interests beyond the boundaries of particular organisations or localities.
One example of this is the use of Encoded Archival
Description as a DTD for the compilation of archival
finding-aids. Individual creators of archival finding-aids do
not have to create their own DTDs because they can use one that
already exists and is recognised by other members of the archival
community. The use of a common DTD by a group of like-minded people
is optional but widely accepted because it provides a basis for
the sharing of information.
Within the structure of an XML document,
the use of a DTD is declared using a document type declaration.
In the case of a DTD which is external to the document, the document
type declaration specifies the location at which the DTD can be
found.
Thus an archival finding-aid using Encoded Archival Description
(EAD) will contain a declaration such as:
<!DOCTYPE ead SYSTEM "www.loc.gov/ead/ead.dtd">
which indicates that the EAD DTD is being used.
In this way, each document carries with it a description of its
own logical format.
As an alternative to the use of a DTD, elements and attributes
can be defined using an XML Schema. Schemas are potentially more
powerful, but as yet they are less widely used than DTDs. The LEADERS
project employs DTDs but does not makes use of schemas at present.
Back to top
What other techniques and technologies
are related to XML?
XML supports the use of a number of related
technologies. Those most relevant to LEADERS are:
- XSL (Extensible Style Sheet Language) and XSLT (XSL Transformation),
used to transform XML documents into different output formats
- XLink (XML Linking Language) and XPointer (XML Pointer Language),
used to create links between XML documents.
Further information on these and other techniques and technologies
in the 'XML family' can be found in R. Eckstein, XML Pocket Reference
(O'Reilly & Associates Inc, Sebastopol, Canada, 1999), and in
other print publications and web-sites listed on the references
page.
Back to top
What are archives?
The word archives is used by different groups of people to mean
different things. To information technologists it usually means
computer files that have been removed to offline storage. To members
of the digital library community it means a collection of back issues
of a published journal or other electronic resource. To office managers
and administrators it often means old files or papers which are
rarely used for current business, or the storage area where such
papers are kept. To historians, archives are seen as unpublished
documents preserved for research purposes, or an institution where
such documents are held.
The LEADERS project uses the word archives in a sense which is
more precise than any of these. It follows the terminology and assumptions
of the archives and records community in seeing archives as a sub-set
of records. Records are created by organisational employees in the
course of their business activities, and by individuals in the course
of their personal and professional lives, and form evidence of the
activities which gave rise to them. They may be created in any medium
(including electronic media as well as paper). Archives are those
records which have been formally identified as having long-term
value to their creator, to an organisation or to the wider society.
Many archives are preserved in publicly-funded archival institutions
at national, regional or local level, in universities or in the
archive departments of libraries or museums. Others are maintained
by the organisations or families which were responsible for their
creation. In most countries, archives of public-sector bodies are
available for use by members of the public (though exceptions may
be made for archives created in the recent past, or those which
are exceptionally sensitive or confidential). Archival institutions
such as the UK Public Record Office or the National Archives of
the United States receive many thousands of enquiries each year.
Archives of business and other private-sector organisations, and
archives of individuals and families, are not always open to the
public, although many of them are publicly accessible.
Records and archives can be used for a number of purposes:
- They are used for business purposes when they are used to support
administration, public service, professional activity, marketing,
trade or other economic activity; or to support dealings between
individuals and organisations or other individuals in the course
of personal or professional life.
- They are used to support accountability when there is a need
to discover whether organisations or individuals have complied
with legal or regulatory requirements or recognised best practice.
- They are used for cultural purposes when they are used as a
means of gaining or augmenting an understanding of an organisation,
family or individual, or of aspects of society or the wider world.
Academic research is one example of a cultural purpose.
In the case of older archives (particularly those created by previous
generations) the most prominent of these is cultural use. Archives
provide a means of understanding human history, a basis for corporate
and public memory and a source of community and personal identity.
They also offer pathways to learning and social inclusion for people
of all ages and backgrounds (UK National Council on Archives, Changing
the Future of our Past, 2002).
Users of archives employ them for three broad values which archives
can provide:
- They may be used because they form evidence of the activity
in which they were created. They are used in this way when proof
is required that a particular activity took place or that it took
place in a particular manner.
- They may be used because they are sources of information. They
are used in this way when the user seeks facts or knowledge about
the operations or working methods of their creator or about other
subjects, persons or places.
- They may be used because they are physical artefacts or objects.
They are used in this way when the user is interested in their
aesthetic qualities, their associations, their tangibility, or
their physical form.
The LEADERS project seeks to recognise these different values of
archives and to provide tools which respect the needs of different
users.
Back to top
What are archival finding-aids? Why
are they needed?
An archival finding-aid is a tool for users or potential users
of archives, and also for their owners or
custodians. An effective finding-aid is one which provides:
- a summary description of particular archival materials
- a statement of the context in which they were created (or a
pointer to a statement of this kind)
- other information to support their management, discovery, use
or interpretation.
Thus archival finding-aids serve a number of purposes. They indicate
what archives exist and how they were created, how archives are
organised and how they can be accessed. Besides describing the nature,
content and origin of archival materials, a typical finding-aid
provides information about the date of their creation or accumulation,
their extent, quantity or size, and the conditions under which they
may be used.
Archives are rarely self-explanatory. Because they are created
in the course of organisational business, or in the course of personal
or professional activity, they can only be fully understood by those
who have adequate knowledge of the context of their creation. The
people who have created archival materials are likely to have been
only dimly aware of the possibility that they might eventually be
used by others, perhaps many years after their creation or for purposes
unconnected with the activity which gave rise to them. In this respect
archives differ from reference books or other published materials
which are consciously created to disseminate information or ideas
on a particular subject. To support the effective use of archives,
an archival finding-aid needs to carry considerably more contextual
information than (for example) a library catalogue.
However, archival finding-aids are similar to library catalogues
insofar as they also provide a means of discovering or locating
the resources which users need. By browsing or searching a finding-aid,
users can seek to ascertain what archival materials are available
and to identify particular materials that may be relevant to their
needs. Finding-aids are (or should be) structured and presented
in ways that assist users to do this. They may also contain elements
such as index terms which are specifically intended to facilitate
searching.
Finding-aids also provide information about the structure of the
archives which they describe. Structural information is necessary
because archives can be viewed at a number of levels. At item level,
there are usually numerous individual documents (e.g. letters, memoranda,
invoices). Each item has an internal structure, while related items
are often aggregated in files or folders. These in turn usually
aggregate to form series of archives. A series is composed of files
and/or items which have something in common, usually because they
were created in the course of interrelated activities. When the
various items, files and series created or accumulated by a single
organisation, family or individual are viewed as a whole, the resulting
aggregation of archives is known to archivists as a fonds.
Thus archival materials, and the finding-aids which describe and
represent them, have a hierarchical structure. While other models
are possible (see for example A.
Cunningham, Dynamic descriptions: Australian strategies for the
intellectual control of records and recordkeeping systems, 1998),
the hierarchy is usually seen as a kind of pyramid. A single fonds
is seen as containing many series, while each series may contain
many files or items. A hierarchical model of this kind is used for
the majority of archival finding-aids and is enshrined in ISAD(G),
the General International Standard [for] Archival Description
published by the International Council on Archives.
Because XML is designed to support hierarchically structured information,
it can easily represent the kind of hierarchical structure employed
in most archival finding-aids. When finding-aids are encoded using
XML, they normally make use of Encoded Archival Description (EAD)
to provide a common standard. The finding-aids used in the LEADERS
model system are encoded in EAD, and the LEADERS toolset will be
designed to work with any EAD-encoded finding-aid.
Back to top
What is Encoded Archival Description (EAD)?
EAD is a standard for the encoding of archival finding-aids. It
was originally devised between 1993 and 1995 at the University of
California at Berkeley, but is now maintained by the US Library
of Congress in conjunction with the Society of American Archivists.
EAD was originally based on the Standard Generalized Markup Language
(SGML), but with the subsequent development of Extensible Markup
Language (XML) it has been made fully XML compliant. It is widely
used both in the United States and in the United Kingdom.
The EAD standard is represented as an XML document type definition
(or DTD). It can be obtained free of charge from the EAD
official web site , which also provides background information
on EAD, an overview of its structure and guidelines for its implementation.
Further web-sites and print publications giving information about
EAD are listed on the LEADERS references
page.
Back to top
What is the Text Encoding Initiative (TEI)?
The TEI is a set of standards for the encoding of electronic texts.
It was launched in 1987 with the support of the Association for
Computing and the Humanities, the Association for Literary and Linguistic
Computing and the Association for Computational Linguistics. A new
non-profit TEI Consortium was established in December 2000 to maintain
and develop the TEI.
The TEI has become the de facto international standard for scholarly
work with electronic texts. Its use enables electronic documents
to be searched, sorted and presented to users in a variety of formats,
and allows elements to be extracted from them for analysis. The
TEI can handle many types of texts, but the focus is on primary
sources of interest to scholars working in the humanities and social
sciences. It can also be used for the textual material which must
surround digital images in order to make them searchable.
The TEI guidelines were originally based on the Standard Generalized
Markup Language (SGML). A new, updated version of the guidelines
published in 2002 is fully compliant with Extensible Markup Language
(XML).
The TEI guidelines are expressed as a modular document type definition
(or DTD) which covers the encoding of most kinds of electronic texts.
The modules making up the TEI DTD can be configured as either an
SGML or an XML DTD. They can be obtained free of charge from the
TEI web-site, which also provides
background information on the TEI and links to projects worldwide
which have used the TEI to provide resources for literary, linguistic
or historical studies. Further web-sites and print publications
giving information about the TEI are listed on the LEADERS references
page.
Back to top
Encoded Archival Context (EAC) is a developing standard for structuring and exchanging information about creators
of archival materials. In 2001 a group of archivists met in Toronto, Canada to develop a model for holding this information and to develop a strategy to test it.
This model was called Encoded Archival Context (EAC) to emphasize its relationship with Encoded Archival Description (EAD). There is currently an early draft version of an
EAC DTD available, which is being used by LEADERS. The final published version may take the form of a DTD or a Schema, but no time frame has been set for its release.
The records that can be created using EAC work in a similar way to library authority files but with an emphasis on the provision
of biographical or administrative histories for the person or organisation concerned.
Back to top
Author: Geoffrey Yeo, July 2002
This page was last modified on 24 June 2003 by Anna Sexton
Copyright © UCL 2002
|