An introduction to databases

The Celtic Inscribed Stones Project (CISP) is jointly run between the Department of History, UCL, and the Institute of Archaeology, UCL, under the direction of Prof. Wendy Davies in collaboration with Prof. James Graham-Campbell. The project currently (as of October 18, 2000) employs three full-time staff (Dr Kris Lockyear, Dr Mark Handley and Dr Paul Kershaw). The database structure described in this manual was constructed with by Dr Kris Lockyear and former research fellow Dr Katherine Forsyth. The first three years of the Project was funded by the HRB/HEFCE via their institutional fellowship scheme.

CISP's aim is to undertake a collaborative, interdisciplinary study of Early Medieval Celtic inscriptions. One of its main objectives is the compilation of a comprehensive and authoritative database of all known inscriptions from Great Britain, Ireland and Brittany. By bringing this material together in one place and making it readily available our goal is to turn what is a largely untapped resource into usable material.

Further details of the Project are available on the Project's web pages
(http://www.ucl.ac.uk/archaeology/cisp).

This guide and manual is intended both as a general introduction to the CISP database, and as a detailed guide for data entry. Chapter

contains an introduction to databases, database management systems, and data structures (terms which are discussed below). The subsequent chapters discuss the contents of the CISP database, and provide a detailed table by table, field by field guide to the database including allowed terms and definitions of fields and entries, and a short guide to the CISP data entry application. Appendices provide a glossary of terms and list major changes to the database since the first version of this manual.

Database concepts

This section discusses a number of database concepts and is primarily intended for those who have had little or no experience of computer-based databases.

Databases

A database is structured collection of data. Thus, card indices, printed catalogues of archaeological artefacts and telephone directories are all examples of databases. Databases may be stored on a computer and examined using a program. These programs are often called `databases', but more strictly are database management systems (DMS). Just as a card index or catalogue has to be constructed carefully in order to be useful, so must a database on a computer. Similarly, just as there are many ways that a printed catalogue can be organised, there are many ways, or models, by which a computerised database may be organised. One of the most common and powerful models is the `relational' model (discussed below), and programs which use this model are known as relational database management systems (RDMS).

Computer-based databases are usually organised into one or more tables. A table stores data in a format similar to a published table and consists of a series of rows and columns. To carry the analogy further, just as a published table will have a title at the top of each column, so each column in a database table will have a name, often called a field name. The term field is often used instead of column. Each row in a table will represent one example of the type of object about which data has been collected. Table 1(a) (p.

) is a an example of a table from a database of English towns. Each row, in this case a town, is an entity, and each column represents an attribute of that entity. Thus, in this table `population' is an attribute of `town.'

One advantage of computer-based tables is that they can be presented on screen in a variety of orders, formats, or according to certain criteria, all the towns in Hertfordshire, or all towns with a cathedral.

Specific purpose vs. resource databases

Databases often fall into one of two broad categories. The first comprises specific purpose, limited databases. In academia, these often contain data gathered to perform a relatively limited rôle only in a particular project. The database may be intended to provide the researcher with a particular set of data, but have no particular function or rôle at the conclusion of the project. For example, Lockyear's Coin Hoards of the Roman Republic (CHRR) database included only data necessary for the project in hand [LockyearLockyear1996, chapter 5].

The second category comprises general purpose, resource databases. A good example of a resource database are county archaeological sites and monuments records (SMRs), or national monuments records (Hansen:1993). These databases are not project specific but are intended to be of use to a wide variety of users. Resource databases usually attempt to be comprehensive within their `domain of discourse', are maintained and updated, and are made available to interested parties. As these databases attempt to be comprehensive in order to accomodate unpredicted enquiries and research, they include a wide variety of data which in turn requires a complex `data structure', or way of storing the information.

The CISP database is intended to be a resource database and as a result has a complex data structure (discussed below). This structure, however, provides great power and flexibility both for the retrieval and for the handling of the data, but also for future expansion of the database to include other information and materials.

Relational databases

A common and powerful method for organising data for computerisation is the relational data model. Use of this model often results in a database with many tables, and a common question is why such a complex structure should be necessary. Table 1(b) is an example of bad table design with the same towns as in Table 1(a) but with some additional information--the population and the area of the counties--added. We can see from this table that the size and population of Hertfordshire is repeated three times. This duplication is called data redundency. Data redundency is a problem for several reasons:

Table 1: Example tables from a geographical database

[A table of English towns]

town	county	population	county town?	cathedral?
Welwyn Garden City	Hertfordshire	40,570	no	no
St. Albans	Hertfordshire	123,800	no	yes
Hertford	Hertfordshire	2,023	yes	no
Durham	Durham	29,490	yes	yes

[A badly designed table]

town	county	population	county town?	cathedral?	county population	county size
Welwyn Garden City	Hertfordshire	40,570	no	no	937,300	631
St. Albans	Hertfordshire	123,800	no	yes	937,300	631
Hertford	Hertfordshire	2,023	yes	no	397,300	631
Durham	Durham	29,490	yes	yes	132,681	295
	Essex				1,426,200	1,528

[A table of counties]

county	population	size (square miles)
Hertfordshire	937,300	631
Durham	132,681	295
Essex	1,426,200	1,528

A second problem with the table can be seen in the last row. We have information about the population of Essex as a whole but none about any individual town. To accomodate this information we have had to create a row of data with only partial information. As well as these problems, a poor data structure can lead to inflexibility in the use of the database, and possibly problems in retrieving data in the form required. Examples of poor database design are all too common.

To solve these problems, the data should be split into several tables. To follow the town example through, we could have a table of towns as given in Table 1(a). Each item of information stored in this table is an attribute of a town. The information about counties is then stored in a second, separate table of counties as shown in Table 1(c). In this table every item of information is an attribute of a county. This process of breaking data down into a series of tables is called normalisation and is the first and most important step in designing a relational database.

Normalisation is the process of identifying entities and their attributes, and defining the relationship between the entities. In our example we have two entities--towns and counties, and we have recorded various attributes (Tables 1(a) & 1(c)). There are three types of relationship between entities: one-to-one, one-to-many, and many-to-many. Figure 1 shows the different types of relationship in a diagramatic form which are discussed in detail below. This type of diagram is known as an entity relationship diagram.

One-to-one relationships

A special case of a one-to-one relationship is where particular pieces of information only exist, or are only applicable, to some of the entries in a table. In our geographical example we may wish to record the length of coast line or other attributes which only relate to counties which border the sea. In these cases one can create a separate table for this information. This helps to save disk space on the computer, minimise data entry time, and break down potentially large tables. This type of relationship is shown in Figure 1(b).

One-to-many relationships

Many-to-many relationships

This type of relationship necessitates the use of the third table. This effectively creates two one-to-many relationships. These intermediate tables can be called linking tables. These tables often only contain two columns which act as a link between the two main tables. In our geographical example, the linking table would contain the names of counties, and the names of rivers only. This solution to modelling many-to-many relationships is illustrated in Figure 1(e).

**Figure 1:** Entity relationship diagrams.
[A one-to-one relationship] $\includegraphics[height=1.4cm]{rel1.eps}$ [A one-to-one relationship for some entities only] $\includegraphics[height=1.4cm]{rel2.eps}$ [A one-to-many relationship] $\includegraphics[height=1.4cm]{rel3.eps}$ [A many-to-many relationship] $\includegraphics[height=1.4cm]{rel4.eps}$ [Splitting a many-to-many relationship into two one-to-many relationships] $\includegraphics[height=1.4cm]{rel5.eps}$

Primary and foreign keys

Every row in a table in a relational database must be unique, there must not be two identical rows. One or more columns are therefore designated the primary key (sometimes called the unique identifier) for the items contained within it. Thus, in Table 1(a) the column `town' could act as the primary key, and in Table 1(c) column `county' can act as that table's primary key. This concept has been used in paper-based (published) databases, each inscribed stone catalogued in R. A. S. Macalister's Corpus Inscriptionum Insularum Celticarum Macalister:1945,Macalister:1949 has an unique identifying number, as does each hoard in Crawford's Roman Republican Coin Hoards Crawford:1969:b.

In our geographical example, however, there can be more than one town with the same name, Newcastle or Newport for example. In this case we could designate the `town' and the `county' columns together as the primary key.

Foreign keys are columns in a table which provide a link to another table. In our geographical example, the county column in our table of towns provides a link to the table of counties, and is thus a key field in that relationship. It is very important therefore to ensure that entries in the both tables are identical, that both tables use the full county name (Hertfordshire) or an abbreviation (Herts) but not a mixture of the two.

There is one final complexity which must be addressed. What could we do in the case where there are two towns with the same name in the same county? Although in our example it is unlikely, in databases of other information this could happen. We could use a combination of name, county and population as the primary key for the table of towns. If we had a table of shops, we would have to include the town name, county and the population to provide a link between the two tables. This, however, will re-introduce the problem of data redundancy. A better course of action is to assign a unique code to each town, and to use this code as the link to the table of shops. The use of codes has other advantages: it can be quite short and thus save time during data entry and disk space. These codes can be assigned by the user, WGC for Welwyn Garden City, or could be a sequential number created automatically by the program.

Data types and definition

The data stored in tables can be classified into types. In Table 1(a) the first column can contain any letter, number, or other character (such as {, or &). This is an alphanumeric data type, also known as a string or character field. The third column for population contains a number and is a numeric data type. The last two columns are `logical' and can only contain yes or no. There are other data types such as date or even images and sounds.

The type of data is important as different types of data behave in different ways. A good example is the sorting order of a series of numbers. If we store 1, 22, 3, 10, 2 and 15 in a numeric column, and ask the program to sort the rows of the table on this column, we will get 1, 2, 3, 10, 15, 22 as we might expect. If that column was defined as an alphanumeric data type, the result would be 1, 10, 15, 2, 22, 3, a rather different result! Different DMSs have different ways of handling different types of data (see below).

Each column of data also has to be defined. This can be quite simple, `the county column will contain the full county name'. We also have to decide what the entries mean, in the table of counties we have a column for area--we have to decide if this is the area in square miles or square kilometers.

We may wish to restrict the possible entries in a column. We can do this to prevent errors, we may decide that the maximum allowed population in a town is 10,000,000 as no town in Britain has a population larger than that. We may also wish to restrict entries to a limited list of terms. If, for example, we had `type' as an attribute of town, we could have market town, small town, county town, village, small village, hamlet and so on. If any term was allowable, this attribute would not be very useful for retrieving groups of settlements in any meaningful way. We might, therefore, create a list of allowed terms which are precisely defined and which would therefore allow meaningful data retrieval.

Look-up tables

In the previous section restricted data fields were discussed. How, in practice, are the entries in fields to be restricted? The first method is for the allowed terms to be listed in a manual such as this one, and for every user to be disciplined enough only to use those terms, and to check that they have used the correct ones. There are advantages, however, in storing these terms on the computer along with the main tables of data. There are thus two further methods. The first is to include the definitions in a database application (see below), or in the way the table is defined within the DMS. This has the disadvantage that the information is dependent on the software being used, and if the data is transferred (`ported') to another program this information will be lost. It is also difficult to add new terms to the list. The second alternative method is to use look-up tables, of which there are two types, simple and hierarchical.

Simple look-up tables typically consist of one or two columns. In a one column example, the list of allowed terms is stored in the table; in a two column example the first column stores the allowed term, often in the form of a code, and the second column stores the definition of that term or code. A good example of simple look-up tables are the POSIT1, POSIT2 and POSIT3 tables discussed on page

Hierarchical look-up tables are very similar in that one column contains a series of unique terms or codes. The remaining columns then contain definitions of that code, but in different levels of detail. Using our geographical example, we might wish to classify the rivers. The look-up table would contain a column of codes. Another column could then contain some broad classification such as `major river', `minor river' and `stream.' A third column could then further subdivide the classification, major rivers might be divided into `tidal' and `non-tidal', and a fourth column could divide `tidal' into `estuarine' and `non-estuarine'. The SITETYPE table discussed in section 10.6 is a good example of a hierarchical look-up table.

Hierarchical look-up tables have a dual function--to restrict the entries in a second table (sometimes called a parent table), and to provide a mechanism by which complex queries can be simplified. Both types of look-up table can be used to create printed output from the database which is more readily understood, by replacing a series of possibly obscure codes with more descriptive pieces of text.

Database applications

Conclusions

The following chapters examine the content and structure of the CISP database in general, and then provide a data definition guide to all tables and fields.