CAVA Repository

A Human Communication Audio-Visual Archive for UCL

British Sign Language Corpus Project (BSLCP) Collection

The CAVA (human Communication: an Audio-Visual Archive) Repository is a digital video repository to support the work of the international human communication research community. This resource enhances the discoverability and re-usability of expensively-created, specialist video content, for the benefit of the community of human communication researchers worldwide.

The CAVA repository houses data including:

young typically-developing children
children with hearing impairments
children with language impairments
people with aphasia
the British Sign Language Corpus

How to access data

Access to the repository content is restricted to bona fide researchers. In order to view any files, you will need to register. Information on how to join can be found in the FAQs below.

CAVA is a collaboration between several UCL Departments and Research Centres and the UK Data Archive (UKDA).

UCL Library Services
The UCL Research Department of Developmental Science
The UCL Research Department of Language and Communication
The UCL Centre for Human Communication

The original CAVA project was co-funded by UCL and the JISC (Joint Information Systems Committee, UK). CAVA ran from 01 April 2009 to 31 August 2010. For more information, see "About CAVA", below, or read the final project report.

About CAVA

Background

The objective of the CAVA project is to establish a repository for audio-visual data on real-life human communication for spoken and signed languages. In order to investigate human communication and interaction, researchers need hours of audio-visual data, sometimes recorded over periods of months or years. The process of collecting, cataloguing and transcribing such valuable data is time-consuming and expensive. Once it is collected and ready to use, it makes sense to get the maximum value from it by reusing it and sharing it among the research community. Historically, the study of communication has been based on highly-controlled experimental data, but a better understanding comes from examining natural audio-visual data. Such work, both qualitative and quantitative, involves in-depth study of video- and audio-recorded data of conversations and clinical encounters.

But unlike highly-controlled experimental data, natural audio-visual data tends to defy easy classification, and may lead to idiosyncratic solutions to preservation, metadata and access issues. It is not uncommon for vital and unique data to languish on VHS tapes in personal collections; researchers across the discipline waste time battling with increasingly inaccessible media and finding individual solutions to the challenges of editing and analysis. The resources of funders, researchers and subjects are wasted on the collection of new data rather than on the re-use of existing data. Natural data can often be used for more than the purpose its collector intended. Researchers may be able to save time and money, or improve the depth of their observations and conclusions, by reusing existing data instead of collecting their own.

Despite its usefulness, data in personal collections does not lend itself to being shared between researchers and institutions on a large scale. This is largely due to the absence, until now, of a centralised data archive to support such research and to offer opportunities for collaborative work.

History and aims

CAVA was the product of a UCL Research Challenges grant which began in November 2007. This allowed the team to investigate the feasibility of centrally archiving data held by the Centre for Applied Interaction Research (CAIR), an interdisciplinary grouping largely based in the UCL Division of Psychology and Language Sciences. The CAIR project investigated a discipline-specific metadata standard, and archived a pilot sample of data for dissemination through UCL's Moodle virtual learning environment. A feasibility study conducted as part of the CAIR project also found considerable support among the research community for a comprehensive and accessible repository.

CAVA will create a repository of re-usable video material to support the work of UCL and of the international human communication research community. The UCL team already holds a large body of content (over 600 hours of material), in suitable formats, with appropriate permissions secured. The data mainly comprises videotaped interactions - conversations, interviews, assessments - between a person who has atypical communication (due to disabilities such as stroke, deafness, autism etc) and their spouse, teacher, parent or another 'typical' communicator, filmed in the home, at school or in the clinic. The duration of the videos ranges from 10 minutes to more than an hour per client. Some of the data is longitudinal with regular sessions filmed over a period of time.

Preservation

An important goal of CAVA is to centralise data held in personal collections and make it easily searchable by uniting the disparate sets of information held by the original researcher. The initial dataset will come from CAIR members based at UCL, though researchers at other institutions have expressed interest in contributing. The majority of this data is stored on portable media or cassettes, which run using proprietary or hard-to-find codecs, meaning that they are difficult to share and are at risk of obsolescence. The data will be captured from its original medium, whether CD, VHS or Mini-DV cassette, and transcoded into low-compression AVI format for preservation. The preservation file will maintain the quality of the original to the highest possible extent. This represents an acceptable compromise between fully uncompressed, but unwieldy, preservation files and the risk of the original formats becoming obsolete. The objective at this point is to achieve uniformity of data in order to aid long-term management of the dataset.

Discovery

It is not enough, though, just to collect and standardise the quality of the data; it must be readily searchable. CAVA uses a modified metadata standard based on the ISLE MetaData Initiative (IMDI), a schema designed for language resources. The nature of the data presents some crucial challenges to the creation of metadata. Implementing the full IMDI standard would be too time-consuming and costly, for both the project team and depositors. The key issue to address is that of multiple participants. Based on various modifications of the IMDI standard, principally the UCL Deafness, Cognition and Language Research unit subset, the CAVA subset presents a pragmatic solution. A mapping between UCL's IMDI instance and the Dublin Core standard will be written. A full metadata schema and best-practice guides for capturing data (for both users and potential depositors) wil be placed on this website.

Storage and dissemination

As the data is collected it will be stored using the UCL Library Services Digital Collections service. The team will devise and test ingest processes so that video clips, transcripts (where available) and descriptive metadata can be uploaded to the repository in batches, in a way which maintains the relationships between the one or more versions of each video recording, its transcript, and the metadata which applies to each. The final ingest process will include the automatic generation of technical metadata and the creation of appropriate access restrictions.

The data will be made available in several dissemination formats. All data accepted by the archive will have appropriate permissions for the various types of dissemination. Users will be available to download compressed video or uncompressed audio-only files. All dissemination formats will be prepared in order to operate with minimal codec and system requirements.

Access management

In order for a researcher to benefit from access to the data, they must be able to manipulate the files at their leisure. However, in order to encourage researchers to use the archive, and primarily to request access, the dissemination will operate on a tiered basis. Our key concerns at this stage are to ensure good procedures for data protection, identity and ethical issues. The metadata will be searchable through the DigiTool front page, although none of the data itself will be viewable at this point. A login will be supplied to researchers to permit them to view streamed excerpts of the data which potentially interest them. The researcher would then request full access to downloadable versions of their selected datasets. By these means, CAVA takes all reasonable precautions to prevent the often-sensitive data from being used inappropriately.

The project team will work with the UKDA to design the application process for prospective users, implement procedures to verify and authorise requests, and register and authenticate users. The retention and presentation of rights information will be implemented within the IMDI record. Users will see hard-copy or click-through licensing agreements to be associated with particular tiers of access, to indicate clearly and unambiguously what an authorised user may and may not do with the material.

Frequently asked questions

What kind of data does CAVA accept?

The CAVA repository is aimed at researchers looking at human communication and interaction. This may include any rights-cleared primary audio and video recordings, especially raw data featuring use of natural language. If you think your data might be suitable for inclusion in the repository, please contact the CAVA Project Officer.

How do I get access to the data in the repository?

To access CAVA you need to apply for a user licence by contacting the Project Officer. The CAVA team will issue you with a login that will give you access to all the data that does not have a further access restriction.

What will I need to do to submit data to the repository?

The most important thing we require, aside from the data itself, is information that makes it searchable. You will need to complete the metadata spreadsheet (a user guide can be found in the documents), and sign a depositor's licence, giving us permission to store and manage the data. This also confirms to us that the data you provide has appropriate permission to be used in the repository. Once the form and the licence are complete, the CAVA team will begin uploading your data to the repository. If you have data that you are interested in storing with CAVA, please contact the Project Officer.

Will there be a cost for storing the data?

There is currently no charge for the services that CAVA offers. At present CAVA is accepting data which will be made available through the repository, and will not accept preservation-only versions of recordings. However, the team are investigating the possibility of long-term offline storage of preservation-quality data.

What formats does CAVA accept?

You can see the preferred file formats in the report. However, the repository is designed to be flexible. If you have a query about a particular format you would prefer to use, please contact the Project Officer.

Can I use data from CAVA for teaching purposes?

All the data that CAVA accepts has appropriate permission for reasonable use in teaching and research. Do not use CAVA data in a situation where you would not use data you collected yourself. If you have a specific question about the use of a particular dataset, please contact the researcher who submitted it (found in the metadata record under 'Project Contact'), or alternatively contact the Project Officer.

Can students access the repository?

Due to the consent arrangements for data in the CAVA repository, MSc students must discuss their need for membership with the CAVA team. If you wish to use data from the archive in an MSc project, please contact Dr Suzanne Beeke (if your research focuses on adult subjects) or Dr Merle Mahon (children), with details of your project and supervisor.

Researchers (including PhDs) and staff should contact the Project Officer to request membership.

Where can I find a form of words for consent to store and use my data?

Most recent consent forms provide the necessary permission to archive and store data. The CAVA team has produced pro forma permissions forms, available in documents.

How should I cite and acknowledge CAVA data?

All users of our data must acknowledge and cite data sources correctly in any publications and outputs. Details of how to cite data can be derived from the catalogue records.

An acknowledgement is a general statement giving credit to the source and distributor and includes copyright information. It can be given at the start of, or within, the text, or at the end of the article before the bibliographic references/citations. You can find this information (e.g. depositor, sponsor) in the metadata records. Please include:

The project ID and title
The sponsor, funder or owner (if different from the institution at which the research was conducted)
the name of the CAVA repository and its web address (https://www.ucl.ac.uk/ls/cava)
Copyright information.

A suggested format for acknowledging data, using the example of research based on the TSA2007/05 (The evaluation of a novel conversation-focused therapy for agrammatism) project, is:
"The data in this article was collected for the following project: Evaluation of a novel conversation-focused therapy for agrammatism (TSA2007/05), 2010, at University College London, funded by the Stroke Association, and supplied by the CAVA repository (https://www.ucl.ac.uk/ls/cava). The data are copyright."

A citation is more formal than an acknowledgement. It follows a standard format and should include enough information so that the exact version of the data being cited can be located. New standards for citing data are emerging, and will depend on the preferred style of the journal or publisher. For further information, refer to the UKDA's guidance.

Documents