The primary objective of the MERLIN project is to demonstrate and evaluate the use of off-the-shelf text mining and thesaurus tools to derive descriptive subject classification from full text repository deposits, at minimum cost, and to use such derived terms to enhance the discovery and navigability of repository content. The testbed for MERLIN will be the SHERPA-LEAP Consortium's LASSO aggregation service. Later, assuming that MERLIN's work in the aggregator is successful, a platform-neutral tool allowing repositories to implement the MERLIN metadata enhancement solution will be made generally available.
SHERPA-LEAP and LASSO
SHERPA-LEAP (London E-prints Access Project, a partner in SHERPA) is a Consortium of London-based Higher Education Institutions, led by UCL. SHERPA-LEAP helps London universities to develop and maintain their institutional repositories. Within the LEAP partnership there is substantial diversity of institutional size and mission, ranging from the large, multi-disciplinary and research-led, to the smaller and highly-specialised, and a substantial range of research interests.
These differences are reflected in the content of the Consortium's repository cross-searching service, LASSO (LEAP Aggregated Search Service On-line), making it an ideal testbed in which to expose and examine issues relating to the application of text mining techniques across institutions and disciplines. LASSO is a simple OAI-PMH-based aggregation service which was developed in 2008 as a demonstrator by UCL Library Services; it offers cross-searching of the institutional repositories of several SHERPA-LEAP member institutions.
Subject description in the SHERPA-LEAP repositories
The LEAP partners, at the outset of the SHERPA-LEAP project, recognised that any shared taxonomy for subject description would have to be so large and unwieldy - supporting research into specialist subjects ranging from clinical biomedicine to ancient South-East Asian cultures - as to be entirely off-putting to depositors, and unworkable by administrators. It was agreed that the Consortium members should make their own arrangements for subject description. In fact, few SHERPA-LEAP repositories use any subject authority in describing their records. In common with arrangements at many UK HEIs, repository staffing resources at the partner institutions have tended to be piecemeal, and are often still unestablished. The primary focus of the partners has been the rapid population of their repositories, in order to achieve a critical mass of content. In many institutions, it is felt that the application of externally-authorised subject classification would turn the process of repository population from a data entry task into a much more time-consuming, highly-skilled and therefore expensive cataloguing task. Even in those repositories which employ authorised subject description, resource constraints often mean that description is at such a high level as to bring little additional practical benefit to users. Subject description, therefore, is not standardised to any degree in the LASSO cross-searching service.
The MERLIN approach to subject description
Away from their own search pages, the content of the LEAP repositories is discoverable either through the somewhat indiscriminate indexing of Google and other search engines, or the rather narrow, lowest-common-denominator search offered by LASSO and other metadata aggregation services. The MERLIN project will bridge the gap between these two extremes by integrating automatic subject description with a metadata aggregation service, without additional resource implications for the participating repositories. The MERLIN approach brings several benefits:
- improves the discoverability of repository content.
- allows cost-effective subject description.
- uses researchers' own vocabularies.
- caters for interdisciplinarity.
- goes beyond simple full-text indexing by bringing selectivity and weight to index terms.
- weighted keyords may be used as the basis for structured navigation.
The MERLIN project
MERLIN will use off-the-shelf text mining techniques to enrich the functionality of LASSO. LASSO offers search across aggregated, normalised metadata which is collected from London-based institutional repositories using OAI-PMH harvesting. MERLIN will use the TerMine term extraction tool to derive terms from full text digital objects held at LASSO's source repositories and, after a weighting process, enrich the LASSO database with derived keywords. The derived terms will be exposed at various points in the LASSO interface as 'clouds', and will also be queryable through the LASSO search forms, to support discovery.
In a supplementary strand of the project, MERLIN aims to use the multi-subject terminological cross-searching aids developed by the HILT project to pilot a hierarchical, browsable subject tree based on the weighted keywords. Text-mined terms in LASSO will be matched against the HILT collections to retrieve a set of related terms and relationships and create a navigable service for end users.
Formative evaluation, involving end-users, of the accuracy, usability and efficiency of the automated enhancements to the LASSO aggregator will be conducted.
An open source, re-usable web application will be created to allow the MERLIN metadata enrichment technology to be incorporated into any repository on any platform. Although the project will showcase the exploitation of text mining in an aggregator service, the MERLIN tool will stand alone, and will be available for integration with any repository software, offering the potential for subject description to be applied to the content of any repository with few of the resource overheads traditionally associated with subject cataloguing.
A summative evaluation report on all the outputs of the MERLIN project will be compiled and made available through the project Web site.
For more information, contact the MERLIN Project Team