XClose

Library Services

Home
Menu

The text and data mining (TDM) exception

This section introduces text and data mining (TDM), focussing on copyright considerations.

Within this guide


Introduction

This guide complements TDM guidance available on the Library Skills page. If you have a question not covered on these pages, please contact copyright@ucl.ac.uk.

What is Text and Data Mining (TDM)?

The UK Intellectual Property Office (IPO) defines text and data mining (TDM) as ‘the use of automated analytical techniques to analyse text and data for patterns, trends and other useful information’.

In a scholarly research context, TDM methods are used to extract information from content (e.g., journal articles, books, websites, social media platforms, images, audio files) using computational methods.

What are the benefits of text and data mining?

TDM allows large amounts of information to be searched and analysed for patterns and relationships which would not be detected with reading by humans and which would take much longer to extract and organise through a manual process.

TDM can lead to new insights in many research areas, including biomedical research, linguistic analysis, big data projects and machine learning. TDM has therefore the potential to lead to breakthroughs that inform decision making and generate innovative applications. Examples drawn from the Kluwer copyright blog include making literature reviews faster and more accurate, numerous medical applications including discovering genes that help develop new treatments, tracking patterns of behaviour during the Covid-19 pandemic, recognising fake news in social networks, building tools to help decolonise research literature and inform educational interventions.

A 2012 JISC report on the value and benefits of text mining highlighted the scholarly, economic and societal benefits of TDM and helped pave the way to a copyright exception supporting it for non-commercial uses in the UK.

What does text and data mining involve?

Text and data mining involves different processes that process large amounts of unstructured data to extract explicit knowledge. A general outline of TDM processes involves:

  • Searching and retrieving content relevant to the research, e.g. published research articles, book chapters, preprints.
  • Converting the content to machine-readable format and creating a structured dataset.

A TDM process therefore involves making copies of large amounts of materials for the purpose of structuring and analysing them. It may also involve publishing any outcomes from the analysis. This, as discussed in the next section, may have copyright implications.

  • Extracting valuable information from the structured dataset.
  • Mining tools applied to the dataset, to identify new patterns, trends or relationships.
  • Explicit knowledge which may be published/shared as a new output. This may include extracts from the original content.

Copyright and TDM

How are text and data mining activities related to intellectual property?

Extracting and mining large amounts of data for analysis involves copying the data.

If the sources used are just facts and data, they are less likely to be protected by copyright. Any other sources – texts, images, music – are much more likely to be protected by copyright: copying them would require permission or a licence from the copyright owner or would have to be otherwise permitted in the legislation. Other rights such as database rights, and neighbouring rights (e.g. performance rights in music or film) may apply. Data protection issues may also arise.

If the results from a TDM analysis are published, either formally or through sharing with others, copyright may still protect parts of these works if, for example, extracts from the original materials are included in the new work.

UK copyright legislation includes an exception that allows copying for the purposes of computational analysis under certain conditions (CDPA, s 29A). Other exceptions may be relevant to publishing the outcomes of a TDM analysis. These exceptions only apply if you are carrying out the TDM within the UK.

What does section 29A in UK copyright law say about text and data mining activities?

UK copyright legislation has a number of copyright exceptions, also known as ‘permitted acts’, which allow the copying and reuse of copyright-protected materials for certain purposes and under certain conditions, without the need for permission or a licence from the rights owner. One such exception is section 29A, which allows you to make ‘copies for text and data analysis for non-commercial research’, provided that:

You acknowledge the materials you copy, unless it is impractical to do so. For example, if including a large number of articles in your analysis, you could reference the whole database containing the articles instead. You may not share the copy with anyone else, unless this has been authorised by the copyright owner. Crucially, if a contract is in place that prevents or restricts you from making a copy under this exception, the term is unenforceable. The person doing the copying is based in the UK at the time of the TDM activity, where this exception applies. The person doing the copying has lawful access to the work. This may include materials the University subscribes to (e.g. e-journals), personal subscriptions and materials that have been made lawfully available to the public, including materials under an open licence. The purpose is to carry out a computational analysis of anything recorded in the work, solely for a non-commercial purpose. No other purposes are permitted without permission. This also means that you cannot sell or lend these copies without permission.

What counts as commercial use?

It is the purpose, rather than the person or organisation doing the analysis, that usually determines whether the activity is commercial or non-commercial. For example, a commercially-sponsored project may include a TDM activity for a non-commercial purpose: this would be covered by the exception.

In theory, a distinction can also be made between the purpose of the TDM activity itself and any commercialisation that relates to the resulting products later on. In practice, it may be difficult to prove that the original intention was indeed non-commercial but resulted in exploiting the results at a later date.

The UK Intellectual Property Office guidance advises that:

‘there are no restrictions on how or where outputs of text and data mining can be published, including journals published for profit by academic publishers and under licences that permit commercial research, such as CC BY. Other commercialisation of the research outputs is not restricted either. But it is important to be scrupulous in assessing whether the original purpose of carrying out the text and data mining analysis is solely non-commercial; if it isn’t, then researchers are very likely to be infringing copyright’.

The UK government had plans to broaden the TDM exception to allow text and data mining for any purpose. Following consultation with several stakeholders, these plans were withdrawn in March 2023. Instead of broadening the exception, the government is preparing a Code of Practice for TDM activities. In November 2023, the UK Libraries and Archives Copyright Alliance (LACA) distributed a letter, signed by several organisations including Research Libraries UK, calling for inclusion of clarifications in the code of practice which would facilitate TDM and AI training activities in the UK. One key point asks for clarification that access to publicly available online information remains ‘available for analysis, including text and data mining, without the need for licensing’.

What if I want to share copies of materials made for TDM purposes with other researchers I am working with?

TDM activities are often carried out as part of a research project involving collaboration between different teams. This may be problematic for several reasons:

The TDM exception states that copies of works made for the purposes of computational analysis may not be shared with others, unless such transfer is authorised by the copyright owner. This means that anyone who is not an authorised user under a supplier’s/publisher’s licence may not access the copies without permission. The TDM exception only applies in the UK; and so does the quotation exception. Other countries may or may not have similar exceptions, but their terms will differ across jurisdictions. This means that: (a) even if you are a UCL researcher with access to the sources, you may only rely on the UK exceptions if you are based in the UK and (b) if your project involves collaborations with overseas partners, they may not rely on the exception. Collaboration with industry partners may need clarification as to whether the purpose of the TDM activity is non-commercial.

It is advisable to contact copyright@ucl.ac.uk early on in the project for further help.

What about sharing or publishing the outcomes from a TDM analysis?

While the TDM exception explicitly says that initial copies made to perform the analysis may not be shared with anyone else, the outcomes of the analysis may be shared with others and published. Many publishers’ licences that address TDM activities also clarify that any copyright and database rights arising from any computational analysis, would be the property of the researcher (or the researcher’s institution, depending on the institution’s IP policy).

However, an outcome of a TDM activity (e.g. a literature review article or an analysis report) may still include parts of the original copyright works used in the analysis: for instance, it could contain extracts from original articles or book chapters. In this case, copyright still applies:  including extracts from these sources would normally require permission or a licence. At this stage, you may not rely on the TDM exception to include them. Instead, you would normally need permission from the copyright owner but you may also consider relying on the criticism, review and quotation exception (s 30, CDPA). As with the TDM exception, the quotation exception cannot be overridden by the terms of a licence.

The quotation exception allows you to use copyright materials without permission, as long as:

  • The work has been made available to the public.
  • The amount quoted is only as much as required for the specific purpose.
  • The sources are acknowledged where possible.
  • The use is fair dealing.

Relying on the exception is a matter of judgement. If you want to discuss your options with the copyright team, please contact copyright@ucl.ac.uk.

Can I use papers and data from open access repositories in TDM?

If a source has been made available lawfully – which is the case with institutional repositories such as UCL Explore – then it can be included in TDM activities. CORE aggregates millions of articles from open access journals and repositories that can be used for computational analysis. Other large datasets include PubMed Central, which offers around six million papers and PLOS, which has around 200,000.These can both be downloaded without any permission.

The content of open access repositories will vary in terms of licensing: many articles and datasets will be available under one of the Creative Commons (CC) attribution licences, which allows copying and reuse as long as the authors are attributed. Other content may be publicly available, but without a specific licence.

While Creative Commons licences are in place to make resources open for access, sharing and reuse, in a TDM context it could be claimed that they are restrictive, as attribution of every source is required. This may not be possible if a very large number of sources are used. However, as discussed on the CC website, exceptions would have primacy over the terms of a CC licence:

“the licenses explicitly state that they in no way restrict uses that are under a limitation or exception to copyright. This means that users do not have to comply with the license for uses of the material permitted by an applicable limitation or exception (…) or uses that are otherwise unrestricted by copyright law, such as text and data mining in many jurisdictions.”

It is still advisable to acknowledge any sources used in the TDM process, but if this is not practical, CC licences do not impose a barrier even if they normally require attribution.

Works that are either not protected by copyright/out of copyright or materials where the copyright owners have waived copyright, for example by applying a CC0 waiver, can be copied for TDM purposes. Many Wikimedia images and some datasets, for example, are public domain under CC0. Attribution is not required with CC0 materials – although it is good practice to attribute the sources – and this facilitates the TDM process further.

What is lawful access for TDM purposes?

Materials (e.g. articles, books, images) available via an institutional or personal subscription,  open access materials and public domain materials are all resources to which access is lawful. It is important to consider materials publicly available on websites. These may include materials that are shared illegally on certain platforms, in which case access is not lawful.

TDM and publisher licences

The TDM exception cannot be overridden by contract. Yet, some publishers restrict my TDM activities. What can I do?

In theory, the TDM exception has primacy over any terms of a contract that may restrict copying for this purpose. In practice, several suppliers and publishers try to prevent, control or restrict TDM activities. Restrictions may include asking that you contact them in advance, requiring you to use their own tools/APIs, and putting technological measures in place to stop you from downloading large amounts of data. In some cases, publishers may even restrict access to their site.

Publishers do have the right to apply technological measures controlling download activity, for instance to ensure their systems remain stable and usable by others. In such cases, it is essential that you don’t try to circumvent any technological measures in place. Instead, please contact copyright@ucl.ac.uk. We discuss such cases with publishers and other suppliers with the aim to find a solution that does not hinder your TDM research.

UK Intellectual Property Office guidance states that ‘any such measures should not stop or unreasonably restrict any researcher’s ability to benefit from the exception’. It is important that you make the library aware of any cases where your TDM activity has been hindered. This will not only help us support your project, but will also inform our ongoing negotiations with publishers when we sign relevant agreements. Working with publishers and other suppliers is essential, as they are the ones who can offer technical support for TDM activities.

What if a publisher asks me to use their own tools to carry out TDM activities?

Some publishers provide their own tools (e.g. APIs) to support TDM. While using publishers’ tools has some advantages – their tools are compatible with their platforms and products, they have been tried and tested, and the process is controlled to avoid disruption for other users – it is worth being aware of additional terms and conditions that may be imposed by publishers.

Agreeing to use a publisher’s API may involve accepting additional terms of use in a click-through licence. These may include further restrictions on using the data. Additional functionality may be offered, but this may come at additional cost.

We advise you to contact us if you are presented with these options. We can also offer advice before the start of your project as to what different publishers offer or require.

Is there a way of knowing what a publisher’s licence says in terms of text and data mining?

Accessing scholarly articles for TDM usually needs to be done on a publisher-by-publisher basis. This is because publishers differ as to how they approach TDM.

  • Some publishers include terms in their licence that formalise what is already permitted in the legislation, explicitly stating that authorised users (anyone who can access and read the sources under the licence) can download and make copies of the materials for the purposes of TDM for non-commercial research purposes. It is usually required that any copies made for computational analysis be deleted as soon as the work is completed. Some licences also specify that outcomes from TDM analysis belong to the researcher or the researcher’s institution, but that permission is required to reproduce substantial amounts of the original sources in these results.
  • Other publishers do not mention TDM, but use of the exception is covered by a general clause stating that the licence is meant to complement and extend rights that users have under copyright legislation.
  • Some publishers do not include a reference, direct or indirect, to TDM; or they may explicitly prohibit it. Additional requirements may apply, including using an API provided by the publisher.

It is very important to work with the publisher processes as they might otherwise act to limit any bulk download of papers, which could disrupt access for other users. Elsevier, Springer-Nature and Wiley all offer APIs that researchers can self-register for. These three publishers together represent ~40% of all papers. Other publishers will need to be checked on a case-by-case basis and may need to be contacted directly.

Please also note that open access papers may be downloaded for TDM purposes without permission. Large open access datasets that can be used include CORE, PubMed Central and PLOS.The content of those can both be downloaded without any permission.

How are TDM licences affected by the rise of AI training tools?

TDM has come to the forefront of discussions with the rise of generative AI. The fact that, in the UK, TDM is permitted in the legislation for non-commercial research purposes only, can add extra complexity to how publishers approach TDM among concerns that their data may be used to train AI models.

Some suppliers have started contacting higher education institutions to tighten their terms of use in view of this. If your research involves copying large amounts of data to train AI models, please contact copyright@ucl.ac.uk for advice.

Putting it all together: your TDM checklist

Before you start your project, it may be worth considering the following:

What is the purpose of your TDM?

Consider:

  • Final outcome
  • Plans to publish/disseminate the results. Will they include substantial amounts from the original works?
  • Types of sources needed (e.g. articles, websites, images)
  • Tools to be used

Can you rely on the TDM exception?

Consider:

  • Do you have lawful access to the resources, e.g. via a subscription or open access?
  • Different considerations for using raw data/facts, subscribed resources, open access resources, public domain resources.
  • Do others in your project need access to the copies but don’t have lawful access or cannot do the TDM activities (e.g. international partners based elsewhere, partners who don’t subscribe to the same resources).
  • Is the intended purpose at the time of the TDM non-commercial?
  • Can you acknowledge the sources (if not, at least acknowledge the general database used).
  • Do you need to store the data for transparency and reproducibility reasons? If this is not possible, document your methods in detail.

 Legal considerations throughout the project

  • Can TDM exception apply or do you need additional permissions, e.g. to share the copies or for commercial use
  • If you intend to include parts of the original resources when you share the outcome of your TDM analysis, consider getting permission or relying on the quotation exception/fair dealing.
  • Beyond copyright: database rights, data protection issues e.g. if the sources include personal data?

Technical considerations

  • Likely to need technical support?
  • Need advice on TDM tools? Can you acknowledge the sources (if not, at least acknowledge the general database used).
  • Do you need to store the data for transparency and reproducibility reasons? If this is not possible, document your methods in detail.