Skip to content

4 minute read

Data integration challenges faced by academic publishers

by Martin Dransfield on

Academic journal publishers face a lot of challenges when integrating data from multiple sources. Here's a breakdown on data integration challenges.

Table of contents

Each of the things mentioned below is a challenge which must be solved by journal publishers in order to deliver articles to people who want to read them. Some of them are trivial; some of them appear trivial.


Free download: The 6-step checklist to implementing a data management framework

Identifiers

Academic publishing attracts identifiers like picnics attract ants. Journals have them (International Standard Serial Numbers, usually two: one each for the print and electronic versions, often a publisher internal identifier).

Articles are assigned them on acceptance.

Usually, articles are also assigned a Digital Object Identifier (DOI) but also issues, volumes and sometimes journals may also have them. Increasingly authors too have Open Researcher & Contributor IDs (ORCIDs)

Production

Articles do not arrive at a publishers in exactly the form in which they are published.

In the olden days, authors would send paper manuscripts, which would have to have their images prepared by draughtsmen, be typeset, proof-read, corrected, assembled into issues, paginated, and printed.

Nowadays, a lot of these steps are compressed: authors send electronic documents and final-form images, which are converted, typeset and paginated by computers.

Authors also may send supplementary material (for example, large result datasets, or the computer programmes used to analyse them).

However, it still takes time and journals can have large backlogs of articles to assemble into issues, some of which are still printed and mailed to subscribers.

A result of this is that some articles may never be printed, they may become an only article.

(Although they will still appear in printed tables of contents)

To reduce the time between submission and final printing, the output of intermediate steps is made available online.

(Typically three versions: the accepted manuscript, the typeset paper before it’s assigned to an issue, and the final version as printed)

Funding

The world’s leading research funding organizations increasingly mandate that publications they fund be made freely available to the public (Open Access).

Consequently, traditional publishers must keep records of whether individual articles should be open access, and, for audit purposes, which funders and grants apply.

Repository deposit

There are requirements on most academic publishers to deposit some data with external repositories.

The most obvious one of these is CrossRef, the publishing industry consortium which administers the DOI system. CrossRef members must deposit metadata for everything to which they assign a DOI.

This is often individual articles, but some publishers might assign DOIs to issues or volumes.

Certain funders might require that all articles they have funded be deposited in repositories.

A prime example of this is that any paper generated by funding from the US National Institutes of Health (NIH) must be deposited with the PubMed Central repository.

Other feeds

Some customers (intermediaries, or large corporates), rather than buying licenses, might buy article data outright, which is delivered to them periodically so they can populate their own delivery platforms, often aggregating content from multiple publishers.

Licenses

There are a bewildering number of possible ways that content can be licensed. These are some of the most common:

  • The content might be freely available because of an Open Access mandate of the type discussed above.
  • You might buy a personal subscription to a journal, or you might join a society that gives a journal subscription as a benefit of membership. This sort of license usually lets you read any of the articles in the journal for one year and then expires (unless you buy another).
  • You might only want to read a single paper, so you might buy a Pay-Per-View (PPV) license which lets you read that article for a limited time. It might let you download a copy of the article but, increasingly not.
  • Institutional libraries don’t get either of these sorts of license. They pay extra for what is called Perpetual Access. This gives them permanent access to the articles published in the year for which they have paid even after the license has expired.
  • Institutions which don’t have complete subscription coverage, librarians might buy tokens which can be used to give (usually limited time) access to individual articles.
  • The final common type is not really a license. The publisher may decide to make articles freely available. This is usually for marketing reasons (often an article has been widely reported in the media).

Customers

Everybody who buys a license is a customer.

You or I, buying an individual PPV or yearly subscription might be identified by our names and email addresses.

Institutional customers have groups of librarians, who might not be the actual person responsible for paying the license fee but who, nevertheless, need to administer the account: audit subscriptions, spend tokens to get access to non-subscribed articles for users.

Perhaps most importantly, institutions often insist that their users have seamless access when they are working on the institution’s networks or via a remote access proxy.

Network access relies on the delivery platform knowing the institution’s IP addresses (which can be an arbitrarily complex set of ranges, some overlapping, some not).

Proxy access relies on the platform knowing details about the proxy platform.

Moreover, institutions sometimes insist that their branding is displayed to their users when they access articles. This typically comprises text, images and links.

Users

Users are those people who read articles.

They might be accessing seamlessly on a university network, or they might have had to log in to access personal subscriptions.

Conclusion

The data and content for all these items is typically stored in disparate systems.

Content, in all its various production stages, is stored in a Content Management System.

The customer, subscription, and license data will be in a relational database, often controlled by a financial system; IP address that ranges for seamless access, customer branding and user details may well be stored in a different relational database.

In order to deliver correctly branded pages to the users entitled to read them, all this data must come together.

Datavid’s consultants have the expertise to guide you through the details of integration and delivery.

datavid data management framework checklist image bottom cta

Frequently asked questions

An example of data integration is combining data from multiple sources, such as databases, spreadsheets, and external APIs, into a unified and cohesive view.

Yes, ETL (Extract, Transform, Load) is a commonly used data integration process. It involves extracting data from various sources, transforming and modifying the data to meet the target system's requirements, and loading it into a destination system or data warehouse.

The three main issues in data integration are data quality, data mapping and transformation, and data governance and security.