ACS: Building a safe place to store, search and enrich data
Learn how a large non-profit research organization leveraged its data to accelerate R&D across departments.
About the customer
American Chemical Society is one of the world’s largest non-profit scientific organizations. It supports scientific query in the field of chemistry to improve people’s lives through the transforming power of chemistry.
2500 - 5000
- » Improved efficiency and cost savings better storage infrastructure
- » Higher performance with ability to search and retrieve
- » Data enrichment through data and test mining processes
- Data lakes
- Data architecture
- Data engineering
- Knowledge management
- Scientific research
- Cognitive search
Handling and leveraging a huge volume of data.
American Chemical Society is one of the world’s largest non-profit scientific organizations.
It aims to sustain scientific research in the field of chemistry by collecting and sharing knowledge.
With more than 140 years’ experience, ACS had gained a huge amount of data, for example more than 890 books and 2400 posters.
The biggest challenge for ACS was managing this data because it did not have a general storage repository for inserting and storing unstructured or structured content. There was the need of a solution that could ingest content, function as a repository, process and export data.
The goal was to create a content management system where you could store, archive and search for data easily, but also a system able to enrich assets.
Developing a content lake.
To meet the storage, archiving and retrieval needs of current and future content, Datavid proposed to build a content lake for structured and unstructured content, both ACS and non-ACS.
Before building the content lake, Datavid has documented the technical and business requirements, identified solutions to meet the requirements, selected the best options, and implemented them.
The idea developed by Datavid was to have a content lake that acts as:
- A suitable “authority” repository solution
- Storage and management of content not produced by GPO content production processes, such as partner Network backfiles (e.g., third-party journals), ACS historic archives, magazine backfile, backfile, Marketing / sales materials, ACS posters, Conference proceedings, Video, audio, and other raw data (e.g., research data)
- Query, search, and processing of non-production content such as supporting ad hoc content analysis requests or bulk data preparation and transformation
- Handle future unknown types of content such as binary content files that need indexing and retrieval in support of future ACS product or service offerings
- A “preservation” repository. An archive for published versions materials, important non-published milestone versions of materials (initial, post-conversion, post-TE, post-corrections, etc.), other milestone content versions that may be required for business analysis, TDM or other content delivery needs.
- where to expand the capabilities of the ACS Publications business to develop new products, perform content research and analysis, and meet efficiency and cost containment goals.
- b. to address the business capacity of non-journalistic/non-book content (posters, partner network backfiles)
The business uses of this content lake include a standardized storage and retrieval system for:
- a preservation repository for ACS and non-ACS, non-WIP journal and book content,
- structured and unstructured non-journal content,
- a source for future text and data mining (TDM).
In fact, the objective of this repository was not only to upload data as it is, but also transform it into JATS/ TDM and other formats and make it available for downstream product applications.
The content lake developed provides functionality like ingestion and updating, content management, asset enrichment, linking, mass export to TDM, and other additional activities.
In particular, ACS content lake has these features:
- Ingestion: Fast loading of XML & Binaries of any schema/DTD.
- Search: Powerful full text search using facets to aid discoverability. Publishing search grammar for constraining queries to common fields.
- Document Library Services: MarkLogic content management tools for authors/editors.
- TDM: Lets business users define and test TDM queries and save them for later.
- REST: All functionality has an associated REST interface for integration into other applications - a content hub approach.
- Data Hub Framework: It utilizes DHF conventions to allow future use of DHF features.
Enhanced performance, efficiency, availability and capacity.
Datavid was able to build a content lake tailored to ACS's needs that brought amazing results and benefits, such as:Higher performance:
- Rapid search and retrieval of content (based on taxonomy, metadata, version, and where possible the full text of the content files)
- Ability to scale up from initial implementation needs
- TDM extract benchmark in 0.324834 seconds per megabyte, 0.11727902 seconds per file (average)
- Transform 1 document per second or faster (i.e., remove body of XML, Watermarking, etc.)
- Efficient storage, processing, search, and retrieval of large volumes of structured and unstructured content
- Reduction the overall costs for labor, licensing, storage, management, infrastructure, and discovery of content
- Tag content as limited / restricted use/access
- Ingest and splitting up, refinement or evolution of content
- Leave original content intact during the first ingestion
- Quickly ingest and retrieve large amounts of data
- Index on the elements to query against to meet required performance needs
- Configurable taxonomies and metadata for tagging, labeling, and organization of the contents for TDM, Posters, Partner network, e-learning)
- Access client-side APIs with configurable UI for future needs
- Handle efficient bulk operations,including transformations of structured content and exporting for bulk delivery.
Datavid was able to build a unique place where ACS can archive, manage and process all its data and leverage it for better performances.
The work is still ongoing as all these results are maintained over time.