From disconnected repositories to a unified content lake: Modernizing scientific content

Industry
Publishing
Challenges
ACS struggled with outdated, disconnected content systems that slowed research workflows, made data hard to find, and limited advanced search or analytics. At the business level, this led to underused assets, missed monetization opportunities, and slower product delivery, weakening competitive advantage.
Solution and Results
Datavid partnered with ACS to create a cloud-native Content Lake that unified storage, enabled semantic enrichment, and made decades of diverse content ready for AI, text, and data mining. The transformation delivered faster processing, greater efficiency, improved governance, and a scalable foundation to support future innovation.
Technologies
Progress MarkLogic, REST API, Apache NiFi , AWS Cloud Services , Progress Semaphore
“Datavid has proven to be an invaluable partner, mirroring our organizational culture and values. Their unparalleled expertise and proficiency in the realm of data management have truly set them apart.”
Mitchell Bakos
Director of product development & solutions architecture, ACS

About CAS
The American Chemical Society (ACS) is the world’s largest scientific society and publisher, dedicated to advancing chemistry and improving lives. As a global leader in research and publishing, it connects scientists, educators, and industry professionals around the world.Setting the Scene
For ACS, content is one of the most valuable assets. However, decades of accumulated knowledge had become difficult to leverage due to fragmented systems, inconsistent formats, and limited discoverability.
With no unified storage repository for structured and unstructured content, ACS faced challenges in:
- Managing large volumes of data across disconnected repositories.
- Supporting modern research workflows with faster, enriched access.
- Unlocking the potential of historical archives for new product development.
ACS recognized that the solution required more than an upgrade. It needed a content lake: a governed, scalable repository capable of storing, enriching, and transforming content for multiple downstream applications.
Who will benefit?
- Chief Data Officers (CDOs): See how legacy repositories can be consolidated into a governed, AI-ready Content Lake that improves operational efficiency, reduces costs, and provides a foundation for future innovation.
- Product Managers: Learn how a modern content platform reduces technical debt, accelerates delivery of content-powered features, and creates a scalable base for competitive digital products.
The Challenges
The organization’s legacy content systems could no longer keep pace with modern R&D and product needs.
ACS had amassed more than 890 books, over 2,400 scientific posters, and large volumes of additional assets such as partner archives, conference videos, marketing materials, and raw research datasets. Many of these were stored in inconsistent formats across disconnected repositories, making them difficult to find, search, and reuse.
Technical challenges
- Outdated technology stack that could not support advanced search, enrichment, or analytics.
- Large content downloads that took excessive time, slowing research workflows and frustrating users.
- Complex data models that made ingestion, linking, and retrieval inefficient.
- Competitive pressure from rivals offering advanced text and data mining products.
Strategic challenges
Beyond the technical barriers, there were business-critical concerns:
- Underutilized historical content that was not easily discoverable or ai-ready.
- Missed monetization opportunities from potential subscription-based, value-added content services.
- Slower product delivery, as teams faced delays developing content-powered features, reducing agility and competitive differentiation.
Addressing these technical and strategic obstacles was essential for ACS to unlock the full value of its content, accelerate innovation, and strengthen its position in an increasingly competitive research landscape.
Are legacy content systems slowing your innovation?
See how Datavid can help you transform decades of underutilized content into an ai-ready platform that fuels research, speeds up delivery, and drives new revenue.
The Solution
Developing a content lake
Datavid partnered with the American Chemical Society to reimagine its content architecture from the ground up, creating a scalable, cloud-native Content Lake. The platform was designed to serve as a single source of truth for all structured and unstructured content, whether produced by ACS or sourced from external partners.
Collaborative design and requirements gathering
Before building the Content Lake, Datavid worked closely with ACS stakeholders to document technical and business requirements, evaluate options, and select the best approach. The design process ensured that the platform would not only meet immediate needs for storage, archiving, and retrieval, but also provide a flexible foundation for long-term innovation and new digital initiatives.
Core capabilities of the content lake
The content lake acts as both an authoritative repository and a preservation archive for ACS’s vast and diverse scientific content:
- Authoritative repository for content outside of traditional publishing workflows, such as partner network backfiles, ACS historic archives, marketing and sales materials, scientific posters, conference proceedings, videos, audio, and raw research data.
- Preservation archive for published and non-published milestone versions of materials, including those needed for business analysis, text and data mining (TDM), and other delivery needs.
- Future-proof architecture designed to accommodate emerging content types such as binary files that require indexing and retrieval.
What the Content Lake enables
- Standardized storage and retrieval for both structured and unstructured content
- Automated transformation into industry-standard formats such as JATS, enabling text and data mining (TDM) and AI-readiness
- Semantic enrichment and linking using tools like Progress Semaphore to improve discoverability and reuse of content
Advanced functionalities powered by key technologies
The Content Lake was built on a modern, scalable architecture that combines Progress MarkLogic, Apache NiFi, and AWS cloud services. Progress MarkLogic, a NoSQL database optimized for metadata-rich content, provides powerful full-text and faceted search across large and diverse datasets. Apache NiFi orchestrates ingestion pipelines for XML, PDFs, images, videos, and other formats, enabling rapid and reliable content loading from any schema or DTD. The platform runs on AWS infrastructure, with EC2 hosting core services and ECS Fargate delivering a fully managed, serverless environment to ensure scalability, resilience, and long-term sustainability.
Other capabilities include:
- Document library services within marklogic for editorial workflows.
- TDM query tools allowing business users to define, test, and save mining queries.
- REST apis for seamless integration with downstream applications.
- Data hub framework (DHF) conventions to ensure future interoperability and feature expansion.
Business alignment
Datavid’s approach was not only technical but also strategic. The Content Lake was designed to align with ACS’s long-term vision of making decades of scientific content more accessible, discoverable, and reusable. By creating a governed, scalable foundation, the platform enabled ACS to preserve valuable assets, improve internal efficiency, and prepare its content for emerging technologies such as text and data mining and AI. This alignment ensured the Content Lake could serve both immediate operational needs and future innovation initiatives.
The Outcomes
The transformation delivered clear, measurable value within months, boosting processing speed, efficiency, scalability, and governance.
Faster access to critical content
Researchers can now search and retrieve content in seconds using taxonomy, metadata, version history, or full-text search.
- Preparation of 2,400+ scientific posters and nearly 890 books was reduced from hours to minutes.
- Document transformation tasks, such as watermarking and XML cleanup, now run at one per second or faster.
- Text and data mining (TDM) extractions achieved consistent sub-second performance per file and per megabyte (0.324834 seconds per megabyte and 0.11727902 seconds per file on average).
Scalable for future growth
The Content Lake was designed to expand well beyond the initial scope, supporting new content types and growing volumes without the need for re-architecture. Flexible infrastructure ensures the platform can evolve with business priorities and emerging technologies.
Operational efficiency and cost savings
By moving to AWS, ACS achieved:
- a 50% increase in data processing speed,
- a 30% reduction in storage and management costs
freeing up resources for innovation.
Ingestion pipelines preserve original content to maintain integrity while enabling ongoing enrichment and refinement.
Improved governance and product enablement
Role-based tagging and configurable taxonomies strengthened compliance, controlled access, and ensured content could be organized for multiple use cases. Bulk operations for transformation and export streamlined large-scale licensing and delivery, while REST APIs allowed seamless integration with downstream applications.
Business impact
By creating a scalable Content Lake, ACS transformed decades of dispersed, underutilized content into a governed and AI-ready foundation. This shift was more than a technology upgrade, it provided the strategic infrastructure needed to preserve valuable assets, improve efficiency, and prepare for future innovation in research and publishing.