10s of millions of documents & no centralized search
Our client is one of the world’s leading agricultural companies.
They comprise more than 49,000 people in over 100 countries around the world. This enables them to conduct thorough research, but it also brings many challenges.
As their R&D function was globalised, our client realized that the growth of research documents was increasing and their ability to learn was decreasing.
The client faced several problems:
- Documents dated back over 70 years and yet were still relevant;
- Information was stored in multiple distributed and diverse data silos;
- With no centralized way to search through their libraries, discovering pre-existing research was becoming a difficult task.
Each time they began a new project, they had to waste money and time searching multiple silos for relevant documents. Standard text search gave unfocussed results. Employees had to do real detective work to find information.
This resulted in the research phase becoming longer and more difficult.
Often, duplicate studies were being made by accident.
This meant more time and money spent on research projects when the results were already known somewhere in their library.
A solution had to be found, otherwise, data silos would continue to grow, increasing the frequency of their problems and the waste of duplicate research costs.
Advanced cognitive search on integrated legacy data
The client decided to create a searchable information asset for the global R&D operation from their legacy documents. The client chose Datavid—a niche software company with a focus on data intelligence solutions—as their partner in this journey.
The first step was to focus on data integration:
- For this, Datavid built various components to pull data from different internal file systems (e.g. Sharepoint, Documentum, etc.) and external data sources (e.g. regulatory websites, EPA).
- The next step was making sure the system had the capability to detect and eliminate duplicates across databases.
- Detection and extraction of sections in the table of contents were also implemented, to help classify information with greater granularity, giving decreased search times and increased relevance of searches.
- The documents were then stored in a MarkLogic database.
- The final step was building the solution that would allow cognitive search across the client’s centralised knowledge base. Tagging and the enrichment of metadata were implemented to help perform specific topic searches.
Synonym search is another part of the project that helps narrow down results and makes the research process more effective.
For instance, a substance may be registered in a database in various forms: with its commercial name, the scientific name, or various product codes.
With Datavid’s synonym search solution, using any of these names or codes as a query will return all instances from the database that refer to the same substance.
The solution also included a friendly user interface that was easy to use and allowed client staff to focus on their work.
Better discoverability, faster research & fewer expenses
The resulting system allows our client to conduct its research in a quick and easy manner. They now have over 16,132,045 internal and external documents from 22 sources that are searchable with more being loaded every month.
Now, when starting a new R&D project, the time required to search through internal and external sources is drastically reduced to minutes.
This task took on average 2 to 3 weeks previously. Thanks to Datavid’s solution, they no longer need to allocate time and resources to go through all the databases, comparing past results, and compiling everything in a new spreadsheet.
Several other benefits the client experienced are:
- Duplicates have been either marked as such or completely removed;
- Data is classified into 16 categories, with 30+ types of extracted concepts;
- The client went from having no idea how many documents they had, to a system that can easily be searched throughout.
The cognitive search allows them to do this research in a very short amount of time within one integrated system and interface. The system also has an easy way to export findings and share them with the entire team.
Our client is now saving thousands of dollars on start-up research before every new R&D project. They can now spend their time focusing on their overarching business goals instead of going through tedious documentation.