5 minute read
How to implement a data ingestion pipeline correctly 
Data ingestion is a complex process that requires multiple steps. Here's how you can implement a large-scale ingestion pipeline correctly.
Table of contents
The purpose of a data ingestion pipeline is to continuously move data—either batched or streaming—from various sources (such as databases) to a target destination (such as a data lake), making it available to the business for further processing and analysis.
But the data ingestion process is complex—it demands a clear vision of the business goals you want to reach and requires technical skills to achieve a pipeline architecture that works in your favor.
Developing your own pipeline from scratch presents a lot of unknowns:
- Real-time or batch processing?
- ETL tools or modern ingestion tools?
- Cloud services or on-premise?
To lower the operational overhead that modern data pipelines demand, we’ve put together a guide with all the best practices we’ve learned over the years.
Understanding data ingestion pipelines
A “data pipeline” is a set of steps that data has to go through to reach its destination. “Ingestion” refers to the funneling of data from multiple sources into a single one.
During its journey through the pipeline, your data undergoes complex transformations to fit the needs of your business, such as changing its schema or redacting information that’s not needed.
The outcome is based on your requirements as a business.
In fact, data ingestion (particularly batch ingestion) is traditionally linked to the specifications-driven ETL process:
- Extract – Identifying all sources and extracting data from them
- Transform – Ensuring that all data is transformed into a usable format
- Load – Making the data available to various business units for further analysis
However, with the advent of real-time ingestion tools and an ever quicker business landscape, ETL has fallen out of favor against its close cousin:
» ELT (Extract, Load, Transform)
The idea with ELT is simple:
To make data available to the business immediately, and only transform it as needed.
This is to prevent a mismatch between business goals and technical challenges as it prevents you from going back in time to recover missing data.
It also significantly reduces time to value for data-intensive projects.
Real-time data pipelines vs batch processing
An important thing to consider before making a decision on ETL/ELT is whether your project’s requirements and/or data structures are likely to change on a regular basis.
ETL can work for time-critical applications, but it lacks flexibility.
For example, you might realize you didn’t add an important data point to your business analytics solution after developing the pipeline, but that its original source was never integrated in the first place, making it hard to recover it.
Traditional ETL tools will make the process of going back in time tedious. Check out our guide on real-time data ingestion to learn more about which approach is best for you.
Four steps for proper data pipeline development
Implementing your data pipeline correctly involves a number of steps, from identifying your business goals to developing the pipeline itself:
Step #1 – Identifying expected business outcomes
Your data pipeline should be designed against the expected business outcomes, not the other way around. Expected outcomes can change over time, but establishing a baseline is important. Ask yourself these questions:
- What business outcome do you want to achieve?
- Where does your data reside? Cloud or on-premise?
- How many data sources does your business currently have?
- Which of these are relevant to achieving the business outcome?
- Do you know what information you need to extract beforehand?
- Does quick data availability affect the business outcome?
- How often will data be accessed in its final form?
These seven questions will help narrow down whether traditional ETL is a good approach for you, or whether you need to employ a more flexible ELT pipeline, perhaps with real-time ingestion. In ETL, we refer to this stage as the “blueprinting” stage.
It demands that the business know what data they need to extract beforehand.
If you don’t know, or if you are unsure, then ELT is a better choice.
It allows gathering as much data as possible without dealing with expensive transformations until they are needed by the user.
Step #2 – Designing the pipeline’s architecture
The design stage of your data pipeline is where all the information gathered in stage 1 is put to use. Here, a team of data engineers brainstorms an ideal architecture to meet your business requirements. There are many factors at play:
- Which data sources are we working with? (SaaS, databases, files, etc.)
- What are the performance requirements of the pipeline? (the “rate”)
- Which tools are best suited to achieve the expected business outcome?
- How much will it cost to implement and maintain the pipeline?
All of these considerations must be made prior to developing the pipeline.
This is done with the help of a project manager who will look at the financial viability of the project, as well as the timeline.
Step #3 – Pipeline development: Ingestion tools and techniques
Once the necessary considerations are made, the development stage looks at the technical details behind implementing your data ingestion pipeline.
This is often where most businesses want to start.
MarkLogic Content Pump (MLCP) is an example of a data ingestion tool
Our recommendation is to take the time to identify the objective on paper first as it helps tremendously with giving developers an exact vision of what needs to be built.
There are a variety of ingestion tools today:
- Hevo – Recommended data ingestion tool
- MarkLogic Content Pump
- Amazon Kinesis
- Apache Kafka
Depending on your business requirements, you may want to use one or the other.
There is no “silver bullet” when it comes to picking data ingestion tools.
What does matter is the technique used; batched, streaming, or a mix of both. Each tool is best suited to fit a specific technique.
For example, Amazon Kinesis is great at handling streaming data.
Other technical details to consider are:
- Reliability and fault tolerance
- Response time (latency)
- Auditing and logging
- Data quality
When it comes to development, data ingestion is a vast topic.
So you shouldn’t underestimate the importance of a developer’s experience in the field and in the specific tooling required.
Step #4 – Data transformations and the user interface
Finally, we get to deliver value to the business in the form that was originally envisioned. In a modern ELT pipeline, data transformations happen on demand as the user requests the information they want.
This puts less strain on the pipeline, making it much more efficient.
The only tradeoff to make here is slightly longer processing times for the user, which is offset by the architectural savings.
Transformations can be of all kinds:
These depend entirely on your business use case.
Both structured data (e.g. data stored in relational databases) and unstructured data (e.g. documents in a file storage system) can be processed at this stage.
A powerful way to extract value from unstructured data is to identify the concepts that are most relevant to your business and surface their relationships.
For example, an enterprise in the life sciences field could better assess the correlation between chemicals in a drug to positive health benefits.
Making the most of your data ingestion pipeline
To benefit from modern data pipelines, it’s important to use the ingestion tools that best fit your business requirements.
The business goal and architectural phases are key to identifying this.
Not all software tools will fit the specific needs of your business.
It’s important to focus on why they are used.
A data ingestion tool like Datavid Rover allows you to focus exclusively on your business goals, removing all operational overhead and offering a simple user interface.
It works by implementing the same principles discussed in this guide but is available for use in a matter of days instead of months.
Whether you need batch processing or real-time analytics, Datavid Rover works to fit your requirements, without getting in the way.
Get a free consultation today to learn more.
FREQUENTLY ASKED QUESTIONS
Data ingestion refers to the process of moving data from multiple sources (such as databases) to a target destination (such as a data lake), with the end goal of making the data available for further processing. Some use cases are business analytics, cross-department search, and more.
To create a data ingestion pipeline, you have to first identify what business outcome you’re looking to achieve, design the pipeline’s architecture to achieve those goals, develop it using tools such as MarkLogic or Hadoop, and finally transform the data for use in a business application.
Four things to think about in data ingestion are 1) Which data sources you need to extract data from; 2) What the performance requirements; 3) What tools to use based on your choice of ingestion technique, and; 4) How much it will cost to develop and maintain the pipeline.
The three most common data ingestion techniques are 1) Batch ingestion, where data is ingested on a set schedule; 2) Streaming ingestion, where data is continuously ingested in small sizes, and; 3) A combination of both batch and streaming ingestion for hybrid data pipelines.