4 minute read
Debugging intermittent data discrepancies in enterprise data pipelines
One report out of 100+ showed intermittent data discrepancies. See the extraction-time cause and how to prevent non-deterministic outputs.
Table of contents
Discrepancies are among the most dangerous failure modes in enterprise environments. They do not break every run. They do not trigger consistent validation errors. And they often disappear the moment a team attempts to reproduce them.
That unpredictability makes them especially high risk in financial and regulated reporting contexts.
This article walks through a real-world scenario we encountered on a production data extraction pipeline. A financial amount field occasionally appeared inconsistent in downstream extracts, but only for one report out of more than a hundred delivered daily. The issue was rare, non-deterministic, and difficult to reproduce. A combination that creates high risk with low visibility.
In financial reporting environments, even occasional unexplained discrepancies can trigger audit scrutiny and erode stakeholder confidence.
The good news is that problems like this are solvable. The even better news is that when you solve them properly, you end up improving the overall reliability of your data platform, not just “fixing a bug”.
The scenario: one report, one field, and a problem that would not stay put
In production, the team observed intermittent discrepancies in a financial amount field in extracted data. The pipeline looked correct on the surface, and it was operating at scale, generating more than 100 reports per day, with only one report showing the problem.
Two aspects made the issue especially challenging:
- Rarity: It occurred very infrequently, making it hard to capture and replay.
- Non-determinism: It did not present a consistent failure pattern, which meant traditional debugging approaches led to dead ends.
In a financial context, even small discrepancies can undermine trust. The broader concern is not just “is this field wrong today?”, it is “can we rely on this platform to produce decision-grade outputs consistently?”
Investigation approach: do not assume the ETL layer is the culprit
In modern data stacks, ETL and orchestration are often blamed first because they are visible and easy to instrument. In this case, early evidence suggested the downstream flows were behaving correctly, and the anomaly was present before the data even entered harmonisation and ETL processing.
So, we focused on building an end-to-end view of the extraction path:
- Attempting to reproduce the issue in a controlled environment
- Reviewing the extraction queries and pipeline configuration
- Examining the interaction of joins, ordering, and how values materialise during extraction
- Running targeted regression tests to isolate which combinations of query behaviour and data shape triggered the discrepancy
This is a key lesson for enterprise data teams. If a downstream system is receiving data that appears valid, it may not throw errors, it will just process what it is given. You can end up debugging the wrong layer for weeks.
Root cause: subtle interactions between joins, ordering, and numeric serialization
The issue turned out to be upstream, at the extraction layer.
Under specific conditions, the combination of:
- complex joins
- result ordering
- and numeric serialization behaviour during extraction
led to non-deterministic representation of the numeric values for that field.
Because the discrepancy occurred before the ETL stage, downstream pipelines received outputs that were structurally valid and schema-compliant. Validation checks therefore passed, even though semantic consistency was occasionally compromised.
That is why it was hard to detect through ETL checks alone, and why the issue appeared to “come and go” in a way that felt random.
This also explained why it impacted only one report. Shared infrastructure does not mean shared execution characteristics. A single report can have unique join patterns, data distributions, or ordering assumptions that expose edge cases.
The fix: enforce determinism at query time
Once the root cause was validated, the solution was precise and effective.
We enforced deterministic behaviour at the source by casting the numeric field to a string during the Optic query select. This prevented unexpected numeric transformations and ensured consistent representation of the value in extracts.
Why this works:
- It resolves the issue at the layer where it originates, rather than introducing compensanvestigation approach: do not assume the ETL layer is the culprittory transformations downstream that increase fragility over time.
- It preserves integrity by controlling how values are serialised during extraction.
- It avoids broad refactoring or performance-heavy workarounds.
In parallel, short-term mitigation steps were also taken to reduce the probability of triggering the edge case, including simplifying the harmonisation flow and reducing join complexity where feasible.
Short-term mitigation and long-term architecture recommendation
Fixes like casting can be the right solution, but they should also trigger a broader question: “Are we relying on a mechanism that gives us enough control over determinism and serialisation?”
In this case, a long-term recommendation was made to move away from v1/rows-style extraction toward a custom extraction approach, providing tighter control over casting, serialization rules, ordering guarantees, and execution behaviour.
That kind of architectural shift is not always necessary, but for systems producing business-critical financial outputs at scale, it can be a strong move. The goal is not to build complexity, it is to increase confidence.
After regression testing and technical review, the client gained confidence in both the diagnosis and the proposed solution.
Practical checklist: how to approach intermittent discrepancies faster
If you ever face a “rare, cannot reproduce” data discrepancy, these steps shorten the path to root cause:
- Confirm where the anomaly first appears
Validate whether it is present at extraction, harmonisation, ETL, or downstream consumption. - Remove assumptions about determinism
If joins and ordering exist, explicitly check whether the query can produce non-deterministic results under different execution conditions. - Treat casting and serialisation as first-class design choices
Do not assume numeric values will behave identically across query evaluation, materialisation, transport, and downstream parsing. - Create regression tests that target combinations, not components
Intermittent issues often come from interactions between features, not a single “broken” step. - Fix issues at the correct layer
If the discrepancy happens upstream, fix it upstream. Downstream compensations add fragility.
Why this matters for enterprise data leaders
This scenario highlights a broader lesson. Not all data quality issues originate where teams expect them to. Intermittent discrepancies often stem from implicit behaviour in query execution, ordering, or serialisation rather than obvious pipeline failures.
If you are building enterprise data products for regulated and financial use cases, determinism is not a “nice to have.” It is a core platform requirement essential for auditability, repeatability, and sustained executive trust.
If you are encountering rare, non-reproducible inconsistencies in extraction or reporting outputs, it may be time to review whether determinism is being enforced explicitly at the correct layer.
Datavid teams work with enterprises to assess extraction reliability, reduce non-deterministic behaviour, and design financial-grade data platforms.
Want to compare notes on reliability patterns you are seeing? 