Data pipelines are growing exponentially larger and more complex every year, thanks to artificial intelligence (AI)/machine learning (ML) and other data-centric innovations. Yet, even the explosive growth in data pipelines over the past couple of years is only the beginning. They will be three to four times as large by the end of the decade.
Worryingly, most of today’s pipeline architectures are nowhere near prepared.
Executives and visionaries dream up innovations and visualize new revenue streams, but the database, DevOps and data experts who run the show are anxiously wondering how it is going to work.
If it is going to work — and I think it will, because nothing can stump teams like these — it will hinge on those DevOps, developer and database teams unifying their workflows. Only then can we achieve the feedback loops, visibility and continuous optimization necessary for modern data pipelines to operate reliably and consistently lift revenues.
Back to the Roots of DevOps: Shift Left
For these teams to come together seamlessly, we need to return to the core tenets of DevOps and database DevOps. There must be a tight feedback and collaboration loop between these different roles to support each other with their software delivery concerns.
The concerns of developers differ from those of the Ops team, but they are all working toward the same goal. This tight feedback loop should allow developers to quickly and cost-effectively identify and address issues found in production. Shifting left with policies in earlier environments should help prevent these issues from occurring in the first place.
As the development pipeline and the new and/or rapidly growing data pipeline converge, these best practices should be applied in the DataOps world. While the roles differ — introducing data engineers, scientists and architects, as well as business intelligence analysts and executives — the core principles remain the same.
This includes ensuring the value stream is secure, reliable and compliant, regardless of whether the data is consumed internally for business intelligence or externally in production.
Focus on Common Goals
Integrating development and data pipelines might be easier said than done, but the starting point is simple:
Aligning their shared goals. Although the database, developer, data and ops workflows are different, they are all working toward the same target.
For example, data engineers on the DataOps side are responsible for taking upstream data sources and applying the right transformations so that downstream data scientists can work effectively. Database and DevOps teams should familiarize themselves with this data analytics pipeline. It involves raw data from sources including application databases, as well as external financial feeds, customer databases and other diverse streams.
In a self-service database change management scenario, a developer deploying database changes needs to understand how those changes might impact data pipelines and other DataOps workflows. For instance, that database schema might need to be adapted to fit a new data lakehouse or analytical process.
In both situations, developers and data teams must ensure quality, compliance and alignment with internal policies through automated checks, maintained by a collaborative team of database administrators (DBAs) and DataOps. DBAs can act as a shared resource, unifying workflows with governance, compliance and operational support across both application databases and new data stores.
Since all this internal and external data goes through various stages of transformation, refinement and visualization, teams must ensure compliance, traceability, reproducibility and scalability at every stage of the shared data journey.
Database Observability for Shared Data Journey
In both realms, DataOps and DevOps, process observability is vital. As DevOps teams continue to invest more in pipeline observability, they must integrate each new data store into the broader pipeline.
When a report fails or an executive data visualization malfunctions, database observability enables the DataOps team to determine what went wrong upstream — whether it is a schema change, an issue in the data flow or something else. Even if a data consumer questions the validity of a report, observability can support it by tracing the source of the data and the refreshes and changes made.
Without observability throughout the database, development and data pipeline, there is no foundation of trust in internal or external data products.
A Holistic Approach to Data Pipeline Modernization
It is not just about bringing DevOps lessons to the DataOps team; DataOps is not simply DevOps for the data stream, after all. DataOps teams have valuable practices of their own, which should be adopted by the DevOps arena.
This includes the importance of testing the entire data pipeline in lower environments before moving to production. While DevOps embraces its infinity loop for continuous optimization, it must expand and integrate to include today’s (and the future’s) rapidly expanding data pipeline.
With this holistic approach to data pipeline change management, teams can collaborate to ensure the final output — be it a report, application or data product — is secure, compliant and reliable.