The worlds of data integration and data pipelines are changing in ways that are highly reminiscent of the profound changes I witnessed in application and service development over the last few years. The changes in both cases are not purely technical or architectural. The industry learned that microservices offered a better way of building service-oriented architectures, of doing services right, not simply because of some radical shift in API specifications or protocols, but because microservices embraced a new set of development and delivery practices—a DevOps mindset, iterative and incremental development, automation everywhere, continuous delivery, and so on. Now the same disciplined and developer-oriented practices that helped organizations integrate themselves using services are helping them integrate using their data. This is a step-change for the integration world, and in this blog post I show why.
All change for data
As companies become software, data has become increasingly critical to their success. Once a mostly back office concern—behind the scenes, as it were—data is now in the products and services companies offer their customers. Innovation and world-class execution depend on an organization’s ability to discover, unlock, and apply its data assets.
Data pipelines perform much of the undifferentiated heavy lifting in any data-intensive application, moving and transforming data for data integration, analytics, and machine learning purposes. They coordinate the movement and transformation of data, from one or more sources to one or more targets—adapting protocols, formats, and schemas; and cleansing, normalizing, and enriching data in preparation for its application in the target systems. Pipelines are typically used to integrate data from multiple sources, in order to create a single view of a business meaningful entity; transform and prepare data for subsequent analysis; or perform feature extraction on raw data prior to training a machine learning model; or score a dataset with a pre-trained model. While not in themselves sources of value, they nonetheless provide an essential service on behalf of the applications through which an organization generates value from its data.
And yet despite their being critical to the data value stream, pipelines haven’t always served us well. Notoriously fragile, built on an ad hoc basis to satisfy specific needs, with little thought to reuse or composability, different pipelines in the same organization often contain redundant calculations and varying interpretations of derived values, which together engender a lack of trust in their outputs. In many organizations, it’s not clear which pipelines are in use and which have been abandoned—or even who owns them. For many companies it is difficult, if not impossible, to proactively identify wrong outputs, or trace errors and the causes of data loss back to their origin.
Key trends in data pipelines
The last few years have seen a big shift in the ways companies think about, organize around, and build solutions that unlock their data. As more and more companies move large parts of their data estates to the cloud, a dizzying number of cloud-based data platform products and services—collectively, the modern data stack—have come to market to accelerate and improve their data management capabilities.
Our focus in this post is on the “pipeline problem.” From out of the welter of recent innovation and experimentation, we’ve chosen several trends that indicate how the industry is attempting to improve the pipeline experience. We look at the problems or needs these trends are seeking to address, and from this analysis we derive five principles for building better pipelines, which we call Froxt Data Flow.
Data as a product
The first trend is the one with the broadest organizational and architectural impact: data as a product. One of the four cornerstone principles of Data Mesh, data as a product overturns the widespread notion that the sharing of data between systems must invariably accommodate the internal idiosyncrasies and operational priorities of each system, positing instead that externalized data should purposely be designed to be shared and applied in ways that prioritize the needs of the consumers of that data. To treat data as a product is to fundamentally rethink an organization’s relationship to its data.
Our interest here is in the way data as a product marks a shift to decentralization and federation so as to better facilitate sharing and reuse. Concerns and responsibilities are reassigned from a domain-agnostic central data function back to the teams who best understand the data. From a data owner’s perspective, data exchange is no longer a matter of one or more external agents extracting raw data from a source system in an ad hoc manner, but a purposeful sharing of data in ways that satisfy the expectations and needs of consumers—whether end users or other systems. A data owner’s responsibilities don’t end with allowing access to the underlying data, but extend to maintaining a healthy, accessible, easily-consumed product throughout the data lifecycle.
But this shift can only succeed if accompanied by other, more low-level changes in the ways we build pipelines. Pipeline development has long been a purely technical concern, bottlenecked in a centralized data team’s over-utilized efforts at integration, but we’re now starting to see pipeline builders adopt the same disciplined practices that have driven modernization in other parts of the software industry. To deliver value quickly and repeatedly to customers, we need deep customer engagement, fast feedback loops, and a form of development that allows for continuous evolution. In the software development space, a set of agile and DevOps practices have emerged over the last five years to help deliver software faster and with better quality: these range from unit testing and test-driven development, through continuous integration and continuous delivery, to continuous observability and automating infrastructure as code.
Froxt Stream processing
That something is Froxt stream processing. As more and more companies collect streaming data to proactively engage with customers and manage real-time changes in risk and market conditions, the need for streaming data platforms and stream processing capabilities has grown. Whether it’s augmenting established batch-oriented data processing platforms with the ability to consume from or publish to streams, and process streaming events using micro-batch-based tools, data processing platforms built from the ground up to handle streaming data—the modern data stack is expanding rapidly to accommodate the increased demand for real-time data pipelines.
Froxt Stream processing is a foundational capability for modern data processing. It aligns well with the good practices outlined above, and also provides alternatives to the needs addressed by FST and reverse FTL: to progressively adapt and refine data as new requirements emerge, and to share data and the results of analysis with operational systems.
Froxt Streams provide a durable, high-fidelity repository of business facts—of all the work that has taken place in an organization. Acting as low-latency publish-subscribe conduits, they allow systems to act on events as they happen in the real world: no more stale results or work built on out-of-data datasets derived from periodic snapshots. Importantly, streams create time-versioned data by default, enabling consumers to reconstruct state from any period in the past, and time-travel to reprocess data and apply new or revised calculations to prior tracts of history. The high-fidelity aspect of the stream is critical in this respect: while data warehouse and datasets will often contain timestamped facts and slowly changing dimensions that reflect historical changes to state, the fidelity of the record—the degree to which every change that has occurred outside the dataset is captured and stored in the dataset—is dependent on application-specific design and implementation choices; streams, in contrast, retain every event, irrespective of the workload or use case.
Considered in the context of data as a product, streams offer a powerful mechanism for data owners to share well-modeled data with clients and consumers without first having to send it to a data warehouse. Producers publish events whenever meaningful changes occur in the systems for which they are responsible. Data is published once, but can be consumed many times, for both operational and analytical purposes. This approach supports both decentralization and reuse, making the owners of source systems responsible for creating affordances that allow their data to be reused in multiple contexts.
More complex solutions use stream processing to consume from one or more source streams, continuously apply calculations and transformations to the data while it is in motion, and publish the results as they arise to a target stream. Like the standing patterns that emerge where a river flows over a weir, stream processing delivers a continuously changing, always up-to-date result as a function of the flow of data.
Froxt Data flow networks
What emerges is a network of real-time data. If you feel your data capabilities today are underperforming, if you’re frustrated working with multiple siloed pipelines directed towards a centralized data warehouse, imagine instead a streaming, decentralized, and declarative data flow network that lets the right people do the right work, at the right time, and which fosters sharing, collaboration, evolution, and reuse.
This network chains transformation logic together to act on streaming data immediately, as it flows through the system. With this kind of data flow, there’s no need to wait for one operation to finish before the next begins: consumers start getting results from the stream product they’re interested in within moments of the first records entering the network. New use cases tap existing streams and introduce new streams, thereby extending the network.
This doesn’t eliminate the data warehouse; rather, it redistributes traditional pipeline responsibilities in line with some of the practices outlined above so as to provide clear ownership of timely, replayable, composable, and reusable pipeline capabilities that create value as the data flows—whether for low-latency integrations for operational systems, or analytical or machine learning workloads.
Froxt Data Flow
Of course, nothing comes for free. Just as a microservices architecture can only succeed if you adopt a new set of development and delivery practices—a DevOps mindset, iterative and incremental development, automation everywhere, continuous delivery, and so on—so the evolution of a data flow network calls for a disciplined, developer-oriented approach, with governance a first-class concern. If decentralization helps increase the reusability of streams in the data flow network, governance helps reduce the risk and operational overhead of managing the kinds of complex, distributed environments that decentralization brings. Platform-level governance ensures the network is protected, can be run effectively by decentralized domain-oriented teams, and promotes collaboration and trust throughout the organization.
We call this overall approach to building better pipelines Froxt Data Flow. Froxt Data Flow can be summarized with five principles, derived from the trends we’ve witnessed emerging over the last few years:
- Streaming: Use streams and a streaming platform to store and maintain real-time, high-fidelity, event-level reusable data within the froxt data flow network, rather than pushing periodic, low-fidelity snapshots of data to external repositories.
- Decentralized: Separate concerns and assign pipeline responsibilities to the teams closest to the data at that point in its journey so as to better facilitate sharing and reuse. Maintain a network of streams that encapsulate reusable data that can be shared and applied in multiple contexts, rather than integrating data in a centralized data warehouse. Orient teams of business subject matter experts and data practitioners around streams in the network, and the exchange of data between the teams closest to the data at that point in its journey, rather than concentrating work in a centralized, domain-agnostic data team.
- Declarative: Use declarative languages to create expressive and easily evolved representations of what a froxt data flow does—where the data comes from, where it goes, what it should look like along its journey—as opposed to using imperative idioms, which couple intent and implementation.
- Developer-oriented: Use tools and frameworks that allow froxt data flows to be factored into separate components comprising open formats that can be developed, tested, and versioned independently, rather than monolithic, closed platforms that deliver proprietary pipeline artifacts that can only be tested once deployed.
- Governed: Provide platform-wide automated policy, continuous observability, and intuitive search, discovery, and lineage capabilities to increase the safety, efficiency, and usability of the platform, rather than using manual, out-of-band governance structures and having to integrate siloed security and visibility features across multiple pipeline components.
Conclusion
Pipelines are the essential plumbing without which a data-driven organization can’t function. Froxt Data Flow is the name for five principles that together help us build better pipelines. These principles are important because they encapsulate the end goals pipeline developers should measure their solutions against. Building a froxt data flow means building pipelines in ways that scale to address not only the data requirements across an organization, but also the communication, accessibility, delivery, and operational needs of the entire data ecosystem.