By Chris Latimer, Vice President, Product Management, DataStax
There’s a lot of talk about the importance of streaming data and event-driven architectures right now. You may have heard of it, but do you really know why it’s so important to many businesses? Streaming technologies unlock the ability to capture information and act instantly on the data flowing through your organization; they are an essential part of developing applications that can respond in real time to user actions, security threats, or other events. In other words, they play a key role in creating exceptional customer experiences and generating revenue.
Here’s a quick look at what streaming technologies do and why they’re so important to businesses.
Data in motion
Organizations have become quite good at creating a relatively comprehensive view of what is known as “data at rest” – the type of information often captured in databases, data warehouses and even data lakes to be used immediately (in “real time”) or to feed applications and analyzes later.
Increasingly, data generated by the activities, actions and events that occur in real time in an organization is flowing in from mobile devices, retail systems, sensor networks and call routing systems from telecommunications.
Although this “data in motion” may eventually be captured in a database or other store, it is extremely valuable when it is still in motion. For a bank, data in motion could allow it to detect fraud in real time and act on it instantly. Retailers can make product recommendations based on a consumer’s search or purchase history, the instant someone visits a webpage or clicks on a particular item.
Consider Overstock, an American online retailer. It must consistently deliver engaging customer experiences and drive revenue from instant monetization opportunities. In other words, Overstock was looking for the ability to make lightning-fast decisions based on data that was coming in real time (typically brands have 20 seconds to connect with customers before moving to another website) .
“It’s like a self-driving car,” says Thor Sigurjonsson, data engineering manager at Overstock. “If you’re expecting feedback, you’re going off the road.”
To maximize the value of their data while it’s being created – instead of waiting hours, days, or even longer to analyze it once it’s at rest – Overstock needed a streaming platform and messaging, which would allow them to use real-time decision-making to deliver personalized experiences and recommend products that are likely to be well-received by customers at the right time (really fast, in other words).
Messaging and data streaming is a key component of an event-driven architecture, which is a software architecture or programming approach built around the capture, communication, processing, and persistence of events (clicks mouse, sensor outputs, etc.).
Data stream processing involves taking action on a series of data from a system that continuously creates “events”. The ability to interrogate this continuous stream and find anomalies, to recognize that something significant has happened and to act quickly and meaningfully is what streaming technology enables.
This contrasts with batch processing, where an application stores data after entering it, processes it, and then either stores the processed result or passes it to another application or tool. Processing may not begin until, for example, 1000 data points have been collected. It’s too slow for the type of applications that require reactive engagement at the point of interaction.
It’s worth stopping to break down this idea:
the place of interaction it can be a system making an API call or a mobile application.
Commitment is defined as adding value to the interaction. This could be giving a tracking number to a customer after placing an order, a product recommendation based on a user’s browsing history, or a billing authorization or service upgrade.
Reagent means that the engagement action occurs in real time or near real time; this translates to hundreds of milliseconds for human interactions, whereas machine-to-machine interactions that occur in an energy supplier’s sensor network, for example, may not require such a near-time response real.
When the message queue is not enough
Some companies have recognized that they need to derive value from their data in motion and have assembled their own event-driven architectures from a variety of technologies, including message-oriented middleware systems like the Java Messaging Service ( JMS) or message queuing (MQ) platforms.
But these platforms were built on the fundamental principle that the data they processed was transient and should be immediately deleted once each message was delivered. This essentially wastes a very valuable asset: data identifiable as arriving at a specific moment in time. Time series information is essential for applications that involve asynchronous analysis, such as machine learning. Data scientists cannot build machine learning models without it. A modern streaming system must not only forward events from one service to another, but also store them in a way that retains their value or future use.
The system must also be able to scale to handle terabytes of data and millions of messages per second. Older MQ systems aren’t designed to do either.
Pulsar and Kafka: the old guard and the unified challenger of the next generation
As I touched on above, there are many choices available when it comes to messaging and streaming technology.
They include various open source projects such as RabbitMQ, ActiveMQ and NATS, as well as proprietary solutions such as IBM MQ or Red Hat AMQ. Then there are the two well-known and unified platforms for handling real-time data: Apache Kafka, a wildly popular technology that has become almost synonymous with streaming; and Apache Pulsar, a new message streaming and queuing platform.
Both of these technologies were designed to handle the high throughput and scalability required by many data-driven applications.
Kafka was developed by LinkedIn to facilitate data communication between different departments of the job networking company and became an open source project in 2011. Over the years it has become a standard for many many companies looking for ways to derive value from real-time data.
Pulsar was developed by Yahoo! to fix email and data issues with apps like Yahoo! To post; it became a top open source project in 2018. While catching Kafka in popularity, it has more features and functionality. And that carries a very important distinction: MQ solutions are purely messaging platforms, and Kafka only handles the streaming needs of an organization. Pulsar handles both of these needs for an organization, making it the only unified platform available.
Pulsar can handle real-time, high-throughput use cases like Kafka, but it’s also a more comprehensive, durable, and reliable solution compared to the legacy platform. To have streaming and queuing (an asynchronous communication protocol that allows applications to talk to each other), for example, a Kafka user would need to equip themselves with something like RabbitMQ or other solutions. Pulsar, on the other hand, can handle many use cases of a traditional queuing system without add-ons.
Pulsar has other advantages over Kafka, including higher throughput, better scalability, and geo-replication, which is especially important in the event of a data center or cloud region failure. Geo-replication allows an application to publish events to another data center without disruption, preventing the application from crashing and preventing an outage from affecting end users. (Here is a more technical comparison of Kafka and Pulsar).
In Overstock’s case, Pulsar was chosen as the retailer’s streaming platform. With it, the company has built what its chief engineering officer Sigurjonsson describes as an “integrated layer of connected data and processes governed by a metadata layer supporting the deployment and use of reusable data embedded in all environments”.
In other words, Overstock now has a way to understand and act on real-time data across the organization, allowing the company to impress customers with magically fast and relevant and personalized experiences.
As a result, teams can reliably transform in-flight data in a way that’s easy to use and requires less data engineering. This makes it much easier to delight their customers and ultimately generate more revenue.
To learn more about DataStax, visit us here.
About Chris Latimer
Chris Latimer is a technology executive whose career spans over twenty years in a variety of roles including enterprise architecture, technical pre-sales and product management. He is currently Vice President of Product Management at DataStax, where he focuses on shaping the company’s product strategy around cloud messaging and event streaming. Prior to joining DataStax, Chris was a Senior Product Manager at Google, where he focused on APIs and API Management in Google Cloud. Chris is based near Boulder, CO, and when not working he is an avid skier and musician and enjoys the endless variety of outdoor activities Colorado has to offer with his family.