denoting or relating to a data-processing system in which a computer receives constantly changing data, such as information relating to air-traffic control, travel booking systems, etc, and processes it sufficiently rapidly to be able to control the source of the data
Collins English Dictionary – Complete & Unabridged 10th Edition
The world of technology systems is awash with data. Companies are at varying stages of handling and extracting meaning out of that data. Some companies have mastered the harvesting and interpretation of data, others have gotten as far as collecting huge volumes of data and are still wrestling with extracting actionable meaning. Yet many companies are getting the feeling that Big Data and information systems are vital for their operations in the 21st century but don’t yet have a strategy in place.
One of the defining aspects to understanding data is centered on the concept of real-time. At the non real-time end of the spectrum data is collected in huge volumes and then processed en-masse via a technique called Map-Reduce. Hadoop stands out as an industry standard for this kind of processing. Depending upon the complexity of the processing running a map-reduce can take many hours – even days – to compute its aggregations.
At the other end of the spectrum we have systems that produce aggregations and summaries based on a live stream of data, that is, the metrics you see are calculated as the data is captured. These systems are called CEP, or Complex Event Processing. An example of this kind of system for Java would be Esper. This kind of event processing is very common in financial trading where absolute real-time metrics is critical for fast responses.
The issue with fully adopting either of these approaches comes when you need to query the data in ad-hoc manner. Map-Reduce systems produce results with great latency whilst CEP systems generate metrics on the fly according to pre-fixed declarative queries and without being able to tap into event history.
We took the view that there is a very useful middle ground. Ingest event data and store it in a format that allows for fast ad-hoc querying and aggregation. The devil is in the details but we’ve managed to achieve this with great success in Corral.
Ultimately, there’s a ‘Goldilocks’ zone for real-time. Waiting for hours for events to manifest towards results isn’t acceptable anymore but a turnaround of a few seconds is fine for most purposes. And one of the most important aspects of metrics is to be able to question the numbers and act upon them.
You can’t achieve that with high latency or the lack of history to query.