Using your data lake as a cheap time series database: do’s and don’ts
Big players like Microsoft and Amazon offer inexpensive cloud-based solutions for organizational data storage known as data lakes, which have rapidly become a central component in the cloud strategy of many CIOs. The promise of these data lakes is significant indeed: to provide a centralized location for all data that is available to all types of users to explore and analyze according to their needs and use cases. In doing so, users can make data-driven decisions and build data-driven organizations. According to Market research firm Cambashi, which tracks the global market for industrial software, the IoT market is provide a rapidly increasing number of “connected applications” that deliver business value through analytics on the collected data.
In this post, I discuss the potential role of data lakes in handling massive time-series data from sensors in equipment and factories. This is a particularly relevant subset of data in manufacturing and more generally IoT, since the analysis of time-series data can reveal patterns that indicate likely failures and allow preventing them.
Data Lakes For Time-Series Data: Getting The Expectations Right
Process industry factories have been using data historian technology from vendors such as Aveva-OSIsoft to store sensor data for decades. With the rise of IoT, we witnessed a rapid increase in the time-series database market, and a set of new time series databases such as InfluxDB, Azure Kusto / TSI, and Amazon Timestream entered the playing field. These historians and time series databases feature specific engines with data models and query languages that are optimized for time-series data.
When dealing with massive amounts of sensor data, it’s certainly tempting to consider data lakes as a cheaper alternative to specialized time-series databases or as a replacement for classical enterprise data historians. However, if care isn’t taken, a data lake can quickly deteriorate into a data swamp, where users struggle to extract the right data, performance is insufficient and organizational expectations aren’t met. When considering adopting data lakes for time-series data, it is therefore important to align stakeholders to set realistic expectations, make requirements explicit and obtain a common understanding of the end goal.
To clarify this, let’s examine some of the key challenges that need to be overcome.
1. Setting Up A Suitable Data Structure (For Good Analytics Performance)
If users expect to work interactively with data, they’ll expect the data to be available where they need it, at the time they need it, with easy and quick access to it. Because of this, the first challenge is setting up a suitable data structure to obtain good read performance.
A common practice to improve read performance in data lakes is to use columnar file formats, allowing users to read data columns or properties that are needed for a specific case. Since the entire file wouldn’t have to be read, less data is loaded resulting in faster response times. Another approach is to use partitioning. This involves arranging data in folder-like structures using key properties, time or a combination of these depending on the data. This structure allows to quickly narrow down the space in which data is searched and to reduce query times.
2. Minimizing Ingestion Delays (Options For Dealing With Recent Data)
Depending on the specific requirements, another challenge you may face deals with writing all data into a data lake with minimal ingestion delays. Typical architectures only write to the data lake when a full file is ready for archival. Conceptually, this is an important difference from a typical setup where data streams writing to a historian or time-series data store directly.
To limit ingestion delays, consider storing the most recent data in an appendable file format such as CSV (as in a journal). Compacting jobs can then periodically transform these journals to a columnar format such as parquet that is more optimized for long-term storage and historical data analytics.
Another option is to use a hybrid setup with a different “hot storage” component. For instance, perhaps choose to adopt a more expensive time-series optimized storage solution for the running month (hot data) and transfer data to the data lake for long-term storage afterward (cold data). Some IoT platforms even facilitate automated offloading to your data lake for long-term storage out of the box.
3. Providing The Connectivity For Advanced Analytics
Given the fact that data lake storage is highly flexible, a good query layer can be instrumental in addressing a third challenge: making data in the data lake accessible for advanced analytics.
Query layers are tools or components in an organization’s data landscape that allow for writing standard SQL language queries against the data. This means any tool that has support for standard ODBC or JDBC connectivity can be used to connect to the data lake. In addition, some query layers leverage technologies such as Apache Arrow to further reduce query overhead and increase the efficiency of data access. Finally, query layers can be used to provide a unified interface on the colder data lake storage and a hot storage component in which the most recent data resides. Depending on the query layer component, this may be possible by configuration, thus lowering integration costs.
4. Managing Metadata
All too often it is assumed that a data lake will magically resolve challenges of data mapping. However, it remains important to address this challenge and explicitly consider data lineage, data age and metadata that provides common attributes/properties to link the data together.
Support for metadata management as part of the data lake offerings is improving, but it is still a factor that needs to be considered, especially when comparing with more domain-specific setups such as enterprise historians offered by OT vendors.
The Right Place For A Data Lake In Your Landscape
A data lake may be a great and cost-effective way to democratize your industrial IoT data; however, it is imperative to ditch the myth that a data lake is an inexpensive one-size-fits-all database. To some extent, optimizing a data lake for time-series data is reminiscent of building a simple time-series database on a cheap storage backend. Perhaps this approach is right for your company, but always keep in mind that understanding your organization’s data and aligning it with the expectations for its use is crucial in giving the data lake the right place in your landscape.
If you cannot afford the overhead of managing your own time series optimizations, the better option may be to adopt more out-of-the-box specialized time-series data storage, or go for a hybrid setup.
Combining the best of both worlds
Luckily, it seems that the future might be one where the benefits of both approaches can coexist to a certain degree. Recent announcements such as this one about InfluxDB-IOx indeed hint at the possibility of future time series databases seamlessly plugging into cheaper blob storage as their long-term durable store. So perhaps soon we’ll be able to have our cake and eat it too.