Have you been spending too much time with your head in the clouds or inside a traditional warehouse? Then it’s time to dive into the lake. The Data Lake, that is. With Data Lakes no information is ever lost, it’s just waiting for the right business use case to come along and claim it. We give you a brief overview of what it is, and how your business can benefit from it.
Data Lakes are the new black when it comes to big or new kinds of data. Put simply, Data Lakes are shared data storage that pays no mind to the data´s source or format. You can then use a variety of processing tools to extract value quickly and use it to make informed organizational decisions.
The difference between a Data Lake and a Data Warehouse
One way to understand Data Lakes is to examine how they differ from traditional data warehousing. They might appear similar at first glance, since Data Lakes and traditional Data Warehouses are both widely used for storing large volume data, but they are not interchangeable systems.
Data warehousing is a super optimized way to process structured data. This means you have to organize and sometimes pre-process data as rows and columns; think an Excel sheet or a simple table, while allowing for more advanced operations to produce reports and statistics. A traditional Data Warehouse gets data from a variety of enterprise applications, which then need to be transformed to conform to the Data Warehouses’ own pre-defined schema. Designed to collect only data that is checked for quality and that conforms to the model, the Data Warehouse is capable of answering only a limited number of questions while lots of potentially valuable data gets thrown away in the conformation process.
Approaching data warehousing on a Data Lake, on the other hand, allows for the Data Lake based Data Warehouse to contain any type of data what so ever, from any source. This could include X-ray images, street addresses, blood samples, DNA-sequences, recorded audio, all your customer transactions, anonymous patient data or millions of emails, for example.
All data is fed into the Data Lake in its native form, to be used for any question that may lie ahead. Then, when you do have a question, the data will be processed as needed to answer that specific question. This differs from a traditional Data Warehouse where the data is processed before being stored, and storage means rigid tables meant for pre-determined questions. I.e. in a Data Lake the data is stored in raw format kept separate from the actual processing – be it the creation of BI reports, ad-hoc SQL analysis, or Machine Learning based model training. Hence the value of a Data Lake is significantly higher than a traditional Data Warehouse, as it enables flexible, ad hoc data warehousing AND other workloads using the same data, in one secure and regulated solution.
Data generation is basically exploding in the world today, especially as enterprises and users turn to mobile interactions and Internet of Things (IoT) connected products. Because of the growing variety and volume of data today, Data Lakes are an emerging and powerful architectural approach vs. traditional data warehousing.
Eva Nahari is senior director of product management at the US based software company Cloudera, where she helps companies produce and put data and cloud strategies into practice, acting as a driving force for the future of distributed data processing and Machine Learning applications.
– The traditional siloed systems, such as legacy Data Warehousing, has put a lot of pressure on IT budgets as data volumes have dramatically increased year over year. In addition, all these new data types that have come to have a crucial effect on core business decisions, such as satellite, or medical images, or support call recordings, or vibration data from manufacturing robots, don’t fit easily into inflexible tables.
Eva Nahari means that the biggest advantage of Data Lakes is flexibility.
– The benefits from Data Lakes are plentiful. First of all it is flexible, i.e. you don’t have to have all the business use cases or questions you need to ask up front – or even all the data.
You can run a multitude of different types of big data analytics, like Data Warehouse, text analytics, and machine learning, on top of the Data Lake to guide better decisions.
– Basically, you can use any stored, historical data to learn patterns and discover what patterns to look for to find and predict problems in the current operation or context. When you see those patterns repeating in new incoming data or real time data, you can then act on it immediately and save your organization money before an incident occurs.
There are of course certain challenges that come with mixing data sets, one of them being compliance.
– GDPR for instance is a good move towards protecting individuals’ data, and has been a push for companies even outside the EU when it comes to tracking every little piece of information about a given individual, in order to be able to erase them upon request. Many companies are hesitant about storing data, as they don’t know how to make sure they stay compliant with GDPR.
– With the right mindset GDPR is great and storing data is valuable, and those two together don’t have to be a problem, Eva says. Information that should not be combined can be stored with a different security access or even in geographically separated Data Lakes, and companies can establish strategies for how data should be masked or anonymized without sacrificing core values. There are lots of mature technologies today that can help trace data and make sure organizations stay compliant. Data is valuable, so organizations shouldn’t shy away from storing it – but should endeavour to store it with privacy and access controls in mind and built into the original design of their Data Lakes.
– I’m a big believer in open data for global research. Interesting things are in motion globally in this area and the development pace is fast, Eva says. If used correctly, companies and communities will save millions of dollars using Data Lakes and open data. The question of the day is less about when the data revolution will occur and more about who will be part of it and who will get to claim the unique insights and values their Data Lakes can offer them first.
For some examples of what a Data Lake can do, read more here:
By Malin Hefvelin