Data Lakes make what was impossible yesterday possible today. Used properly, a Data Lake can discover problems before they arise and help you save resources, as well as promote the actions with the biggest impact on a given community or organization.


Today’s big data analytics use Data Lakes to uncover patterns, market trends, and customer or patient preferences in order to help businesses or communities make informed decisions faster and automate the process of making accurate predictions.

Eva Nahari is senior director of product management at US based software company Cloudera, advisor for the Swedish Innovation Agency Vinnova and a member of the FPX board evaluation council.

– We have hundreds of examples on how Data Lakes can benefit organizations. Let´s look at the turbine manufacturer that decided to collect turbine vibration data for all plants at which their turbines were installed. We’re talking hundreds of plants and thousands of turbines, all of which are generating vibrations continuously. In the traditional Data Warehouse you couldn’t even store that type of data, and taking into account quality analysis, it would be far too massive and impossible to represent in a table. Instead they decided to store these enormous amounts in a Data Lake that can handle the volume and the new data type. They would also store structured data about turbines that had previously failed in the Data Lake. These data could contain information such as what problems arose, what sub parts were part of the problem, how long the downtime was, etc.

– Then the turbine manufacturer could run Machine Learning algorithms over the Data Lake to recognize patterns. For example, patterns would emerge when a certain type of vibration associated with a component soon to fail was produced. These patterns could then be used to analyze new incoming data, in real time, and allow operating staff to be alerted before a turbine was about to fail. When this alert happens, the company may for example pre-order the sub component likely to be the root cause of the upcoming failure and send it to the plant where the sub-part is installed during a scheduled downtime.

Compare this to having to deal with surprise downtimes, including having to analyze the issue on-site, order and then wait for the spare parts to arrive, and then finally exchange those parts etc., likely having a much larger impact on their business. The ability to predict failures and do scheduled downtime for maintenance ended up saving the turbine company millions every year.

– This is something that previously wasn’t possible, and there are many other companies that have saved or generated millions utilizing Data Lakes in combination with modern time series data warehousing and machine learning, says Eva Nahari.


Another example of where Data Lakes could be of use is in the area of medical care. In many US hospitals it is not uncommon that people coming in for common surgery gets an infection post surgery, usually in the form of sepsis.

– After analyzing tons of data and distilled data patterns from hundreds of hospitals including patient data, employer data, layout, blueprints, sourcing, processes, nurses’ notes etc., they managed to identify a dip in soap dispenser use and were able to connect it with dispensers near rooms where patients had died of sepsis. Long story short, and put in simple terms, turns out staff delivering food to the patients did not wash their hands as frequently and with the same care as doctors and nurses did in certain hospitals – specifically where the sepsis mortality rate was high. The affected hospitals could then introduce new sanitary regulations for all staff and the number of fatal cases dropped. We actually saved lives!

– I can see Data Lakes being used for example when it comes to tracking and treating epidemics – Machine Learning can be used to detect probable cluster roots and geographical areas where the disease spreads more quickly. As a result, you can apply major relief efforts exactly where it’s most needed and stay focused with limited resources. This same pattern recognition could also be applied to unemployment programs, I’m sure, or educational or even migration focused groups. All in order to find better ways to serve and help and achieve better results for the society as a whole. I‘m convinced there are data driven patterns just waiting to be extracted from those topics or situations that could help determine what programs will ultimately be more successful, and help put limited resources to use in the best possible way, rather than spend your time not knowing.

FPX and Gävle Innovation Arena are currently building a Data Lake with open data for the purpose of establishing an informational resource database for researchers, students and companies both in Gävle and abroad. The Data Lake keeps growing and the goal is to store information about things like buildings, traffic, roads, air quality, accidents, health, agriculture and safety in and around Gävle.

– Gävle Innovation Arena is very inspiring, Eva emphasizes. They’re very forward thinking and I hope people recognize that.

– They are on a mission to make open data accessible in a responsible way, bringing it all together in one portal and thereby making it easier to use for entrepreneurs, scientists, researchers, academia, policymakers, citizens and big industry. They really embrace the concept of what a Data Lake is meant for, which is to fuel innovation and give you the opportunity to gain previously unreachable insights by bringing all these data sets together and raise our awareness of them. By staying focused on the areas in which the region is strong, such as health, population, geo information and sustainable city planning, I think this region has a really good chance of flourishing and taking the lead.

 As opposed to a traditional Data Warehouse, where the storing of data has a pre-determined purpose and set of questions, the purpose of the data in a Data Lake does not need to be known nor defined prior to storage.

By Malin Hefvelin

Pin It on Pinterest