Skip to content

Towards a Next-Gen Data Lake Architecture

By Apoorv Kashyap

Imagine if your data lake could be more like a modern smartphone, where you can simply download apps to do important things such as linking product recommendations to your website, managing loyalty campaigns for marketing, or producing cash-flow prediction for your finance.

Today, there’s no doubt that in the minds of business and technology leaders, in order to generate ROI from our data lake investments, we need to be successfully productionising multiple AI-powered programs within the data lake – that is, programs which produce a continuous stream of actionable insights based on the digital, enterprise, and third-party data being stored inside data lakes.

There’s no doubt that in the minds of business and technology leaders, we need to be successfully productionising multiple AI-powered programs within the data lake.

First-Gen Data Lakes

First generation data lakes which have already been implemented wide across the industry and domain verticals (often in multiple instances within a single organisation), have been struggling to generate returns on the investments that have gone into implementing big data stacks like those of Cloudera, Hortonworks, MapR, Datastacks, and many others.

The single most common design pattern that was overused during the first-gen DLs was to apply design patterns from the enterprise data warehouse and relational database worlds. Whilst this was a great starting point, we quickly realised that we canʼt simply “port” our current data warehouses into this new big data stack and pretend to be the next Facebook or Google (where much of this new tech originated). It’s not that EDW and RDBMS concepts didnʼt suit big data itself, it is just that these patterns inherently have a single bottleneck: the schema.

Let me explain using a real-world example. Letʼs say you and your stakeholders want to use Adobe data, along with your existing ERP systems, customer call logs, and third-party credit score data, to create a complete view of a customer. This view of customer is to be utilised by downstream applications that can use this view to make decisions on business use cases – let’s say, product recommendations on your website, and next-best action email campaigns. You and your stakeholders have been battling with this scenario for 18+ months, and during this time youʼve gotten external consultants to write scripts in Scala that can ingest input datasets into your HDFS-based data lake on-prem, your internal IT team to write scripts that run on a Spark cluster in Python to create the single customer view, and then your digital analytics teams to write code to generate product recommendations as an API and an export of xls daily for the email campaigns.

Stitching all of the above together and managing rapid-change on a daily basis becomes a much bigger task than anticipated. Soon your teams start to disagree on which of the Hive and Hbase tables contain golden copies and which ones can be discarded. The IT and infrastructure teams become doubtful about which clusters need to run, and maintaining the proper execution sequence of these various programs become more and more taxing. Your teams become more reliant on unicorns in the team which increases operational risk manyfold. Your CDO becomes uneasy with the lack of data catalogue and data governance built into the change processes. The Data Lake truly becomes a Data Swamp that many have put in lots of work into creating intellectual property for, and yet no value can be drawn out of it by putting insights into production. Next thing the CIO knows, its 2019 and we still havenʼt gained even 2x of ROI from the millions invested in and around data programs in the last several years.

Next-Gen Architecture

Four years ago, a small team of Data Scientists and digital ninjas started on a journey to create a new and more robust way of implementing AI into the big data landscape. We imagined a world where disparate teams of technologists, Data Scientists, Data Engineers, IT and infrastructure, could create containerised solutions for their companyʼs toughest challenges, and which could be deployed inside their existing data lake infrastructure in the form of interconnected apps.

Today the team at Syntasa does just that for Dixons, Sky, Lenovo, and many more leading digital brands. Your digital business wants product recommendations based on your customers’ behavioural data from Adobe Analytics – we have an app for that. You want to increase average basket size on your website – we have another app for that. Segment customers in real-time based on 1000s of online and offline data points – we have yet another app exactly for that.

We are on a mission to create many more configurable apps but equally, your Data Science & Data Engineering teams can just as easily create novel apps using widely-adopted technologies such as Python and Apache Spark. Or, configure Syntasaʼs out-of-the-box apps to form end-to-end solutions that can be deployed in your data lake as “containers” within hours. This is done with an inbuilt CI/CD pipeline to make sure your business can experiment and retune your AI- powered customer and business strategies in real-time, all while having a full dashboard view of how these apps perform and interact with each other. Our architecture can co-exist with your current data lake implementations and help you package your current code into apps, creating rapid business value out of your data lakes.

We would love for you to get in touch with us and discuss how Syntasa can help you in reaping best outcomes with your next-generation data lake. If any of this resonates with you, email me (Apoorv Kashyap) or request a briefing here.

Share this: