Database vs Data Warehouse vs Data Lake | What is the Difference?

38 Comments



Database vs Data Warehouse vs Data Lake | Today we take a look at these 3 different ways to store data and the differences between them.

____________________________________________

SUBSCRIBE!
Do you want to become a Data Analyst? That’s what this channel is all about! My goal is to help you learn everything you need in order to start your career or even switch your career into Data Analytics. Be sure to subscribe to not miss out on any content!
____________________________________________

RESOURCES:

Coursera Courses:
Google Data Analyst Certification: https://coursera.pxf.io/5bBd62
Data Analysis with Python – https://coursera.pxf.io/BXY3Wy
IBM Data Analysis Specialization – https://coursera.pxf.io/AoYOdR
Tableau Data Visualization – https://coursera.pxf.io/MXYqaN

Udemy Courses:
Python for Data Analysis and Visualization- https://bit.ly/3hhX4LX
Statistics for Data Science – https://bit.ly/37jqDbq
SQL for Data Analysts (SSMS) – https://bit.ly/3fkqEij
Tableau A-Z – http://bit.ly/385lYvN

*Please note I may earn a small commission for any purchase through these links – Thanks for supporting the channel!*
____________________________________________

SUPPORT MY CHANNEL – PATREON/MERCH

Patreon Page – https://www.patreon.com/AlexTheAnalyst

Alex The Analyst Shop – https://teespring.com/stores/alex-the-analyst-shop

____________________________________________

Websites:
Website: AlexTheAnalyst.com
GitHub: https://github.com/AlexTheAnalyst
Instagram: @Alex_The_Analyst
____________________________________________

*All opinions or statements in this video are my own and do not reflect the opinion of the company I work for or have ever worked for*

source

38 Comments
    • blank
      I bet you read my username and like it
      May 23, 2022 02:41 am Reply

      Its 2:00am – Why am I watching a video about data lakes?

    • blank
      James Hughes
      May 23, 2022 02:41 am Reply

      This was an Excellent Summary, Alex. I'll be showing this to the young'ins next time I need to train one =D

    • blank
      B G
      May 23, 2022 02:41 am Reply

      Though often times by database people do mean a RDBMS (Relational Data Base Management System), and use SQL, a database is any organized data store and goes well beyond just relational data models. One of the more common generic alternatives are hierarchical databases. Some simple examples of that are xml with xquery or the windows registry. I have even used what was tab delineated files and a file system hierarchy to do bioinformatics research.
      A second thing is that databases don't need to be persistent, ACID, or on disk. The point being keep your mind open to things past RDBs. A different data store and query system may provide a better solution to the problem at hand.
      If you do use RDBs, please bring it to third normal form, preferably BCNF. I cannot tell you the number of times I have come into a company to help solve a problem and it turns out that their data doesn't even fit first normal form.

    • blank
      Walter White
      May 23, 2022 02:41 am Reply

      I think this guy's trying to look like me.

    • blank
      Dustin
      May 23, 2022 02:41 am Reply

      The "transaction" in OLTP doesn't mean retail sale. It means that a specific and complete interaction with the database has occurred. Yes, it works like a retail sale, but using such an analogy tends to confuse people that are unfamiliar with databases.

    • blank
      Zach Wiggin
      May 23, 2022 02:41 am Reply

      Not sure why this video was in my feed but it was interesting! I now know enough to be stupid.

    • blank
      Rajib Roy
      May 23, 2022 02:41 am Reply

      Like your Explanation Sir.

    • blank
      Stk
      May 23, 2022 02:41 am Reply

      As a dev its a bit confusing with the data warehouse. I mean it's just a concept and a design idea right? I mean if I create a new database in SQL server and optimized for Data warehouse its an data warehouse right?

    • blank
      Andrij Demianczuk
      May 23, 2022 02:41 am Reply

      Also have a look at Delta Lake. It’s about providing warehousing capabilities on a data lake, relying on a hive metastore and parquet normalization for column features. Delta lake is OSS and becoming more widely supported 🙂

    • blank
      Classicracer Uk
      May 23, 2022 02:41 am Reply

      Don’t forget Data Marts

    • blank
      Charlie
      May 23, 2022 02:41 am Reply

      Great video.

    • blank
      MastJedi
      May 23, 2022 02:41 am Reply

      I like very much work with Databases rather than Warehouses or Lakes. Especially when I can prove (many times did), that my report is right, and report based on Warehouse/Lake is wrong. Work with databases give me freedom and very fast answer on problems. Investigate, find issue and ask data owner to fix input data (usually).
      When chat with owners of reports based on Warehouse/Lakes is always the same: – I dont know.
      My report – headcount 167 ppl.
      Lake 152.
      Where did you take 152 from? I dont know
      I can give you 167 names with numbers of hours, when, where, how long. Can you do the same with 152 so we can compare our data? – I dont have this data. But 152 is THE number.
      Taken all data from Data Lake and check it took them TWO Months. lol.
      Welcome to the big corpo life.

    • blank
      Martijn Vos
      May 23, 2022 02:41 am Reply

      Nice breakdown. From your explanation, I get the impression that the neo4j graph database we used on my previous project was actually a data warehouse for us, because we filled it with data we drew from many other databases, and structured it in a specific way (relations between the many different data elements from the different systems) for the reporting tool we built on top of it.

      And much of our data didn't come directly from those other databases, but drew it daily from an intermediary portal that contained all sorts of different kinda of data from different systems, which I guess was a data lake of some sort.

    • blank
      Andrew Whitworth
      May 23, 2022 02:41 am Reply

      It's worth mentioning, and I don't think you said this explicitly, that the three things can all be running the same underlying software. You can run a Database, Data Warehouse and (at least part) of a Data Lake using the same SQL engine, for example. The big difference is in how you use it, which has more to do with convention (and configuration) than any kind of different technical solution.

    • blank
      WECON Technology
      May 23, 2022 02:41 am Reply

      Wecon IIOT V-BOX could support PLC connecting with database.

    • blank
      Richard Sanders
      May 23, 2022 02:41 am Reply

      Maybe a video on how the cluster/cloud companies charge the customer based on processing/RAM/data storage/bytes transferred, etc. 😀

    • blank
      VadimChes
      May 23, 2022 02:41 am Reply

      Don't put your face video to the right upper corner, where you close your slide text…

    • blank
      Knight Errant
      May 23, 2022 02:41 am Reply

      I disagree with your descriptions in some ways. Especially the bullet point and mention of the data warehouse's data being summarized. You seemed to say, without saying the actual word, that the database data was being aggregated when the ETL process brought the data into the data warehouse. Ok. That is fine and in my experience happens that way quite often by default. The issue I have with that is you also describe the Data Warehouse as where the the analysis and reporting of the data occurs. The sort of 'analysis' you can perform with aggregated data is severely limited. As in 'kindergarden through grade school' limited. Any sort of analysis with aggregated data will be comparing volumes of X and volumes of Y against dates when known incidents occurred or certain conditions obtained (i.e. an outage, inventory shortfall, holiday weekend, etc) at most. And that is pretty much the limit of what can be done when performing an analysis with aggregated data. I – myself – as a systems data analyst and data architect, have come to regard these terms as having become 'executive-speak' as of 2022 and are not very useful in describing the concepts required for conceptualizing modern data storage, processing, and analytics – especially in light of the revolution slowly unfolding with the emergence and evolution of cloud (distributed) technologies. Hmm. Perhaps I need to make some videos of my own (or collaborate on making a few) to put this new conceptualization model out there? What do you think, Alex? And my fellow YouTube video consumers and data nerds? 🙂

    • blank
      Random Debris
      May 23, 2022 02:41 am Reply

      Data lake makes me think of Apple's Newton's data storage (Soups)

    • blank
      Jesse Lima
      May 23, 2022 02:41 am Reply

      By the first time, those concepts got clear to me. Thanks!

    • blank
      おちゃむ
      May 23, 2022 02:41 am Reply

      I develop DWH.It's good video.

    • blank
      Moses Varghese
      May 23, 2022 02:41 am Reply

      Alex, it is awesome to see these videos man! Very informative.

    • blank
      Rik Killeen
      May 23, 2022 02:41 am Reply

      As a foodie, I propose we rename "data lake" to "data ragu."

    • blank
      Liam Dennehy
      May 23, 2022 02:41 am Reply

      You typically don't find a relational database with a flexible schema – they're most often mutually exclusive. A few other generalisations need attention. I understand this is a short video, but still…

    • blank
      D Lama
      May 23, 2022 02:41 am Reply

      Your head is covering parts of the slide.

    • blank
      Matt B
      May 23, 2022 02:41 am Reply

      Any videos on Data visualization?

    • blank
      Ou8y2k2
      May 23, 2022 02:41 am Reply

      One or all three? What about just two?

    • blank
      milligoree
      May 23, 2022 02:41 am Reply

      Great video, nice and clear. The only thing that could have been a helpful addition would be the names of the main tools for each.

    • blank
      DanDirindon
      May 23, 2022 02:41 am Reply

      Sorry but you have described a Data Lake in a too generic way. It sounds only like a place where you can put everything, and nothing more! So does any company that holds data own a Data Lake? I do not think so! In my humble opinion, a Data Like is more an idea than a well-defined concept, appreciated more by the technology vendors to sell huge systems. The term was coined by the Chief Technology Officer of Pentaho, James Dixon, but then it evolved depending on the interests of users, vendors and academic gurus. A serious Data Lake should have the technology, tools, space, resources to load and manage any kind of data mainly from the scratch, with no predetermined or predefined structure or schema. Including real-time data, raw, not cleaned data, or unstructured data, videos, voice, text or clickstream! Independently from the true presence of all these kinds of data in the company and from the real usage of this data. It is strategic to maintain a central repository with metatags and the references to all the data stored in the Data Lake to avoid losing large amount of time to look for something. The same for a system of Data Security to prevent unauthorized accesses to each portion of data. And a lot of effort to avoid that the Data Lake evolves in a Data Swamp! To be more precise, a Data Lake is useful but not strictly necessary for machine learning or artificial intelligence like many people say. And finally, let me state that at least 99% of the companies in the world do not need a real Data Lake, no matter what the technology vendors suggest.

    • blank
      amitava paul
      May 23, 2022 02:41 am Reply

      very comprehensive video..Thanks!

    • blank
      Łukasz Smykał
      May 23, 2022 02:41 am Reply

      clear, straightforward, fantastic 😉
      thx

    • blank
      Katlego
      May 23, 2022 02:41 am Reply

      SQL Developer all the way from South Africa 👋🏾. These lessons really help, thank you. But they (Datawarehouse Architects) don't trust me yet with WRITE access to the Datawarehouse 😂

    • blank
      EndMan7
      May 23, 2022 02:41 am Reply

      There is also a Data Lakehouse

    • blank
      FrostSpike
      May 23, 2022 02:41 am Reply

      Data in the core of the warehouse is normally full-grain (albeit not necessarily all the tables/columns – it's a subset not a summary). It's highly normalised to keep the size down as much as possible. Overlaying the data warehouse are the data marts which are structured (with summaries) for easy analytical consumption. Some people conflate the two and see the data marts (sometimes called information marts) as being part of the data warehouse (and, depending on the platform, might just be views over the curated store) but they're logically a separate layer performing a different function. The Data Lake fulfils the same function as the Data Warehouse core but is more flexible on the types of content it can capture. There's a new-ish architecture called a "Data Lakehouse" which sort of composites the two concepts.

    • blank
      CodingStream
      May 23, 2022 02:41 am Reply

      you could have included some examples of each, that would be of great help !!
      and also maybe a real life usage of these, I know you discussed it for database and datawarehouse but for data-lake as well in a littlr but more detail. datalake was very vaguely covered

    • blank
      JaeHo Yun
      May 23, 2022 02:41 am Reply

      Thanks Alex. This is simple and useful for me

    • blank
      Joshua Morley
      May 23, 2022 02:41 am Reply

      And now we see the latest iteration of a data topology, the lakehouse 🙂

      Great video! i really like that you mention the summarisation as a point of differentiation. Analytics on a data warehouse is typically done by a data analyst with traditional data analyst skillsets (SQL, R etc). Analytics on a data lake is typically done by a data scientist (Python, ML etc)

      An important point to note is "if you have all this data and you have no idea what to do with it" for data lake, a vital thing to focus on when creating a data lake is the structure of your data in the lake (as in directory and file structure)
      if you just dump it all in, it will become a data swamp and waste of money

    • blank
      Anas Farah
      May 23, 2022 02:41 am Reply

      Would it be fair to say that a database goal is to optimize for write operations while a DW is to optimize for read operations?

Leave us a comment