Dealing With Big Data - Computerphile



Big Data sounds like a buzz word, and is hard to quantify, but the problems with large data sets are very real. Dr Isaac Triguero explains some of the challenges.

https://www.facebook.com/computerphile

This video was filmed and edited by Sean Riley.

Computer Science at the University of Nottingham: https://bit.ly/nottscomputer

Computerphile is a sister project to Brady Haran’s Numberphile. More at http://www.bradyharan.com

source

32 thoughts on “Dealing With Big Data – Computerphile”
  1. You could process a Petabyte of data on a cell phone.

    The real thing I think this video is missing is simply that you scale up when a business need to speed up is required.

  2. 16GB a lucky guy? Thats like the bare minmum for a developer these days. I want 64GB for my next upgrade in a year or so.

  3. One handy trick is to reduce the number of "reductions" in a map-reduce task. In other words, more training, less validation. The downside this could mean the training coverages more slowly

  4. At my university there is a masters programme in data science and artificial intelligence.

    It's something I might go into after finishing my bachelor in computational linguistics. However I do need to do additional maths courses, which I haven't looked into yet.

    Apparently the supercomputer at the University has the largest memory in all of Europe. Which is 8 TB per nodd

  5. If you need 200+ servers, just run it on a IBM z Server as a plain jane computer task all by itself.

  6. Here is my biggest problem with all of the definitions of 'big data' in that it requires multiple computer. What if it only requires multiple computers because the person who is 'analyzing' it, doesn't know how to deal with large data efficiently? Quality of data? I will just use SQL/SSIS to cleanse the data. I normally deal with data in the multiple TB range on either my laptop [not a typical laptop – 64 GB of ram], or my workstation [again, perhaps not a normal computer with 7 hard drives, mostly SSD, 128 GB of ram and a whole lot of cores] and can build an OLAP from the OLTP in minutes and then running more code doing some deeper analyst taking a few minutes more. If it takes more than 30 minutes, I know that I screwed something up. If you have to run it on multiple servers, maybe you also messed something up. Python is great for the little stuff [less than 1 GB], so is R, but for the big data, you need to work with something that can handle it. I have 'data scientist' friends with degrees from MIT who couldn't handle simple SQL and would freak out if they had more than a couple of MB of data to work with. In the meanwhile, I would handle TB of data in less time with SQL, SSIS, OLAP, MTX.
    Yeah, those are the dreaded Microsoft words.

  7. No. Everyone is talking about COVID. And I listened him until he mentioned COVID in first few minutes, enough nof broken English anyway

  8. Do a video on Raid storage! All this talk about big data and storage, I would love some videos on raid 5 and parity drives!

  9. FDR infiniband is super cheap for networking!! you can get twin port cards on ebay for $50 that'll xfer 14GB/sec and 36 port switches for $200! I love it!!!!

  10. It's not the size of your data set that matters, nor is how many computers you use or the statistical they apply, what matters is how useful is the knowledge you extract.

  11. Interesting video. I worked on and designed big data building large databases for litigation in the early 1980… that was big at the time. Then a few years later creating big data for shopping analysis. The key is that big data is big for the years that you are working on it and not afterwards as storage and processing gets bigger and faster. I think that while analysis and reporting is important, (otherwise there is no value to the data) I do believe that designing and building proper ingestion and storage designs are as important. My two cents from over 30 years of building big data.

  12. As someone who works more on the practical side of this field, it really is a huge problem to solve. I work with data sets where we feed in multiple terabytes per day, and making sure the infrastructure stays healthy is a huge undertaking. It's cool to see it broken down in a digestible manner like this.

  13. "If you are using windows that is your own mistake" …well that is the hard truth for data scientists lol

  14. I prefer the strategy where I make everything super memory efficient and then go do something while it runs for a long time

  15. I have worked with Redshift and then with Snowflake. Snowflake solved the problems Redshift had by actually storing all the data efficiently in a central storage instead of storing in each machine. The paradigm is actually backwards now as storing is cheap (network is still the bottleneck)

  16. What? Buying more memory is cheaper than buying more computers… which just means you’re throwing more memory and cpu at it. I think you meant you solve it by writing a slower algorithm that uses less memory as the alternative. Also, buying more memory is often cheaper than the labor cost of refactoring, especially when it comes to distributed systems. Also, why the Windows hate? I don’t use Windows but still cringed there a bit

  17. This video felt more like marketing than education, sorry. Surely you just use whatever solution is appropriate for your problem, right? Get that hammer out of your hand before fixing the squeaky door.

Leave a Reply

Your email address will not be published.

Captcha loading...