Tags:conceptdatabasebigdatamapreduce Status:🟩


Big Data

Summary

Big Data refers to extremely large and complex datasets that traditional tools can’t manage effectively. It involves not just the data, but also advanced technologies and methods to handle, process, and analyze it. Key characteristics of Big Data include its large volume, high velocity, diverse variety, uncertain veracity, and the ability to derive value from it. Various processing systems, like MapReduce, distributed databases, and data lakehouses, are used to manage and analyze these datasets, enabling insights for business, science, and public sectors.

Big data

Big data is extremely large and complex datasets that are difficult or impossible to process, analyze or manage using traditional data processing tools and techniques like RDBMS. The concept not only just involves the data itself, but also the methods and technologies used to handle it effectively.

Characteristics of Big Data

There are five V’s that describe the key characteristics of big data.

Volume

The volume is the amount of data that is generated, stored and processed. Data management systems must be able to scale to the given data volume.

Velocity

Velocity is the speed at which data is generated, collected and processed. It requires the data system to be able to process data at a given speed.

Variety

Variety is the diversity of data types and sources. Data can come in structured, semi-structured or unstructured formats. It is challenging to manage different types of data, so the data management system should be chosen based on the use case.

Veracity

Veracity is the accuracy and trustworthiness of data. Data quality varies, and not all collected data can be trusted. It is essential to filer, clean and validate data to ensure accurate insights.

Value

Value is the ability to turn data into insights for decision making and creating business value. Big data is only beneficial if it delivers insights.

  • Business: More effective operations stronger customer relationships more money.
  • Science: New discoveries.
  • Public sector: Better serving people.

Classes of big data processing systems

Big Data Processing Overview

On the left side is all the data sources like small data from OLTP, reports, social media feeds and internet of things devices.

Right for the data sources is ETL, which collects the raw data from the sources, process it by converting it into a suitable format for analyzing, and then stores it in the Big Data System.

In the center is the Processing System which is the heart og the process. It stores and manages massive amounts of data. It supports maintenance, transformations and the creation of pipelines (automated workflows that process data continuously).

On the right side is what the processed and stored data is used for.

  • Analysis is about examining data to identify patterns, trends and insights.
  • Data mining is about discovering hidden relationships or structures within the data.
  • Machine learning uses the data to build predictive models and algorithms to make data-driven decisions.

At the bottom is the continuous processes for keeping data relevant and up to date.

  • Maintenance: Ensures data integrity, reliability, and performance.
  • Transformations: Ongoing changes to adapt data formats or correct errors.
  • Pipelines: Automates the flow of data from raw sources to actionable insights, enabling real-time or batch processing.

Big Data Processing System

Big Data uses OLAP

Data collections typically consists of many millions of files of data.

  • It is too much for a single server, hence it uses a distributed storage.
  • The reading amounts of data are very large, hence it uses sequential scans (full table scan).

Big Data uses distributed processing

When processing big data, there can be processes like

  • Applying filters to reduce data quantity
  • Run complex processing pipelines
    • Periodic tasks & triggered tasks upon new data.
    • User-defined functions.
    • Server-side applications (like machine learning).

This is all to large for a single server, hence it uses Distributed Processing. (Large queries can go some taking a lot of months in 1 single system to a couple of minutes in a large distributed system.)

Classes of Big Data Processing Systems

Relational Database Systems

Systems: snowflake, amazon redshift, teradata. Strengths

Weaknesses

  • Little support for unstructured data.
  • Little support for machine learning training and serving.

MapReduce Systems

Systems: hadoop, cloudera, hortonworks. Strengths

  • Distributed Storage
  • Great scalability
  • Able to process great variety of data sets (structured, semi-structured, unstructured)

Weaknesses

  • Limited query interface (only maps and reduce functions, no SQL)
  • Large maintenance cost.
  • Difficult to build applications on top.

Key-value Stores

Systems: amazon dynamodb, redis, cassandra. Strengths

  • Distributed storage
  • Great scalability
  • Short latency for single items
  • Quick “out of the box” way to store data

Weaknesses

  • Limited storage model (only key-value pairs)
  • Limited query interface (only sql)
  • There are no tools to scan and filter data

Document Databases

Systems: mongodb, couchbase. Strengths

  • Store objects / XML / JSON in hierarchical form
  • Good integration with object-oriented languages and javascript

Weaknesses

  • Limited query interface (no sql)
  • No ACID guarantees
  • Not designed for scans (very related to key-value stores)

Graph Databases

Systems: neo4j, apache giraph. Strengths

  • Capture graph relationships, e.g. knowledge graphs, social networks.
  • Fast at traversing edge chains, no joins needed

Weaknesses

  • Specific to graph applications
  • Not many such use cases
  • Relational databases outperform graph databases these days (most likely still faster at processing edge chains with joins)

Data Lakehouses

Systems: databricks Strengths

  • Data storage in open data formats (Apache Parquet + Apache Iceberg) in the cloud.
  • Great ML support, e.g. training, but also Python-based notebooks.
  • Spark is significantly more flexible than Hadoop for complex processing pipelines.

Weaknesses

  • Typical API is Apache Spark, less versatile and supported than SQL.
  • However Lakehouses are moving to SQL.

MapReduce

MapReduce breaks down a large data processing task into smaller chunks, processes them in parallel, and combines the results. It has two main phases in the pipeline:

  • Read lots of data
  • Map: Process the data items (runs in parallel per file)
  • Sort and shuffle
  • Reduce: Aggregate the data by merging duplicates and setting a duplicate counter (runs in parallel per group)
  • Write the result

Per definition: Map input: . Map output: . Reduce input: . Reduce output: .

MapReduce Illustration

MapReduce in a distributed system

DFS stands for “Distributed File System”, and typically divides large datasets into smaller chunks called data blocks.

Concepts of MapReduce in Distributed Systems

  • Parallel computing: MapReduce divides data processing tasks into small, parallelizable units
  • Data processing: Easy to understand data flow through the system
  • Fault tolerance: internal mechanism to recover from failures
  • Scalability: demonstrates how to scale “horizontally” (across many servers)

Transformation in MapReduce

Its possible to use some transformations to manipulate the results you’re getting. Some are used before map, in between or after reduce.

Big Data Landscapes Throughout The Years

2012 2014 2016 2017 2018 2019 2023