NoSQL

Summary

NoSQL (Not Only SQL) is a type of database designed for flexibility and scalability, often used in distributed systems. Unlike traditional relational databases, NoSQL databases can store unstructured or semi-structured data using models like key-value pairs, documents, and graphs. These databases do not rely solely on SQL queries, offering alternative query methods like RESTful APIs. They are ideal for handling large volumes of data across multiple servers, with a focus on performance, fault tolerance, and scalability.

NoSQL Introduction

NoSQL is a database architecture that is usually used in on a distributed server across multiple units.

Unlike relational databases, NoSQL databases does not rely on predefined tables and relationships. Instead they use flexible data models like key-value pairs, documents, wide-column stores or graphs. This makes well-suited for unstructured or semi-structured data such as XML or JSON.

NoSQL stands for Not only SQL and means that these databases do not solely rely on SQL queries. (Could instead be RESTful APIs, graph query languages or APIs designed for their specific data model). However some NoSQL databases like MongoDB and Cassandra support SQL-like query syntax for convenience.

Key-Value Stores

Key-Value stores is a data model for NoSQL that use associative array. This means that a unique key points to a value, and the content of the value itself is not known by the database.

Key-value stores only support simple key-based lookups like GET and PUT. They cannot perform queries (SQL) or searches directly on the value unlike relational databases where you can write SELECT * FROM users WHERE age > 30. This means you can only retrieve (or update) a value based on its key, but you cannot filter, sort, or aggregate data based on the content of the value itself.

The value can be an aggregate structure such as:

JSON: { "name": "Bob", "age": 22, "email": "bob@example.com" }.
Arrays: [1, 2, 3, 4].
Serialized data: Complex objects like binary files or serialized data that requires deserialization.

Document Stores

Document Stores is a data model for NoSQL where each value is a document. Documents are associated with a unique key (or id), which is used to retrieve the document when needed.

The value/document is often structured as a collection of fields and values, which can be retrieved, modified, or stored. The most common format is JSON, which is lightweight and human-readable. Other formats as BSON (Binary JSON) and XML are used as well. Hence documents is considered an aggregate structure, because it can contain multiple fields, and each can hold complex data types.

Document Stores allows querying inside the document. It is possible to search and filter based on the fields or attributes inside the document - not just the key. db.users.find({ "age": 30 }) finds all users who are 30 years old. You can also access specific parts of the document and query based on them.

Graph Stores

Graph Stores is a data model for NoSQL which represents and stores graphs. In a graph database, the data is stored as a collection of nodes, edges and properties.

Nodes Nodes in a graph database represent entities or objects.

(John Doe)

Edges Edges are connections between nodes representing relationships. They represent how entities are related to each other. An edge is directional, meaning that the edge have a direction. An edge can point from one node to another, indicating the direction of the relationship. It can not be bidirectional. Instead there will be 2 directional edges.

John is friends with Jane
(John Doe) --[FRIEND]--> (Jane Smith)

John is friends with Jane and Jane is friends with John
(John Doe) --[FRIEND]--> (Jane Smith)
(Jane Smith) --[FRIEND]--> (John Doe)

Properties In graph databases, nodes and edges can have properties, like information about the node or edge, which are stored as key-value pairs.

A property are associated with the id of the node or edge. This means that if 5 persons are 30 years old, the property "age": 30 exists 5 times, because it is not reused.

Property for a node:
{ "name": "John Doe", "age": 30, "city": "New York" }

Property for an edge:
{ "since": "2010" }

Distributed Architecture

The goals of distributed storage are to ensure that data can be reliably stored and accessed across multiple machines, or nodes, in a way that maximizes:

Performance
Fault Tolerance
Scalability

Consistent Hashing

Consistent hashing is a technique used in distributed systems to map data (i.e., keys) to a set of nodes (i.e., servers) in a way that minimizes data reorganization when nodes are added or removed. In this method, both nodes and data are hashed to a virtual ring, and each piece of data is assigned to the nearest node in a clockwise direction. This ensures that adding or removing nodes only affects a small subset of data, leading to a more efficient and scalable system.

Balance

In consistent hashing, we want to have even distribution of data across the available nodes. The goal is to prevent any node from being overloaded with too much data while others have too little. Achieving balance ensures efficient resource usage and avoids performance bottlenecks.

Problems

Randomness: Even though consistent hashing tries to evenly distribute data across nodes, the distribution of data can sometimes become uneven due to the randomness of the hash function.

Unused node capability: Consistent hashing typically treats all nodes the same, but in reality, servers in a distributed system may have different hardware capabilities (i.e., processing power, memory, storage).

High load from 1 node: When a new server is added, consistent hashing redistributes data only for the affected part of the hash ring, causing the load to shift primarily to one or a few existing nodes instead of being spread evenly across all servers.

Virtual Servers

This technique is about representing a single physical server as multiple smaller virtual servers (or virtual nodes) in the hash ring. This helps achieve a more even distribution of data across the servers in a distributed system.

Each physical server is mapped to multiple points on the hash ring. These multiple points are often called virtual nodes.

The placement of virtual servers on the hash ring is random or pseudo-random. This randomness helps distribute the data more evenly across all available nodes. By assigning virtual servers randomly, the system can avoid clusters of nodes with a heavy concentration of data, achieving a more balanced data distribution.

Redundancy

Redundancy is the practice of duplicating or replicating data across multiple nodes in a distributed system to ensure reliability, fault tolerance, and high availability.

Redundancy is often achieved by storing copies of data on multiple nodes (replicas). This ensures that if one node fails or becomes unavailable, the data can still be accessed from another node that holds a replica.

Consistency

NoSQL databases often focus on availability and partition tolerance (as per the CAP theorem), but this comes at the cost of consistency. Different NoSQL systems offer different levels of consistency based on how they handle replicas (copies of data) in distributed systems.

Replica Consistency

Replica consistency defines how data replicas are synchronized across distributed systems. It ensures that when data is updated, all replicas reflect those updates in a specific way.

Sequential (or Strong) Consistency: All updates are seen by all processes in the same order, ensuring no inconsistency between replicas.
Weak Consistency: There is no guarantee that all replicas will be updated in the same way, or in the same order. Observers might see different versions of the data at different times.
Eventual Consistency: A form of weak consistency where, over time, all replicas will eventually converge to the same state (assuming no failures). However, there’s no guarantee of when this will happen.

Tunable Consistency

Tunable consistency is about the ability to configure the level of consistency that a NoSQL system provides. Unlike traditional systems with strict consistency, NoSQL databases often allow the user to tune consistency based on the trade-off between performance, availability, and consistency.

N: Total number of replicas of data in the system.
R Number of replicas that must be involved in a read operation.
W Number of replicas that must confirm a write operation before it’s considered successful.
R = W = 1 (Eventual Consistency): Both reads and writes are confirmed by just one replica, providing eventual consistency.
R + W > N (Strong Consistency): Read and write operations require a majority of replicas (R + W > N), ensuring stronger consistency across replicas.

INDBS Noter

Explorer