Tags:conceptdatabaseindexindexingquery Status:🟩


Database Indexing

Summary

Indexes are data structures in a database that enhance data retrieval speed, similar to an index in a book. They allow the Database Management System (DBMS) to quickly locate rows based on specific attributes. Queries can be full table scans, which read all rows, or selective queries, which return smaller subsets. Selective queries include point queries for single values and range queries for multiple values. Indexing techniques include hashing for exact matches and tree search for dynamic retrieval. Indexes can be clustered, where data is stored in order, or unclustered for additional indexing options, with covering indexes containing all necessary attributes to eliminate extra data reads.

Indexes

An index is a data structure that supports efficient access to data. To speed up queries the DBMS may build an index on an attribute. It is similar to an index in the back of a book. For every piece of data you might be interested in, the index says where to find the row with the actual data.

For every instance of a query, the DBMS chose whether or not to use an index, and the user do not need to take indexes into account when writing queries. The choice of index may depend on query parameters like WHERE, GROUP BY or ORDER BY. However if any aggregate functions are used on an attribute, then an index on that may be useful. I.e: AVG, SUM etc.

Full table scans and selective queries

A full table scan (also called Sequential Scan) is when querying through all the rows in a table and return those rows that satisfy the condition. It would make sense to make a full table scan if we have to return 80% of the rows in the table.

SELECT * FROM t WHERE <condition>

A selective query is when we return just a small percentage of the rows.

Point queries and Range queries

Both point queries and range queries are selective queries that return a small percentage of the rows.

A point query is where we look for a single value which will return a small result set with a few records.

SELECT * FROM person WHERE bithdate='2002-01-05';

Points queries are more efficient to perform if data is sorted by the attribute used in the condition, since binary search can be used to find the result(s).

A range query is where we look for a range of values, which could result in a much larger result set.

SELECT * FROM person
WHERE birthdate BETWEEN '2002-05-01' and '2005-10-25'

Range queries are also more efficient to perform when data is sorted by the right attribute, but they are often less specific than point queries.

Searching through indexes

Primarily two techniques are used to search through indexes.

Hashing uses a fixed transformation algorithm to convert the attribute value into a database address. However hashing only supports equality queries (queries that retrieve data based on an exact match), which doesn’t make them very versatile.

Tree Search builds a dynamic search structure that guides the search for a given attribute value to the proper database location.

Clustered and unclustered index

Clustered index

A clustered index determines the physical order of data in a table. The table rows are stored on disk in the same order as the clustered index. A table can have only one clustered index.

When the rows in a table are stored in order based on an attribute, an index on that attribute is called a clustered index. Clustered indexes make point and range queries on the key very efficient. Many DBMSs automatically build a clustered index on the primary key of each table. When saying something like “The data is clustered” its like saying “The data is ordered”. i.e. with id, it starts from 0 to the highest.

Unclustered index

A table can have multiple non-clustered indexes, which do not affect the physical order of the data but instead store a logical pointer to the data rows.

CREATE INDEX myIndex ON tableName(attributeName);

They are sometimes also called, non-clustered, secondary indexes or dense indexes. They make points queries efficient and some range queries more efficient.

Clustered vs. Unclustered index

Clustered Index: More efficient for retrieving a large number of records (m is large) because data is physically stored together, enabling better sequential I/O.

Unclustered Index: More efficient for retrieving a small number of records (m is small) because it’s optimized for selective point queries.

Covering index

An index that contains all attributes used in a query is called covering. This means that the data from the table is not needed and therefore no disk reads are required to retrieve rows.

CREATE INDEX movieyear ON movie(year);
SELECT COUNT(*) FROM movie WHERE year=2012;
 
CREATE INDEX phn ON person(height, name);
SELECT name FROM person WHERE height=180;

Multi attribute indexes

It is possible to define an index on several attributes.

CREATE INDEX myIndex ON person (age,salary)

It can speed up point and range queries such as

SELECT * FROM person
WHERE age = 22 and salary < 20000;

An unclustered index on multiple attributes typically allows for indexing on any starting combination of those attributes because they are sorted in order.