Query Optimization

Chapter 15

 

From SQL to machine language

•      In a high level language like SQL, a query must be

–     scanned – identifies the tokens

–    parsed – checks the query syntax

–    validated – checks the table and attribute names are valid

•      Then an internal representation of the query is created

–    The query must be translated into relational algebra

–    A query tree is built with the relational operators

•      The DBMS then must devise an execution strategy for retrieving the result of the query

–    To do this it optimizes the query tree

 

Steps in processing a SQL query

•      Scanning, parsing and validating

•      Query optimizer module

–    Translate the SQL into relational algebra

–    Build a query tree, then optimize it

–    Produce an execution plan from a query tree

•      Code generator

–    Generate the code to execute the plan

–    Use algorithms available in the DBMS for each relational algebra operator

•      Runtime database processor

–    Run the query code to produce the result

•    May be either compiled or interpreted

 

The query optimizer

•      Disclaimer: it cannot really produce the optimal strategy, just a reasonably efficient strategy

•      First, translate SQL into relational algebra

–    Nested queries are decomposed into query blocks

•    Then the select, from , where, group by and having clauses are translated into their relational algebra equivalent

•      A query tree is then built

•      Then heuristic rules are applied to “optimize” the tree

•      After you have an efficient tree, algorithms to implement the relational algebra operations are applied

 

Example of an algorithm to implement a relational operator

•      There must be algorithms to implement every relational algebra operation

•      Methods for implementing the JOIN operation

–   Nested-loop join (brute force)

•   Retrieve every record from both relations

•   Test whether the two records satisfy the join condition

–   Single-loop join (one relation has an index on the attribute in the join condition)

•   Retrieve each record in the unindexed relation

•   Use the index to retrieve only those records in the second relation that match

•      Sort-merge join (both relations can be searched for the join attribute)

•    If the files are not sorted on the join attribute, they are sorted

•    Both relations are scanned concurrently in order of the join attribute

•    If the join attributes match, the records are joined

•      Hash-join ( the records of the two relations are both hashed to the same hash file using the same hashing function on the join attributes)

 

Heuristic rules

•     One of the main rules is to apply SELECT and PROJECT operations before applying JOIN.

–   SELECT and PROJECT operations reduce the size of a file

–   JOIN usually multiplies the size of the file

•     The further down the tree an operations appears, the earlier it is applied

 

Query trees

•      A query tree  is a tree that corresponds to a relational algebra expression

•      The input relations are the leaf nodes of the tree

•      The relational algebra operations are the internal nodes

•      The query parser will generate a standard initial query tree (without optimization)

•      How the initial tree is built:

–    Apply Cartesian cross for relations in the FROM clause

–    Then the conditions (join & selections) from WHERE clause are added

–    Then projections from the SELECT clause

 

Example query

•      Find the last names of employees born after 1957 who work on a project names ‘Aquarius’.

•      SQL query

   SELECT lname
FROM employee, works_on, project
WHERE pname = ‘Aquarius’
          AND pnumber = pno
          AND essn = ssn
          AND bdate > ‘1957-12-31’;

 

Optimization of Query trees

•      Making the initial tree more efficient

–   Many different relational algebra expressions correspond to the same query

–   The trick is to take the initial tree and convert it into an equivalent, more efficient tree

–   There are rules to help do this

–   Basically, we want to move select and project as far down the tree as possible and still maintain equivalence

•   This is because they limit the size of the relation

 

The evolving query tree

•      1) First the initial query tree with Cartesian crosses

•      2) Then move the select operations down the tree

•      3) Then apply the more restrictive operations first

–    pname is a key attribute, so will retrieve only one tuple

•      4) Replace Cartesian product and select with join operations

•      5) Move project operations down the query tree

 

Transformation rules for relational algebra operations

•      See overhead or page 519 of text

 

Practice optimizing a query

•      Draw Draw the initial and optimized query trees for the following queries

–   Select fname, lname address
from employee, department
where dname = ‘research and dnumber = dno;

–   Select E.lname S.lname
from employee as E, employee as S
Where E.superssn = S.ssn
    and S.sex = ‘F’;

 

Converting the query tree into a query execution plan

•      The execution plan must include information about the access methods that exist for each relation

–    Each relation is stored ordered on one field; there may be other indexes into the relation

•      It must also know the algorithms that exist for  the relational operators represented in the tree

–    There must be a different algorithm for each access into the relation

•      With this information, it can choose the algorithm to implement the operation with the given data

 

Query code generator

•      The query tree gives the order of execution of each relational algebra operation; the execution plan involves choosing the correct algorithm to implement the operation

•      Temporary results are computed, then used as input for the next operation in the query tree

•      The RDBMS must include algorithms to implement the different types of relational operations

•      There are often several algorithms to implement the same operation, since one may apply only to a particular storage structure and access path

 

Typical algorithms

•      Any query that specifies order by must use an external sort algorithm

•      Implementing the select operation

–    These can include linear search, binary search, using a primary key or hash table, using a clustering index or a secondary key (usually a B+ tree)

•      Implementing a join operation

–    These can include nested-loop join, single-loop join, sort-merge join, hash-join

•      Implementing project operations

–   This just involves eliminating some of the columns in a table