Indexing files

Primary organization

•      The data file must first be saved on disk with one of the primary organizations we talked about

–   Unordered file

–   Ordered file

–   Hashed file

•      With an index, you have two files stored on disk

–   The data file

–   The index file

 

Indexes

•      An index is an additional access structure used to speed up the retrieval of records in response to search conditions

–    This means it contains block addresses

•      Usually an index structure provides a secondary access path to the data based on the indexing field

•      Basically, any field can be used to create an index

•      There can be multiple indexes on the fields of a file

•      If there is an index on every field it is called a fully inverted file.

 

What indexes do

•      Speed up access by associating a field with the disk address where that data can be found

–    This reduces the number of disk I/O

–    Often you need to speed up both sequential access and direct or random access

•      Slow down updates

–    More indexes = more overhead

–    Every index on the table must be updated every time the data is changed

•      When to index

–    Only if you frequently retrieve less than 15% of the rows in a large table

–    Usually it is a good idea to index foreign keys

–    Do not index small tables with few rows

 

Using indexes

•      To find a record or records based on a condition on an indexed field, first the disk block with the index must be accessed.

–   Example:  an SQL query says:
Select * from table where name=‘Wong, James’

•      The index will point to one or more blocks in the file where the required records are stored

 

Type of indexes

•      Single level indexes on the physically ordered files

–   Primary indexes on the ordered key fields

–   Secondary indexes on the nonordered field

–   Clustering indexes on ordered nonkey fields

•      Indexes with a tree data structure

–   Multilevel indexes (indexes of indexes)

–   B+ trees

 

Primary indexes

•      Specified on the ordering key field of the records on disk (the key is unique)

•      The primary index is an ordered file whose records have two fields

–    The first is the same data type as the key filed

–    The second is a pointer to a disk block

•    The first record in each block of the data file is called the block anchor

•      The total number of entries in the index is the same as the number of disk blocks in the data file.

•      A binary search can be done on the index

•      This index file is smaller than the data file, so takes fewer block accesses to search

–    There are fewer entries, one for each block

–    The index records are smaller than the data records (only two fields)

 

An example of how an index helps

•      Suppose we have an ordered file with r=30,000 records of fixed size 100bytes

•      The size of a block B=1024 bytes

–    How many records/block if unspanned?

–    This is the blocking factor: 10 records/block

•      How many blocks are needed to store the file?

–    b = 3000 blocks

•      A binary search on the data file would need
log2 3000, or 12 block accesses with no index

 

How much does a primary index help?

•      Suppose we create a primary index;

–    The index will have 3000 records

–    Each index record is 15 bytes

–    The block size is still 1024 bytes

–    The blocking factor for the index is (block_size/size of entry)
1024/15 = 68 entries/block

•      How many blocks will be necessary to hold the index?

–    Num of blocks to hold data/entries per block

–    3000/68 = 45 blocks to hold index

•      A binary search of the index log2 45 = 6 block accesses, then one more to get the data

 

Problems with a primary index

•      As with any ordered file, insertion of records means we have to move records to make space for the new records

–    Deletions have similar problems

•      In addition, some index entries also must be changed with each insertion or deletion

–    One solution is to use an unordered overflow file.

–    Another solution is to use a linked list of overflow records for each block.

•      Because a primary index is on the physical ordering field, there can only be one primary index for a file

 

Clustering Indexes

•      If the records are ordered on a non key field the ordering field is not unique

–    Example:  physical order on DNO

•      This is called a clustering field

–    Example-- there are a cluster of values where DNO=5

•      The cluster index is an ordered file with two fields, ordering field and block address

•      There is one entry in the clustering index for each distinct value of the clustering field

–    The index contains a value and a pointer to the first block that has a record with that value

 

Two type of clustering index

1. The block pointer in the index point to the first block that contains that index.

–   There may be more than one index value in a block

2. Each index value has its own separate block

–   If a block fills up, a pointer points to another block that contains more items with that index value

•      Same problems with insertions and deletions as primary indices

–   A solution may be to reserve extra block(s) for each value of the clustering field

•      Clustering is a type of nondense index, because it does not have an entry in the index for every record in the file

–   Instead, it has an entry for each distinct value of the indexing field

 

Dense vs. Nondense Indexing

•      Dense indexing – contains an entry for every stored record in the indexed file

•      Nondense indexing

–   Does not contain an entry for every stored record in the indexed file

–   Relies on the physical sequence of the stored file

•      Advantages of nondense indexing

–   Occupy less space

–   Quicker to scan

 

Secondary Indexes

•      A secondary index provides a secondary means of accessing a file for which some primary access already exists.

•      A secondary index is an ordered file with two fields

–    The first field is the same data type as some nonordering field of the data file 

•    Example:  SSN where the files is physically ordered on the name

–    The second is a block pointer or a record pointer

•      There can be many secondary indices for the same file

•      The field that is indexed either provides a unique value, or it does not.  It is called a secondary key if it does.

 

Secondary key indexes

•      In this index, there is one index entry for each record in the data file

–    This is a dense index

•      The entries are ordered in the index by the value of the secondary key

–    But they are not ordered physically this way in the data file

•      Because the data file is NOT physically ordered by the secondary key, an index entry is necessary for each record

•      Each value in the index is associated with a block pointer where the record can be found.

–    A sequential search of the block may be necessary if there is more than one record per block

 

Secondary indexes on non key fields

•      In this case, numerous records in the data file can have the same value for the indexing field

•      Options

  1)A dense index with several index entries with the same field and the block address where the record can be found

  2) A nondense two level index (usually used)

•   The first index file has two fields: the field value and a block pointer to a block record pointers

•   The second index file has a a variable number of block pointers, one to each record with the field value

–   This requires at least one additional block access:

 

An example of how an index helps

•      Ordered file with r=30,000

•      Each record =100bytes

•      The size of a block B=1024 bytes

•      A binary search on the data file would need
log2 3000, or 12 block accesses with no index

•      We saw that a primary index would reduce the number of block accesses to 7

•      How much would a secondary index help?

 

How much does a secondary index help

•      Same file: 30,000 records, BF=10 records/block; 3000 blocks of data

•      A sequential search requires how many block accesses?

–    A linear search requires 3000/2 block accesses  on average.

•      Secondary index with each entry 15 bytes as before

–    So the BF = 1024/15 = 68 entries per block

•      This will be a dense index  30,000 entries with 68 entries/block:  30,000/68 = 442 blocks for the index

•      How many block accesses to find an item?

–    Log 442 = 9 block accesses plus the additional one to access the data

 

 

Secondary indexes

•      They need more storage space than primary indexes

•      They also require a longer search time

•      But, the improvement in search time is great, because otherwise a sequential search is necessary

 

Indexes into  indexes

•      If we index the secondary index, we can find the correct record with fewer block accesses

•      For the problem we looked at, it took 442 blocks to store the index

–    This secondary index is an ordered file, so we could create a nondense index on it.

•      How many blocks would this index into the index take

–    # blocks for the index/block factor  is 442/68 = 7

•      If we also indexed this index, we can find the data with 4 disk accesses.

 

Multilevel indexes

•      The first level is a primary, clustering or secondary index into the data file

–    This may be either a dense or nondense index

•      The second level is a primary index into the first level

–    The first level must be an ordered index

•      The third level is a primary index into the second level….and so on

•      Each of these indexes is a nondense index with a block anchor that points to the beginning of the block

•      This organization relies on the ordered nature of the file

 

Multilevel indexes

•      If we divide the list into thirds, or fourths, or 20th the search space is reduced much faster than halves

•      The fan-out(fo) is the number of parts the list is divided into each time

–    This is also the blocking factor for the index

–    For a binary search, the fan-out is 2

 

Multiway search trees

•      A multilevel index can be thought of as a variation of a search tree

•      This allows the “indexes” (the interior nodes of the tree) to change dynamically.

 

What are Multi-way search trees?

•      Multi-way search trees generalize binary search trees into m-ary search trees

•      They allow more than one key to be stored in a  node

•      There will always be one more child pointer than key

–    So if a node has two keys, it would have three pointers, with possibly three children

•      Increasing the number of children decreases the height of a tree, given the same number of nodes

•      The keys in a node are maintained in order

•      The ordering principal of binary search trees still holds