Indexing files

Primary organization

• The data file must first be saved on disk with one of the primary organizations we talked about

– Unordered file

– Ordered file

– Hashed file

• With an index, you have two files stored on disk

– The data file

– The index file

Indexes

• An index is an additional access structure used to speed up the retrieval of records in response to search conditions

– This means it contains block addresses

• Usually an index structure provides a secondary access path to the data based on the indexing field

• Basically, any field can be used to create an index

• There can be multiple indexes on the fields of a file

• If there is an index on every field it is called a fully inverted file.

What indexes do

• Speed up access by associating a field with the disk address where that data can be found

– This reduces the number of disk I/O

– Often you need to speed up both sequential access and direct or random access

• Slow down updates

– More indexes = more overhead

– Every index on the table must be updated every time the data is changed

• When to index

– Only if you frequently retrieve less than 15% of the rows in a large table

– Usually it is a good idea to index foreign keys

– Do not index small tables with few rows

Using indexes

• To find a record or records based on a condition on an indexed field, first the disk block with the index must be accessed.

– Example: an SQL query says:
Select * from table where name=‘Wong, James’

• The index will point to one or more blocks in the file where the required records are stored

Type of indexes

• Single level indexes on the physically ordered files

– Primary indexes on the ordered key fields

– Secondary indexes on the nonordered field

– Clustering indexes on ordered nonkey fields

• Indexes with a tree data structure

– Multilevel indexes (indexes of indexes)

– B+ trees

Primary indexes

• Specified on the ordering key field of the records on disk (the key is unique)

• The primary index is an ordered file whose records have two fields

– The first is the same data type as the key filed

– The second is a pointer to a disk block

• The first record in each block of the data file is called the block anchor

• The total number of entries in the index is the same as the number of disk blocks in the data file.

• A binary search can be done on the index

• This index file is smaller than the data file, so takes fewer block accesses to search

– There are fewer entries, one for each block

– The index records are smaller than the data records (only two fields)

An example of how an index helps

• Suppose we have an ordered file with r=30,000 records of fixed size 100bytes

• The size of a block B=1024 bytes

– How many records/block if unspanned?

– This is the blocking factor: 10 records/block

• How many blocks are needed to store the file?

– b = 3000 blocks

• A binary search on the data file would need
log₂ 3000, or 12 block accesses with no index

How much does a primary index help?

• Suppose we create a primary index;

– The index will have 3000 records

– Each index record is 15 bytes

– The block size is still 1024 bytes

– The blocking factor for the index is (block_size/size of entry)
1024/15 = 68 entries/block

• How many blocks will be necessary to hold the index?

– Num of blocks to hold data/entries per block

– 3000/68 = 45 blocks to hold index

• A binary search of the index log₂ 45 = 6 block accesses, then one more to get the data

Problems with a primary index

• As with any ordered file, insertion of records means we have to move records to make space for the new records

– Deletions have similar problems

• In addition, some index entries also must be changed with each insertion or deletion

– One solution is to use an unordered overflow file.

– Another solution is to use a linked list of overflow records for each block.

• Because a primary index is on the physical ordering field, there can only be one primary index for a file

Clustering Indexes

• If the records are ordered on a non key field the ordering field is not unique

– Example: physical order on DNO

• This is called a clustering field

– Example-- there are a cluster of values where DNO=5

• The cluster index is an ordered file with two fields, ordering field and block address

• There is one entry in the clustering index for each distinct value of the clustering field

– The index contains a value and a pointer to the first block that has a record with that value

Two type of clustering index

1. The block pointer in the index point to the first block that contains that index.

– There may be more than one index value in a block

2. Each index value has its own separate block

– If a block fills up, a pointer points to another block that contains more items with that index value

• Same problems with insertions and deletions as primary indices

– A solution may be to reserve extra block(s) for each value of the clustering field

• Clustering is a type of nondense index, because it does not have an entry in the index for every record in the file

– Instead, it has an entry for each distinct value of the indexing field

Dense vs. Nondense Indexing

• Dense indexing – contains an entry for every stored record in the indexed file

• Nondense indexing

– Does not contain an entry for every stored record in the indexed file

– Relies on the physical sequence of the stored file

• Advantages of nondense indexing

– Occupy less space

– Quicker to scan

Secondary Indexes

• A secondary index provides a secondary means of accessing a file for which some primary access already exists.

• A secondary index is an ordered file with two fields

– The first field is the same data type as some nonordering field of the data file

• Example: SSN where the files is physically ordered on the name

– The second is a block pointer or a record pointer

• There can be many secondary indices for the same file

• The field that is indexed either provides a unique value, or it does not. It is called a secondary key if it does.

Secondary key indexes

• In this index, there is one index entry for each record in the data file

– This is a dense index

• The entries are ordered in the index by the value of the secondary key

– But they are not ordered physically this way in the data file

• Because the data file is NOT physically ordered by the secondary key, an index entry is necessary for each record

• Each value in the index is associated with a block pointer where the record can be found.

– A sequential search of the block may be necessary if there is more than one record per block

Secondary indexes on non key fields

• In this case, numerous records in the data file can have the same value for the indexing field

• Options

1)A dense index with several index entries with the same field and the block address where the record can be found

2) A nondense two level index (usually used)

• The first index file has two fields: the field value and a block pointer to a block record pointers

• The second index file has a a variable number of block pointers, one to each record with the field value

– This requires at least one additional block access:

An example of how an index helps

• Ordered file with r=30,000

• Each record =100bytes

• The size of a block B=1024 bytes

• A binary search on the data file would need
log₂ 3000, or 12 block accesses with no index

• We saw that a primary index would reduce the number of block accesses to 7

• How much would a secondary index help?

How much does a secondary index help

• Same file: 30,000 records, BF=10 records/block; 3000 blocks of data

• A sequential search requires how many block accesses?

– A linear search requires 3000/2 block accesses on average.

• Secondary index with each entry 15 bytes as before

– So the BF = 1024/15 = 68 entries per block

• This will be a dense index 30,000 entries with 68 entries/block: 30,000/68 = 442 blocks for the index

• How many block accesses to find an item?

– Log 442 = 9 block accesses plus the additional one to access the data

Secondary indexes

• They need more storage space than primary indexes

• They also require a longer search time

• But, the improvement in search time is great, because otherwise a sequential search is necessary

Indexes into indexes

• If we index the secondary index, we can find the correct record with fewer block accesses

• For the problem we looked at, it took 442 blocks to store the index

– This secondary index is an ordered file, so we could create a nondense index on it.

• How many blocks would this index into the index take

– # blocks for the index/block factor is 442/68 = 7

• If we also indexed this index, we can find the data with 4 disk accesses.

Multilevel indexes

• The first level is a primary, clustering or secondary index into the data file

– This may be either a dense or nondense index

• The second level is a primary index into the first level

– The first level must be an ordered index

• The third level is a primary index into the second level….and so on

• Each of these indexes is a nondense index with a block anchor that points to the beginning of the block

• This organization relies on the ordered nature of the file

Multilevel indexes

• If we divide the list into thirds, or fourths, or 20^th the search space is reduced much faster than halves

• The fan-out(fo) is the number of parts the list is divided into each time

– This is also the blocking factor for the index

– For a binary search, the fan-out is 2

Multiway search trees

• A multilevel index can be thought of as a variation of a search tree

• This allows the “indexes” (the interior nodes of the tree) to change dynamically.

What are Multi-way search trees?

• Multi-way search trees generalize binary search trees into m-ary search trees

• They allow more than one key to be stored in a node

• There will always be one more child pointer than key

– So if a node has two keys, it would have three pointers, with possibly three children

• Increasing the number of children decreases the height of a tree, given the same number of nodes

• The keys in a node are maintained in order

• The ordering principal of binary search trees still holds