Indexing files
Primary
organization
The data file must first be saved on disk with one of
the primary organizations we talked about
Unordered file
Ordered file
Hashed file
With an index, you have two files stored on disk
The data file
The index file
Indexes
An index is an additional access structure used
to speed up the retrieval of records in response to search conditions
This means it
contains block addresses
Usually an index structure provides a secondary
access path to the data based on the indexing field
Basically, any field can be used to create an index
There can be multiple indexes on the fields of a file
If there is an
index on every field it is called a fully inverted file.
What indexes do
Speed up access
by associating a field with the disk address where that data can be found
This reduces the
number of disk I/O
Often you need to
speed up both sequential access and direct or random access
Slow down
updates
More indexes =
more overhead
Every index on
the table must be updated every time the data is changed
When to index
Only if you
frequently retrieve less than 15% of the rows in a large table
Usually it is a
good idea to index foreign keys
Do not index
small tables with few rows
Using indexes
To find a record or records based on a condition on an
indexed field, first the disk block with the index must be accessed.
Example: an SQL query says:
Select * from table where name=Wong, James
The index will point to one or more blocks in the file
where the required records are stored
Type of indexes
Single level
indexes on the physically ordered files
Primary indexes on the ordered key fields
Secondary indexes on the nonordered
field
Clustering indexes on ordered nonkey
fields
Indexes with a tree data structure
Multilevel indexes (indexes of indexes)
B+ trees
Primary indexes
Specified on the ordering key field of the
records on disk (the key is unique)
The primary index is an ordered file whose records
have two fields
The first is the
same data type as the key filed
The second is a
pointer to a disk block
The first record in each block of the data file
is called the block anchor
The total number of entries in the index is the same
as the number of disk blocks in the data file.
A binary search can be done on the index
This index file is smaller than the data file,
so takes fewer block accesses to search
There are fewer entries,
one for each block
The index records
are smaller than the data records (only two fields)
An example of how an index helps
Suppose we have an ordered file with r=30,000 records
of fixed size 100bytes
The size of a block B=1024 bytes
How many
records/block if unspanned?
This is the
blocking factor: 10 records/block
How many blocks are needed to store the file?
b = 3000 blocks
A binary search
on the data file would need
log2 3000, or 12 block accesses with no index
How much does a primary index help?
Suppose we create a primary index;
The index will
have 3000 records
Each index record
is 15 bytes
The block size is
still 1024 bytes
The blocking factor for the index is (block_size/size of entry)
1024/15 = 68 entries/block
How many blocks will be necessary to hold the index?
Num of blocks to
hold data/entries per block
3000/68 = 45
blocks to hold index
A binary search of the index log2 45 = 6
block accesses, then one more to get the data
Problems with a primary index
As with any ordered file, insertion of records means
we have to move records to make space for the new records
Deletions have
similar problems
In addition, some index entries also must be changed
with each insertion or deletion
One solution is
to use an unordered overflow file.
Another solution
is to use a linked list of overflow records for each block.
Because a primary index is on the physical ordering
field, there can only be one primary index for a file
Clustering Indexes
If the records are ordered on a non key
field the ordering field is not unique
Example: physical order on DNO
This is called a clustering field
Example-- there
are a cluster of values where DNO=5
The cluster index is an ordered file with two
fields, ordering field and block address
There is one entry in the clustering index for each
distinct value of the clustering field
The index contains a value and a pointer to the first
block that has a record with that value
Two type of clustering index
1. The block pointer in the index point to
the first block that contains that index.
There may be more than one index value in a block
2. Each index value has its own separate
block
If a block fills up, a pointer points to another block
that contains more items with that index value
Same problems with insertions and deletions as primary
indices
A solution may be to reserve extra block(s) for each
value of the clustering field
Clustering is a type of nondense index, because
it does not have an entry in the index for every record in the file
Instead, it has an entry for each distinct value of
the indexing field
Dense vs. Nondense Indexing
Dense indexing
contains an entry for every stored record in the indexed file
Nondense
indexing
Does not contain an entry for every stored record in
the indexed file
Relies on the physical sequence of the stored file
Advantages of nondense indexing
Occupy less space
Quicker to scan
Secondary Indexes
A secondary index provides a secondary means of
accessing a file for which some primary access already exists.
A secondary index is an ordered file with two fields
The first field
is the same data type as some nonordering
field of the data file
Example: SSN where the files is physically ordered on
the name
The second is a
block pointer or a record pointer
There can be many secondary indices for the same file
The field that is indexed either provides a unique
value, or it does not. It is called a secondary
key if it does.
Secondary key indexes
In this index, there is one index entry for each
record in the data file
This is a dense
index
The entries are ordered in the index by the value of
the secondary key
But they are not
ordered physically this way in the data file
Because the data file is NOT physically ordered by the
secondary key, an index entry is necessary for each record
Each value in the index is associated with a block
pointer where the record can be found.
A sequential
search of the block may be necessary if there is more than one record per block
Secondary indexes on non key fields
In this case, numerous records in the data file can
have the same value for the indexing field
Options
1)A dense index with several index entries with the
same field and the block address where the record can be found
2) A
nondense two level index (usually used)
The first index file has two fields: the field value
and a block pointer to a block record pointers
The second index file has a a
variable number of block pointers, one to each record with the field value
This requires at least one additional block access:
An example of how an index helps
Ordered file with r=30,000
Each record =100bytes
The size of a block B=1024 bytes
A binary search
on the data file would need
log2 3000, or 12 block accesses with no index
We saw that a
primary index would reduce the number of block accesses to 7
How much would a secondary index help?
How much does a secondary index help
Same file: 30,000 records, BF=10 records/block; 3000
blocks of data
A sequential search requires how many block accesses?
A linear search
requires 3000/2 block accesses on average.
Secondary index with each entry 15 bytes as before
So the BF =
1024/15 = 68 entries per block
This will be a dense index 30,000 entries with 68
entries/block: 30,000/68 = 442 blocks
for the index
How many block accesses to find an item?
Log 442 = 9 block
accesses plus the additional one to access the data
Secondary indexes
They need more storage space than primary indexes
They also require a longer search time
But, the improvement in search time is great, because
otherwise a sequential search is necessary
Indexes into indexes
If we index the secondary index, we can find the
correct record with fewer block accesses
For the problem we looked at, it took 442 blocks to
store the index
This secondary
index is an ordered file, so we could create a nondense index on it.
How many blocks would this index into the index take
# blocks for the index/block factor is 442/68 = 7
If we also indexed this index, we can find the data
with 4 disk accesses.
Multilevel indexes
The first level is a primary, clustering or secondary
index into the data file
This may be
either a dense or nondense index
The second level is a primary index into the first
level
The first level
must be an ordered index
The third level is a primary index into the second
level
.and so on
Each of these indexes is a nondense index with a block
anchor that points to the beginning of the block
This organization relies on the ordered nature of the
file
Multilevel indexes
If we divide the list into thirds, or fourths, or 20th
the search space is reduced much faster than halves
The fan-out(fo) is the number of parts the list is divided into
each time
This is also the
blocking factor for the index
For a binary
search, the fan-out is 2
Multiway search trees
A multilevel index can be thought of as a variation of
a search tree
This allows the
indexes (the interior nodes of the tree) to change dynamically.
What are Multi-way search trees?
Multi-way search trees generalize binary search
trees into m-ary search trees
They allow more than one key to be stored in a node
There will always
be one more child pointer than key
So if a node has
two keys, it would have three pointers, with possibly three children
Increasing the number of children decreases the
height of a tree, given the same number of nodes
The keys in a node are maintained in order
The ordering
principal of binary search trees still holds