A Parallel Hardware Implementation of the

JPEG Still Image Compression Standard

Bob Wall

Course Paper

EE 565 - Parallel and Associative Processors

Oct. 20, 2003

The JPEG image compression standard was developed by the JPEG (Joint Photographic Expert Group) committee for use in compressing digital images, and in particular full color photographic images. Implementations of this standard have received a great deal of attention due to the widespread adoption of the JPEG image format; it is one of the primary formats used for exchanging pictures on the World Wide Web, and it is commonly used in digital cameras as the storage format.

Because of its adoption for use by digital cameras and similar electronic devices, there has been a great deal of research into efficient hardware implementations of the algorithm. The standard (which is actually a collection of algorithms that are combined to achieve significant image compression) lends itself well to implementations in data flow architectures and exhibits a significant degree of parallelism.

This paper does not introduce any novel approaches to implementation; others have studied this problem extensively from various aspects, including its implementation in an FPGA. However, it does serve as an excellent problem to demonstrate parallel dataflow architectures and highlights the usefulness of FPGAs in both creating limited runs of high-performance circuits and in prototyping architectures that will subsequently be designed into custom ASICs.

The following block diagram depicts the steps involved in compressing an image using the JPEG standard:

If the input image is a full color image (with each pixel in
the image represented by a 24-bit value, with eight bits each of red, green,
and blue information), the first step is to map the image to an alternate color
space. (if the image is grayscale, this step and the next are skipped.) The RGB information representing each pixel
is mapped to an equivalent luminance-chrominance representation (the *YCbCr*
color space). Note that this mapping is
lossless (i.e. fully reversible), except for round-off error.

If the image has been converted to the *YCbCr* color
space, the *Cb* and *Cr* components can be *down sampled;* that
is, blocks of adjacent pixels in the *Cb* and *Cr* color spaces can
be replaced with their average value.
This technique can be applied due to the fact that the human eye is much
more sensitive to changes in intensity than changes in color. Thus, down sampling can yield a significant
amount of compression with little visual effect on the image. There are several alternatives for down sampling
– the pixels are typically reduced 2:1
horizontally and either 2:1 or 1:1 (unchanged) horizontally (in JPEG
terminology, these are referred to as 2h2h or 411 and 2h1v or 422,
respectively).

Once the pixel data has been pre-processed and separated into three separate component, the pixels in each component are grouped into 8x8 blocks. Each block is transformed using the Discrete Cosine Transform (DCT). This transform is similar to the Discrete Fourier Transform, and the transforms the block into the spatial frequency domain.

This step does not actually produce any compression – the DCT is reversible, except for round off error in the mathematical operations. It is performed because you can subsequently discard higher frequency data without significantly visual effect on the image – again, the eye is much less sensitive to high-frequency changes in intensity or color than low-frequency changes.

Once the data has been transformed to the frequency domain,
each of the 64 frequency components is divided by a separate *quantization
coefficient* and the result is rounded to an integer. As mentioned previously, more of the
high-frequency information is discarded in this step, by using larger
coefficients for those elements. Also,
the luminance component is quantized more accurately than the chrominance component,
by using separate quantization tables, for reasons discussed previously.

If the user can tolerate a lower quality resulting image in exchange for a smaller compressed image, the quantization coefficients can be increased. There are a number of different ways to tune the quantization tables. In addition, this step can be influenced by a user-selected “quality” setting. Many implementations do simple linear scaling of the example coefficient tables from the JPEG standard, based on the quality setting. this is an area of active research.

This is the step that introduces significant loss of data (and thus generates significant compression).

Once the frequency data has been quantized, it is encoded
using an *entropy encoder*. This
can use either Huffman or arithmetic coding (specifically Q coding) – since
arithmetic coding is covered by several patents by IBM and others, only Huffman
coding is required by the standard. The
goal is to represent the data using a minimal number of bits - in practice, Q
coding introduces only marginal (5% to 10%) increases in coding efficiency, so
Huffman coding is an effective choice.

In order to encode the data, the elements are traversed in a
*zigzag* order; this places the low frequency elements at the start of the
data stream and the high frequency elements later. Since the high frequency elements are more likely to be zero,
this results in longer strings of zeroes, which can be more efficiently encoded. The data is run-length encoded to compress
the zeroes, then run through the Huffman coder.

The output of the entropy coder must be assembled into bytes, since the output of the Huffman coder for each input is a variable number of bits.

Other operations are typically performed, including padding to byte boundaries at the end of blocks and prepending headers to the result – these headers contain information required to reverse the process. They include the quantization tables and Huffman coding tables. If the encoding and decoding are done in a closed system, this information can be omitted.

The remainder of this discussion will only be concerned with the first part of this process, assembling the output into bytes.

The decompression process is essentially the reverse of the compression process. It requires the entropy decoder, an inverse DCT, and the conversion back to the RGB color space. In addition, better decompressors might include some smoothing steps to reduce discontinuities when the 8x8 blocks are tiled together to form the final image.

This paper will concentrate on the compression algorithm.

This problem is an ideal candidate for implementation in a data-flow architecture - the major tasks required are readily apparent from the system block diagram. There are a number of levels of parallelism inherent in the process. For the remaining discussion, we will assume that the compressor will be fed 16x16 blocks of RGB image data. Assuming the image will be larger than a single 16x16 block, the steps depicted in the compressor block diagram can each be pipelined.

The decomposition of the problem into tasks follows directly from the system block diagram – the individual tasks are described in more detail below.

The initial color space conversion requires matrix multiplication. There are a number of different coefficients that are used – the formulas typically used for JPEG are as follows (from [2]):

Y
= 0.2989 R + 0.5866 G + 0.1145 B

Cb = -0.1687 R - 0.3312 G + 0.5000 B + 128

Cr =
0.5000 R - 0.4183 G - 0.0816 B + 128

RGB values are normally on the scale of 0 to 1, or since they are stored as unsigned single bytes, 0 to 255. The resulting luminance value is also on the scale of 0 to 255, the chrominance values need 127.5 added to them so they can be saved in an unsigned byte.

The color space conversion takes the 16x16 array of 3 byte
pixel values and produces 3 equivalent 16x16 arrays. The *Cb* and *Cr* arrays are then down sampled, using
the formula

C’_{i,j} = (C
_{2i,2j} + C _{2i+1,2j} + C _{2i, 2j+1} + C _{2i+1,2j+1})
/ 4 i = 0 to 7, j = 0 to 7

The result will be 6 8x8 blocks (4 *Y*, one *Cb*,
and one *Cr*).

Each of the 8x8 blocks is transformed to the frequency domain using the Discrete Cosine Transform. The DCT is a type of wavelet transform – the one-dimensional transform is defined as

_{}

where *s* is the array of *N* original values, *t*
is the array of *N* transformed values, and the coefficients *c* are
given by

_{}

In two dimensions, this becomes

_{}

where *N*, *s*, and *t* are analogous, and *c*
is given by

_{}

A great deal of research has been done on ways to pipeline this computation; many are based on the algorithm by Weiping Li ([6]). Several algorithms have been proposed that use integer multipliers ([1], [3]).

After I started my research, I discovered that the Xilinx Development Kit includes both 1-D and 2-D DCT blocks, so these will be used in the design in order to minimize the design and development process.

Each transformed block is quantized. The JPEG standard defines two separate tables of quantization constants, one for luminance and one for chrominance. Each element in a transformed block is divided by its corresponding quantization coefficient. There are a number of algorithms which can be used to simplify this operation so that a full divider is not required – for example, a pipelined quantizer is described in a paper by Sun and Lee ([7]) that requires five adders and five pipeline registers.

As mentioned, the elements of the quantized block are processed in a zigzag order, as indicated by this sequence:

1 |
2 |
6 |
7 |
15 |
16 |
28 |
29 |

3 |
5 |
8 |
14 |
17 |
27 |
30 |
43 |

4 |
9 |
13 |
18 |
26 |
31 |
42 |
44 |

10 |
12 |
19 |
25 |
32 |
41 |
45 |
54 |

11 |
20 |
24 |
33 |
40 |
46 |
53 |
55 |

21 |
23 |
34 |
39 |
47 |
52 |
56 |
61 |

22 |
35 |
38 |
48 |
51 |
57 |
60 |
62 |

36 |
37 |
49 |
50 |
58 |
59 |
63 |
64 |

The upper left element is the DC coefficient – its value will be the average of all the 64 input samples. This coefficient is treated separately from the remaining AC coefficients – since the average value of successive blocks in the image is likely to change slowly, this value is differentially encoded as the difference between the DC value and that value from the preceding block. The remaining coefficients are then processed in the order indicated and run-length encoded, producing a pair of numbers – the count of zeroes that preceded the value and the value itself. These pairs are passed through the Huffman encoder, which produces a variable-length code (VLC) to represent each pair. The Huffman coding is done using a set of tables that are provided by the standard. (For more details, see [8]). The output of the coder must be accumulated in a barrel shifter until full bytes are accumulated; these are then shifted into an output buffer.

The output of the Huffman coder is assembled into bytes and buffered for output. In addition, when a block is completed, the output is padded to a byte boundary and a marker code is inserted.

The naïve approach is to assign each task to its own process, and in fact that seems to be a reasonable approach. The only instance where there seems to be much benefit to combining tasks within a process involves the encoding and assembly of the output. The following are the processes:

- Color Space Converter
- Down
Sampler – this will handle both
*Cr*and*Cb*component - DCT –
this will handle a block of any the components (Y,
*Cr*, or*Cb*). - Quantization – this will also handle a block of any type. It is necessary for the process to have access to the type of the block, so it can determine the appropriate coefficient table to use.
- Encoding – this process will Huffman encode quantized blocks of any type. It will also handle the task of assembling the output.

Since this is a dataflow architecture, orchestration between processes is straightforward. Since most processes require a complete 8x8 block of input data for processing, in most cases it is sufficient to provide a pair of buffers of sufficient size to hold a single block of data. One block will serve as a space to accumulate the output of the process; once the block is complete, it will serve as the input to the next process, and the other block will become the output buffer. The two blocks will continue to swap between output accumulator and input buffer until all the data is processed.

Thus the orchestration will primarily involve detecting when a process’s output buffer is full and pausing that process until an empty buffer is available, detecting when an input buffer is full and can be processed, and coordinating the swap of input and output buffers. Once the input is done, the process must be paused until a new buffer is available and the pair of buffers can be swapped.

If there is sufficient logic available, the most desirable mapping is to provide special-purpose logic to execute each process. Additionally, since there are multiple instances of most of the processes (for each of the components of the input), parallel instances of these logic blocks will improve performance.

However, some of the logic blocks necessary to perform the
computations will be large, so it is probably not feasible to provide separate
processors for every process. This is
true in particular for the DCT transform processors. A good compromise is to provide a pair of processors running in
parallel; since there will be four times as many blocks of the luminance (*Y*)
component as of either chrominance component (*Cb* or *Cr*), one
pipeline can do both the chrominance components, plus one third of the luminance
blocks. This will result in a
concurrent flow of data through the DCT processors.

Given this decision, the following diagram depicts the processors and the assignment of processes:

The following is a description of the data flow through the system:

·
The color space conversion is run on the complete
16x16x3 byte input block, and the result is separated out into three 16x16x1
byte blocks, one per component (*Y*, *Cb*, and *Cr*).

·
Once the color space conversion has been completed, the
first two blocks of *Y* data are dispatched to the parallel DCT
processors.

·
While the first pair of blocks is being transformed,
the down sampler is applied to the *Cb* blocks. This will produce a single 8x8 block. The down sampling can be easily completed prior to the completion
of the first transform process.

·
When the first blocks are transformed, the third *Y*
block and the *Cb* block are fed to the DCT processors.

·
While the second pair of blocks is being transformed,
the down sampler is applied to the *Cr* blocks.

·
When the second pair of blocks are transformed, the
last *Y* block and the *Cr* block are fed to the DCT processors.

The time required to perform the 2-D DCT will dominate the
pipeline. This relaxes the performance
requirements on other stages in the pipeline.
For instance, there could be separate processors to down sample the *Cb
*and *Cr *components, but that is not necessary, since once the
pipeline is filled, they will spend most of their time waiting on the DCT block
anyway. Likewise, there could be
separate quantizers and encoders for each DCT processor, but they will work at
least twice as fast as the DCT, so a single processor will be able to keep
up. Additional orchestration is
required to switch the quantizer between the output buffers of both DCT
processors.

Amdahl’s Law states that the maximum speedup possible by enhancing the solution to a problem (i.e. by executing some parts of it in parallel) is limited by the percentage of time that the enhanced portion of the solution is used. In this example, the primary enhancement to the problem is the parallelization of the DCT transforms – running the two processors in parallel, plus the performance increases gained by the pipelined implementation of each processor. Once the pipeline is filled, the DCT processors should be in use continuously. Therefore, this implementation should yield optimal speedup, given the use of two DCT processors.

The speedup should be significant greater than a software
implementation – the new execution time for an image will be 3 *N * *T _{DCT} *

The primary concerns with implementing this logic in an FPGA are the availability of logic blocks and the availability of multipliers. The available embedded multipliers can be used in several of the logic blocks; this requires that the arithmetic be converted to fixed point, but this should not be too difficult.

Also, while custom VHDL logic blocks will be an optimal implementation of each “processor”, the availability of embedded microprocessors provides a number of alternatives. The MicroBlaze or PicoBlaze processor cores available in the Xilinx FPGAs could be used to perform the color space conversion and down sampling, and that processor or another could be used to handle the quantization and the encoding. The flexibility of the FPGA core and the availability of different logic cores allows a number of design tradeoffs to be explored.

The most significant impact is the availability of the 2-D DCT core. Designing, implementing, and testing a VHDL implementation of the DCT processor would be a time-consuming exercise; its availability as a pre-packaged core will greatly speed development time.

The JPEG image compression standard is an ideal candidate for implementation in a pipelined dataflow architecture. There are multiple levels of parallelism which can be exploited; the individual tasks that make up the overall algorithm can be executed in a pipelined fashion, and several tasks can also be pipelined internally (the DCT, the quantization, the down sampler, and the color space converter could all be enhanced to some degree by parallel computation). The limiting factor is the speed at which the DCT transform can be performed. Given sufficient hardware resources, this could be optimized by using multiple transform processors executing in parallel. The implementation proposed here is a tradeoff that should allow the entire circuit to be implemented practically within a single FPGA.

[1] Agostini,
Luciano, Ivan Silva, and Sergio Bampi, “Pipelined Fast 2-D DCT Architecture for
JPEG Image Compression,” __Symposium on Integrated Circuits and
System Design, SBCCI Proceedings__, 2001, pp.226-231.

[2] Bourke, Paul, “YCC Colour Space and Image Compression,” URL http://astronomy.swin.edu.au/-~pbourke/colour/ycc/.

[3] Bousselmi,
M., M.S. Bouhlel, N. Masmoudi, L. Kamoun, “New Parallel Architecture for the
DCT and Its Inverse for Image Compression,”
URL

http://www.csgroup.tunet.tn/publications_equipe_-c&s/moncef/ICECS2000/icecs2000_moncef.pdf

[4] Gailly, Jean-loup (moderator), “comp.compression Frequently Asked Questions,” URL http://www.faqs.org/faqs/compression-faq/.

[5] Kovac, M.
and N. Ranganathan, “JAGUAR: A Fully Pipelined VLSI Architecture for JPEG Image
Compression Standard,” __Proceedings of the IEEE__, Vol. 83, No 2, February
1995.

[6] Li,
Weiping, “A New Algorithm to Cmopute the DCT and Its Inverse,” __IEEE Trans.
on Signal Processing__, Vol. 39, 1991, pp 1305-1313.

[7] Sun,
Sung-Hsien, and Shie-Jue Lee, “A JPEG Chip for Image Compression and Decompression,”
__Journal of VLSI Signal Processing__, Vol. 35, pp 43-60, 2003.

[8] Wallace,
Gregory K, “The JPEG Still Picture Compression Standard,” __Communications of
the ACM__, Vol. 34, No. 4, pp 31-44, April 1991.