A Parallel Hardware Implementation of the
 JPEG Still Image Compression Standard



Bob Wall

Course Paper

EE 565  -  Parallel and Associative Processors

Oct. 20, 2003



1.0         Introduction

The JPEG image compression standard was developed by the JPEG (Joint Photographic Expert Group) committee for use in compressing digital images, and in particular full color photographic images.  Implementations of this standard have received a great deal of attention due to the widespread adoption of the JPEG image format; it is one of the primary formats used for exchanging pictures on the World Wide Web, and it is commonly used in digital cameras as the storage format.

Because of its adoption for use by digital cameras and similar electronic devices, there has been a great deal of research into efficient hardware implementations of the algorithm.  The standard (which is actually a collection of algorithms that are combined to achieve significant image compression) lends itself well to implementations in data flow architectures and exhibits a significant degree of parallelism.

This paper does not introduce any novel approaches to implementation; others have studied this problem extensively from various aspects, including its implementation in an FPGA.  However, it does serve as an excellent problem to demonstrate parallel dataflow architectures and highlights the usefulness of FPGAs in both creating limited runs of high-performance circuits and in prototyping architectures that will subsequently be designed into custom ASICs.

1.1           Compression Algorithm

The following block diagram depicts the steps involved in compressing an image using the JPEG standard:







1.1.1       Color Space Conversion

If the input image is a full color image (with each pixel in the image represented by a 24-bit value, with eight bits each of red, green, and blue information), the first step is to map the image to an alternate color space. (if the image is grayscale, this step and the next are skipped.)  The RGB information representing each pixel is mapped to an equivalent luminance-chrominance representation (the YCbCr color space).  Note that this mapping is lossless (i.e. fully reversible), except for round-off error.

1.1.2       Down Sampling

If the image has been converted to the YCbCr color space, the Cb and Cr components can be down sampled; that is, blocks of adjacent pixels in the Cb and Cr color spaces can be replaced with their average value.  This technique can be applied due to the fact that the human eye is much more sensitive to changes in intensity than changes in color.  Thus, down sampling can yield a significant amount of compression with little visual effect on the image.  There are several alternatives for down sampling – the pixels are typically  reduced 2:1 horizontally and either 2:1 or 1:1 (unchanged) horizontally (in JPEG terminology, these are referred to as 2h2h or 411 and 2h1v or 422, respectively).

1.1.3       2-D DCT

Once the pixel data has been pre-processed and separated into three separate component, the pixels in each component are grouped into 8x8 blocks.  Each block is transformed using the Discrete Cosine Transform (DCT).  This transform is similar to the Discrete Fourier Transform, and the transforms the block into the spatial frequency domain.

This step does not actually produce any compression – the DCT is reversible, except for round off error in the mathematical operations.  It is performed because you can subsequently discard higher frequency data without significantly visual effect on the image – again, the eye is much less sensitive to high-frequency changes in intensity or color than low-frequency changes.

1.1.4       Quantization

Once the data has been transformed to the frequency domain, each of the 64 frequency components is divided by a separate quantization coefficient and the result is rounded to an integer.  As mentioned previously, more of the high-frequency information is discarded in this step, by using larger coefficients for those elements.  Also, the luminance component is quantized more accurately than the chrominance component, by using separate quantization tables, for reasons discussed previously.

If the user can tolerate a lower quality resulting image in exchange for a smaller compressed image, the quantization coefficients can be increased.  There are a number of different ways to tune the quantization tables.  In addition, this step can be influenced by a user-selected “quality” setting.  Many implementations do simple linear scaling of the example coefficient tables from the JPEG standard, based on the quality setting.  this is an area of active research.

This is the step that introduces significant loss of data (and thus generates significant compression).

1.1.5       Entropy Coder

Once the frequency data has been quantized, it is encoded using an entropy encoder.  This can use either Huffman or arithmetic coding (specifically Q coding) – since arithmetic coding is covered by several patents by IBM and others, only Huffman coding is required by the standard.  The goal is to represent the data using a minimal number of bits - in practice, Q coding introduces only marginal (5% to 10%) increases in coding efficiency, so Huffman coding is an effective choice.

In order to encode the data, the elements are traversed in a zigzag order; this places the low frequency elements at the start of the data stream and the high frequency elements later.  Since the high frequency elements are more likely to be zero, this results in longer strings of zeroes, which can be more efficiently encoded.  The data is run-length encoded to compress the zeroes, then run through the Huffman coder.

1.1.6       Finalizing Data Stream

The output of the entropy coder must be assembled into bytes, since the output of the Huffman coder for each input is a variable number of bits.

Other operations are typically performed, including padding to byte boundaries at the end of blocks and prepending headers to the result – these headers contain information required to reverse the process.  They include the quantization tables and Huffman coding tables.  If the encoding and decoding are done in a closed system, this information can be omitted.

The remainder of this discussion will only be concerned with the first part of this process, assembling the output into bytes.

1.2           Decompression Algorithm

The decompression process is essentially the reverse of the compression process.  It requires the entropy decoder, an inverse DCT, and the conversion back to the RGB color space.  In addition, better decompressors might include some smoothing steps to reduce discontinuities when the 8x8 blocks are tiled together to form the final image.

This paper will concentrate on the compression algorithm.

2.0         Problem Decomposition

This problem is an ideal candidate for implementation in a data-flow architecture - the major tasks required  are readily apparent from the system block diagram. There are a number of levels of parallelism inherent in the process.  For the remaining discussion, we will assume that the compressor will be fed 16x16 blocks of RGB image data. Assuming the image will be larger than a single 16x16 block, the steps depicted in the compressor block diagram can each be pipelined.

The decomposition of the problem into tasks follows directly from the system block diagram – the individual tasks are described in more detail below.

2.1           Color Space Conversion

The initial color space conversion requires matrix multiplication.  There are a number of different coefficients that are used – the formulas typically used for JPEG are as follows (from [2]):

   Y  =  0.2989 R + 0.5866 G + 0.1145 B

   Cb = -0.1687 R - 0.3312 G + 0.5000 B + 128

   Cr =  0.5000 R - 0.4183 G - 0.0816 B + 128

RGB values are normally on the scale of 0 to 1, or since they are stored as unsigned single bytes, 0 to 255. The resulting luminance value is also on the scale of 0 to 255, the chrominance values need 127.5 added to them so they can be saved in an unsigned byte.

2.2           Down Sampling

The color space conversion takes the 16x16 array of 3 byte pixel values and produces 3 equivalent 16x16 arrays.  The Cb and Cr arrays are then down sampled, using the formula

C’i,j  =  (C 2i,2j + C 2i+1,2j + C 2i, 2j+1 + C 2i+1,2j+1) / 4               i = 0 to 7, j = 0 to 7

The result will be 6 8x8 blocks (4 Y, one Cb, and one Cr).

2.3           2-D DCT

Each of the 8x8 blocks is transformed to the frequency domain using the Discrete Cosine Transform.  The DCT is a type of wavelet transform – the one-dimensional transform is defined as

where s is the array of N original values, t is the array of N transformed values, and the coefficients c are given by

In two dimensions, this becomes

where N, s, and t are analogous, and c is given by

A great deal of research has been done on ways to pipeline this computation; many are based on the algorithm by Weiping Li ([6]).  Several algorithms have been proposed that use integer multipliers ([1], [3]).

After I started my research, I discovered that the Xilinx Development Kit includes both 1-D and 2-D DCT blocks, so these will be used in the design in order to minimize the design and development process.

2.4           Quantization

Each transformed block is quantized.  The JPEG standard defines two separate tables of quantization constants, one for luminance and one for chrominance.  Each element in a transformed block is divided by its corresponding quantization coefficient.  There are a number of algorithms which can be used to simplify this operation so that a full divider is not required – for example, a pipelined quantizer is described in a paper by Sun and Lee ([7]) that requires five adders and five pipeline registers.

2.5           Encoding

As mentioned, the elements of the quantized block are processed in a zigzag order, as indicated by this sequence:



































































The upper left element is the DC coefficient – its value will be the average of all the 64 input samples.  This coefficient is treated separately from the remaining AC coefficients – since the average value of successive blocks in the image is likely to change slowly, this value is differentially encoded as the difference between the DC value and that value from the preceding block.  The remaining coefficients are then processed in the order indicated and run-length encoded, producing a pair of numbers – the count of zeroes that preceded the value and the value itself.  These pairs are passed through the Huffman encoder, which produces a variable-length code (VLC) to represent each pair.  The Huffman coding is done using a set of tables that are provided by the standard.  (For more details, see [8]).  The output of the coder must be accumulated in a barrel shifter until full bytes are accumulated; these are then shifted into an output buffer.

2.6           Assembling Output

The output of the Huffman coder is assembled into bytes and buffered for output.  In addition, when a block is completed, the output is padded to a byte boundary and a marker code is inserted.

3.0         Assignment

The naïve approach is to assign each task to its own process, and in fact that seems to be a reasonable approach.  The only instance where there seems to be much benefit to combining tasks within a process involves the encoding and assembly of the output.  The following are the processes:

  1. Color Space Converter
  2. Down Sampler – this will handle both Cr and Cb component
  3. DCT – this will handle a block of any the components (Y, Cr, or Cb).
  4. Quantization – this will also handle a block of any type.  It is necessary for the process to have access to the type of the block, so it can determine the appropriate coefficient table to use.
  5. Encoding – this process will Huffman encode quantized blocks of any type.  It will also handle the task of assembling the output.

4.0         Orchestration

Since this is a dataflow architecture, orchestration between processes is straightforward.  Since most processes require a complete 8x8 block of input data for processing, in most cases it is sufficient to provide a pair of buffers of sufficient size to hold a single block of data.  One block will serve as a space to accumulate the output of the process; once the block is complete, it will serve as the input to the next process, and the other block will become the output buffer.  The two blocks will continue to swap between output accumulator and input buffer until all the data is processed.

Thus the orchestration will primarily involve detecting when a process’s output buffer is full and pausing that process until an empty buffer is available, detecting when an input buffer is full and can be processed, and coordinating the swap of input and output buffers.  Once the input is done, the process must be paused until a new buffer is available and the pair of buffers can be swapped.

5.0         Mapping

If there is sufficient logic available, the most desirable mapping is to provide special-purpose logic to execute each process.  Additionally, since there are multiple instances of most of the processes (for each of the components of the input), parallel instances of these logic blocks will improve performance.

However, some of the logic blocks necessary to perform the computations will be large, so it is probably not feasible to provide separate processors for every process.  This is true in particular for the DCT transform processors.  A good compromise is to provide a pair of processors running in parallel; since there will be four times as many blocks of the luminance (Y) component as of either chrominance component (Cb or Cr), one pipeline can do both the chrominance components, plus one third of the luminance blocks.  This will result in a concurrent flow of data through the DCT processors.

Given this decision, the following diagram depicts the processors and the assignment of processes:


The following is a description of the data flow through the system:

·         The color space conversion is run on the complete 16x16x3 byte input block, and the result is separated out into three 16x16x1 byte blocks, one per component (Y, Cb, and Cr).

·         Once the color space conversion has been completed, the first two blocks of Y data are dispatched to the parallel DCT processors.

·         While the first pair of blocks is being transformed, the down sampler is applied to the Cb blocks.  This will produce a single 8x8 block.  The down sampling can be easily completed prior to the completion of the first transform process.

·         When the first blocks are transformed, the third Y block and the Cb block are fed to the DCT processors.

·         While the second pair of blocks is being transformed, the down sampler is applied to the Cr blocks.

·         When the second pair of blocks are transformed, the last Y block and the Cr block are fed to the DCT processors.

The time required to perform the 2-D DCT will dominate the pipeline.  This relaxes the performance requirements on other stages in the pipeline.  For instance, there could be separate processors to down sample the Cb and Cr components, but that is not necessary, since once the pipeline is filled, they will spend most of their time waiting on the DCT block anyway.  Likewise, there could be separate quantizers and encoders for each DCT processor, but they will work at least twice as fast as the DCT, so a single processor will be able to keep up.  Additional orchestration is required to switch the quantizer between the output buffers of both DCT processors.

6.0         Amdahl’s Law and the Maximum Projected Speedup

Amdahl’s Law states that the maximum speedup possible by enhancing the solution to a problem (i.e. by executing some parts of it in parallel) is limited by the percentage of time that the enhanced portion of the solution is used.  In this example, the primary enhancement to the problem is the parallelization of the DCT transforms – running the two processors in parallel, plus the performance increases gained by the pipelined implementation of each processor.  Once the pipeline is filled, the DCT processors should be in use continuously.  Therefore, this implementation should yield optimal speedup, given the use of two DCT processors.

The speedup should be significant greater than a software implementation – the new execution time for an image will be 3 N  TDCT  + TP, where N is the number of 16x16 pixel blocks in the raw image, TDCT  is the time it takes for the DCT block to complete the transform on an 8x8 input block, and TP is the additional time to initially fill the pipeline and to accumulate the results of the last block (which should be insignificant for any practical input image).

7.0         FPGA Implications

The primary concerns with implementing this logic in an FPGA are the availability of logic blocks and the availability of multipliers.  The available embedded multipliers can be used in several of the logic blocks; this requires that the arithmetic be converted to fixed point, but this should not be too difficult.

Also, while custom VHDL logic blocks will be an optimal implementation of each “processor”,  the availability of embedded microprocessors provides a number of alternatives.  The MicroBlaze or PicoBlaze processor cores available in the Xilinx FPGAs could be used to perform the color space conversion and down sampling, and that processor or another could be used to handle the quantization and the encoding.  The flexibility of the FPGA core and the availability of different logic cores allows a number of design tradeoffs to be explored.

The most significant impact is the availability of the 2-D DCT core.  Designing, implementing, and testing a VHDL implementation of the DCT processor would be a time-consuming exercise; its availability as a pre-packaged core will greatly speed development time.

8.0         Conclusions

The JPEG image compression standard is an ideal candidate for implementation in a pipelined dataflow architecture.  There are multiple levels of parallelism which can be exploited; the individual tasks that make up the overall algorithm can be executed in a pipelined fashion, and several tasks can also be pipelined internally (the DCT, the quantization, the down sampler, and the color space converter could all be enhanced to some degree by parallel computation).  The limiting factor is the speed at which the DCT transform can be performed.  Given sufficient hardware resources, this could be optimized by using multiple transform processors executing in parallel.  The implementation proposed here is a tradeoff that should allow the entire circuit to be implemented practically within a single FPGA.

9.0         References

[1]           Agostini, Luciano, Ivan Silva, and Sergio Bampi, “Pipelined Fast 2-D DCT Architecture for JPEG Image Compression,” Symposium on Integrated Circuits and System Design, SBCCI Proceedings, 2001, pp.226-231.

[2]           Bourke, Paul, “YCC Colour Space and Image Compression,” URL http://astronomy.swin.edu.au/-~pbourke/colour/ycc/.

[3]           Bousselmi, M., M.S. Bouhlel, N. Masmoudi, L. Kamoun, “New Parallel Architecture for the DCT and Its Inverse for Image Compression,”  URL

[4]           Gailly, Jean-loup (moderator), “comp.compression Frequently Asked Questions,” URL http://www.faqs.org/faqs/compression-faq/.

[5]           Kovac, M. and N. Ranganathan, “JAGUAR: A Fully Pipelined VLSI Architecture for JPEG Image Compression Standard,” Proceedings of the IEEE, Vol. 83, No 2, February 1995.

[6]           Li, Weiping, “A New Algorithm to Cmopute the DCT and Its Inverse,” IEEE Trans. on Signal Processing, Vol. 39, 1991, pp 1305-1313.

[7]           Sun, Sung-Hsien, and Shie-Jue Lee, “A JPEG Chip for Image Compression and Decompression,” Journal of VLSI Signal Processing, Vol. 35, pp 43-60, 2003.

[8]           Wallace, Gregory K, “The JPEG Still Picture Compression Standard,” Communications of the ACM, Vol. 34, No. 4, pp 31-44, April 1991.