Compiler Optimizations for Modern Hardware Architectures - Part II

Bob Wall

CS 550 (Fall 2003) Class Presentation



The presentation will be held on Wed., Dec. 3rd at 10:00 in EPS 347 (the Computer Science conference room).

(This is the second in our lecture series on compiler optimizations - see details of the first lecture by Neal Richter here).



Abstract

The topic of optimization in modern compilers is a broad one. Google handily offers up the following Web definition:

compiler optimization - (n.) Rearranging or eliminating sections of a program during compilation to achieve higher performance. Compiler optimization is usually applied only within basic blocks and must account for the possible dependence of one section of a program on another.

There is an abundance of ways in which this process can be done (many of which can be combined to further increase overall performance); we found it particularly interesting to focus on optimization techniques that are aimed at improving the performance of modern computing systems. Certain architectural advancements in systems, and in CPUs in particular, realize higher performance gains if the code they are executing is arranged to take advantage of hardware features.

These hardware features include the following:

Neal's lecture looks at these hardware features as they are implemented in some of the popular current-generation processors and explores the ways in which these features are taken into account in modern compilers for those processors.

My lecture will explore some optimization topics as they pertain to the new 64-bit Intel® IA-64 (Itanium®) processor. This is an example of one of the newest directions in processor architectures, EPIC (Explicitly Parallel Instruction-set Computing). I'll discuss what EPIC is all about, give an overview of the Itanium architecture, then look in more detail at some compiler optimizations that are newly introduced for or have been heavily influenced by the architecture.

As an aside: I originally planned to present information on both the Itanium and the 64-bit AMD architectures, but subsequent research revealed that while the Itanium branches off in a significantly new direction, the AMD processor (which has been dubbed "Hammer") is essentially just a souped-up successor to the Athlon processor -- a major refinement to the x86 architecture, but not a real leap. Here's an interesting article that compares Hammer to the Itanium, and actually gives a pretty good overview of each processor. Based on this analysis, I decided to only examine the Itanium.



Featured Materials

This paper, "An Overview of the Intel® IA-64 Compiler" , was written by members of the team that worked on Intel's compiler for the new IA-64 architecture. It describes how some new architectural features of the processor require special compiler optimizations, and how other features simplify the optimization process.

If you would like more details, here are a set of slides (in PDF format) from a presentation by one of the Intel compiler gurus, Dr. Yong-fong Lee, about the topic. This is optional - I will try to hit the high points of this in my presentation. I paraphrased some of this material in my presentation, and I will actually show some of these slides.



Presentation Slides

These are the PowerPoint slides of my presentation.



References

For the papers, I have tried to provide links to the URLs from which they can be downloaded. Note that many of them go through the MSU Library's Electronic Journal Finder to the ACM Digital Library - this means that in order to follow the link, you will be prompted to enter your ID number and password.

For books, I provided links to their listings on www.nerdbooks.com, a site that I have found to be a very good source for technical books. I also included links to their listings on Amazon.


General Material on Compiler Optimization

Allen, Randy and Ken Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based Approach, Morgan Kauffman Publishers, San Francisco, CA, 2002.

A whole book on this very topic - half the book surveys modern architectures and describes the challenges they impose for compilers. They focus on the concept of using the analysis of data dependence to direct the optimization. They then look at dependence-based methods applied to superscalar and VLIW architectures. (Here's the Amazon link.)

Muchnick, Steven S. Advanced Compiler Design and Implementation, Morgan Kauffman Publishers, San Francisco, CA, 1997.

This seems to be a very good presentation of topics related to the "back end" of the compiler - intermediate representation, run-time support, and in particular optimization. I mostly looked at the chapters on register allocation and code scheduling.

This book is used as the textbook for a number of advanced compiler courses around the country - I'd definitely recommend it for people who want to dig a little deeper in compiler implementation. (Here's the Amazon link.)

Fox, Armando, Michael Hsiao, James Reed, and Brent Whitlock. "A Survey of General and Architecture-Specific Compiler Optimization Techniques."

An excellent survey of various optimization techniques. Unfortunately, neither the paper nor the Citeseer reference indicates when or where the paper was published; from the list of references, I think it was probably written around 1992.

Schneck, Paul B. "A Survey of Compiler Optimization Techniques", ACM/CSC-ER Proceedings of the Annual Conference, 1973, pp. 106-113.

For the curious and the computer history buff, this is an older paper on the topic - it discusses optimizing FORTRAN, which gives you an idea of how old it is! The list of references contains much of the pioneering work in the field.

Bacon, David F., Susan L. Graham, and Oliver J. Sharp. "Compiler Transformations for High-Performance Computing", ACM Computing Surveys, Vol. 26, No. 4, Dec. 1994, pp. 345-420.

A much more extensive survey on the topic of compiler optimization, with more attention to hardware-specified optimizations. The reference list is enormous.

Chaitin, Gregory J. "Register Allocation and Spilling via Graph Coloring", Proceedings of the 1982 SIGPLAN Symposium on Compiler Construction, Boston, Mass., June 1982, pp. 98-105.

One of the important original papers on this topic, often cited in later research. This is an algorithm for allocating variables to registers and for generating the code to save registers to memory (spill them) and later restore them, when there are not enough to meet the program's demands.

Jouppi, Norman P. and David W. Wall. "Available Instruction-Level Parallelism for Superscalar and Superpipelined Machines", in Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, Boston, Mass., 1989, pp. 272-282.

An early paper exploring the impact of emerging superscalar architectures on compiler optimization. (Both Jouppi and Wall were very involved with the design of the original superscalar processors.)

Kistler, Thomas and Michael Franz. "Continuous Program Optimization: A Case Study", ACM Transactions on Programming Languages and Systems (TOPLAS) Archive, Vol. 25, No. 4, July, 2003, pp. 500-548.

An interesting paper on two relatively new areas of research - optimizing programs at load time, to take advantage of specific knowledge about the hardware and of exactly which modules will be loaded (i.e. system libraries), and dynamic run-time optimization, where the program's live performance profile can be used to direct on-the-fly reoptimization.


General Material on Hardware Architectures

Hennessy, John L., David A. Patterson, and David Goldberg, Computer Architecture: A Quantitative Approach, Morgan Kauffman Publishers, San Francisco, CA, 1990.

The seminal book on advanced computer architectures - a must-have for the hard-core bit head. (The third edition was published in 2002.) (Here's the Amazon link.)

Patterson, David A. and John L. Hennessy. Computer Organization & Design: The Hardware / Software Interface, Morgan Kauffman Publishers, San Francisco, CA, 1994.

A companion to the Hennessy / Patterson book - actually, more of a precursor or introduction to topics that are covered in more detail in the other book. This is an excellent starting point for learning about computer architectures, or a good reference for software people who need to familiarize themselves with computer architecture but don't want to dig into it really deeply. (The second edition was published in 1997.)

Here's an amusing little site that spells out the differences between the Hennessy/Patterson book and the Patterson/Hennessy book. (Here's the Amazon link.)


Material on the IA-64 Hardware Architecture

Smotherman, Mark. "Understanding Epic Architectures and Implementations", available online at http://www.cs.clemson.edu/~mark/464/acmse_epic.pdf.

Contains a good history of the work that led to the development of the IA-64 / Itanium architecture, and a very nice characterization of the differences between VLIW, superscalar, and EPIC architectures.

Gwennap, Linley. "Intel, HP Make EPIC Disclosure", Microprocessor Report, Vol. 11, No. 14, Oct. 27, 1997, pp. 1-5.

One of the initial articles following the announcement of the new EPIC instruction set by HP and Intel. It includes a very straightforward description of the advantages EPIC processors have over the current CISC, RISC, and VLIW processors.

Gwennap, Linley. "Intel Discloses New IA-64 Features", Microprocessor Report, Vol. 13, No. 3, Mar. 8, 1999, pp. 1-4.

Intel reveals more details about the IA-64 instruction set and the first implementation, Merced. These include the implementation of a rotating register set, the special branch and loop count registers, the ability of the compiler to provide branch prediction "hints", and the Advanced Load Address Table (ALAT).

Gwennap, Linley. "IA-64: A Parallel Instruction Set", Microprocessor Report, Vol. 13, No. 7, May 31, 1999, pp. 1-7.

More early details on the IA-64 instruction set. (Obviously, I like Gwennap's articles.) This is the "full disclosure" by HP and Intel of the IA-64 architecture and instruction set. New additions include register frames and the Register Stack Engine.

Gwennap, Linley. "Merced Shows Innovative Design", Microprocessor Report, Vol. 13, No. 13, Oct. 6, 1999, pp. 1-6.

Details about the Merced microarchitecture.

Dulong, Carole. "The IA-64 Architecture at Work" , IEEE Computer, Vol. 31, No. 7, July 1998, pp. 24-32.

A demonstration of how the Itanium architecture's predication and control speculation features enable compilers to extract instruction-level parallelism.

McNairy, Cameron and Don Soltis. "Itanium 2 Processor MicroArchitecture", IEEE Micro, Vol. 23, No. 2, Mar.-Apr. 2003, pp. 44-55.

A good description of the new Itanium 2 internals, including paging and caching, the processor pipeline, instruction issue, branch prediction, and the system interface.

Evans, James S. and Gregory L. Trimper. Itanium Architecture for Programmers: Understanding 64-Bit Processors and EPIC Principles, Prentice Hall PTR, 2003.

Details about the Itanium architecture - the instruction set, addressing, the register stack engine, predication, I/O, procedure calls, floating-point operations, etc. Also includes some pointers on optimizing code.

I haven't read any of the book, but the authors have a Web site that contains a lot of material and links related to the Itanium. (Here's the Amazon link.)

Triebel, Walter. Itanium Architecture for Software Developers, Intel Press, 2003.

I haven't read any of this book, but it is supposedly a good starter for people that need to understand the IA-64 architecture in depth. (Here's the Amazon link.)

Barcella, Marco, Karthik Sankaranarayanan, and Ganesh J. Pai. "Itanium: An EPIC Architecture", Web site at the University of Virginia, Mar. 2001.

Students did a survey of different architectures for a class - the Itanium page includes the Gwennap articles I referenced above, plus a number of links to information from Intel and HP about the IA-64 architecture, microarchitecture, compiler, etc.

Intel Itanium 2 Web Site.

Information on the Itanium 2, if you can wade through the hype.

Hewlett Packard IA-64 Web Site.

A number of white papers on EPIC, IA-64, and the Itanium 2.


Material on IA-64 Compiler Optimization

Dulong, Carole, Rakesh Krishnaiyer, Dattatraya Kulkarni, Daniel Lavery, Wei Li, John Ng, and David Sehr. "An Overview of the Intel® IA-64 Compiler", Intel Technology Journal, Q4, 1999, pp. 1-15.

Details on how Intel® built their compiler for the Itanium processor. It includes a good discussion of some general optimizations plus specifics on modifications to these optimizations and new optimizations that are specific to the IA-64 architecture. This was one of my primary sources of information.

Dulong, Carole, Priti Shrivastav, and Azita Refah. "The Making of a Compiler for the Intel® Itanium(TM) Processor", Intel Technology Journal, Q3, 2001, pp. 1-7.

A peripheral article describing the difficulties of developing the new Itanium compiler in the absence of hardware (Intel planned to have the compiler ready when the hardware was released). There are no real details of the compiler or the hardware, but it is an interesting look at the development process.

Lee, Yong-fong. "An Overview of IA-64 Architectural Features and Compiler Optimization", presented at 14th International Conference on Parallel and Distributed Computing Systems (PDCS-2001), Aug. 8-10, 2001, Richardson, TX.

A more detail version of my talk, presented by one of the key members of Intel's IA-64 compiler research group. I referred to these slides quite a bit for ideas and clarification for my presentation.

Zahir, Rumi, Jonathan Ross, Dale Morris, and Drew Hess. "OS and Compiler Considerations in the Design of the IA-64 Architecture", ACM SIGPLAN Notices Archive, Volume 35, No. 11, Nov., 2000, pp. 212-221.

A much more in-depth analysis of the IA-64's new features that support control and data speculation and register stacking, and how these features can be used to the advantage of the computer system.

Settle, Alex, Daniel A. Connors, Gerolf Hoflehner, Dan Lavery. "Optimization for the Intel® Itanium® Architecture Register Stack", Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, San Francisco, CA, 2003, pp. 115-124.

A more in-depth look at the Itanium register stack and what the compiler can do to maximize its positive impact on program speed.

Kästner, Daniel and Sebastian Winkel. "ILP-based Instruction Scheduling for IA-64", Proceedings of the ACM SIGPLAN Workshop on Languages, Compilers and Tools for Embedded Systems, Snow Bird, UT, 2001, pp. 145-154.

Generating a good instruction schedule on the IA-64 requires modeling the data dependence, resource constraints, and the bundling mechanism of the processor. The paper examines how Integer Linear Programming (ILP) techniques can be used to perform this modeling to find nearly optimal solutions in acceptable time.

Yang, Liu, Sun Chan, G. R. Gao, Roy Ju, Guei-Yuan Lueh, and Zhaoqing Zhang. "Inter-procedural Stacked Register Allocation for Itanium® Like Architecture", Proceedings of the 17th Annual International Conference on Supercomputing, San Francisco, CA, 2003, pp. 215-225.

More ideas about the tradeoffs between using the Register Stack Engine and using explicit spilling if there is a large register demand. The paper introduces an algorithm to solve this problem across procedural boundaries.

Collard, Jean-Francois and Daniel Lavery. "Optimizations to Prevent Cache Penalties for the Intel® Itanium® 2 Processor", Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, San Francisco, CA, 2003, pp. 105-114.

Rather than focusing on improving cache hit rates, this paper examines the use of memory disambiguation techniques to minimize the penalties for cache misses, and analyzes the performance increases that can be obtained by combining both techniques.




Mail me at: bwall@cs.montana.edu

Last modified: Dec. 1, 2003