KAVALIPOST

Tuesday 25 June 2013



  • PC Architecture


    Chapter 1. The PC, history and logic


  • The PC is a fascinating subject, and I want to take you on an illustrated, guided tour of its workings. But first I will tell you a bit about the background and history of computers. I will also have to introduce certain terms and expressions, since computer science is a subject with its own terminology. Then I will start to go through the actual PC architecture!

    1. The historical PC

    The PC is a microcomputer, according to the traditional division of computers based on size.

    Microcomputers

    No-one uses the expression microcomputer much anymore, but that is what the PC actually is. If we look at computers based on size, we find the PC at the bottom of the hierarchy.
  • Mainframes and super computers are the biggest computers – million dollar machines, as big as a refrigerator or bigger. An example is the IBM model 390.
  • Minicomputers are large, powerful machines which are often found at the centre of networks of “dumb” terminals and PC’s. For example, IBM’s AS/400. A definition that was used in the past, was that minicomputers cost between $10,000 and $100,000.
  • Workstations are very powerful user machines. They have the capacity to execute technical/scientific programs and calculations, and typically use a UNIX variant or Windows NT as their operating system. Workstations used to be equipped with powerful RISC processors, like Digital Alpha, Sun Sparc or MIPS, but today workstations can be configured with one or more of Intel’s more powerful CPU’s.
  • The PC is the baby of the family: Small, cheap, mass-produced computers which typically run Windows and which are used for standard programs which can be purchased anywhere.
    Fig.  1. Data processing in 1970. Digital PDP 11/20.

    The PC’s childhood

    Let’s take a short look at the historical background of the modern PC, which originated in 1981. In less than 20 years, the PC went through a technological development which has surpassed everything we have seen before. The PC has simply revolutionised society’s production and communication in just about every sector. And the revolution appears to be set to continue for many more years.
    Today the PC is an industry standard. More than 90% of all microcomputers are based on Microsoft’s software (Windows) and standardised hardware designed primarily by Intel. This platform or design is sometimes calledWintel, a combination of the two product names.
    But at the time that the PC was introduced by IBM, it was just one of many 16-bit microcomputers. For example, the company, Digital, sold many of their “Rainbow” machines in the middle of the 1980’s, which I have worked with myself. These other machines were not IBM-compatible, but they weren’t very different from IBM’s machines either, since they were all based on Intel’s 8088 CPU. There were actually a number of different types of PC in the 1980’s.
    Fig.  2. DEC Rainbow from 1982. It costed around Euro 8.000 – then!
    But over just a few years, late in the 1980’s, the market got behind IBM’s standards for PC architecture. Using the Intel 8086 and 8088 processors and Microsoft’s operating systems (DOS initially, later Windows), the PC revolution got seriously underway. From that time on, we talked bout IBM-compatible PCs, and as the years passed, the PC developed to become the triumphant industry standard.
    In parallel with the IBM/Intel project, Apple developed the popular Macintosh computers, which from the very start were very user-friendly, with a graphical user interface. The Macintosh is a completely different platform from the platform of Windows-based pc’s I am describing in this guide.
    The Macintosh has also been released in generation after generation, but it is not compatible with IBM/Intel/Microsoft’s PC standard.
    Fig.  3. An almost IBM-compatible PC from 1984.
    In the table below you can see the development of the PC and it’s associated operating systems. The PC was actually a further development of the 8-bit microprocessors (like the Commodore 64, etc.), which were very popular until late in the 1980’s.
    The computer shown in Fig. 2, is a very interesting hybrid. It marked the transition from 8 to 16-bit architecture. The computer contains two processors: an 8-bit Z80 and a 16-bit 8088. This enabled it to run several different operating systems, such as CP/M and MS-DOS 2. The two processors, each with their own bus, shared the 128 KB RAM. It was a particularly advanced machine.
    Fig  4. The microprocessor has entered its fourth decade.

    IBM and the PC’s success

    If we look back at the earlier PC, there are a number of factors which have contributed to its success:
  • From the very beginning the PC had a standardised and open architecture.
  • It was well-documented and had extensive expansion options.
  • The PC was cheapsimple and robust (but definitely not advanced technology)
    Initially, the PC was an IBM product. It was their design, built around an Intel processor (8088) and adapted to Microsoft’s simple operating system, MS-DOS.
    But other companies were quick to get involved. They found that they could freely copy the important BIOS system software and the central ISA bus. None of the components were patented. That wouldn’t happen today! But precisely because of this open architecture, a whole host of companies gradually appeared, which developed and supplied IBM-compatible PC’s and parts.

    Clones

    In the late 1980’s there was a lot of talk about clones. A clone is a copycat machine. A machine which can do exactly the same things as an original PC (from IBM), and where the individual components (e.g. the hard disk) could be identical to the original’s. The clone just has another name, or is sold without any name.
    We don’t distinguish as much today between the various PC manufacturers; but they can still be divided into two groups:
  • Brand name PC’s from IBM, Compaq, Dell, Fujitsu-Siemens, etc. Companies which are large enough to develop (potentially) their own hardware components.
  • Clones, which are built from standard components. Anyone can build their own clone, like the one shown in Fig. 15 on page 10.
    However, the technology is basically the same for all PC’s – regardless of the manufacturer. And this common technology is the subject I am going to expound.
    Finally, I just want to mention the term servers. They are special PC’s built to serve networks. Servers can, in principle, be built using the same components that are used in normal PC’s. However, other motherboards and a different type of RAM and other controllers are often used. My review will concentrate primarily on standard PC’s.

    Bit width

    The very first microprocessor Intel produced (the model 4004, also discussed on page 26) was 4 bit. This meant that in a single operation, the processor could process numbers which were 4 bits long. One can say that the length of a machine word was 4 bits. The Intel 4004 was a 4-bit processor with a 4-bit architecture. Later came processors which could process 8 bits at a time, like the Intel 8008, 8080, and not least, the Zilog Z80 (a very large number were sold). These were used in a large number of 8-bit computers throughout the 1970’s and well into the 1980’s.
    The PC (in the 1980’s) was initially a 16-bit computer. With the development of the 80386 processor, there was a change to the 32-bit architecture which we are still using today.
    Now there is a 64-bit architecture on the way, both from Intel (with the Itanium processor) and from AMD (with various Athlon 64 processors). But it is still too early to predict the extent to which the 64-bit architecture will spread into normal, Windows-based PC’s.
    Width
    Processor
    Application
    4 bit
    4004
    Pocket calculators
    8 bit
    8080
    Small CP/M based home computers
    16 bit
    8086, 8088, 80286
    IBM-compatible PC’s running MS-DOS
    32 bit
    80386 - Pentium 4
    32 bit versions of Windows
    (Windows 95/98/2000/XP)
    64 bit
    Athlon 64
    Pentium 4 Itanium
    Server software
    64 bits versions of
    Windows, Linux etc.
    Fig.  5. Today’s PC’s use mostly 32-bit architecture.

    The pre-history of computers

    Our PC’s have “spiritual roots” going back 350 years. Mathematicians and philosophers like Pascal, Leibnitz, Babbage and Boole laid the foundations with their theoretical work.
    The Frenchman, Blaise Pascal, lived from 1623-1662, and was a mathematical genius from a very young age.
    As an 18-year-old, he constructed a calculating machine, and his mathematical theories have had enormous significance to all later scientific research.
    The Englishman, George Boole (1815-1864), was also a natural talent. He grew up in very humble surroundings, and was largely self-taught.
    When he was 20 years old, Boole founded a mathematics school and then began to develop the symbolic logic which is currently the cornerstone of every program.
    Another Englishman, Charles Babbage, began developing various mechanical calculating machines in 1823, which are today considered to be the theoretical forerunners of the computer. Babbage’s “analytical machine” could perform data calculations using punched cards. The machine was never fully realised; the plan was to power it using steam.
    Fig.  6. A construction drawing for one of Babbage’s calculating machines, which consisted of several tons of brass machinery.

    Fig.  7. Charles Babbage (1791-1871) and his staff constructed various programs (software) for his calculating machine. Babbage is therefore called the ”father of the computer” today.
     
    However, it was only in the 20th century that electronics advanced sufficiently to make practical exploitation of these theories interesting.
    The Bulgarian John Vincent Atanasoff (1903-1995) is the inventor of theelectronic digital computer.
    Atanasoff was a genius. At the age of nine, he studied algebra with the help of his mother Iva Lucena Purdy, a mathe­matics schoolteacher.

    In the 1930’ies Atanasoff was a professor of mathematics and physics at Iowa State University in the USA. Here he used the existing tools like the Monroe calculater and IBM tabulator for his calculations, but he found these machines too slow and inaccurate. For years he worked on the idea that there should better machines for calculation. His thought was to produce a digital machine, since Atanasoff had concluded that mathematical devices fell into two classes, analog and digital. The term digital was not invented, so he called this class of devices “computing machines proper”
    In the winter of 1939 Atanasoff was very frustrated from his lack of progress. After a long car ride (Atanasoff was fond of fast cars) he found himself drinking whisky in a bar (he was fond of scotch as well). Suddenly he had the solution. A machine built on four principles. It should work on base-two (binary) numbers instead of base-10 and use condensers for memory. Atanasoff teamed up with a brilliant young electrician Clifford Berry and later the 700 pounds machine called Atanasoff-Berry Computer was developed. This was the first digital computer.
    Another pioneer was the German Konrad Zuse (1910-1995). He was only 18 when he constructed his own mechanical binary computer called Z1.
    During the Second World War Zuse’s computer Z3 was used in the German aircraft industry. It was the first computer in the world to be programmed with software. It is interesting, that Zuse’s computers were developed entirely independent of other contemporary scientists work.
    Figur  8. Konrad Zuse. One of the first scientists to produce working computers.
    During the war, the Germans also used an advanced code machine (Fig. 8), which the English expended a great deal of effort on “hacking”.  They were successful, and this contributed to laying the foundation for the later development of computing.
    An interesting piece of trivia: In 1947, the American computer expert, Howard Aiken, stated that there was only a need for six computers in the entire USA. History proved him wrong.
    Fig.  9. The German ”ENIGMA” code machine.


  • Chapter 2. The Von Neumann model

  • The modern microcomputer has roots going back to USA in the 1940’s. Of the many researchers, the Hungarian-born mathematician, John von Neumann (1903-57), is worthy of special mention. He developed a very basic model for computers which we are still using today.
    Fig.  10. John von Neumann (1903-57). Progenitor of the modern, electronic PC.
    Von Neumann divided a computer’s hardware into 5 primary groups:
  • CPU
  • Input
  • Output
  • Working storage
  • Permanent storage
    This division provided the actual foundation for the modern PC, as von Neumann was the first person to construct a computer which had working storage (what we today call RAM). And the amazing thing is, his model is still completely applicable today. If we apply the von Neumann model to today’s PC, it looks like this:
    Fig.  11. The Von Neumann model in the year 2004.
    Fig.  12. Cray supercomputer, 1976.
    In April 2002 I read that the Japanese had developed the world’s fastest computer. It is a huge thing (the size of four tennis courts), which can execute 35.6 billion mathematical operations per second. That’s five times as many as the previous record holder, a supercomputer from IBM.
    The report from Japan shocked the Americans, who considered themselves to be the leaders in the are of computer technology. While the American super computers are used for the development of new weapons systems, the Japanese one is to be used to simulate climate models.

    2. The PC’s system components

    This chapter is going to introduce a number of the concepts which you have to know in order to understand the PC’s architecture. I will start with a short glossary, followed by a brief description of the components which will be the subject of the rest of this guide, and which are shown in Fig. 11.

    The necessary concepts

    I’m soon going to start throwing words around like: interface, controller and protocol. These aren’t arbitrary words. In order to understand the transport of data inside the PC we need to agree on various jargon terms. I have explained a handful of them below. See also the glossary in the back of the guide.
    The concepts below are quite central. They will be explained in more detail later in the guide, but start by reading these brief explanations.

    Concept
    Binary data
    Data, be it instructions, user data or something else, which has been translated into sequences of 0’s and 1’s.
    Bus width
    The size of the packet of data which is processed (e.g. moved) in each work cycle. This can be 8, 16, 32, 64, 128 or 256 bits.
    Band width
    The data transfer capacity. This is measured in, for example, kilobits/second (Kbps) or megabytes/second (MBps).
    Cache
    A temporary storage, a buffer.
    Chipset
    A collection of one or more controllers.  Many of the motherboard’s controllers are gathered together into a chipset, which is normally made up of a north bridge and a south bridge.
    Controller
    A circuit which controls one or more hardware components. The controller is often part of the interface.
    Hubs
    This expression is often used in relation to chipset design, where the two north and south bridge controllers are called hubs in modern design.
    Interface
    A system which can transfer data from one component (or subsystem) to another. An interface connects two components (e.g. a hard disk and a motherboard). Interfaces are responsible for the exchange of data between two components. At the physical level they consist of both software and hardware elements.
    I/O units
    Components like mice, keyboards, serial and parallel ports, screens, network and other cards, along with USB, firewire and SCSI controllers, etc.
    Clock frequency
    The rate at which data is transferred, which varies quite a lot between the various components of the PC.
    Usually measured in MHz.
    Clock tick (or clock cycle)
    A single clock tick is the smallest measure in the working cycle. A working cycle (e.g. the transport of a portion of data) can be executed over a period of about 5 clock ticks (it “costs” 5 clock cycles).
    Logic
    An expression I use to refer to software built into chips and controllers. E.g. an EIDE controller has its own “logic”, and the motherboard’s BIOS is “logic”.
    MHz
    (Megahertz)
    A ”speed” which is used to indicate clock frequency. It really means: million cycles per second. The more MHZ, the more data operations can be performed per second.
    North bridge
    A chip on the motherboard which serves as a controller for the data traffic close to the CPU. It interfaces with the CPU through the Front Side Bus (FSB) and with the memory through the memory bus.
    Protocols
    Electronic traffic rules which regulate the flow of data between two components or systems. Protocols form part of interfaces.
    South bridge
    A chip on the motherboard which works together with the north bridge. It looks after the data traffic which is remote from the CPU (I/O traffic).
    Fig.  13. These central concepts will be used again and again. See also the definitions on page  PAGEREF Ordforklaringer2 \h 95.


  • Chapter 3. A data processor

    The PC is a digital data processor. In practise this means that all analogue data (text, sound, pictures) gets translated into masses of 0’s and 1’s. These numbers (binary values) exist as tiny electrical charges in microscopic circuits, where a transistor can take on two states: charged or not charged. This is one picture of a bit, which you can say is either turned on or off.
    There can be billions of these microscopic bits hidden inside a PC, and they are all managed using electronic circuits (EDP stands for electronic data processing). For example, the letter ”A” (like all other characters) can be represented by a particular 8-digit bit pattern. For ”A”, this 8-digit bit pattern is 01000001.
    When you type an ”A” on your keyboard, you create the digital data sequence, 01000001. To put it simply, the ”A” exists as a pattern in eight transistors, where some are “turned on” (charged) and others are not. Together these 8 transistors make up one byte.
    The same set of data can be stored in the video card’s electronics, in RAM or even as a magnetic pattern on your hard disk:
    Fig.  14. The same data can be found on the screen, on the hard disk and in RAM.
    The set of data can also be transferred to a printer, if you want to print out your text. The printer electronically and mechanically translates the individual bits into analogue letters and numbers which are printed on the paper. In this way, there are billions of bytes constantly circulating in your PC, while ever it is switched on. But how are these 0’s and 1’s moved around, and which components are responsible?

    The physical PC

    The PC is made up of a central unit (also called a system unit) and some external devices. The central unit is a box (a cabinet), which contains most of the computer’s electronics (the internal devices). The external devices are connected to the central unit (shown below) using cables.
    Fig.  15. The central unit contains the majority of a PC’s electronics.
    The cabinet shown in Fig. 15 is a minitower. In this cabinet, the motherboard is mounted vertically down one side. You can buy a taller cabinet of the same type. It’s called a tower. If the cabinet is designed to be placed on a desk (under the monitor), it is called a desktop cabinet.
    Fig. 16. A desktop cabinet.
    Fig.17 shows a list of most of the components of the PC. Some of them are internal, i.e., they are inside the cabinet. Other components are external, they are located outside the cabinet.
    Read through the list and think about what the words refer to. Do you know all these devices?
    Internal devices
    External devices
    Motherboard
    CPU, RAM, cache, ROM circuits containing the BIOS and startup programs. Chipsets (controllers). Ports, busses and slots. EIDE interface, USB, AGP, etc.
    Keyboard
    Mouse
    Joystick
    Screen
    Printer
    Scanner
    Speakers
    External drives
    Tape drive
    MIDI units
    Modem
    Digital camera
    Drives
    Hard disk(s), diskette drive, CD-ROM, DVD, etc.
    Plug-in cards
    Graphics card (video adapter), network card, SCSI controller.
    Sound card, video and TV card.
    Modem and ISDN card.
    Fig.  17. The PC’s components can be divided into internal and external groups.

    Speed – the more we get, the more we want

    The PC processes data. It performs calculations and moves data between the various components. It all happens at our command, and we want it to happen fast.
    It is interesting to note that current technological development is basically focusing exclusively on achieving fasterdata processing. The entire PC revolution over the last 20 years is actually just a sequence of ever increasing speed records in the area of data transfer. And there doesn’t seem to be any upper limit to how much data transfer speed we need.
    This continual speed optimisation is not just occurring in one place in the PC; it’s happening everywhere that data is moved.
  • The transfer from RAM to CPU – it has to be faster.
  • The transfer between hard disk and motherboard – it has to be faster.
  • Data to the screen – it has to be faster.
  • Etc.
    The PC can be viewed as a series of more or less independent subsystems, which can each be developed to permit greater capacity and higher speed. We constantly need new standards, because of the new, faster, interfaces, busses, protocols (which we all work out together), delivering better performance.
    Fig. 18. Data transfer between all the components of the PC has to be fast.

    Interfaces hold it all together

    The PC is the sum of all these subsystems. At each boundary between one subsystem and another, we find aninterface. That is, an electrical system which connects the two subsystems together and enables them to exchange data.
    Fig.  19. The hardware components are connected to each other via interfaces.
    The concept of an interface is a little abstract, as it most accurately refers to a standard (a set of rules for the exchange of data). In practise, an interface can consist of, for example, two controllers (one at each end of the connection), a cable, and some software (protocols, etc.)  contained in the controllers.
    The controllers are small electronic circuits which control the movement of data to and from the device.
    Fig. 20. An interface connects two hardware devices. An interface can consist of controllers with built-in software, cables, etc.
    There are many interfaces in the PC, because there are many subsystems which have to be connected. Each interface is normally tailor-made for the job, and tuned to achieve maximum bandwidth (data transfer capacity) between the two components.

    An example of an interface

    Later in the guide I want to explore the EIDE interface in more detail, but I will use it here as a specific example of an interface. Keep your attention focused on the concept of an interface – you may not understand all the details, that doesn’t matter here.
    Fig.  21. Underneath the hard disk you can see a small printed circuit board. This incorporates the controller functions which work together with the corresponding controller in the PC’s motherboard.
    The advantage of this system is that the hard disk can be connected directly to the motherboard with a cable. But the cable still runs from one controller to the other.
    The two controllers work according to a common standard, which is the ATA standard. This standard includes a set of protocols which are continually being developed in new versions. Let’s say our specific hard disk can use theATA/100 protocol. That means the controller on the motherboard has to also be compatible with ATA/100, and the cable as well. When all that is in place, we have a working ATA interface.
    Fig.  22. A specific example of an interface.




  • Chapter 4. Intro to the motherboard

  • Construction of the motherboard.
  • The CPU.
  • The busses.
  • Chipsets (controllers).
    I will work through the individual components in more detail later in the guide. This chapter will describe the architecture in “broader” brush strokes.

    Data exchange in the motherboard

    The motherboard is a large printed circuit board, which has lots of chips, connectors and other electronics mounted on it. Computer nerds simply call it a board.
    Inside the PC, data is constantly being exchanged between or via the various devices shown in Fig. 17. Most of the data exchange takes place on the motherboard itself, where all the components are connected to each other:
    Fig.  23. Data exchange on the motherboard.
    In relation to the PC’s external devices, the motherboard functions like a central railway station.
    Fig.  24. The motherboard is the hub of all data exchange.
    All traffic originates from or ends up in the motherboard; which is appropriately called the most important component of the PC. I will show you pictures of the individual components of the motherboard later, but this is what it looks like as a total unit:
    Fig.  25. A motherboard is a board covered with electronics.

    Find your motherboard

    If you are in position to look at a motherboard, I would recommend you do so. It is a very good exercise to try to identify the various components on a motherboard.
    The motherboard is really just a big plastic sheet which is full of electrical conductors. The conductors (also called tracks) run across and down, and in several layers, in order to connect all the individual components, and transfer data between them.
    The motherboard is mounted in the PC box using small plastic brackets and screws. The cabinet and the motherboard are made to suit each other, so there are holes in the metal for the connectors mounted on the board. Finally, the motherboard has to be connected to the PC’s power supply installed in the cabinet. This is done using a standard connector:
    Fig. 26. The power supply is connected to the motherboard via a multicoloured cable and a large white plastic connector.
    Now we’ll look at the various types of components on the motherboard.

    Chips

    The active devices on the motherboard are gathered together in chips. These are tiny electronic circuits which are crammed with transistors. The chips have various functions. For example, there are:
  • ROM chips, which store the BIOS and other programs.
  • CMOS storage, which contains user-defined data used by the setup program.
  • The chipset, which normally consists of two, so-called controllers, which incorporate a number of very essential functions.
    You’ll learn a lot about these chips and their functions later in the guide.

    Sockets

    You will also find sockets on the motherboard. These are holders, which have been soldered to the motherboard. The sockets are built to exactly match a card or a chip.
    This is how a number of components are directly connected to the motherboard. For example, there are sockets (slots) to mount:
  • The CPU and working storage (the RAM modules).
  • Expansion cards, also called adapters (PCI, AGP and AMR slots, etc.).
    The idea of a socket is, that you can install a component directly on the motherboard without needing special tools. The component has to be pushed carefully and firmly into the socket, and will then hopefully stay there.
    Fig. 27. Here you can see three (white) PCI sockets, in which plug-in cards can be installed.

    Plugs, connectors and ports…

    The motherboard also contains a number of inputs and outputs, to which various equipment can be connected. Most ports (also called I/O ports) can be seen where they end in a connector at the back of the PC. These are:
  • Ports for the keyboard and mouse.
  • Serial ports, the parallel port, and USB ports.
  • Sockets for speakers/microphone etc.
    Often, the various connectors are soldered onto the motherboard, so that the external components, like the keyboard, mouse, printer, speakers, etc., can be connected directly to the motherboard.
    Fig.  28. Connectors mounted directly on a motherboard.
    In addition to these sockets, connectors and ports, the motherboard contains a number of other contacts. These include:
  • The big connector which supplies the motherboard with power from the power supply (see Fig. 26.
  • Other connectors for the diskette drive, hard disk, CD-ROM drive, etc.
  • So-called jumpers, which are used on some motherboards to configure voltage and various operating speeds, etc.
  • A number of pins used to connect the reset button, LED for hard disk activity, built-in speaker, etc.
    Fig.  29. A connector can be an array of pins like this, which suits a special cable.
    Take a look at Fig. 30 and Fig. 31, which show connectors and jumpers from two different motherboards.
    Fig. 30. The tiny connectors and jumpers that are hidden on any motherboard.
    The ROM BIOS chip (Award brand), inFig. 31, contains a small collection of programs (software) which are permanently stored on the motherboard, and which are used, for example, when the PC starts up:
    Fig. 31. At the bottom left, you can see the two rows of pins which connect, for example, to the little speaker inside the cabinet. On the bottom right you can see two “jumpers”.
    The round thing in Fig. 31 is the motherboard battery, which maintains the clock function and any settings saved in the CMOS storage.
    In a later chapter I will describe the motherboard seen through the eyes of a PC builder. But first we shall take a look at the motherboard’s architecture and the central components found on it.

  • Chapter 5. It all starts with the CPU

  • Ukrainian translation of this page by Agnessa Petrova
    There are two very fundamental components to study on the motherboard. The CPU and the busses. The CPU does all the data processing, and the busses handle all data transfer.
    Fig. 32. The CPU is mounted on the motherboard, hidden under the cooling fan and heat sink.

    What is a CPU?

    CPU stands for Central Processing Unit. There can be several processors in a computer, but one of them is the central one – the CPU.
    The reason the CPU is called a processor is because it can work with data. And it has two important jobs:
  • It can do calculations.
  • It can move data.
    The CPU is very fast at doing both jobs. The faster the CPU can do calculations and move data, the faster we say the PC is. What follows is a short description of how to achieve faster data processing. Read it, and see if you understand all the concepts. There are three ways to improve a PC’s performance: 
  • Higher clock frequencies (which means more clock ticks per second).
  • Greater bus width.
  • Optimising the core of the processor and other components so that the maximum amount of work is done for each clock tick.
    All this can lead to better bandwidth, which is required throughout the PC. The entire development process is focused around the motherboard, and especially the CPU. But all of the electronics has to be able to keep up with the high pace, and that is what makes the motherboard so fascinating.
    The CPU is physically quite small. At its core is an electronic circuit (called a die), which is no bigger than your little fingernail.
    Fig.  33. The CPU circuit (the ”die”) can be seen in the middle of the chip (An AthlonXP shown close to actual size).
    Despite its small size, the CPU is full of transistors. The die in a Pentium 4 CPU contains 125 million transistors, all squashed together into a very tight space. It is about 1 cm x 1 cm in size:
    Fig. 34. Close up of a CPU circuit (die).
    The electronic circuit is encapsulated in a much bigger plastic square. This is in order to make room for all the electrical contacts which are used to connect the CPU to the motherboard.
    The individual contacts are called pins, and a CPU can have 478 of them (as does the Pentium 4 ). The large number of pins means that the socket has to be relatively large.
    Fig.  35. The underside of a (Pentium 4) CPU, showing the many pins.

    Which CPU?

    The companies Intel and AMD make most CPU’s. Intel laid the foundations for the development of CPU’s for PCs with their more than 20 year old 8086 and 8088 processors.
    CPU’s are developed in series, or generations. Each series is known by its name. The last four generations of Intel processors, for example, have been the Pentium, Pentium II, Pentium III and Pentium 4. Running alongside these is the Celeron series, which are cheaper versions, typically with reduced L2 cache and a slower front side bus:
    Fig. 36. A Celeron processor supplied in a box from Intel, with heat sink and fan.
    Within each generation there are many variants with different clock frequencies. For example, when the Pentium 4 was released in the year 2000, it was as a 1400 MHz version. The original model was later followed up by versions with 1800, 2000, etc. MHz, up to 2400 MHz (the clock frequencies came in intervals of 100 MHz). In the year 2002, anew model came out for which the clock frequencies started at 2266, 2400 and 2533 MHz, and increased in intervals of 133 MHz. A year later the clock frequencies was raised to intervals of 200 MHz with the Pentium 4 chips running from 2600 to 3600 MHz. And so it continues.
    The company, AMD, produces similar processors in the Sempron and Athlon 64 series, which also come with different clock frequencies.
    Figur 37. The Pentium 4 socket 478 on a motherboard.

    Find your CPU

    If you are not sure which CPU your PC uses, you can investigate this in several ways. You could check your purchase receipt. The name of the CPU should be specified there.
    You could look inside your PC and locate the CPU. But it is quite difficult to get to see the model name, because there is a fan mounted on the actual chip. The fan is often glued directly onto the processor, so that it is not easy to remove it.
    Fig. 38. A CPU is shown here without a cooling fan. It is mounted in a small socket which it clicks into without needing any tools.
    In Windows, you can select the System Properties dialog box, where you can see the processor name and clock frequency:
    You can also watch carefully when your PC starts up. Your CPU name and clock frequency is shown as one of the first things displayed on the screen. You can press the P key to pause the startup process. Below you can see a picture of the startup screen for PC. This PC has an Intel Pentium 4, with a clock frequency (work rate) of 2553 MHz:
    Fig. 39. If you are not sure which CPU your PC uses, you can see it on the screen, shortly after you switch on your PC.

    CPU testing programs

    Finally, let me just mention some small utility programs which you can download from the Internet (e.g. search for “WCPUID” or “CPU-Z” on www.google.com, and you’ll find it). The programs WCPUID and CPU-Z, reveals lots of information about your CPU, chipset, etc. They are used by motherboard nerds.
    Figur 40. Here CPU-Z reports that the Pentium 4 processor is a ”Prescott” model. Due to Hyper Threading, the processor virtually holds two cores.

  • Chapter 6. The CPU and the motherboard

    The heart and soul of the PC’s data processing is the CPU. But the processor is not alone in the world, it communicates with the rest of the motherboard. There will be many new terms introduced in the following sections, so remember that you can find definitions for all the abbreviations in the back of the guide.

    Busses do the transfers

    Data packets (of 8, 16, 32, 64 or more bits at a time) are constantly being moved back and forth between the CPU and all the other components (RAM, hard disk, etc.). These transfers are all done using busses.
    The motherboard is designed around some vary powerful data channels (or pathways, as they are also called). It is these busses which connect all the components to each other.

    Fig.  41. The busses are the data channels which connect the PC’s components together. Some are designed for small transfers, others for large ones.

    Busses with varying capacities

    There is not just one bus on a motherboard; there are several. But they are all connected, so that data can run from one to another, and hence reach the farthest corners of the motherboard.
    We can say that a bus system is subdivided into several branches. Some of the PC components work with enormous amounts of data, while others manage with much less. For example, the keyboard only sends very few bytes per second, whereas the working storage (RAM) can send and receive several gigabytes per second. So you can’t attach RAM and the keyboard to the same bus.
    Two busses with different capacities (bandwidths) can be connected if we place a controller between them. Such a controller is often called a bridge, since it functions as a bridge between the two different traffic systems.
    Fig.  42. Bridges connect the various busses together.
    The entire bus system starts close to the CPU, where the load (traffic) is greatest. From here, the busses work outwards towards the other components. Closest to the CPU we find the working storage. RAM is the component which has the very greatest data traffic, and is therefore connected directly to the CPU by a particularly powerful bus. It is called the front side bus (FSB) or (in older systems) the system bus.
    Fig.  43. The PC’s most important bus looks after the “heavy” traffic between the CPU and RAM.
    The busses connecting the motherboard to the PC’s peripheral devices are called I/O busses. They are managed by the controllers.

    The chip set

    The motherboard’s busses are regulated by a number of controllers. These are small circuits which have been designed to look after a particular job, like moving data to and from EIDE devices (hard disks, etc.).
    A number of controllers are needed on a motherboard, as there are many different types of hardware devices which all need to be able to communicate with each other. Most of these controller functions are grouped together into a couple of large chips, which together comprise the chip set.
    Fig.  44. The two chips which make up the chipset, and which connect the motherboard’s busses.
    The most widespread chipset architecture consists of two chips, usually called the north and south bridges. This division applies to the most popular chipsets from VIA and Intel. The north bridge and south bridge are connected by a powerful bus, which sometimes is called a link channel:
    Fig.  45. The north bridge and south bridge share the work of managing the data traffic on the motherboard.

    The north bridge

    The north bridge is a controller which controls the flow of data between the CPU and RAM, and to the AGP port.
    In Fig. 46  you can see the north bridge, which has a large heat sink attached to it. It gets hot because of the often very large amounts of data traffic which pass through it. All around the north bridge you can see the devices it connects:
    Fig.  46. The north bridge and its immediate surroundings. A lot of traffic runs through the north bridge, hence the heat sink.
    The AGP is actually an I/O port. It is used for the video card. In contrast to the other I/O devices, the AGP port is connected directly to the north bridge, because it has to be as close to the RAM as possible. The same goes for the PCI Express x16 port, which is the replacement of AGP in new motherboards. But more on that later.

  • Chapter 7. The south bridge

    The south bridge incorporates a number of different controller functions. It looks after the transfer of data to and from the hard disk and all the other I/O devices, and passes this data into the link channel which connects to the north bridge.
    In Fig. 44 you can clearly see that the south bridge is physically located close to the PCI slots, which are used for I/O devices.
    Fig.  47. The chipset’s south bridge combines a number of controller functions into a single chip.

    The various chipset manufacturers

    Originally it was basically only Intel who supplied the chipsets to be used in motherboards. This was quite natural, since Intel knows everything about their own CPU’s and can therefore produce chipsets which match them. But at the time the Pentium II and III came out, other companies began to get involved in this market. The Taiwanese company, VIA, today produces chipsets for both AMD and Intel processors, and these are used in a large number of motherboards.
    Other companies (like SiS, nVidia, ATI and ALi) also produce chipsets, but these haven’t (yet?) achieved widespread use. The CPU manufacturer, AMD, produces some chipsets for their own CPU’s, but they also work together closely with VIA as the main supplier for Athlon motherboards.
    Fig.  48. The Taiwanese company, VIA, has been a leader in the development of new chipsets in recent years.
    Since all data transfers are managed by the chipset’s two bridges, the chipset is the most important individual component on the motherboard, and new chipsets are constantly being developed.
    The chipset determines the limits for clock frequencies, bus widths, etc. The chipset’s built-in controllers are also responsible for connecting I/O devices like hard disks and USB ports, thus the chipset also determines, in practise, which types of devices can be connected to the PC.
    Fig.  49. The two chips which make up a typical chipset. Here we have VIA’s model P4X266A, which was used in early motherboards for Pentium 4 processors.

    Sound, network, and graphics in chipsets

    Developments in recent years have led chipset manufacturers to attempt to place more and more functions in the chipset.
    These extra functions are typically:
  • Video card (integrated into the north bridge)
  • Sound card (in the south bridge)
  • Modem (in the south bridge)
  • Network and Firewire (in the south bridge)
    All these functions have traditionally been managed by separate devices, usually plug-in cards, which connect to the PC. But it has been found that these functions can definitely be incorporated into the chipset.
    Fig.  50. Motherboard with built-in sound functionality.
    Intel has, for many years, managed to produce excellent network cards (Ethernet 10/100 Mbps); so it is only natural that they should integrate this functionality into their chipsets.
    Sound facilities in a chipset cannot be compared with “real” sound cards (like, for example, Sound Blaster Audigy). But the sound functions work satisfactorily if you only want to connect a couple of small speakers to the PC, and don’t expect perfect quality.
    Fig.  51. This PC has two sound cards installed, as shown in this Windows XP dialog box. The VIA AC’97 is a sound card emulation which is built into the chipset.
    Many chipsets also come with a built-in video card. The advantage is clear; you can save having a separate video card, which can cost a $100 or more.
    Again, the quality can’t be compared with what you get with a separate, high quality, video card. But if you don’t particularly need support for multiple screens, DVI (for flat screens), super 3D performance for games, or TV-out, the integrated graphics controller can certainly do the job.
    Fig.  52. This PC uses a video card which is built into the Intel i810 chipset.
    It is important that the integrated sound and graphics functions can be disabled, so that you can replace them with a real sound or video card. The sound functions won’t cause any problems; you can always ask Windows to use a particular sound card instead of another one.
    But the first Intel chipset with integrated graphics (the i810) did not allow for an extra video card to be installed. That wasn’t very smart, because it meant users were locked into using the built-in video card. In the subsequent chipset (i815), the problem was resolved.

    Buying a motherboard

    If you want to build a PC yourself, you have to start by choosing a motherboard. It is the foundation for the entire PC.
    Most of the motherboards on the market are produced in Taiwan, where manufacturers like Microstar, Asus, Epox, Soltek and many others supply a wide range of different models. Note that a producer like Microstar supplies motherboards to brand name manufacturers like Fujitsu-Siemens, so you can comfortably trust in the quality.Taiwan is the leader in the area of motherboards.
    The first issue to work out is, which CPU you want to use. For example, if you want to use a Pentium 4 from Intel, there is one line of motherboards you can choose between. If you choose an AthlonXP, there is another line. And the difference lies in which chipset is being used in the motherboard.
    Fig.  53. A typical (technical) advertisement for a motherboard.
    Once you have decided on a processor, you should try to get a motherboard with the latest chipset available, because new versions of chipsets continue to be released, with greater functionality. At the time of writing, for example, chipsets often include these functions:
  • USB version 2.0.
  • Dual channel RAM.
  • Support for 400 and 533 MHz DDR2 RAM.
  • Integrated Firewire ports.
  • Serial ATA.
  • Surround sound.
  • Gigabit Ethernet.
    You will most likely want to have these facilities (which are described later in the guide) on your PC. That is why it is important to choose the right motherboard with the latest generation chipset.

    Extra facilities

  • Built-in RAID or (seldom) SCSI controller.
  • Other network, screen and sound facilities.
  • Wireless LAN.
  • SmartCard/MemoryStick/etc. readers.
    One of the advantages of building your own PC is that you can choose a really exciting motherboard.
    Development is taking place rapidly, and by choosing the right motherboard, you can design the absolute latest PC on the market.
    You can also find hundreds of articles on the Internet about each motherboard and chipset. So I can comfortably recommend you build your own PC, as long as you do your homework first! Make sure you read the rest of the guide before you start choosing a new motherboard!

  • Chapter 8. Inside and around the CPU

  • In this and the following chapters, I will focus on a detailed look at the CPU. One of the goals is help to you understand why manufacturers keep releasing new and more powerful processors. In order to explain that, we will have to go through what will at times be a quite detailed analysis of the CPU’s inner workings.
    Some of the chapters will probably be fairly hard to understand; I have spent a lot of time myself on my “research”, but I hope that what I present in these chapters will shed some light on these topics.
    Naturally, I will spend most of my time on the latest processors (the Athlon XP and Pentium 4). But we need to examine their internal architectures in light of the older CPU architectures, if we want to understand them properly. For this reason I will continually make comparisons across the various generations of CPU’s.
    I will now take you on a trip inside the CPU. We will start by looking at how companies like Intel and AMD can continue to develop faster processors.
  • Two ways to greater speed

  • Of course faster CPU’s are developed as a result of hard work and lots of research. But there are two quite different directions in this work:
  • More power and speed in the CPU, for example, from higher clock frequencies.
  • Better exploitation of existing processor power.
    Both approaches are used. It is a well-known fact that bottlenecks of various types drain the CPU of up to 75 % of its power. So if these can be removed or reduced, the PC can become significantly faster without having to raise the clock frequency dramatically.
    It’s just that it is very complicated to remove, for example, the bottleneck surrounding the front side bus, which I will show you later. So the manufacturers are forced to continue to raise the working rate (clock frequency), and hence to develop new process technology, so that CPU’s with more power can come onto the market.
  • Clock frequencies

  • If we look at a CPU, the first thing we notice is the clock frequency. All CPU’s have a working speed, which is regulated by a tiny crystal.
    The crystal is constantly vibrating at a very large number of “beats” per second. For each clock tick, an impulse is sent to the CPU, and each pulse can, in principle, cause the CPU to perform one (or more) actions.
    Fig.  54. The CPU’s working speed is regulated by a crystal which “oscillates” millions of times each second.
    The number of clock ticks per second is measured in Hertz. Since the CPU’s crystal vibrates millions of times each second, the clock speed is measured in millions of oscillations (megahertz or MHz). Modern CPU’s actually have clock speeds running into billions of ticks per second, so we have started having to use gigahertz (GHz).
    These are unbelievable speeds. See for yourself how short the period of time is between individual clock ticks at these frequencies. We are talking about billionths of a second:
    Clock frequency
    Time period per clock tick
    133 MHz
    0.000 000 008 000 seconds
    1200 MHz
    0.000 000 000 830 seconds
    2 GHz
    0.000 000 000 500 seconds
    Fig.  55. The CPU works at an incredible speed.
    The trend is towards ever increasing clock frequencies. Let’s take a closer look at how this is possible.
  • More transistors

  • New types of processors are constantly being developed, for which the clock frequency keeps getting pushed up a notch. The original PC from 1981 worked at a modest 4.77 MHz, whereas the clock frequency 20 years later was up to 2 GHz.
    In Fig. 56 you can see an overview of the last 20 years of development in this area. The table shows the sevengenerations of Intel processors which have brought about the PC revolution. The latest version of Pentium 4 is known under the code name Prescott.
    Gen.
    CPU
    Yr
    (intr.)
    Clock
    Frequency
    No. of
    transistors
    1
    8088
    1979
    4.77- 8 MHz
    29,000
    2
    80286
    1982
    6-12.5 MHz
    134,000
    3
    80386
    1985
    16-33 MHz
    275,000
    4
    80486
    1989
    25-100 MHz
    1,200,000
    5
    Pentium
    Pentium MMX
    1993
    1997
    60-200 MHz
    166-300 MHz
    3,100,000
    4,500,000
    6
    Pentium Pro
    Pentium II
    Pentium III
    1995
    1997
    1999
    150-200 MHz
    233-450 MHz
    450-1200 MHz
    5,500,000
    7,500,000
    28,000,000
    7
    Pentium 4


    Prescott
    2000
    2002
    2003
    2004
    1400-2200
    2200-2800
    2600-3200
    2800-3600
    42,000,000
    55,000,000
    55,000,000
    125,000,000
    Fig.  56. Seven generations of CPU’s from Intel. The number of transistors in the Pentium III and 4 includes the L2 cache.
    Each processor has been on the market for several years, during which time the clock frequency has increased. Some of the processors were later released in improved versions with higher clock frequencies, I haven’t included the Celeron in the overview processor. Celerons are specially discount versions of the Pentium II, III, and 4 processors.
    Anyone can see that there has been an unbelievable development. Modern CPU’s are one thousand times more powerful than the very first ones.
    In order for the industry to be able to develop faster CPU’s each year, new manufacturing methods are required.More and more transistors have to be squeezed into smaller and smaller chips.
    Fig.  57.
    A photograph from one of Intel’s factories, in which a technician displays the Pentium 4 processor core. It is a tiny piece of silicon which contains 42 million transistors.


  • Chapter 9. Moores' Law

  • This development was actually described many years ago, in what we call Moores Law.
    Right back in 1965, Gordon Moore predicted (in the Electronics journal), that the number of transistors in processors (and hence their speed) would be able to be doubled every 18 months.
    Moore expected that this regularity would at least apply up until 1975. But he was too cautious; we can see that the development continues to follow Moores Law today, as is shown inFig. 59.
    Fig.  58. In 1968, Gordon Moore helped found Intel.
    If we try to look ahead in time, we can work out that in 2010 we should have processors containing 3 billion transistors. And with what clock frequencies? You’ll have to guess that for yourself.
    Fig.  59Moores Law (from Intels website).
  • Process technology

  • The many millions of transistors inside the CPU are made of, and connected by, ultra thin electronic tracks. By making these electronic tracks even narrower, even more transistors can be squeezed into a small slice of silicon.
    The width of these electronic tracks is measured in microns (or micrometers), which are millionths of a metre.
    For each new CPU generation, the track width is reduced, based on new technologies which the chip manufacturers keep developing. At the time of writing, CPU’s are being produced with a track width of 0.13 microns, and this will be reduced to 0.09 and 0.06 microns in the next generations.
    Fig.  60. CPU’s are produced in extremely high-technology environments (“clean rooms”). Photo courtesy of AMD.
    In earlier generations aluminium was used for the current carrying tracks in the chips. With the change to 0.18 and 0.13-micron technology, aluminium began to be replaced with copper. Copper is cheaper, and it carries current better than aluminium. It had previously been impossible to insulate the copper tracks from the surrounding silicon, but IBM solved this problem in the late 1990’s.
    AMD became the first manufacturer to mass-produce CPU’s with copper tracks in their chip factory fab 30 inDresdenGermany. A new generation of chips requires new chip factories (fabs) to produce it, and these cost billions of dollars to build. That’s why they like a few years to pass between each successive generation. The old factories have to have time to pay for themselves before new ones start to be used.
    Fig.  61. AMD’s Fab 30 in Dresden, which was the first factory to mass-produce copper-based CPU’s.
  • A grand new world …

  • We can expect a number of new CPU’s in this decade, all produced in the same way as they are now – just with smaller track widths. But there is no doubt that we are nearing the physical limits for how small the transistors produced using the existing technology can be. So intense research is underway to find new materials, and it appears that nanotransistors, produced using organic (carbon-based) semiconductors, could take over the baton from the existing process technology.
    Bell Labs in the USA has produced nanotransistors with widths of just one molecule. It is claimed that this process can be used to produce both CPU’s and RAM circuits up to 1000 times smaller than what we have today!
  • Less power consumption

  • The types of CPU’s we have today use a fairly large amount of electricity when the PC is turned on and is processing data. The processor, as you know, is installed in the motherboard, from which it receives power. There are actually two different voltage levels, which are both supplied by the motherboard:
  • One voltage level which powers the CPU core (kernel voltage).
  • Another voltage level which powers the CPU’s I/O ports, which is typically 3.3 volts.
    As the track width is reduced, more transistors can be placed within the same area, and hence the voltage can be reduced.
    As a consequence of the narrower process technology, the kernel voltage has been reduced from 3 volts to about 1 volt in recent years. This leads to lower power consumption per transistor. But since the number of transistors increases by a corresponding amount in each new CPU generation, the end result is often that the total power consumption is unchanged.
    Fig.  62. A powerful fan. Modern CPU’s require something like this.
    It is very important to cool the processor; a CPU can easily burn 50-120 Watts. This produces a fair amount of heat in a very small area, so without the right cooling fan and motherboard design, a Gigahertz processor could quickly burn out.
    Modern processors contain a thermal diode which can raise the alarm if the CPU gets to hot. If the motherboard and BIOS are designed to pay attention to the diode’s signal, the processor can be shut down temporarily so that it can cool down.
    Figur  63. The temperatures on the motherboard are constantly reported to this program..
    Cooling is a whole science in itself. Many “nerds” try to push CPU’s to work at higher clock speeds than they are designed for. This is often possible, but it requires very good cooling – and hence often huge cooling units.
  • 30 years development

  • Higher processor speeds require more transistors and narrower electronic tracks in the silicon chip. In the overview in Fig. 64 you can see the course of developments in this area.
    Note that the 4004 processor was never used for PC’s. The 4004 was Intel’s first commercial product in 1971, and it laid the foundation for all their later CPU’s. It was a 4-bit processor which worked at 108 KHz (0.1 MHz), and contained 2,250 transistors. It was used in the first pocket calculators, which I can personally remember from around 1973-74 when I was at high school. No-one could have predicted that the device which replaced the slide rule, could develop, in just 30 years, into a Pentium 4 based super PC.
    If, for example, the development in automobile technology had been just as fast, we would today be able to drive from Copenhagen to Paris in just 2.8 seconds!
    Year
    Intel CPU
    Technology (track width)
    1971
    4004
    10 microns
    1979
    8088
    3 microns
    1982
    80286
    1.5 microns
    1985
    80386
    1 micron
    1989
    80486
    1.0/0.8 microns
    1993
    Pentium
    0.8/0.5/0.35 microns
    1997
    Pentium II
    0.28/0.25 microns
    1999
    Pentium III
    0.25/0.18/0.13 microns
    2000-2003
    Pentium 4
    0.18/0.13 microns
    2004-2005
    Pentium 4
    Prescott
    0.09 microns
    Fig.  64. The high clock frequencies are the result of new process technology with smaller electronic ”tracks”.
    A conductor which is 0.09 microns (or 90 nanometres) thick, is 1150 times thinner than a normal human hair. These are tiny things we are talking about here.
  • Wafers and die size

  • Another CPU measurement is its die size. This is the size of the actual silicon sheet containing all the transistors (the tiny area in the middle of Fig. 33 on page 15).
    At the chip factories, the CPU cores are produced in so-called wafers. These are round silicon sheets which typically contain 150-200 processor cores (dies).
    The smaller one can make each die, the more economical production can become. A big die is also normally associated with greater power consumption and hence also requires cooling with a powerful fan (e.g. see Fig. 63 on page 25 and Fig. 124 on page 50).
    Figur  65. A technician from Intel holding a wafer. This slice of silicon contains hundreds of tiny processor cores, which end up as CPU’s in everyday PC’s.
    You can see the measurements for a number of CPU’s below. Note the difference, for example, between a Pentium and a Pentium II. The latter is much smaller, and yet still contains nearly 2½ times as many transistors. Every reduction in die size is welcome, since the smaller this is, the more processors can fit on a wafer. And that makes production cheaper.
    CPU
    Track width
    Die size
    Number of
    transistors
    Pentium
    0.80
    294 mm2
    3.1 mil.
    Pentium MMX
    0.28
     140 mm2
    4.5 mil.
    Pentium II
    0.25
    131 mm2
    7.5 mil.
    Athlon
    0.25
    184 mm2
    22 mil.
    Pentium III
    0.18
     106 mm2
    28 mil.
    Pentium III
    0.13
     80 mm2
    28 mil.
    Athlon XP
    0.18
    128 mm2
    38 mil.
    Pentium 4
    0.18
    217 mm2
    42 mil.
    Pentium 4
    0.13
    145 mm2
    55 mil.
    Athlon XP+
    0.13
    115 mm2
    54 mil.
    Athlon 64 FX
    0,13
    193 mm2
    106 mill.
    Pentium 4
    0.09
    112 mm2
    125 mil.
    Fig.  66. The smaller the area of each processor core, the more economical chip production can be.
  • The modern CPU generations

  • As mentioned earlier, the various CPU’s are divided into generations (see also Fig. 56 on page 23).
    At the time of writing, we have started on the seventh generation. Below you can see the latest processors from Intel and AMD, divided into these generations. The transitions can be a bit hazy. For example, I’m not sure whether AMD’s K6 belongs to the 5th or the 6th generation. But as a whole, the picture is as follows:
    Generation
    CPU’s
    5th
    Pentium, Pentium MMX, K5, K6
    6th
    Pentium Pro, K6-II, Pentium II, K6-3, Athlon, Pentium III
    7th
    Pentium 4, Athlon XP
    8th.
    Athlon 64 FX, Pentium 5
    Fig.  67. The latest generations of CPU’s.



  • Chapter 10. The cache

  • In the previous chapter, I described two aspects of the ongoing development of new CPU’s – increased clock frequencies and the increasing number of transistors being used. Now it is time to look at a very different yet related technology – the processor’s connection to the RAM, and the use of the L1 and L2 caches.
  • Speed conflict

  • The CPU works internally at very high clock frequencies (like 3200 MHz), and no RAM can keep up with these.
    The most common RAM speeds are between 266 and 533 MHz. And these are just a fraction of the CPU’s working speed. So there is a great chasm between the machine (the CPU) which slaves away at perhaps 3200 MHz, and the “conveyor belt”, which might only work at 333 MHz, and which has to ship the data to and from the RAM. These two subsystems are simply poorly matched to each other.
    If nothing could be done about this problem, there would be no reason to develop faster CPU’s. If the CPU had to wait for a bus, which worked at one sixth of its speed, the CPU would be idle five sixths of the time. And that would be pure waste.
    The solution is to insert small, intermediate stores of high-speed RAM. These buffers (cache RAM) provide a much more efficient transition between the fast CPU and the slow RAM. Cache RAM operates at higher clock frequencies than normal RAM. Data can therefore be read more quickly from the cache.
  • Data is constantly being moved

  • The cache delivers its data to the CPU registers. These are tiny storage units which are placed right inside the processor core, and they are the absolute fastest RAM there is. The size and number of the registers is designed very specifically for each type of CPU.
    Fig.  68. Cache RAM is much faster than normal RAM.
    The CPU can move data in different sized packets, such as bytes (8 bits), words (16 bits), dwords (32 bits) or blocks(larger groups of bits), and this often involves the registers. The different data packets are constantly moving back and forth:
  • from the CPU registers to the Level 1 cache.
  • from the L1 cache to the registers.
  • from one register to another
  • from L1 cache to L2 cache, and so on…
    The cache stores are a central bridge between the RAM and the registers which exchange data with the processor’s execution units.
    The optimal situation is if the CPU is able to constantly work and fully utilize all clock ticks. This would mean that the registers would have to always be able to fetch the data which the execution units require. But this it not the reality, as the CPU typically only utilizes 35% of its clock ticks. However, without a cache, this utilization would be even lower.
  • Bottlenecks

  • CPU caches are a remedy against a very specific set of “bottleneck” problems. There are lots of “bottlenecks” in the PC – transitions between fast and slower systems, where the fast device has to wait before it can deliver or receive its data. These bottle necks can have a very detrimental effect on the PC’s total performance, so they must be minimised.
    Fig.  69. A cache increases the CPU’s capacity to fetch the right data from RAM.
    The absolute worst bottleneck exists between the CPU and RAM. It is here that we have the heaviest data traffic, and it is in this area that PC manufacturers are expending a lot of energy on new development. Every new generation of CPU brings improvements relating to the front side bus.
    The CPU’s cache is “intelligent”, so that it can reduce the data traffic on the front side bus. The cache controller constantly monitors the CPU’s work, and always tries to read in precisely the data the CPU needs. When it is successful, this is called a cache hit. When the cache does not contain the desired data, this is called a cache miss.
  • Two levels of cache

  • The idea behind cache is that it should function as a “near store” of fast RAM. A store which the CPU can always be supplied from.
    Fig.  70. The cache system tries to ensure that relevant data is constantly being fetched from RAM, so that the CPU (ideally) never has to wait for data.
  • L1 cache

  • Level 1 cache is built into the actual processor core. It is a piece of RAM, typically 8, 16, 20, 32, 64 or 128 Kbytes, which operates at the same clock frequency as the rest of the CPU. Thus you could say the L1 cache is part of the processor.
    L1 cache is normally divided into two sections, one for data and one for instructions. For example, an Athlon processor may have a 32 KB data cache and a 32 KB instruction cache. If the cache is common for both data and instructions, it is called a unified cache.

  • Chapter 11. The L2 cache

  • The level 2 cache is normally much bigger (and unified), such as 256, 512 or 1024 KB. The purpose of the L2 cache is to constantly read in slightly larger quantities of data from RAM, so that these are available to the L1 cache.
    In the earlier processor generations, the L2 cache was placed outside the chip: either on the motherboard (as in the original Pentium processors), or on a special module together with the CPU (as in the first Pentium II’s).
    Fig.  71. An old Pentium II module. The CPU is mounted on a rectangular printed circuit board, together with the L2 cache, which is two chips here. The whole module is installed in a socket on the motherboard. But this design is no longer used.
    As process technology has developed, it has become possible to make room for the L2 cache inside the actual processor chip. Thus the L2 cache has been integrated and that makes it function much better in relation to the L1 cache and the processor core.
    The L2 cache is not as fast as the L1 cache, but it is still much faster than normal RAM.
    CPU
    L2 cache
    Pentium, K5, K6
    External, on the motherboard
    Pentium Pro
    Internal, in the CPU
    Pentium II, Athlon
    External, in a module
    close to the CPU
    Celeron (1st generation)
    None
    Celeron (later gen.),
    Pentium III, Athlon XP,
    Duron, Pentium 4
    Internal, in the CPU
    Fig.  72. It has only been during the last few CPU generations that the level 2 cache has found its place, integrated into the actual CPU.
    Traditionally the L2 cache is connected to the front side bus. Through it, it connects to the chipset’s north bridge and RAM:
    Fig.  73. The way the processor uses the L1 and L2 cache has crucial significance for its utilisation of the high clock frequencies.
    The level 2 cache takes up a lot of the chip’s die, as millions of transistors are needed to make a large cache. The integrated cache is made using SRAM (static RAM), as opposed to normal RAM which is dynamic (DRAM).
  • Powerful bus

  • The bus between the L1 and L2 cache is presumably THE place in the processor architecture which has the greatest need for high bandwidth. We can calculate the theoretical maximum bandwidth by multiplying the bus width by the clock frequency. Here are some examples:
    CPU
    Bus
    width
    Clock
    frequency
    Theoretical bandwidth
    Intel Pentium III
    64 bits
    1400 MHz
    11.2 GB/sek.
    AMD
    Athlon XP+
    64 bits
    2167 MHz
    17.3 GB/sek.
    AMD Athlon 64
    64 bits
    2200 MHz
    17,6 GB/sek.
    AMD Athlon 64 FX
    128 bits
    2200 MHz
    35,2 GB/sek.
    Intel Pentium 4
    256 bits
    3200 MHz
    102 GB/sek.
    Fig.  74. Theoretical calculations of the bandwidth between the L1 and L2 cache.
  • Different systems

  • There are a number of different ways of using caches. Both Intel and AMD have saved on L2 cache in some series, in order to make cheaper products. But there is no doubt, that the better the cache – both L1 and L2 – the more efficient the CPU will be and the higher its performance.
    AMD have settled on a fairly large L1 cache of 128 KB, while Intel continue to use relatively small (but efficient) L1 caches.
    On the other hand, Intel uses a 256 bit wide bus on the “inside edge” of the L2 cache in the Pentium 4, while AMD only has a 64-bit bus (see Fig. 74).
    Fig. 75. Competing CPU’s with very different designs.
    AMD uses exclusive caches in all their CPU’s. That means that the same data can’t be present in both caches at the same time, and that is a clear advantage. It’s not like that at Intel.
    CPU
    L1 cache
    L2 cache
    Athlon XP
    128 KB
    256 KB
    Athlon XP+
    128 KB
    512 KB
    Pentium 4 (I)
    20 KB
    256 KB
    Pentium 4 (II, “Northwood”)
    20 KB
    512 KB
    Athlon 64
    128 KB
    512 KB
    Athlon 64 FX
    128 KB
    1024 KB
    Pentium 4 (III, “Prescott”)
    28 KB
    1024 KB
    Fig.  76. The most common processors and their caches.
  • Latency

  • A very important aspect of all RAM – cache included – is latency. All RAM storage has a certain latency, which means that a certain number of clock ticks (cycles) must pass between, for example, two reads. L1 cache has less latency than L2; which is why it is so efficient.
    When the cache is bypassed to read directly from RAM, the latency is many times greater. In Fig. 77 the number of wasted clock ticks are shown for various CPU’s. Note that when the processor core has to fetch data from the actual RAM (when both L1 and L2 have failed), it costs around 150 clock ticks. This situation is called stalling and needs to be avoided.
    Note that the Pentium 4 has a much smaller L1 cache than the Athlon XP, but it is significantly faster. It simply takes fewer clock ticks (cycles) to fetch data:
    Latency
    Pentium II
    Athlon
    Pentium 4
    L1 cache:
    3 cycles
    3 cycles
    2 cycles
    L2 cache:
    18 cycles
    6 cycles
    5 cycles
    Fig.  77. Latency leads to wasted clock ticks; the fewer there are of these, the faster the processor will appear to be.
  • Intelligent ”data prefetch”

  • In CPU’s like the Pentium 4 and Athlon XP, a handful of support mechanisms are also used which work in parallel with the cache. These include:
    hardware auto data prefetch unit, which attempts to guess which data should be read into the cache. This device monitors the instructions being processed and predicts what data the next job will need.
    Related to this is the Translation Look-aside Buffer, which is also a kind of cache. It contains information which constantly supports the supply of data to the L1 cache, and this buffer is also being optimised in new processor designs. Both systems contribute to improved exploitation of the limited bandwidth in the memory system.
    Fig.  78. The WCPUID program reports on cache in an Athlon processor.
  • Conclusion

  • L1 and L2 cache are important components in modern processor design. The cache is crucial for the utilisation of the high clock frequencies which modern process technology allows. Modern L1 caches are extremely effective. In about 96-98% of cases, the processor can find the data and instructions it needs in the cache. In the future, we can expect to keep seeing CPU’s with larger L2 caches and more advanced memory management. As this is the way forward if we want to achieve more effective utilisation of the CPU’s clock ticks. Here is a concrete example:
    In January 2002 Intel released a new version of their top processor, the Pentium 4 (with the codename, “Northwood”). The clock frequency had been increased by 10%, so one might expect a 10% improvement in performance. But because the integrated L2 cache was also doubled from 256 to 512 KB, the gain was found to be all of 30%.
    CPU
    L2 cache
    Clock freq.
    Improvement
    Intel Pentium 4
    (0.18 micron)
    256 KB
    2000 MHz

    Intel Pentium 4
    (0.13 micron)
    512 KB
    2200 MHz
    +30%
    Fig.  79. Because of the larger L2 cache, performance increased significantly.
    In 2002 AMD updated the Athlon processor with the new ”Barton” core. Here the L2 cache was also doubled from 256 to 512 KB in some models. In 2004 Intel came with the “Prescott” core with 1024 KB L2 cache, which is the same size as in AMD’s Athlon 64 processors. Some Extreme Editions of Pentium 4 even uses 2 MB of L2 cache.
  • Xeon for servers

  • Intel produces special server models of their Pentium III and Pentium 4 processors. These are called Xeon, and are characterised by very large L2 caches. In an Intel Xeon the 2 MB L2 cache uses 149,000,000 transistors.
    Xeon processors are incredibly expensive (about Euro 4,000 for the top models), so they have never achieved widespread distribution.
    They are used in high-end servers, in which the CPU only accounts for a small part of the total price.

    Otherwise, Intel’s 64 bit server CPU, the Itanium. The processor is supplied in modules which include 4 MB L3 cache of 300 million transistors.
  • Multiprocessors

  • Several Xeon processors can be installed on the same motherboard, using special chipsets. By connecting 2, 4 or even 8 processors together, you can build a very powerful computer.
    These MP (Multiprocessor) machines are typically used as servers, but can also be used as powerful workstations, for example, to perform demanding 3D graphics and animation tasks. AMD has the Opteron processors, which are server-versions of the Athlon 64. Not all software can make use of the PC’s extra processors; the programs have to be designed to do so. For example, there are professional versions of Windows NT, 2000 and XP, which support the use of several processors in one PC.


    See also the discussion of Hyper Threading, which allows a Pentium 4 processor to appear as an MP system. Both Intel and AMD also works on dual-core processors.

  • Chapter 12. Data and instructions

  • Now it’s time to look more closely at the work of the CPU. After all, what does it actually do?
  • Instructions and data

  • Our CPU processes instructions and data. It receives orders from the software. The CPU is fed a gentle stream of binary data via the RAM.
    These instructions can also be called program code. They include the commands which you constantly – via user programs – send to your PC using your keyboard and mouse. Commands to print, save, open, etc.
    Data is typically user data. Think about that email you are writing. The actual contents (the text, the letters) is user data. But when you and your software say “send”, your are sending program code (instructions) to the processor:
    Fig.  80. The instructions process the user data.
  • Instructions and compatibility

  • Instructions are binary code which the CPU can understand. Binary code (machine code) is the mechanism by which PC programs communicate with the processor.
    All processors, whether they are in PC’s or other types of computers, work with a particular instruction set. These instructions are the language that the CPU understands, and thus all programs have to communicate using these instructions. Here is a simplified example of some “machine code” – instructions written in the language the processor understands:
    proc near
    mov AX,01
    mov BX,01
    inc AX
    add BX,AX
    You can no doubt see that it wouldn’t be much fun to have to use these kinds of instructions in order to write a program. That is why people use programming tools. Programs are written in a programming language (like Visual Basic or C++). But these program lines have to be translated into machine code, they have to be compiled, before they can run on a PC. The compiled program file contains instructions which can be understood by the particular processor (or processor family) the program has been “coded” for:
    Fig.  81. The program code produced has to match the CPU’s instruction set. Otherwise it cannot be run.
    The processors from AMD and Intel which we have been focusing on in this guide, are compatible, in that they understand the same instructions.
    There can be big differences in the way two processors, such as the Pentium and Pentium 4, process the instructions internally. But externally – from the programmer’s perspective – they all basically function the same way. All the processors in the PC family (regardless of manufacturer) can execute the same instructions and hence the same programs.
    And that’s precisely the advantage of the PC: Regardless of which PC you have, it can run the Windows programs you want to use.
    Fig.  82. The x86 instruction set is common to all PC’s.
    As the years have passed, changes have been made in the instruction set along the way. A PC with a Pentium 4 processor from 2002 can handle very different applications to those which an IBM XT with an 8088 processor from 1985 can. But on the other hand, you can expect all the programs which could run on the 8088, to still run on a Pentium 4 and on a Athlon 64. The software is backwards compatible.
    The entire software industry built up around the PC is based on the common x86 instruction, which goes back to the earliest PC’s. Extensions have been made, but the original instruction set from 1979 is still being used.
  • x86 and CISC

  • People sometimes differentiate between RISC and CISC based CPU’s. The (x86) instruction set of the original Intel 8086 processor is of the CISC type, which stands for Complex Instruction Set Computer.
    That means that the instructions are quite diverse and complex. The individual instructions vary in length from 8 to 120 bits. It is designed for the 8086 processor, with just 29,000 transistors. The opposite of CISC, is RISC instructions.
    RISC stands for Reduced Instruction Set Computer, which is fundamentally a completely different type of instruction set to CISC. RISC instructions can all have the same length (e.g. 32 bits). They can therefore be executed much faster than CISC instructions. Modern CPU’s like the AthlonXP and Pentium 4 are based on a mixture of RISC and CISC.
    Fig.  83. PC’s running Windows still work with the old fashioned CISC instructions.
    In order to maintain compatibility with the older DOS/Windows programs, the later CPU’s still understand CISC instructions. They are just converted to shorter, more RISC-like, sub-operations (called micro-ops), before being executed. Most CISC instructions can be converted into 2-3 micro-ops.
    Fig.  84. The CISC instructions are decoded before being executed in a modern processor. This preserves compatibility with older software.
  • Extensions to the instruction set

  • For each new generation of CPU’s, the original instruction set has been extended. The 80386 processor added 26 new instructions, the 80486 added six, and the Pentium added eight new instructions.
    At the same time, execution of the instructions was made more efficient. For example, it took an 80386 processor six clock ticks to add one number to a running summation. This task could be done in the 80486 (see page 40), in just two clock ticks, due to more efficient decoding of the instructions.
    These changes have meant that certain programs require at least a 386 or a Pentium processor in order to run. This is true, for example, of all Windows programs. Since then, the MMX and SSE extensions have followed, which are completely new instruction sets which will be discussed later in the guide. They can make certain parts of program execution much more efficient.
    Another innovation is the 64-bit extension, which both AMD and Intel use in their top-processors. Normally the pc operates in 32-bit mode, but one way to improve the performance is using a 64-bit mode. This requires new software, which is not available yet.
  • 9. Inside the CPU

  • Instructions have to be decoded, and not least, executed, in the CPU. I won’t go into details on this subject; it is much too complicated. But I will describe a few factors which relate to the execution of instructions. My description has been extremely simplified, but it is relevant to the understanding of the microprocessor. This chapter is probably the most complicated one in the guide – you have been warned! It’s about:
  • Pipelines
  • Execution units
    If we continue to focus on speeding up the processor’s work, this optimisation must also apply to the instructions – the quicker we can shove them through the processor, the more work it can get done.
  • Pipelines

  • As mentioned before, instructions are sent from the software and are broken down into micro-ops (smaller sub-operations) in the CPU. This decomposition and execution takes place in a pipeline.
    The pipeline is like a reverse assembly line. The CPU’s instructions are broken apart (decoded) at the start of the pipeline. They are converted into small sub-operations (micro-ops), which can then be processed one at a time in the rest of the pipeline:
    Fig.  85. First the CISC instructions are decoded and converted into more digestible micro instructions. Then these are processed. It all takes place in the pipeline.
    The pipeline is made up of a number stages. Older processors have only a few stages, while the newer ones have many (from 10 to 31). At each stage “something” is done with the instruction, and each stage requires one clock tick from the processor.
    Fig.  86. The pipeline is an assembly line (shown here with 9 stages), where each clock tick leads to the execution of a sub-instruction.
    Modern CPU’s have more than one pipeline, and can thus process several instructions at the same time. For example, the Pentium 4 and AthlonXP can decode about 2.5 instructions per clock tick.
    The first Pentium 4 has several very long pipelines, allowing the processor to hold up to 126 instructions in total, which are all being processed at the same time, but at different stages of execution (see Fig. 88). It is thus possible to get the CPU to perform more work by letting several pipelines work in parallel:
    Fig.  87. Having two pipelines allows twice as many instructions to be executed within the same number of clock ticks.

    CPU
    Instructions executed
    at the same time
    AMD K6-II
    24
    Intel Pentium III
    40
    AMD Athlon
    72
    Intel Pentium 4
    (first generation)
    126
    Fig.  88. By making use of more, and longer, pipelines, processors can execute more instructions at the same time.
  • The problems of having more pipelines

  • One might imagine that the engineers at Intel and AMD could just make even more parallel pipelines in the one CPU. Perhaps performance could be doubled? Unfortunately it is not that easy.
    It is not possible to feed a large number of pipelines with data. The memory system is just not powerful enough. Even with the existing pipelines, a fairly large number of clock ticks are wasted. The processor core is simply not utilised efficiently enough, because data cannot be brought to it quickly enough.
    Another problem of having several pipelines arises when the processor can decode several instructions in parallel – each in its own pipeline. It is impossible to avoid the wrong instruction occasionally being read in (out of sequence). This is called misprediction and results in a number of wasted clock ticks, since another instruction has to be fetched and run through the “assembly line”.
    Intel has tried to tackle this problem using a Branch Prediction Unit, which constantly attempts to guess the correct instruction sequence.
  • Length of the pipe

  • The number of “stations” (stages) in the pipeline varies from processor to processor. For example, in the Pentium II and III there are 10 stages, while there are up to 31 in the Pentium 4.
    In the Athlon, the ALU pipelines have 10 stages, while the FPU/MMX/SSE pipelines have 15.
    The longer the pipeline, the higher the processor’s clock frequency can be. This is because in the longer pipelines, the instructions are cut into more (and hence smaller) sub-instructions which can be executed more quickly.

    CPU
    Number of
    pipeline stages
    Maximum clock frequency
    Pentium
    5
    300 MHz
    Motorola G4
    4
    500 MHz
    Motorola G4e
    7
    1000 MHz
    Pentium II and III
    12
    1400 MHz
    Athlon XP
    10/15
    2500 MHz
    Athlon 64
    12/17
    >3000 MHz
    Pentium 4
    20
    >3000 MHz
    Pentium 4 „Prescott
    31
    >5000 MHz
    Fig.  89. Higher clock frequencies require long “assembly lines” (pipelines).
    Note that the two AMD processors have different pipeline lengths for integer and floating point instructions. One can also measure a processor’s efficiency by looking at the IPC number (Instructions Per Clock), and AMD’s Athlon XP is well ahead of the Pentium 4 in this regard. AMD’s Athlon XP processors are actually much faster than the Pentium 4’s at equivalent clock frequencies.
    The same is even more true of the Motorola G4 processors used, for example, in Macintosh computers. The G4 only has a 4-stage pipeline, and can therefore, in principle, offer the same performance as a Pentium 4, with only half the clock frequency or less. The only problem is, the clock frequency can’t be raised very much with such a short pipeline. Intel have therefore chosen to future-proof the Pentium 4 by using a very long pipeline.
  • Execution units

  • What is it that actually happens in the pipeline? This is where we find the so-called execution units. And we must distinguish between to types of unit:
  • ALU (Arithmetic and Logic Unit)
  • FPU (Floating Point Unit)
    If the processor has a brain, it is the ALU unit. It is the calculating device that does operations on whole numbers (integers). The computer’s work with ordinary text, for example, is looked after by the ALU.
    The ALU is good at working with whole numbers. When it comes to decimal numbers and especially numbers with many decimal places (real numbers as they are called in mathematics), the ALU chokes, and can take a very long time to process the operations. That is why an FPU is used to relieve the load. An FPU is a number cruncher, specially designed for floating point operations.
    There are typically several ALU’s and FPU’s in the same processor. The CPU also has other operation units, for example, the LSU (Load/Store Unit).
  • An example sequence

  • Look again at Fig. 73 on page 29. You can see that the processor core is right beside the L1 cache. Imagine that an instruction has to be processed:
  • The processor core fetches a long and complex x86 instruction from the L1 instruction cache.
  • The instruction is sent into the pipeline where it is broken down into smaller units.
  • If it is an integer operation, it is sent to an ALU, while floating point operations are sent to an FPU.
  • After processing the data is sent back to the L1 cache.
    This description applies to the working cycle in, for example, the Pentium III and Athlon. As a diagram it might look like this:
    Fig.  90. The passage of instructions through the pipeline.
    But the way the relationship between the pipeline and the execution units is designed differs greatly from processor to processor. So this entire examination should be taken as a general introduction and nothing more.
  • Pipelines in the Pentium 4

  • In the Pentium 4, the instruction cache has been placed between the “Instruction fetch/Translate” unit (in Fig. 90) and the ALU/FPU. Here the instruction cache (Execution Trace Cache) doesn’t store the actual instructions, but rather the “half-digested” micro-ops.

    Fig.  91. In the Pentium 4, the instruction cache stores decoded micro instructions.
    The actual pipeline in the Pentium 4 is longer than in other CPU’s; it has 20 stages. The disadvantage of the long pipeline is that it takes more clock ticks to get an instruction through it. 20 stages require 20 clock ticks, and that reduces the CPU’s efficiency. This was very clear when the Pentium 4 was released; all tests showed that it was much slower than other processors with the same clock frequency.
    At the same time, the cost of reading the wrong instruction (misprediction) is much greater – it takes a lot of clock ticks to fill up the long assembly line again.
    The Pentium 4’s architecture must therefore be seen from a longer-term perspective. Intel expects to be able to scale up the design to work at clock frequencies of up to 5-10 GHz. In the “Prescott” version of Pentium 4, the pipeline was increased further to 31 stages.
    AMD’s 32 bit Athlon line can barely make it much above a clock frequency of 2 GHz, because of the short pipeline. In comparison, the Pentium 4 is almost ”light years” ahead.


  • Chapter 13. FPU’s and multimedia

  • The computer is constantly performing calculations, which can be divided into two groups.
  • Whole numbers
  • Floating point numbers
    The whole number calculations are probably the most important, certainly for normal PC use – using office programs and the like. But operations involving floating point numbers have taken on greater significance in recent years, as 3D games and sound, image and video editing have become more and more a part of everyday computing. Let’s have a brief look at this subject.
  • Floating point numbers

  • The CPU has to perform a lot of calculations on decimal (or real) numbers when the PC runs 3D games and other multimedia programs.
    These decimal numbers are processed in the CPU by a special unit called an FPU (Floating Point Unit). In case you are wondering (as I did) about the name, floating point – here is an explanation: Real numbers (like the number Ï€, pi) can have an infinite number of decimal places. In order to work with these, often very large, numbers, they are converted into a special format.
    First the required level of precision is set. Since the numbers can have an infinite number of decimals, they have to be rounded. For example, one might choose to have five significant digits, as I have done in the examples below (inside the PC, one would probably choose to have many more significant digits).
    Once the precision has been set, the numbers are converted as shown below (the decimal point floats):
  • The number 1,257.45 is written as 0.12575 x 104.
  • The number 0.00696784 is written as 0.69678 x 10-2.
    Now the FPU can manage the numbers and process them using the arithmetic operators.
    Normal form
    Rewritten
     In the FPU
    1,257.45
    0.12575 x 104
    12575  +4
    0.00696784
    0.69678 x 10-2
    69678  -2
    Fig.  92. Re-writing numbers in floating point format.
  • FPU – the number cruncher

  • Floating point numbers are excessively difficult for the CPU’s standard processing unit (the ALU) to process. A huge number of bits are required in order to perform a precise calculation. Calculations involving whole numbers (integers) are much simpler and the result is correct, every time.
    That is why an FPU is used – a special calculating unit which operates with floating point numbers of various bit lengths, depending on how much precision is needed. FP numbers can be up to 80 bits long, whereas normal whole numbers can “only” be up to 32 bits (permitting 4,294 billion different numbers). So the FPU is a number cruncher, which relieves the load on the ALU’s. You can experiment with large numbers yourself, for example, in a spreadsheet.
    In Excel 2000, 21023 (the number 2, multiplied by itself 1023 times), is the biggest calculation I can perform. The result is slightly less than 9 followed by 307 zeroes.
    Fig.  93. Experiments with big numbers in Excel.
    Modern CPU’s have a built-in FPU (Floating Point Unit) which serves as a number cruncher. But it hasn’t always been like this.
    For example, Intel’s 80386 processor didn’t have a built-in FPU calculating unit. All calculations were done using the processor’s ALU. But you could buy a separate FPU (an 80387), which was a chip which you mounted in a socket on the motherboard, beside the CPU. However, in the 80486 processor, the FPU was built-in, and it has been that way ever since.
    Fig.  94. A “separate” FPU, Intel’s 80387 from 1986.
  • 3D graphics

  • Much of the development in CPU’s has been driven by 3D games. These formidable games (like Quake and others) place incredible demands on CPU’s in terms of computing power.  When these programs draw people and landscapes which can change in 3-dimensional space, the shapes are constructed from tiny polygons (normally triangles or rectangles).
    Fig.  95. The images in popular games like Sims are constructed from hundreds of polygons.
    A character in a PC game might be built using 1500 such polygons. Each time the picture changes, these polygons have to be drawn again in a new position. That means that every corner (vortex) of every polygon has to be re-calculated.
    In order to calculate the positions of the polygons, floating point numbers have to be used (integer calculations are not nearly accurate enough). These numbers are called single-precision floating points and are 32 bits long. There are also 64-bit numbers, called double-precision floating points, which can be used for even more demanding calculations.
    When the shapes in a 3D landscape move, a so-called matrix multiplication has to be done to calculate the new vortexes. For just one shape, made up of, say, 1000 polygons, up to 84,000 multiplications have to be performed on pairs of 32-bit floating point numbers. And this has to happen for each new position the shape has to occupy. There might be 75 new positions per second. This is quite heavy computation, which the traditional PC is not very good at. The national treasury’s biggest spreadsheet is child’s play compared to a game like Quake, in terms of the computing power required.
    The CPU can be left gasping for breath when it has to work with 3D movements across the screen. What can we do to help it? There are several options:
  • Generally faster CPUs. The higher the clock frequency, the faster the traditional FPU performance will become.
  • Improvements to the CPU’s FPU, using more pipelines and other forms of acceleration. We see this in each new generation of CPU’s.
  • New instructions for more efficient 3D calculations.
    We have seen that clock frequencies are constantly increasing in the new generations of CPU. But the FPU’s themselves have also been greatly enhanced in the latest generations of CPU’s. The Athlon, especially, is far more powerfully equipped in this area compared to its predecessors.
    The last method has also been shown to be very effective. CPU’s have simply been given new registers and newinstructions which programmers can use.
  • MMX instructions

  • The first initiative was called MMX (multimedia extension), and came out with the Pentium MMX processor in 1997. The processor had built-in “MMX instructions” and “MMX registers”.
    The previous editions of the Pentium (like the other 32 bit processors) had two types of register: One for 32-bit integers, and one for 80-bit decimal numbers. With MMX we saw the introduction of a special 64-bit integer register which works in league with the new MMX instructions. The idea was (and is) that multimedia programs should exploit the MMX instructions. Programs have to be “written for” MMX, in order to utilise the new system.
    MMX is an extension to the existing instruction set (IA32). There are 57 new instructions which MMX compatible processors understand, and which require new programs in order to be exploited.
    Many programs were rewritten to work both with and without MMX (see Fig 96). Thus these programs could continue to run on older processors, without MMX, where they just ran slower.
    MMX was a limited success. There is a weakness in the design in that programs either work with MMX, or with the FPU, and not both at the same time – as the two instruction sets share the same registers. But MMX laid the foundation for other multimedia extensions which have been much more effective.
    Fig  96. This drawing program (Painter) supports MMX, as do all modern programs.
  • 3DNow!

  • In the summer of 1998, AMD introduced a collection of CPU instructions which improved 3D processing. These were 21 new SIMD (Single Instruction Multiple Data) instructions. The new instructions could process several chunks of data with one instruction. The new instructions were marketed under the name, 3DNow!. They particularly improved the processing of the 32-bit floating point numbers used so extensively in 3D games.
    Fig.  97. 3DNow! became the successor to MMX.
    3DNow! was a big success. The instructions were quickly integrated into Windows, into various games (and other programs) and into hardware manufacturers’ driver programs.
  • SSE

  • After AMD’s success with 3DNow!, Intel had to come back with something else. Their answer, in January 1999, was SSE (Streaming SIMD Extensions), which are another way to improve 3D performance. SSE was introduced with the Pentium III.
    In principle, SSE is significantly more powerful than 3DNow! The following changes were made in the CPU:
  • 8 new 128-bit registers, which can contain four 32-bit numbers at a time.
  • 50 new SIMD instructions which make it possible to do advanced calculations on several floating point numbers with just one instruction.
  • 12 New Media Instructions, designed, for example, for the encoding and decoding of MPEG-2 video streams (in DVD).
  • 8 new Streaming Memory instructions to improve the interaction between L2 cache and RAM.
    SSE also quickly became a success. Programs like Photoshop were released in new SSE optimised versions, and the results were convincing. Very processor-intensive programs involving sound, images and video, and in the whole area of multimedia, run much more smoothly when using SSE.
    Since SSE was such a clear success, AMD took on board the technology. A large part of SSE was built into the AthlonXP and Duron processors. This was very good for software developers (and hence for us users), since all software can continue to be developed for one instruction set, common to both AMD and Intel.
  • SSE2 and SSE3

  • With the Pentium 4, SSE was extended to use even more powerful techniques. SSE2 contains 144 new instructions, including 128-bit SIMD integer operations and 128-bit SIMD double-precision floating-point operations.
    SSE2 can reduce the number of instructions which have to be executed by the CPU in order to perform a certain task, and can thus increase the efficiency of the processor. Intel mentions video, speech recognition, image/photo processing, encryption and financial/scientific programs as the areas which will benefit greatly from SSE2. But as with MMX, 3DNow! and SSE, the programs have to be rewritten before the new instructions can be exploited.
    SSE2 adopted by the competition, AMD, in the Athlon 64-processors. Here AMD even doubled up the number of SSE2 registers compared to the Pentium 4. Latest Intel has introduced 13 new instructions in SSE3, which Intel uses in the Prescott-version of Pentium 4.
    We are now going to leave the discussion of instructions. I hope this examination has given you some insight into the CPU’s work of executing programs.

  • Chapter 14.Examples of CPU’s

  • In this chapter I will briefly describe the important CPU’s which have been on the market, starting from the PC’s early childhood and up until today.
    One could argue that the obsolete and discontinued models no longer have any practical significance. This is true to some extent; but the old processors form part of the “family tree”, and there are still legacies from their architectures in our modern CPU’s, because the development has been evolutionary. Each new processor extended and built “on top of” an existing architecture.
    Fig.  98. The evolutionary development spirals ever outwards.
    There is therefore value (one way or another) in knowing about the development from one generation of CPU’s to the next. If nothing else, it may give us a feeling for what we can expect from the future.
  • 16 bits – the 8086, 8088 and 80286

  • The first PC’s were 16-bit machines. This meant that they could basically only work with text. They were tied to DOS, and could normally only manage one program at a time.
    But the original 8086 processor was still “too good” to be used in standard office PC’s. The Intel 8088 discount model was therefore introduced, in which the bus between the CPU and RAM was halved in width (to 8 bits), making production of the motherboard much cheaper. 8088 machines typically had 256 KB, 512 KB or 1 MB of RAM. But that was adequate for the programs at the time.
    The Intel 80286 (from 1984) was the first step towards faster and more powerful CPU’s. The 286 was much more efficient; it simply performed much more work per clock tick than the 8086/8088 did. A new feature was also the 32 bit protected mode – a new way of working which made the processor much more efficient than under real mode, which the 8086/8088 processor forced programs to work in:
  • Access to all system memory – even beyond the 1MB limit which applied to real mode.
    Access to multitasking, which means that the operating system can run several programs at the same time.
  • The possibility of virtual memory, which means that the hard disk can be used to emulate extra RAM, when necessary, via a swap file.
  • 32 bit access to RAM and 32 bit drivers for I/O devices.
    Protected mode paved the way for the change from DOS to Windows, which only came in the 1990’s.
    Fig.  99. Bottom: an Intel 8086, the first 16-bit processor. Top: the incredibly popular 8-bit processor, the Zilog Z80, which the 8086 and its successors out competed.
  • 32 bits – the 80386 and 486

  • The Intel 80386 was the first 32-bit CPU. The 386 has 32-bit long registers and a 32-bit data bus, both internally and externally. But for a traditional DOS based PC, it didn’t bring about any great revolution. A good 286 ran nearly as fast as the first 386’s – under DOS anyway, since it doesn’t exploit the 32-bit architecture.
    The 80386SX became the most popular chip – a discount edition of the 386DX. The SX had a 16-bit external data bus (as opposed to the DX’s 32-bit bus), and that made it possible to build cheap PC’s.
    Fig.  100. Discount prices in October 1990 – but only with a b/w monitor.
  • The fourth generation

  • The fourth generation of Intel’s CPU’s was called the 80486. It featured a better implementation of the x86 instructions – which executed faster, in a more RISC-like manner. The 486 was also the first CPU with built-in L1 cache. The result was that the 486 worked roughly twice as fast as its predecessor – for the same clock frequency.
    With the 80486 we gained a built-in FPU. Then Intel did a marketing trick of the type we would be better off without. In order to be able to market a cheap edition of the 486, they hit on the idea of disabling the FPU function in some of the chips. These were then sold under the name, 80486SX. It was ridiculous – the processors had a built-in FPU; it had just been switched off in order to be able to segment the market.
    Fig.  101. Two 486’s from two different manufacturers.
    But the 486 was a good processor, and it had a long life under DOS, Windows 3.11 and Windows 95. New editions were released with higher clock frequencies, as they hit on the idea of doubling the internal clock frequency in relation to the external (see the discussion later in the guide). These double-clocked processors were given the name, 80486DX2.
    A very popular model in this series had an external clock frequency of 33 MHz (in relation to RAM), while working at 66MHz internally. This principle (double-clocking) has been employed in one way or another in all later generations of CPU’s. AMD, IBM, Texas Instruments and Cyrix also produced a number of 80486 compatible CPU’s.
  • Pentium

  • In 1993 came the big change to a new architecture. Intel’s Pentium was the first fifth-generation CPU. As with the earlier jumps to the next generation, the first versions weren’t especially fast. This was particularly true of the very first Pentium 60 MHz, which ran on 5 volts. They got burning hot – people said you could fry an egg on them. But the Pentium quickly benefited from new process technology, and by using clock doubling, the clock frequencies soon skyrocketed.
    Basically, the major innovation was a superscalar architecture. This meant that the Pentium could process several instructions at the same time (using several pipelines). At the same time, the RAM bus width was increased from 32 to 64 bits.
    Fig.  102. The Pentium processor could be viewed as two 80486’s built into one chip.
    Throughout the 1990’s, AMD gained attention with its K5 and K6 processors, which were basically cheap (and fairly poor) copies of the Pentium. It wasn’t until the K6-2 (which included the very successful 3DNow! extensions), that AMD showed the signs of independence which have since led to excellent processors like the AthlonXP.
    Fig.  103. One of the earlier AMD processors. Today you’d hesitate to trust it to run a coffee machine…
    In 1997, the Pentium MMX followed (with the model name P55), introducing the MMX instructions already mentioned. At the same time, the L1 cache was doubled and the clock frequency was raised.
    Fig.  104. The Pentium MMX. On the left, the die can be seen in the middle.
  • Pentium II with new cache

  • After the Pentium came the Pentium II. But Intel had already launched the Pentium Pro in 1995, which was the first CPU in the 6th generation. The Pentium Pro was primarily used in servers, but its architecture was re-used in the popular Pentium II, Celeron and Pentium III models, during 1997-2001.
    The Pentium II initially represented a technological step backwards. The Pentium Pro used an integrated L2 cache. That was very advanced at the time, but Intel chose to place the cache outside the actual Pentium II chip, to make production cheaper.
    Fig.  105. L2 cache running at half CPU speed in the Pentium II.
    The Level 2 cache was placed beside the CPU on a circuit board, an SEC module (e.g. see Fig. 71, on page 28).  The module was installed in a long Slot 1 socket on the motherboard. Fig. 106 shows the module with a cooling element attached.  The CPU is sitting in the middle (under the fan). The L2 cache is in two chips, one on each side of the processor.
    Fig.  106. Pentium II processor module mounted on its edge in the motherboard’s Slot 1 socket (1997-1998).
    The disadvantage of this system was that the L2 cache became markedly slower than it would have been if it was integrated into the CPU. The L2 cache typically ran at half the CPU’s clock frequency. AMD used the same system in their first Athlons. For these the socket was called, Slot A (see Fig. 107).
    At some point, Intel decided to launch a discount edition of the Pentium II – the Celeron processor. In the early versions, the L2 cache was simply scrapped from the module. That led to quite poor performance, but provided an opportunity for overclocking.
    Overclocking means pushing a CPU to work at a higher frequency than it is designed to work at. It was a very popular sport, especially early on, and the results were good.
    Fig.  107. One of the first AMD Athlon processors, mounted in a Slot A socket. See the large cooling element.
    One of the problems of overclocking a Pentium II was that the cache chips couldn’t keep up with the high speeds. Since these Celerons didn’t have any L2 cache, they could be seriously overclocked (with the right cooling).
    Fig.  108. Extreme CPU cooling using a complete refrigerator built into the PC cabinet. With equipment like this, CPU’s can be pushed up to very high clock frequencies (See Kryotech.com and Asetek.com).
    Intel later decided to integrate the L2 cache into the processor. That happened in a new versions of the Celeron in 1998 and a new versions of the Pentium III in 1999. The socket design was also changed so that the processors could be mounted directly on the motherboard, in a socket called socket 370. Similarly, AMD introduced their socket A.
  • Pentium 4 – long in the pipe

  • The Pentium III was really just (yet) another edition of the Pentium II, which again was a new version of the Pentium Pro. All three processors built upon the same core architecture (Intel P6).
    It wasn’t until the Pentium 4 came along that we got a completely new processor from Intel. The core (P7) had a completely different design:
  • The L1 cache contained decoded instructions.
  • The pipeline had been doubled to 20 stages (in later versions increased to 31 stages).
  • The integer calculation units (ALU’s) had been double-clocked so that they can perform two micro operations per clock tick.
  • Furthermore, the memory bus, which connects the RAM to the north bridge, had been quad-pumped, so that it transfers four data packets per clock tick. That is equivalent to 4 x 100 MHz and 4 x 133 in the earliest versions of the Pentium 4. In later version the bus was pumped up to 4 x 200 MHz, and an update with 4 x 266 MHz is scheduled for 2005.
  • The processor was Hyper Threading-enabled, meaning that it under certain circumstances may operate as two individual CPUs.
    All of these factors are described elsewhere in the guide. The important thing to understand, is that the Pentium 4 represents a completely new processor architecture.
    Fig.  109. The four big changes seen in the Pentium 4.


  • Chapter 15. Evolution of the Pentium 4

  • As was mentioned earlier, the older P6 architecture was released back in 1995. Up to 2002, the Pentium III processors were sold alongside the Pentium 4. That means, in practise, that Intel’s sixth CPU generation has lasted 7 years.
    Similarly, we may expect this seventh generation Pentium 4 to dominate the market for a number of years. The processors may still be called Pentium 4, but it comes in al lot varietes.
    A mayor modification comes with the version using 0.65 micron process technology. It will open for higher clock frequencies, but there will also be a number of other improvements.
    Hyper-Threading Technology is a very exciting structure, which can be briefly outlined as follows: In order to exploit the powerful pipeline in the Pentium 4, it has been permitted to process two threads at the same time. Threads are series of software instructions. Normal processors can only process one thread at a time.
    In servers, where several processors are installed in the same motherboard (MP systems), several threads can be processed at the same time. However, this requires that the programs be set up to exploit the MP system, as discussed on page 31.
    The new thing is that a single Pentium 4 logically can function as if there physically were two processors in the pc. The processor core (with its long pipelines) is simply so powerful that it can, in many cases, act as two processors. It’s a bit like one person being able to carry on two independent telephone conversations at the same time.
    Figur 110. The Pentium 4 is ready for MP functions.
    Hyper-Threading works very well in Intel’s Prescott-versions of Pentium 4. You gain performance when you operate more than one task at the time. If you have two programs working simultaneously, both putting heavy pressure on the CPU, you will benefit from this technology. But you need a MP-compatible operating system (like Windows XP Professional) to benefit from it.
    The next step in this evolution is the production of dual-core processors. AMD produces Opteron chips which hold two processors in one chip. Intel is working on dual core versions of the Pentium 4 (with the codename ”Smithfield”). These chips will find use in servers and high performance pc’s. A dual core Pentium 4 with Hyper-Threading enabled will in fact operate as a virtual quad-core processor.
    Figur 111. A dual core processor with Hyper Threading operates as virtual quad-processor.
    Intel also produces EE-versions of the Pentium 4. EE is for Extreme Edition, and these processors are extremely speedy versions carrying 2 MB of L2 cache. 
    In late 2004 Intel changed the socket design of the Pentium 4. The new processors have no ”pins”; they connect directly to the socket using little contacts in the processor surface.
    Figur 112. The LGA 775 socket for Pentium 4.
  • Athlon

  • The last processor I will discuss is the popular Athlon and Athlon 64 processor series (or K7 and K8).
    It was a big effort on the part of the relatively small manufacturer, AMD, when they challenged the giant Intel with a complete new processor design.
    The first models were released in 1999, at a time when Intel was the completely dominant supplier of PC processors. AMD set their sights high – they wanted to make a better processor than the Pentium II, and yet cheaper at the same time. There was a fierce battle between AMD and Intel between 1999 and 2001, and one would have to say that AMD was the victor. They certainly took a large part of the market from Intel.
    The original 1999 Athlon was very powerfully equipped with pipelines and computing units:
  • Three instruction decoders which translated X86 program CISC instructions into the more efficient RISC instructions (ROP’s) – 9 of which could be executed at the same time.
  • Could handle up to 72 instructions (ROP out of order) at the same time (the Pentium III could manage 40, the K6-2 only 24).
  • Very strong FPU performance, with three simultaneous instructions.
    All in all, the Athlon was in a class above the Pentium II and III in those years. Since Athlon processors were sold at competitive prices, they were incredibly successful. They also launched the Duron line of processors, as the counterpart to Intel’s Celeron, and were just as successful with it.
    Figur 113. Athlon was a huge success for AMD. During 2001-2002, the Athlon XP was in strong competition with the Pentium 4.
     
  • Athlon XP versus Pentium 4

  • The Athlon processor came in various versions. It started as a Slot A module (see Fig. 107 on page 42). It was then moved to Socket A, when the L2 cache was integrated.
    In 2001, a new Athlon XP version was released, which included improvements like a new Hardware Auto Data Prefetch Unit and a bigger Translation Look-aside Buffer. The Athlon XP was much less advanced than the Pentium 4 but quite superior at clock frequencies less than 2000 MHz. A 1667 MHz version of AthlonXP was sold as 2000+. This indicates, that the processor as a minimum performs like a 2000 MHz Pentium 4.
    Later we saw Athlons in other versions. The latest was based on a new kernel called ”Barton”. It was introduced in 2003 with a L2-cachen of 512 KB. AMD tried to sell the 2166 MHz version under the brand 3000+. It did not work. A Pentium 4 running at 3000 MHz had no problems outperforming the Athlon.
  • Opteron/ Athlon64

  • AMD’s 8th generation CPU was released in 2003. It is based on a completely new core called Hammer.
    A new series of 64-bits processors is called Athlon 64, Athlon 64 FX and Opteron. These CPU’s has a new design in two areas:
  • The memory controller is integrated in the CPU. Traditionally this function has been housed in the north bridge, but now it is placed inside the processor.
  • AMD introduces a completely new 64-bit set of instructions.
    Moving the memory controller into the CPU is a great innovation. It gives a much more efficient communication between CPU and RAM (which has to be ECC DDR SDRAM – 72 bit modules with error correction).)
    Every time the CPU has to fetch data from normal RAM, it has to first send a request to the chipset’s controller. It has to then wait for the controller to fetch the desired data – and that can take a long time, resulting in wasted clock ticks and reduced CPU efficiency. By building the memory controller directly into the CPU, this waste is reduced. The CPU is given much more direct access to RAM. And that should reduce latency time and increase the effective bandwidth.
    The Athlon 64 processors are designed for 64 bits applications. This should be more powerful than the existing 32 bit software. We will probably see plenty of new 64 bit software in the future, since Intel is releasing 64 bit processors compatible with the Athlon 64 series.
    Figur 114. In the Athlon 64 the memory controller is located inside the processor. Hence, the RAM modules are interfacing directly with the CPU.
    Overall the Athlon 64 is an updated Athlon-processor with integrated north bridge and 64 bits  instructions. Other news are:
  • Support for SSE2 instructions and 16 registers for this.
  • Dual channel interface to DDR RAM giving a 128 bit memory bus, although the discount version Athlon 64 keeps the 64 bit bus.
  • Communikationen to and from the south bridge via a new HyperTransport bus, operating with high-speed serial transfer.
  • New sockets of 754 and 940 pins.
  • A complete line of chips

  • AMD expects to use the K8 kernel in all types of processors:

    The Opteron is the most expensive and advanced version to be used in multi-processor servers. The models are called 200, 400 and 800, and they use 2, 4 or 8 CPUs on the same motherboard – without use of a north bridge.
    All processors share a common memory of up to 64 GB. Each Opteron has three Hyper­Transport I/O channels, which each can move 6,4 GB/secund.
    The Athlon FX is a Opteron to be used in single processor configurations, high-end pc’s and workstations. There is dual RAM interface, but only one channel of Hyper Transport Link.
    This is the discount version with reduced performance and lower prices. Only 64 bit RAM interface and smaller L2-cache.
    Figur 115. Three versions of the latest AMD processor.
  • Historical overview

  • I will close off this review with a graphical summary of a number of different CPU’s from the last 25 years. The division into generations is not always crystal clear, but I have tried to present things in a straightforward and reasonably accurate way:
    Figur 116. There are scores of different processors. A selection of them is shown here, divided into generations.
    But what is the most powerful CPU in the world? IBM’s Power4 must be a strong contender. It is a monster made up of 8 integrated 64-bit processor cores. It has to be installed in a 5,200 pin socket, uses 500 watts of power (there are 680 million transistors), and connects to a 32 MB L3 cache, which it controls itself. Good night to Pentium.

  • No comments:

    Post a Comment