KAVALIPOST

Tuesday 25 June 2013

  • Chapter 16. Choosing a CPU

    If you happen to need to choose a CPU for your new PC, what should you choose? Let me give you a bit of food for thought.






    Processor
    Euro


    INTEL Celeron 2.66 GHZ/533 256KB
    70,00


    INTEL P4 520 2.8 GHZ/800 1 MB
    130,00


    INTEL P4 530 3.0 GHZ/800 1 MB
    145,00


    INTEL P4 540 3.2 GHZ/800 1 MB
    175,00


    INTEL P4 550 3.2 GHZ/800 1 MB
    220,00


    INTEL P4 560 3.2 GHZ/800 1 MB
    330,00


    AMD Sempron 3100+ (1,8 GHZ)
    105,00


    AMD ATHLON 64 3000+ (2,0 GHz) 
    209,00


    AMD ATHLON 64 3400+ (2,4 GHz) 
    227,00


    AMD ATHLON 64 FX-53 (2,4 GHz)   
    670,00





    Figur. 117. Pricelist from October 2004 (without VAT).

    How long does a CPU last?

    The individual components have different lifetimes. The way development has gone up until the present, CPU’s and motherboards have been the components that have become obsolete the most quickly. CPU’s and motherboards also go together ­– you normally change them both at the same time.
    You have by now read all of my review of new processor technology; new and faster models are constantly being developed. The question is, then, how important is it to have the latest technology? You have to decide that for yourself. But if you know that your PC has to last for many years, you probably should go for the fastest CPU on the market.
    For the rest of us, who regularly upgrade and replace our PC’s insides, it is important to find the most economic processor. There is usually a price jump in the processor series, such that the very latest models are disproportionately expensive.
    By the time you read this, prices will have fallen in relation to those shown in Fig. 117. There will no doubt also be significantly faster CPU’s to choose between. But the trend will most likely be the same: The latest models are a lot more expensive than the other ones. You have to find the model that gives the most power in proportion to the price.
    Also, the amount of RAM is just as important as the speed of the processor. RAM prices fluctuate unbelievably, in just a year the price can double or halve. So it’s a good idea to buy your RAM when the prices are low.

    CPU-intensive operations

    Back in the 1990’s it was quite important to get the latest and fastest CPU, because CPU’s were not fast enough for things like image processing. For example, if you try to work with fairly large images in Photoshop, on a PC with a 233 MHz processor, you will probably quickly decide to give up the project.
    But whether you have 2400 or 3200 MHz – that’s not as critical, especially if you have enough RAM and are working with normal tasks. A processor running at 3200 MHz is roughly 33% faster than a 2400 MHz model, but it doesn’t always make that much difference to your daily work.
    Here are some tasks, which might require more CPU power:
  • Video editing.
  • DVD ripping (often illegal).
  • Photo editing, especially with high resolution images and 48 bit color depth.
  • Production of PDF-files in high quality.
  • Speech recognition.
    Video (including DVD) contains huge amounts of data. The CPU and RAM therefore have to work very hard when you edit video footage. At the time of writing, it is legal to copy DVD films for personal use, but that may change. Legal or not – it’s certainly possible. The actual ripping can take 10-20 minutes, during which the CPU slowly chews through the over 4 GB of data there are on a DVD. A 5 GHz CPU would obviously be able to reduce the time taken considerably.
    Finally, speech recognition software is considered to be very CPU-intensive. This is probably only going to be a problem in the distant future. Some people imagine future versions of Windows and Office programs will have built-in speech recognition, which will place quite new demands on CPU’s. However, it is far from certain that speech recog

    nition will ever be introduced into PC’s. There appear to be big problems in getting it to work well.

  • Chapter 17. The CPU’s immediate surroundings

    In this part of this guide, we dug down into the inner workings of the CPU. We well let it rest in peace now, and concentrate on the processor’s immediate surroundings. That is, the RAM and the chipset – or more precisely, the north bridge.
    In the first section of the guide I introduced the chipset, including the north bridge (see, for example, Fig. 46 on page19), which connects the CPU to the PC’s memory — the RAM.

    The pathway to RAM

    The most important data path on the motherboard runs between the CPU and the RAM. Data is constantly pumped back and forth between the two, and this bus therefore often comes under focus when new generations of CPU’s, chipset’s and motherboards are released.
    The RAM sends and receives data on a bus, and this work involves a clock frequency. This means that all RAM has aspeed, just like a CPU does. Unfortunately RAM is much slower than the CPU, and the buses on the motherboard have to make allowance for this fact.

    The XT architecture

    In the original PC design (the IBM XT), the CPU, RAM and I/O devices (which we will come to later) were connected on one and the same bus, and everything ran synchronously (at a common speed). The CPU decided which clock frequency the other devices had to work at:
    Fig. 118. In the original PC architecture, there was only one bus with one speed.
    The problem with this system was that the three devices were “locked to each other”; they were forced to work at the lowest common clock frequency. It was a natural architecture in the first PC’s, where the speed was very slow.

    The first division of the bus

    In 1987, Compaq hit on the idea of separating the system bus from the I/O bus, so that the two buses could work at different clock frequencies. By letting the CPU and RAM work on their own bus, independent of the I/O devices, their speeds could be increased.
    In Fig. 119, the CPU and RAM are connected to a common bus, called the system bus, where in reality the CPU’s clock frequency determines the working speed. Thus the RAM has the same speed as the CPU; for example, 12, 16 or 25 MHz.
    Fig. 119. With this architecture, the I/O bus is separate from the system bus (80386).
    The I/O devices (graphics card, hard disk, etc.) were separated from the system bus and placed on a separate low speed bus. This was because they couldn’t keep up with the clock frequencies of the new CPU versions.
    The connection between the two buses is managed by a controller, which functions as a “bridge” between the two paths. This was the forerunner of the multibus architecture which all motherboards use today.

    Clock doubling

    With the introduction of the 80486, the CPU clock frequency could be increased so much that the RAM could no longer keep up. Intel therefore began to use clock doubling in the 80486 processor.
    The RAM available at the time couldn’t keep up with the 66 MHz speed at which an 80486 could work. The solution was to give the CPU two working speeds.
  • An external clock frequency
  • An internal clock frequency
    Inside the processor, the clock frequency of the system bus is multiplied by a factor of 2, doubling the working speed.
    Fig. 120. The bus system for an 80486 processor.
    But this system places heavy demands on the RAM, because when the CPU internally processes twice as much data, it of course has to be “fed” more often. The problem is, that the RAM only works half as fast as the CPU.
    For precisely this reason, the 486 was given a built-in L1 cache, to reduce the imbalance between the slow RAM and the fast processor. The cache doesn’t improve the bandwidth (the RAM doesn’t work any faster), but it ensures greater efficiency in the transfer of data to the CPU, so that it gets the right data supplied at the right time.
    Clock doubling made it possible for Intel to develop processors with higher and higher clock frequencies. At the time the Pentium was introduced, new RAM modules became available, and the system bus was increased to 66 MHz. In the case of the Pentium II and III, the system bus was increased to 100 and 133 MHz, with the internal clockfrequency set to a multiple of these.
    Figur 121. The bus system for a Pentium III processor.

  • Chapter 18. Overclocking

    The Pentium II was subjected to a lot of overclocking. It was found that many of Intel’s CPU’s could be clocked at a higher factor than they were designed for.
    If you had a 233 MHz Pentium II, you could set up the motherboard to, for example, run at 4.5 x 66 MHz, so that the processor ran at 300 MHz. I tried it myself for a while, it worked well. At a factor of 5 it didn’t work, but at factor of 4.5 it functioned superbly.

    CPU
    System bus
    Clock
    factor
    Internal clock
    frequency
    Pentium
    66 MHz
    1.5
    100 MHz
    Pentium MMX
    66 MHz
    2.5
    166 MHz
    Pentium II
    66 MHz
    4.5
    300 MHz
    Pentium II
    100 MHz
    6
    600 MHz
    Celeron
    100 MHz
    8
    800 MHz
    Pentium III
    133 MHz
    9
    1200 MHz
    AthlonXP
    133 MHz x 2
    13
    1733 MHz
    AthlonXP+
    166 MHz x 2
    13
    2167 MHz
    Pentium 4
    100 MHz x 4
    22
    2200 MHz
    Pentium 4
    133 MHz x 4
    23
    3066 MHz
    Pentium 4
    200 MHz x 4
    18
    3600 MHz

    Figur 122. The CPU’s internal clock frequency is locked to the system bus frequency.

    Overclocking the system bus

    Another method of overclocking was to turn up the system bus clock frequency. In the early versions of the Pentium II, the system bus was at 66 MHz, which suited the type of RAM used at that time.
    You could increase the bus speed, for example to 68 or 75 MHz, depending on how fast your RAM was. This type of tuning makes both the CPU and RAM faster, since it is the actual system clock speed which is increased.
    The disadvantage is that the system clock in these motherboard architectures also controls the I/O bus, which runs synchronously with the system bus. PCI bus devices (which we will come to in a later chapter) cannot handle being overclocked very much; otherwise faults can occur, for example in reading from the hard disk.
    Overclocking typically requires a higher voltage for the CPU, and most motherboards can be set up to supply this:

    Figur 123. Setting the CPU voltage using the motherboard’s Setup program.
    Many still use the same kind of overclocking on the Athlon XP and Pentium 4. The system clock has to be able to be adjusted in increments, which it can on many motherboards.

    Figur 124. A gigantic cooler with two fans and pure silver contact surfaces. Silverado, a German product which is used for overclocking CPU’s.

    Example using the Pentium 4

    A Pentium 4 processor is designed for a system clock of 200 MHz. If you can have a 3200 MHz model with a 200 MHz system bus, it can theoretically be clocked up to 4000 MHz by turning up the system clock. However, the processor needs a very powerful cooling system to operate at the increased frequencies:
    System clock
    CPU clock
    200 MHz
    3200 MHz
    230 MHz
    3700 MHz
    250 MHz
    4000 MHz
    Fig. 125. Overclocking a Pentium 4 processor.
    The manufacturers, Intel and AMD, don’t like people overclocking their CPU’s. They have sometimes attempted to prevent this by building a lock into the processors, so that the processor can only work at a specific clock frequency. In other cases the CPU’s can be overclocked. In any case, you shouldn’t expect your warranty to apply if you play around with overclocking.


    Chapter 19. Different types of RAM

    RAM works in synch with the system bus, as I described in the previous chapter. But what is RAM actually? RAM is a very central component in a PC, for without RAM there can be no data processing. RAM is simply the storage area where all software is loaded and works from.

    Silicon storage area

    RAM is made in electronic chips made of so-called semiconductor material, just like processors and many other types of chips.
    In RAM, transistors make up the individual storage cells which can each “remember” an amount of data, for example, 1 or 4 bits – as long as the PC is switched on.


    Fig. 126. Manufacturing RAM.

    Normal RAM is dynamic (called DRAM), and requires constant electronic recharging to preserve its data contents. Without power, all RAM cells are cleared. RAM is very closely linked to the CPU, and it is very important to both have enough RAM, and to have fast RAM. If both conditions are not met, the RAM will be a bottleneck which will slow down the PC. What follows is an introduction to RAM, as it is used in modern PC’s. After this I will discuss the various types I more detail.

    Several types

    At the time of writing, there are several types of RAM, which are quite different. This means that they normally cannot be used on the same motherboard – they are not compatible.
    However, some motherboards can have sockets for two types of RAM. You typically see this during periods where there is a change taking place from one type of RAM to another. Such motherboards are really designed for the new type, but are made “backwards compatible”, by making room for RAM modules of the old type.
    The RAM types on the market at the moment are:
    RAM
    type
    Pins
    Width
    Usage
    SD RAM
    168
    64 bit
    Older and slower type. No use.
    Rambus RAM
    184
    16 bit
    Advanced RAM. Only used for very few Pentium 4’s with certain Intel chipsets.
    DDR RAM
    184
    64 bit
    A faster version of SD RAM.
    Used both for Athlon and
    Pentium 4’s. 2,5 Volt.
    DDR2
    RAM
    240
    64 bit
    New version of DDR RAM with higher clock frequencies. 1,8 Volt.
    Figur 127. Very different types of RAM.
    In any case, there is a lot of development taking place in DDR. A number of new RAM products will be released within the next few years. The modules are packaged differently, so they cannot be mixed. The notches in the sides are different as the he bottom edges of the modules are.
    SDRAM is an old and proven type, which is used in the majority of existing PC’s. DDR RAM is a refinement of SDRAM, which is in reality double clocked. Rambus RAM is an advanced technology which in principle is superior to DDR RAM in many ways. However, Rambus has had a difficult birth. The technology has been patented by Rambus Inc., which has been involved in many legal suits. A number of important manufacturers (such as VIA) have opted out of Rambus, and only develop products which use DDR RAM. With the new DDR2 standard, there is no obvious need for Rambus RAM.

    Notes on the physical RAM

    RAM stands for Random Access Memory. Physically, RAM consists of small electronic chips which are mounted in modules (small printed circuit boards). The modules are installed in the PC’s motherboard using sockets — there are typically 2, 3 or 4 of these. On this motherboard there are only two, and that’s a bit on the low side of what is reasonable.


    Figur 128. RAM modules are installed in sockets on the motherboard. In the background you see the huge fan on a Pentium 4 processor.

    Each RAM module is a rectangular printed circuit board which fits into the sockets on the motherboard:

    Fig. 129. A 512 MB DDR RAM module.
    On a module there are typically 8 RAM chips which are soldered in place. There can also be 16 if it is a double-sided module. Below is a single RAM chip:
    Fig.   Figur130.
    A single RAM chip, a 256 megabit circuit.

    On the bottom edge of the module you can see the copper coated tracks which make electrical contact (the edge connector). Note also theprofile of the module; this makes it only possible to install it one way round and in the right socket.

    Fig. 131. Anatomy of the SD RAM module.
    The notches in the sides of the module fit the brackets or “handles” which hold the module in place in the motherboard socket:

    Module or chip size

    All RAM modules have a particular data width, which has to match the motherboard, chipset, and ultimately the CPU. Modules using the two most common RAM types, SD and DDR RAM, are 64 bits wide.
    The modules are built using chips which each contain a number of megabits. And since each byte needs 8 bits, more than one chip is needed to make a module. Look at the RAM chip in Fig. 130. You can see the text ”64MX4”:

    Fig. 132. Text on a RAM chip.
    The text in Fig. 132 indicates that the chip contains 64 x 4 mega bits of data, which is the same as 256 megabits. If we want to do some calculations, each chip contains 1024 x 1024 x 64 = 67,108,864 cells, which can each hold 4 bits of data. That gives 268,435,456 bits in total, which (when divided by 8) equals 33,554,432 bytes = 32,768 KB = 32 MB.
    So a 64MX4 circuit contains 32 MB, and is a standard product. This type of chip is used by many different manufacturers to make RAM modules. Modules are sold primarily in four common sizes, from 64 to 512 MB:
    Number of chips per module
    Module size
    2 (single-sided)
    2 x 256 Mbit = 64 Mbyte
    4 (single-sided)
    4 x 256 Mbit = 128 Mbyte
    8 (single-sided)
    8x 256 Mbit = 256 Mbyte
    16 (double-sided)
    16 x 256 Mbit = 512 Mbyte
    Fig. 133. The module size is determined by what RAM chips are used.

    RAM speeds

    For each type of RAM, there are modules of various sizes. But there are also modules with various speeds. The faster the RAM chips are, the more expensive the modules.

    Name
    Type
    PC700
    2x 356 MHz Rambus RAM
    PC800
    2x 400 MHz Rambus RAM
    PC1066
    2 x 533 MHz Rambus RAM
    DDR266 el. PC2100
    2x 133 MHz DDR RAM
    DDR333 el. PC2700
    2x 166 MHz DDR RAM
    DDR400 el. PC3200
    2x 200 MHz DDR RAM
    DDR2-400
    400 MHz DDR2 RAM
    DDR2-533
    533 MHz DDR2 RAM
    DDR2-667
    667 MHz DDR2 RAM

    Fig. 134. The speed is measured in megahertz, but it is not possible to compare directly between the three types of RAM shown in the table. The names shown are those which are being used on the market at the time of writing, but they may change.

  • Chapter 20. RAM technologies

    Let’s look a little more closely at the technology which is used in the various types of RAM.

    In the old days

    Back in the 1980’s, DRAM was used. This was dynamic RAM, which was relatively slow. It was replaced by FPM (Fast Page Mode) RAM which was also dynamic, only a bit faster.
    Originally, loose RAM chips were installed directly in large banks on the motherboard. Later people started combining the chips in modules. These came in widths of 8 bits (with 30 pins) and 32 bits (with 72 pins). The 32-bit modules were suited to the system bus for the 80486 processor, which was also 32 bits wide.
    Fig. 135. Older RAM modules.
    FPM RAM could not run any faster than 66 MHz, but that was fine for the system bus clock frequency in the original Pentium processors.
    After FPM came EDO RAM (Extended Data Out). EDO is a bit faster than FPM because the data paths to and from the RAM cells have been optimised. The gain was a 3-5 % improvement in bandwidth. The clock frequency could be increased to 75 MHz, but basically, EDO is not very different to FPM RAM.
    When Intel launched the Pentium processor, there was a change to using the 64 bit wide RAM modules (with 168 pins, as in Fig. 127 on page 51, which are still used for SDRAM today.
    Fig. 136. An old motherboard with sockets for both 64-bit and 32-bit RAM modules. From the transition period between EDO and SDRAM.

    SDRAM

    The big qualitative shift came in around 1997, when SDRAM (Synchronous DRAM) began to break in. This is a completely new technology, which of course required new chipsets. SDRAM, in contrast to the earlier types of RAM, operates synchronously with the system bus.
    Data can (in burst mode) be fetched on every clock pulse. Thus the module can operate fully synchronised with (at the same beat as) the bus – without so-called wait states (inactive clock pulses). Because they are linked synchronously to the system bus, SDRAM modules can run at much higher clock frequencies.
    The 100 MHz SDRAM (PC100) quickly became popular, and with new processors and chipsets, the speed was brought up to 133 MHz (PC133).
    Another innovation in SDRAM is the small EEPROM chip called the Serial Presence Detect chip, which is mounted on the modules. It is a very small chip containing data on the modules speed, etc.
    Fig. 137. The motherboard BIOS can now read SDRAM module specifications directly.

    DDR RAM

    It is expensive to produce fast RAM chips. So someone hit on a smart trick in 1999-2000, which in one blow made normal RAM twice as fast. That was the beginning of DDR RAM (Double Data Rate). See the module in Fig. 131.
    In DDR RAM, the clock signal is used twice. Data is transferred both when the signal rises, and when it falls. This makes it possible to perform twice as many operations per clock pulse, compared to earlier RAM types:
    Fig. 138. DDR RAM sends off two data packets for each clock pulse.

    Timings

    DDR RAM exist in many versions, with different the clock frequencies and timings. The timing indicates how many clock cycles there are wasted, when the motherboard waits for the memory to deliver the requested data.
    With smaller numbers, we have better timings and the CPU having fewer idle clock cycles. You may find memory modules of the same clock frequency but with different timings. The better timing, the more expensive the RAM module is.
    Ordinary pc users need not to speculate in special RAM with fast timing; this is primary sold to gamers and over-clockers, who tries to achieve the maximum performance from their motherboards.
    Module
    Clock frequency
    Timing
    PC2100
    2 x 133 MHz
    2-2-2
    PC 2700
    2 x 166 MHz
    2-2-2
    2-3-3
    PC 3200
    2 x 200 MHz
    2-3-2
    2-3-3
    PC 3700
    2 x 233 MHz
    3-4-4
    PC 4000
    2 x 250 MHz
    3-4-4
    PC 4400
    2 x 275 MHz
    3-4-4
    Note that different timing means that the gain in terms of increased bandwidth doesn’t quite match the clock speed. It is a bit less.

    Next generation RAM

    In the beginning the problem with DDR RAM, was that the RAM modules were poorly standardized. A module might work with a particular motherboard, but not with another. But this – which was typical for a new technological standard – is not a big problem anymore. Intel was initially against DDR RAM. They claimed that Rambus was a much better design, and that they wouldn’t use DDR RAM. But consumers wanted DDR RAM, which Intel’s competitors were able to deliver, and in the end even Intel had to give in. At the end of 2001, the i845 chipset was released, which uses DDR RAM for the Pentium 4, and later we had the i865 and i875 chip sets, which use dual channel DDR RAM.
    The next generation of RAM is the DDR2, which is a new and better standardized version of DDR using less power. The DDR2 modules operates at higher clock speeds due to better design with higher signal integrity and a more advanced internal data bus. The first chip sets to use DDR2 was Intel’s i915 and i925. Later DDR4 is expected with clock frequencies of up to 1,6 GHz!

    Rambus RAM

    Rambus Inc., as already mentioned, has developed a completely new type of RAM technology. Rambus uses a completely different type of chip, which are mounted in intelligent modules that can operate at very high clock frequencies. Here is a brief summary of the system:
  • The memory controller delivers data to a narrow high-speed bus which connects all the RDRAM modules in a long series.
  • The modules contain logic (Rambus ASIC), which stores the data in the format the chips use.
  • Data is written to one chip at a time, in contrast to SDRAM where it is spread across several chips.
  • The modules work at 2.5 volts, which is reduced to 0.5 volts whenever possible. In this way, both the build up of heat, and electromagnetic radiation can be kept down. They are encapsulated in a heat conducting, aluminium casing.
    Rambus RAM thus has a completely new and different design. The modules are only 16 bits wide. Less data is transferred per clock pulse, but the clock frequencies are much higher. The actual Rambus modules (also called RIMM modules) look a bit like the normal SDRAM modules as we know them. They have 184 pins, but as mentioned, the ships are protected by a heat-conducting casing:
    Fig. 139. Rambus module.
    As the advanced Rambus modules are quite costly to produce, the technology is on its way out of the market.


  • Chapter 21. Advice on RAM

    RAM can be a tricky thing to work out. In this chapter I will give a couple of tips to anyone having to choose between the various RAM products.

    Bandwidth

    Of course you want to choose the best and fastest RAM. It’s just not that easy to work out what type of RAM is the fastest in any given situation.
    We can start by looking at the theoretical maximum bandwidth for the various systems. This is easy to calculate bymultiplying the clock frequency by the bus width. This gives:
    Module type
    Max.
    transfer
    SD RAM, PC100800 MB/sec
    SD RAM,  PC133
    1064 MB/sec
    Rambus, PC800
    1600 MB/sec
    Rambus, Dual PC800
    3200 MB/sec
    DDR 266 (PC2100)
    2128 MB/sec
    DDR 333 (PC2700)
    2664 MB/sec
    DDR 400  (PC3200)
    3200 MB/sec
    DUAL DDR PC3200
    6400 MB/sec
    DUAL DDR2-400
    8600 MB/sec
    DUAL DDR2-533
    10600 MB/sec
    Fig. 140. The highest possible bandwidth  (peak bandwidth) for the various types of RAM.
    However, RAM also has to match the motherboard, chipset and the CPU system bus. You can try experimenting with overclocking, where you intentionally increase the system bus clock frequency. That will mean you need faster RAM than what is normally used in a given motherboard. However, normally, we simply have to stick to the type of RAM currently recommended for the chosen motherboard and CPU.

    RAM quality

    The type of RAM is one thing; the RAM quality is something else. There are enormous differences in RAM prices, and there are also differences in quality. And since it is important to have a lot of RAM, and it is generally expensive, you have to shop around.
    One of the advantages of buying a clone PC (whether you build it yourself or buy it complete) is that you can use standard RAM. The brand name suppliers (like IBM and Compaq) use their own RAM, which can be several times more expensive than the standard product. The reason for this is that the RAM modules have to meet very specific specifications. That means that out a particular production run, only 20% may be “good enough”, and that makes them expensive.
    Over the years I have experimented with many types of RAM in many combinations. In my experience, for desktop PC’s (not servers), you can use standard RAM without problems. But follow these precautions:
  • Avoid mixing RAM from various suppliers and with various specifications in the same PC – even if others say it is fine to do so.
  • Note that the RAM chips are produced at one factory, and the RAM modules may be produced at another.
  • Buy standard RAM from a supplier you trust. You need to know who manufactured the RAM modules and the seller needs to have sold them over a longer period of time. Good brands are Samsung, Kingston and Corsair.
  • The modules have to match the motherboard. Ensure that they have been tested at the speed you need to use them at.
  • The best thing is to buy the motherboard and RAM together. It’s just not always the cheapest.
  • Avoid modules with more than 8 chips on each side.

    How much RAM?

    RAM has a very big impact on a PC’s capacity. So if you have to choose between the fastest CPU, or more RAM, I would definitely recommend that you go for the RAM. Some will choose the fastest CPU, with the expectation of buying extra RAM later, “when the price falls again”. You can also go that way, but ideally, you should get enough RAM from the beginning. But how much is that?
    If you still use Windows 98, then 256 MB is enough. The system can’t normally make use of any more, so more would be a waste. For the much better Windows 2000 operating system, you should ideally have at least 512 MB RAM; it runs fine with this, but of course 1024 MB or more is better. The same goes for Windows XP:

    128 MB
    256 MB
    512 MB
    1024
    MB
    Windows 98
    **
    ***
    Waste
    Waste
    Windows 2000
    *
    **
    ***
    ****
    Windows XP

    *
    ***
    ****
    Fig. 141. Recommended amount of PC RAM, which has to be matched to the operating system.
    The advantage of having enough RAM is that you avoid swapping. When Windows doesn’t have any more free RAM, it begins to artificially increase the amount of RAM using a swap file. The swap file is stored on the hard disk, and leads to a much slower performance than if there was sufficient RAM in the PC.

    RAM addressing

    Over the years there have been many myths, such as ”Windows 98 can’t use more than 128 MB of RAM”, etc. The issue is RAM addressing.
    Below are the three components which each have an upper limit to how much RAM they can address (access):
  • The operating system (Windows).
  • The chipset and motherboard.
  • The processor.
    Windows 95/98 has always been able to access lots of RAM, at least in theory. The fact that the memory management is so poor that it is often meaningless to use more than 256 MB, is something else. Windows NT/2000 and XP can manage gigabytes of RAM, so there are no limits at the moment.
    In Windows XP, you have to press Control+Alt+Delete in order to select the Job list. A dialog box will then be displayed with two tabs, Processes and Performance, which provide information on RAM usage:
    Under the Processes tab, you can see how much RAM each program is using at the moment. In my case, the image browser, FotoAlbum is using 73 MB, Photoshop, 51 MB, etc., as shown in Fig. 143.
    Modern motherboards for desktop use can normally address in the region of 1½-3 GB RAM, and that is more than adequate for most people. Server motherboards with special chipsets can address much more.
    Figur 143. This window shows how much RAM each program is using (Windows XP).
    Standard motherboards normally have a limited number of RAM sockets. If, for example, there are only three, you cannot use any more than three RAM modules (e.g. 3 x 256 MB or 3 x 512 MB).
    CPU’s have also always had an upper limit to how much RAM they can address:
    Processor
    Address bus
    width (bits)
    Maximum
    System RAM
    8088, 8086
    20
    1 MB
    80286, 80386SX
    24
    16 MB
    80386DX, 80486, Pentium, Pentium MMX, K5, K6 etc.
    32
    4 GB
    Pentium Pro, Pentium II, III
    Pentium 4
    36
    64 GB
    Fig. 144. The width of the CPU’s address bus determines the maximum amount of RAM that can be used.
    Let me conclude this examination with a quote. It’s about RAM quantity:
    ”640K ought to be enough for anybody.”
    Bill Gates, 1981.

  • Chapter 22. Chipsets and hubs

    Since 1997, there has been more and more focus on refinement of the chipset, and not least the north bridge, which looks after data transfer to and from RAM. The south bridge has also been constantly developed, but the focus has been on adding new facilities.
    For the north bridge, the development has focused on getting more bandwidth between the RAM and CPU. Let’s look at a few examples of this.

    Bridge or Hub

    Fig. 145. In this architecture (from the Pentium II chipset), the PCI bus connects to the chipset’s two bridges.
    In 1998-99, new developments took place at both AMD and Intel. A new architecture was introduced based on a Memory Controller Hub (MCH) instead of the traditional north bridge and an I/O Controller Hub (ICH) instead of the south bridge. I am using Intel’s names here; the two chips have other names at AMD and VIA, but the principle is the same. The first Intel chipset with this architecture was called i810.
    The MCH is a controller located between the CPU, RAM and AGP. It regulates the flow of data to and from RAM. This new architecture has two important consequences:
  • The connection between the two hubs is managed by a special bus (link channel), which can have a very high bandwidth.
  • The PCI bus comes off the ICH, and doesn’t have to share its bandwidth with other devices.
    The new architecture is used for both Pentium 4 and Athlon processors, and in chipsets from Intel, VIA, and others. In reality, it doesn’t make a great deal of difference whether the chipset consists of hubs or bridges, so in the rest of the guide I will use both names indiscriminately.

    Figur 146. The MCH is the central part of the i875P chip set.

    The i875P chipset

    In 2003, Intel launched a chipset which work with the Pentium 4 and dual channel DDR RAM, each running at 200 MHz. This chip set became very popular, since it had a very good performance.

    Fig. 147. The architecture surrounding the Intel® 82875P Memory Controller Hub (MCH).
    Another new feature in this chip set is that a Gigabit Ethernet controller can have direct access to the MCH (the north bridge). Traditionally the LAN controller is connected to the PCI bus. But since a gigabit network controller may consume a great band width, it is better to plug it directly into north bridge. This new design is called Communication Streaming Architecture (CSA).

    Figur 148. Report from the freeware program CPU-Z.

    The i925 chipset

    In late 2004 Intel introduced a new 900-series of chipsets. They were intended for the new generation of Pentium 4 and Celeron processors based on the LGA 775-socket (as in Figur 112 at page 44). The chip sets comes with support for the PCI Express bus, which is replacing the AGP bus and with support of DDR2 RAM:

    Fig. 149. The new chipset architecture, where the north bridge has become a hub. Here Intel chip set i925.
    By making use of dual channel DDR2 RAM, a bandwidth of up to 8.5 GB/sec is achieved.

    Big bandwidth for RAM

    One might be tempted to think that the bandwidth to the RAM ought to be identical with that of the system bus. But that is not the case. It would actually be good if it was higher. That’s because the RAM doesn’t only deliver data to the CPU. Data also goes directly to the graphics port and to and from the I/O devices – bypassing the CPU. RAM therefore needs even greater bandwidth. In future architectures we will see north bridges for both Pentium 4 and Athlon XP processors which employ more powerful types of RAM, such as 533 MHz DDR2.

    Fig. 150. In reality, the RAM needs greater bandwidth than the CPU.

  • Chapter 23. Data for the monitor

    I have several times mentioned the AGP port, which is directly connected to the CPU and RAM. It is a high-speed port used for the video card, which has it’s own RAM access.
    Fig.   Figur151. AGP is a high-speed bus for the video card, developed by Intel.

    About screens and video cards

    As users, we communicate with PC programs via the screen. The screen shows us a graphical representation of the software which is loaded and active in the PC. But the screen has to be fed data in order to show a picture. This data comes from the video card, which is a controller.
    Screens can be analogue (the traditional, big and heavy CRT monitors) or digital devices, like the modern, flat TFT screens. Whatever the case, the screen has to be controlled by the PC (and ultimately by the CPU). This screen control takes place using a video card.
    A video card can be built in two ways:
  • As a plug-in card (an adapter).
  • Integrated in chips on the motherboard.
    Finally, the video card can be connected either to the PCI bus, the AGP bus or the PCI Express x16 bus.

    Big bandwidth

    Traditionally, the video card (also called the graphics card) was connected as an I/O device. This meant that data for the video card was transferred using the same I/O bus which looks after data transfer to and from the hard disks, network card, etc.
    In the late 1990’s, the demands placed on the graphics system increased dramatically. This was especially due to the spread of the many 3D games (like Quake, etc.). These games require enormous amounts of data, and that places demands on the bus connection. At that time, the video card was connected to the PCI bus, which has a limited bandwidth of 133 MB/sec. The same bus, as just mentioned, also looks after the hard disk and other I/O devices, which all need bandwidth. The PCI bus therefore became a bottleneck, and the solution to the problem was to release the video card from the I/O bus.
    Fig. 152. The data path to the video card before the AGP standard.

    AGP

    AGP (Accelerated Graphics Port) is a special I/O port which is designed exclusively for video cards. AGP was developed by Intel. The AGP port is located close to the chipset’s north bridge.
    Fig. 153. The AGP slot can be seen on this motherboard, left of the three PCI slots.
    The video card port was moved from the south to the north bridge. The new architecture gives optimal access to RAM and hence to the bandwidth which the 3D games require.
    At the same time, the PCI system is spared from the large amount of graphic data traffic to and from the video card. It can now focus on the other intensive transfer tasks, such as transfer to and from the network adapter and disk drive.
    Fig. 154. AGP Video card from ATI.

    Technical details

    AGP consists of a number of different technical elements, of which I will highlight two:
  • A bus structure built around a double-clocked PCI bus.
  • The ability to use the motherboard RAM as a texture cache.
    The texture cache is used by games, and by giving access to the motherboard RAM, less RAM is needed on the cards.
    The AGP bus is actually a 64-bit variant of the PCI bus. You can also see that on the surface, the motherboard AGP slot looks a fair bit like a PCI slot. But it is placed in a different position on the motherboard, to avoid confusion (see Fig. 156). The slot also has a different colour.
    The first version of AGP was 1X, with a bandwidth of 254 MB/sec. But AGP was quickly released in a new mode, called 2X, with 508 MB/sec.
    Later came 4X and 8X, which are the standards today. This involves a clock doubling, just as we have seen, for example, with DDR RAM. Two or four data packets are sent for each clock pulse. In this way, a bandwidth of 2,032 MB/sec has been reached.

    Texture cache and RAMDAC

    Textures are things like backgrounds in games. They can be uploaded directly from RAM to the video card. The system is called DIME (Direct Memory Execute). This allows the video card memory to be extended using standard RAM on the motherboard. In Fig. 155 you can see the system shown graphically.
    In this figure you can also see the RAMDAC device. This is a chip on the video card which looks after the “translation” of digital data into analogue signals, when the card is connected to an analogue screen. The RAMDAC is a complete little processor in itself; the higher it’s clock frequency, the higher the refresh rate with which the card can supply the screen image.
    Fig.   Figur 155. The AGP bus provides direct access to RAM.

    Video card on PCI Express

    With the new PCI Express bus, we get a new system for the video card. Replacing the AGP, the PCI Express X16-bus offers a transfer of 8 GB/sec, which leaves plenty of room for even the most graphical-intensive pc games.
    Figur 156. The black PCI Express X16-slot to the left is for the video card.
    The PC’s I/O system
    There are a lot of I/O ports in the PC’s architecture, with their associated I/O devices and standards. I/O stands for Input/Output, and these ports can both send and receive data from the processor and RAM.

    The I/O system provides flexibility

    The use of I/O devices has contributed to making the PC an incredibly flexible machine. Computers can be used for anything from normal office tasks, processing text and numbers, to image processing using scanners and cameras, to producing video, light and music.
    The PC can also be used industrially. In 1987-88 I worked in a company that produced special PC’s which could control the production of concrete. This was achieved using special I/O cards which could monitor the weighing of sand, gravel, cement and water. The core of the system was a standard office PC from Olivetti.
    This particularly flexible architecture is based on an I/O system which can be extended almost without limit. This is one place we really see the PC’s open architecture: any engineer or technician can, in principle, develop their own plug-in cards and other special devices, if they just meet one of the I/O standards. The opportunities for extension really are unlimited!
    In the following chapters we will look at the various I/O buses which link the PC’s other devices with the CPU and RAM.

  • Chapter 24. Intro to the I/O system

    During the last 10-15 years we have seen numerous technological innovations, the goal of which has been to increase the amount of traffic in the PC. This increase in traffic has taken place in the motherboard – with the system bus at the centre.
    But the high clock frequencies and the large capacity for data transfer also affect the I/O system. Demands are being made for faster hard disks and greater bandwidth to and from external devices such as scanners and cameras. This has led to ongoing development of the I/O controllers in the south bridge.
    Fig. 157. The south bridge connects a large number of different devices with the CPU and RAM.
    You can work out the capacity of a data bus based on data width and clock frequency. Here is a brief comparison between the system bus and the I/O buses:



    Bus
    The north bridge’s buses
    The I/O buses
    Variants
    FSB, RAM, AGP,
    PCI Express X16, CSA
    ISA, PCI, PCI Express, USB, ATA, SCSI, FireWire
    Connects
    CPU, RAM, Video, Ethernet
    All other devices.
    Clock freq.
    66 - 1066 MHz
    Typically 10-33 MHz.
    Maximum capacity
    > 3 GB/sec.
    Typically 20-500 MB/sec.
    per bus
    Figur 158. The system bus is much faster than any other bus.

    Function of the I/O buses

    As I already mentioned, the south bridge was introduced back in 1987 (see Fig. 119 on page 49). The reason for this was that the I/O devices couldn’t keep up with the high clock frequencies which the CPU and RAM could work at. There was electrical noise and other problems on the motherboard when the signals had to be carried to the plug-in cards, etc. Very few plug-in cards work at anything higher than 40 MHz – the electronics can’t cope, the chips and conductors simply can’t react that quickly. The I/O speed had to therefore be scaled down in relation to the system bus.
    Since then, a number of different standards for I/O buses have been developed, which all emanate from the south bridge controllers. These include the older ISA, MCA, EISA and VL buses, and the current PCI, PCI Express and USB buses. The differences lie in their construction and architecture – in their connection to the motherboard.

    I/O devices

    The I/O buses run all over the motherboard, and connect a large number of different I/O devices. The south bridge controllers connect all the I/O devices to the CPU and RAM. Fig. 159 shows a brief overview of the various types of I/O devices. Note that AGP is not included, as it is tied to the north bridge.
    Name
    Devices
    KBD, PS2, FDC, Game
    Keyboard, mouse, floppy disk drive, joystick, etc.
    ROM, CMOS
    BIOS, setup, POST.
    ATA
    Hard disk, CD-ROM/RW, DVD etc.
    PCI and
    PCI Express
    Network card, SCSI controller, video grapper card, sound cards and lots of other adapters.
    USB
    Mouse, scanner, printers, modem, external hard disks and much more.
    Firewire
    Scanner, DV camera, external hard disk etc.
    SCSI
    Hard disks, CD-ROM drives, scanners, tape devices etc. (older)
    LPT, COM
    Parallel and serial devices such as printers, modems, etc.
    Fig.   Figur 159. Various types of I/O devices. The two last ones are not used much anymore.

    The south bridge combines many functions

    Originally, the various I/O devices could have their own controller mounted on the motherboard. If you look at a motherboard from the 1980’s, there are dozens of individual chips – each with a particular function. As the years have passed, the controller functions have been gathered together into fewer and larger chips. The modern south bridge is now a large multi-controller, combining a number of functions previously managed by independent chips.
    The south bridge is normally supplemented by a small Super I/O controller which takes care of a number of less critical functions that also used to be allotted to separate controller chips in the past. The Super I/O controller will be described in more detail later. It used to be connected to the south bridge via the ISA bus; in modern architectures the LPC (Low Pin Count) interface is used:
    Fig. 160. The south bridge is part of the chipset, but is supplemented by the small Super I/O controller.

    Several types of I/O bus

    Throughout the years, several different I/O buses have been developed. These are the proper I/O buses:
  • The ISA bus – an old, low-speed bus which is not used much any more.
  • The MCI, EISA and VL buses – faster buses which are also not used any more.
  • The PCI bus – a general I/O bus used in more modern PC’s.
  • The PCI Express – the most modern bus.
    These ”real” I/O buses are designed for mounting plug-in cards (adapters) inside the actual PC box. They therefore connect to a series of sockets (slots) on the motherboard:

    Fig. 161. Plug-in cards are mounted in the I/O slots on the motherboard.
    The various I/O buses have gradually replaced each other throughout the PC’s history. Motherboards often used to have several buses, but today basically only the PCI bus is used for installing adapters such as network cards, etc.

    Figur 162. The features in this motherboard are all available through the choice of chipset. Here the south bridge delivers many nice features.

    Other types of bus

    The PC always needs a low-speed bus, used for the less intensive I/O devices. For many years that was the job of the ISA bus, but it has been replaced today by USB (Universal Serial Bus). USB is not a traditional motherboard bus, as it doesn’t have slots for mounting plug-in cards. With USB, external devices are connected in a series. More on this later.
    SCSI and FireWire are other types of high-speed bus which give the PC more expansion options. They are not part of the standard PC architecture, but they can be integrated into any PC. This is normally done using a plug-in card for the PCI bus, on which the SCSI or FireWire controller is mounted. Thus the two interfaces draw on the capacity of the PCI bus:

    Fig. 163. SCSI and FireWire controllers are both normally connected to the PCI bus.
    Finally, I should also mention the ATA hard disk interface, which is not normally called a bus, but which really could be called one. The ATA interface is only used for drives, and the standard configuration allows four devices to be connected directly to the motherboard, where the hard disk’s wide cable, for example, fits inside an ATA connector.

    Fig. 164. The ATA interface works like a bus in relation to the south bridge.


  • Chapter 25. From ISA to PCI Express

    From about 1984 on, every PC had a standard bus which was used for I/O tasks. That was the ISA (Industry Standard Architecture) bus.
    Right up until about 1999 there were still ISA slots in most PC’s. In the later years, however, they were only kept for compatibility, so that plug-in cards of the old ISA type could be re-used. This was particularly the case for sound cards from SoundBlaster; they worked quite well on the ISA bus, and many games were programmed to directly exploit this type of hardware. It therefore took many years to get away from the ISA bus, but we have managed to now.
    Fig. 165. Motherboard from 1998 with three (black) ISA slots and four (white) PCI slots.

    History of the ISA bus

    The ISA bus is thus the I/O bus which survived the longest. Here is some information about it:
    ISA was an improvement to IBM’s original XT bus (which was only 8 bits wide) IBM used the protected name ”AT Bus”, but in everyday conversation it was called the ISA bus.
    The ISA bus was 16 bits wide, and ran at a maximum clock frequency of 8 MHz.
    The bus has a theoretical bandwidth of about 8 MB per second. However in practise it never exceeds about 1-2 MB/sec. – partly because it takes 2-3 of the processor’s clock pulses to move a packet (16 bits) of data.
    Fig. 166. The ISA bus is not used much today, but it had enormous significance in the years up until the middle of the 1990’s.
    The ISA bus had two ”faces” in the old PC architecture:
  • An internal ISA bus, which the simple ports (the keyboard, floppy disk and serial/parallel ports) were connected to.
  • An external expansion bus, to which 16-bit ISA adapters could be connected.

    Sluggish performance

    One of the reasons the ISA bus was slow was that it only had 16 data channels. The 486 processor, once it was introduced, worked with 32 bits each clock pulse. When it sent data to the ISA bus, these 32-bit packets (dwords ordoublewords) had to be split into two 16-bit packets (two words), which were sent one at a time, and this slowed down the flow of data.
    Bus
    Time per
    packet
    Amount of data per
    packet
    ISA
    375 ns
    16 bits
    PCI
    30 ns
    32 bits
    Fig. 167. The PCI bus was a huge step forward.
    The ISA bus was not ”intelligent” either, since it was in principle the CPU which controlled all the work the bus was doing. This meant the CPU could only begin a new task when the transfer was complete. You may perhaps have experienced yourself, that when your PC works with the floppy disk drive – the rest of the PC virtually grinds to a halt. That’s the ISA bus’s fault, and it therefore only happens on older PC’s.
    These small delays are called wait states. If an ISA adapter cannot keep up with the data it is receiving, it sends wait states to the CPU. These are signals to the CPU telling it to do nothing. A wait state is a wasted clock pulse – the CPU skips over a clock pulse, without doing anything. Thus a slow ISA adapter could choke any PC.
    Another problem was that the ISA bus often played up when you were installing an expansion card (e.g. a sound card). Many of the problems were associated with the handling of IRQ’s and DMA channels (I will explain these terms later), which often had to be done ”by hand” with the old ISA bus.
    Every device takes up one particular IRQ, and possibly a DMA channel, and conflicts often arose with other devices. It was a big relief when Intel, in the late 1990’s, finally dropped the ISA bus and replaced it with the smart USB bus.
    Fig. 168. ISA based Sound Blaster sound card.

    The MCA, EISA and VL buses

    The ISA bus was too slow, and the solution was to develop new standards for I/O devices. In 1987-88, two new I/O buses were put forward. First, IBM brought out their technologically advanced MCA bus. But since it was patented, a number of other companies pooled together to create the equivalent EISA bus.
    But neither MCA or EISA had a big impact on the clone market. We were stuck with the ISA bus up until 1993, when the VL bus finally became reasonably widespread. It was a genuine local bus, which means it worked synchronously with the system bus. The VL bus was very primitive; it was really just an extension of the system bus.
    The VL bus never managed to have a big impact, because almost at the same time, the robust and efficient PCI bus broke through. The various I/O buses are summarised below:
    Bus
    Description
    PC-XT
    from 1981
    Synchronous 8-bit bus which followed the CPU clock frequency of 4.77 or 6 MHz.
    Band width: 4-6 MB/sec.
    ISA (PC-AT)
    from 1984
    Simple, cheap I/O bus.
    Synchronous with the CPU.
    Band width: 8 MB/sec.
    MCA
    from 1987
    Advanced I/O bus from IBM (patented). Asynchronous, 32-bit, at 10 MHz.
    Band width: 40 MB/sec.
    EISA
    From 1988
    Advanced I/O bus (non-IBM), used especially in network servers.
    Asynchronous, 32-bit, at 8.33 MHz.
    Band width: 32 MB/sec.
    VESA Local Bus
    from 1993        
    Simple, high-speed I/O bus.
    32-bit, synchronised with the CPU’s clock frequency:33, 40, 50 MHz.
    Band width: up to 160 MB/sec.
    PCI
    from 1993
    Advanced, general, high-speed I/O bus. 32-bit, asynchronous, at 33 MHz.
    Band width: 133 MB/sec.
    USB and Firewire, from 1998
    Serial buses for external equipment.
    PCI Express
    from 2004
    A serial bus for I/O cards with very high speed. Replaces PCI and AGP.
    500 MB/sec. per. Channel.
    Fig. 169. The PC’s I/O buses, throughout the years.

    The PCI bus

    PCI stands for Peripheral Component Interconnect. The bus is an Intel product which is used in all PC’s today, and also in other computers, as the PCI bus is processor independent. It can be used with all 32-bit and 64-bit processors, and is therefore found in many different computer architectures.
    Fig. 170. PCI bus adapter.
    At the same time, the bus is compatible with the ISA bus to a certain extent, since PCI devices can react to ISA bus signals, create the same IRQ’s etc. One consequence of this was that Sound Blaster compatible sound cards could be developed, which was very important in the middle of the 1990’s. In optimal conditions, the PCI bus sends one packet of data (32 bits) each clock pulse. The PCI bus therefore has a maximum bandwidth of 132 MB per second, as shown below:
    Clock
    frequency:
    33 MHz
    Bus width:
    32 bits
    Bandwidth:
    32 bits x 33,333,333 clock pulses/second =
    4 bytes x 33,333,333 clock pulses/second =
    132 MB per second
    Fig. 171. The maximum bandwidth of the PCI bus.
    There is also a more powerful versions of the PCI standard, which provides greater bandwidth, but most motherboards still use the original version. The PCI bus has a buffer which operates between the CPU and the peripheral devices (a kind of cache RAM). This allows the CPU to deliver its data to the buffer, and then perform other tasks. The bus looks after the rest of the delivery itself at its own pace. Alternatively, PCI adapters can also deliver data to the buffer, whether or not the CPU has time to process it. The data just stands in a queue and waits until there is room on the system bus, which then relays it to the CPU.
    As a result of all this, the peripheral PCI devices operate asynchronously – at their own pace – in relation to the CPU. Thus the PCI bus (in contrast to the VL bus) is not a local bus from a technical perspective.
    Fig. 172. The PCI bus is being refined by a Special Interest Group. You can follow their progress yourself on the Net (www.pcisig.com).

    Plug and Play

    The Plug and Play standard is part of the PCI specification. It means that all PCI adapter cards are self-configuring. The specification for Plug and Play was developed by Microsoft and Intel, among others, and the ideas was (as the name suggests) to provide a system where one can simply install an adapter and it will work. It’s not quite this simple in practise; a software driver has to normally be installed before an adapter will work. But the actual cooperation between adapter, motherboard and operating system – happens automatically. During startup, communication takes place between the PC’s startup programs, the PCI controller and each PCI device (adapter).
    The adapter has to be able to inform the I/O bus which I/O addresses and IRQ’s it can operate with. And it has to be able to configure itself to use the resources allocated to it by the I/O bus. When the exercise is successful, the adapter is configured automatically, and is ready to be used by the operating system.
    All the components involved (adapter, motherboard and Windows) have to be Plug and Play compatible for the system to work.
    Fig. 173. Schematic overview of Plug and Play.

    ESCD

    The ESCD store is used to save adapter configuration information. This means the motherboard doesn’t have to go through the whole plug and play operation at each startup – information about the PC’s configuration can be read from the CMOS storage.
    Fig. 174. Using the CMOS setup program, the user can directly allocate resources for PCI adapters.

    See the devices during startup

    All I/O devices have small, built-in registers (in ROM circuits), which contain various information. The registers can, for example, describe the nature of the device (e.g. a network card, video card or SCSI controller, etc.) and the manufacturer. You can see this information for yourself during PC startup, when the devices are configured:
    Fig. 175. The PCI devices ”identify themselves” during startup as they are configured.

    PCI Express development

    In 2004 a new PCI bus was introduced. The PCI Special Interest Group (see www.pcisig.com) consists of the most important companies (Intel, IBM, Apple, etc.), who coordinate and standardize the bus via this forum. The PCI Express is the successor to the PCI bus. This is a completely new type of I/O bus using a serial connection (like the buses USB, Firewire and SATA).
    This new I/O bus will be extremely scalable, as it works with a large number of channels (X1, X2, X16 etc.), each of which has a bandwidth of about 250 MB/second in both directions, simultaneously.
    The standard plans for the use of plug-in cards and devices in various categories, with varying bandwidths and power consumption. A 16X video card, for example, will totally be able to pull about 8 GB/sec.
    PCI Express is based on a serial architecture, making it possible to develop smaller and cheaper devices with many fewer pins. The new I/O bus will initially co-exist with the PCI interface, as we se it in the motherboards with Intel i900-series chip sets. But the goal is that PCI Express should replace both PCI and AGP.
    Figur 176. The two black slots are for PCI Express cards. To the left a 16X card for the graphics controller and to the right a smaller 1X slot. Inbetween you se two traditional PCI slots.

  • 26. The CPU and the motherboard

    The heart and soul of the PC’s data processing is the CPU. But the processor is not alone in the world, it communicates with the rest of the motherboard. There will be many new terms introduced in the following sections, so remember that you can find definitions for all the abbreviations in the back of the guide.

    Busses do the transfers

    Data packets (of 8, 16, 32, 64 or more bits at a time) are constantly being moved back and forth between the CPU and all the other components (RAM, hard disk, etc.). These transfers are all done using busses.
    The motherboard is designed around some vary powerful data channels (or pathways, as they are also called). It is these busses which connect all the components to each other.
    Figure 41. The busses are the data channels which connect the PC’s components together. Some are designed for small transfers, others for large ones.

    Busses with varying capacities

    There is not just one bus on a motherboard; there are several. But they are all connected, so that data can run from one to another, and hence reach the farthest corners of the motherboard.
    We can say that a bus system is subdivided into several branches. Some of the PC components work with enormous amounts of data, while others manage with much less. For example, the keyboard only sends very few bytes per second, whereas the working storage (RAM) can send and receive several gigabytes per second. So you can’t attach RAM and the keyboard to the same bus.
    Two busses with different capacities (bandwidths) can be connected if we place a controller between them. Such a controller is often called a bridge, since it functions as a bridge between the two different traffic systems.
    Figure 42. Bridges connect the various busses together.
    The entire bus system starts close to the CPU, where the load (traffic) is greatest. From here, the busses work outwards towards the other components. Closest to the CPU we find the working storage. RAM is the component which has the very greatest data traffic, and is therefore connected directly to the CPU by a particularly powerful bus. It is called the front side bus (FSB) or (in older systems) the system bus.
    Figure 43. The PC’s most important bus looks after the “heavy” traffic between the CPU and RAM.
    The busses connecting the motherboard to the PC’s peripheral devices are called I/O busses. They are managed by the controllers.

    The chip set

    The motherboard’s busses are regulated by a number of controllers. These are small circuits which have been designed to look after a particular job, like moving data to and from EIDE devices (hard disks, etc.).
    A number of controllers are needed on a motherboard, as there are many different types of hardware devices which all need to be able to communicate with each other. Most of these controller functions are grouped together into a couple of large chips, which together comprise the chip set.
    Figure 44. The two chips which make up the chipset, and which connect the motherboard’s busses.
    The most widespread chipset architecture consists of two chips, usually called the north and south bridges. This division applies to the most popular chipsets from VIA and Intel. The north bridge and south bridge are connected by a powerful bus, which sometimes is called a link channel:
    Figure 45. The north bridge and south bridge share the work of managing the data traffic on the motherboard.

    The north bridge

    The north bridge is a controller which controls the flow of data between the CPU and RAM, and to the AGP port.
    In Fig. 46  you can see the north bridge, which has a large heat sink attached to it. It gets hot because of the often very large amounts of data traffic which pass through it. All around the north bridge you can see the devices it connects:
    Figure 46. The north bridge and its immediate surroundings. A lot of traffic runs through the north bridge, hence the heat sink.
    The AGP is actually an I/O port. It is used for the video card. In contrast to the other I/O devices, the AGP port is connected directly to the north bridge, because it has to be as close to the RAM as possible. The same goes for the PCI Express x16 port, which is the replacement of AGP in new motherboards. But more on that later.

    The south bridge

    The south bridge incorporates a number of different controller functions. It looks after the transfer of data to and from the hard disk and all the other I/O devices, and passes this data into the link channel which connects to the north bridge.
    In Fig. 44 you can clearly see that the south bridge is physically located close to the PCI slots, which are used for I/O devices.
    Figure 47. The chipset’s south bridge combines a number of controller functions into a single chip.

    The various chipset manufacturers

    Originally it was basically only Intel who supplied the chipsets to be used in motherboards. This was quite natural, since Intel knows everything about their own CPU’s and can therefore produce chipsets which match them. But at the time the Pentium II and III came out, other companies began to get involved in this market. The Taiwanese company, VIA, today produces chipsets for both AMD and Intel processors, and these are used in a large number of motherboards.
    Other companies (like SiS, nVidia, ATI and ALi) also produce chipsets, but these haven’t (yet?) achieved widespread use. The CPU manufacturer, AMD, produces some chipsets for their own CPU’s, but they also work together closely with VIA as the main supplier for Athlon motherboards.
    Figure 48. The Taiwanese company, VIA, has been a leader in the development of new chipsets in recent years.
    Since all data transfers are managed by the chipset’s two bridges, the chipset is the most important individual component on the motherboard, and new chipsets are constantly being developed.
    The chipset determines the limits for clock frequencies, bus widths, etc. The chipset’s built-in controllers are also responsible for connecting I/O devices like hard disks and USB ports, thus the chipset also determines, in practise, which types of devices can be connected to the PC.
    Figure 49. The two chips which make up a typical chipset. Here we have VIA’s model P4X266A, which was used in early motherboards for Pentium 4 processors.

    Sound, network, and graphics in chipsets

    Developments in recent years have led chipset manufacturers to attempt to place more and more functions in the chipset.
    These extra functions are typically:
  •         Video card (integrated into the north bridge)
  •         Sound card (in the south bridge)
  •         Modem (in the south bridge)
  •         Network and Firewire (in the south bridge)
    All these functions have traditionally been managed by separate devices, usually plug-in cards, which connect to the PC. But it has been found that these functions can definitely be incorporated into the chipset.
    Figure 50. Motherboard with built-in sound functionality.
    Intel has, for many years, managed to produce excellent network cards (Ethernet 10/100 Mbps); so it is only natural that they should integrate this functionality into their chipsets.
    Sound facilities in a chipset cannot be compared with “real” sound cards (like, for example, Sound Blaster Audigy). But the sound functions work satisfactorily if you only want to connect a couple of small speakers to the PC, and don’t expect perfect quality.
    Figure 51. This PC has two sound cards installed, as shown in this Windows XP dialog box. The VIA AC’97 is a sound card emulation which is built into the chipset.
    Many chipsets also come with a built-in video card. The advantage is clear; you can save having a separate video card, which can cost a $100 or more.
    Again, the quality can’t be compared with what you get with a separate, high quality, video card. But if you don’t particularly need support for multiple screens, DVI (for flat screens), super 3D performance for games, or TV-out, the integrated graphics controller can certainly do the job.
    Figure 52. This PC uses a video card which is built into the Intel i810 chipset.
    It is important that the integrated sound and graphics functions can be disabled, so that you can replace them with a real sound or video card. The sound functions won’t cause any problems; you can always ask Windows to use a particular sound card instead of another one.
    But the first Intel chipset with integrated graphics (the i810) did not allow for an extra video card to be installed. That wasn’t very smart, because it meant users were locked into using the built-in video card. In the subsequent chipset (i815), the problem was resolved.

    Buying a motherboard

    If you want to build a PC yourself, you have to start by choosing a motherboard. It is the foundation for the entire PC.
    Most of the motherboards on the market are produced in Taiwan, where manufacturers like Microstar, Asus, Epox, Soltek and many others supply a wide range of different models. Note that a producer like Microstar supplies motherboards to brand name manufacturers like Fujitsu-Siemens, so you can comfortably trust in the quality.Taiwan is the leader in the area of motherboards.
    The first issue to work out is, which CPU you want to use. For example, if you want to use a Pentium 4 from Intel, there is one line of motherboards you can choose between. If you choose an AthlonXP, there is another line. And the difference lies in which chipset is being used in the motherboard.
    Figure 53. A typical (technical) advertisement for a motherboard.
    Once you have decided on a processor, you should try to get a motherboard with the latest chipset available, because new versions of chipsets continue to be released, with greater functionality. At the time of writing, for example, chipsets often include these functions:
  •         USB version 2.0.
  •         Dual channel RAM.
  •         Support for the latest RAM like DDR2.
  •         Integrated Firewire ports.
  •         Serial ATA.
  •         Surround sound.
  •         Gigabit Ethernet.
    You will most likely want to have these facilities (which are described later in the guide) on your PC. That is why it is important to choose the right motherboard with the latest generation chipset.

    Extra facilities

  •         Built-in RAID or (seldom) SCSI controller.
  •         Other network, screen and sound facilities.
  •         Wireless LAN.
  •         SmartCard/MemoryStick/etc. readers.
    One of the advantages of building your own PC is that you can choose a really exciting motherboard.
    Development is taking place rapidly, and by choosing the right motherboard, you can design the absolute latest PC on the market.
    You can also find hundreds of articles on the Internet about each motherboard and chipset. So I can comfortably recommend you build your own PC, as long as you do your homework first! Make sure you read the rest of the guide before you start choosing a new motherboard!

  • Chapter 27. Inside and around the CPU

    In this and the following chapters, I will focus on a detailed look at the CPU. One of the goals is help to you understand why manufacturers keep releasing new and more powerful processors. In order to explain that, we will have to go through what will at times be a quite detailed analysis of the CPU’s inner workings.
    Some of the chapters will probably be fairly hard to understand; I have spent a lot of time myself on my “research”, but I hope that what I present in these chapters will shed some light on these topics.
    Naturally, I will spend most of my time on the latest processors (the Athlon XP and Pentium 4). But we need to examine their internal architectures in light of the older CPU architectures, if we want to understand them properly. For this reason I will continually make comparisons across the various generations of CPU’s.

    Inside the CPU

    I will now take you on a trip inside the CPU. We will start by looking at how companies like Intel and AMD can continue to develop faster processors.

    Two ways to greater speed

    Of course faster CPU’s are developed as a result of hard work and lots of research. But there are two quite different directions in this work:
  • More power and speed in the CPU, for example, from higher clock frequencies.
  • Better exploitation of existing processor power.
    Both approaches are used. It is a well-known fact that bottlenecks of various types drain the CPU of up to 75 % of its power. So if these can be removed or reduced, the PC can become significantly faster without having to raise the clock frequency dramatically.
    It’s just that it is very complicated to remove, for example, the bottleneck surrounding the front side bus, which I will show you later. So the manufacturers are forced to continue to raise the working rate (clock frequency), and hence to develop new process technology, so that CPU’s with more power can come onto the market.

    Clock frequencies

    If we look at a CPU, the first thing we notice is the clock frequency. All CPU’s have a working speed, which is regulated by a tiny crystal.
    The crystal is constantly vibrating at a very large number of “beats” per second. For each clock tick, an impulse is sent to the CPU, and each pulse can, in principle, cause the CPU to perform one (or more) actions.
    Figure 54. The CPU’s working speed is regulated by a crystal which “oscillates” millions of times each second.
    The number of clock ticks per second is measured in Hertz. Since the CPU’s crystal vibrates millions of times each second, the clock speed is measured in millions of oscillations (megahertz or MHz). Modern CPU’s actually have clock speeds running into billions of ticks per second, so we have started having to use gigahertz (GHz).
    These are unbelievable speeds. See for yourself how short the period of time is between individual clock ticks at these frequencies. We are talking about billionths of a second:
    Clock frequency
    Time period per clock tick
    133 MHz
    0.000 000 008 000 seconds
    1200 MHz
    0.000 000 000 830 seconds
    2 GHz
    0.000 000 000 500 seconds
    Figure 55. The CPU works at an incredible speed.
    The trend is towards ever increasing clock frequencies. Let’s take a closer look at how this is possible.

    More transistors

    New types of processors are constantly being developed, for which the clock frequency keeps getting pushed up a notch. The original PC from 1981 worked at a modest 4.77 MHz, whereas the clock frequency 20 years later was up to 2 GHz.
    In Fig. 56 you can see an overview of the last 20 years of development in this area. The table shows the sevengenerations of Intel processors which have brought about the PC revolution. The latest version of Pentium 4 is known under the code name Prescott.
    Gen.
    CPU
    Yr
    (intr.)
    Clock
    Frequency
    No. of
    transistors
    1
    8088
    1979
    4.77- 8 MHz
    29,000
    2
    80286
    1982
    6-12.5 MHz
    134,000
    3
    80386
    1985
    16-33 MHz
    275,000
    4
    80486
    1989
    25-100 MHz
    1,200,000
    5
    Pentium
    Pentium MMX
    1993
    1997
    60-200 MHz
    166-300 MHz
    3,100,000
    4,500,000
    6
    Pentium Pro
    Pentium II
    Pentium III
    1995
    1997
    1999
    150-200 MHz
    233-450 MHz
    450-1200 MHz
    5,500,000
    7,500,000
    28,000,000
    7
    Pentium 4


    Prescott
    2000
    2002
    2003
    2004
    1400-2200
    2200-2800
    2600-3200
    2800-3600
    42,000,000
    55,000,000
    55,000,000
    125,000,000
    Figure 56. Seven generations of CPU’s from Intel. The number of transistors in the Pentium III and 4 includes the L2 cache.
    Each processor has been on the market for several years, during which time the clock frequency has increased. Some of the processors were later released in improved versions with higher clock frequencies, I haven’t included the Celeron in the overview processor. Celerons are specially discount versions of the Pentium II, III, and 4 processors.
    Anyone can see that there has been an unbelievable development. Modern CPU’s are one thousand times more powerful than the very first ones.
    In order for the industry to be able to develop faster CPU’s each year, new manufacturing methods are required.More and more transistors have to be squeezed into smaller and smaller chips.
    Figure 57.
    A photograph from one of Intel’s factories, in which a technician displays the Pentium 4 processor core. It is a tiny piece of silicon which contains 42 million transistors.

    Moores Law

    This development was actually described many years ago, in what we call Moores Law.
    Right back in 1965, Gordon Moore predicted (in the Electronics journal), that the number of transistors in processors (and hence their speed) would be able to be doubled every 18 months.
    Moore expected that this regularity would at least apply up until 1975. But he was too cautious; we can see that the development continues to follow Moores Law today, as is shown in Fig. 59.
    Figure 58. In 1968, Gordon Moore helped found Intel.
    If we try to look ahead in time, we can work out that in 2010 we should have processors containing 3 billion transistors. And with what clock frequencies? You’ll have to guess that for yourself.
    Figure 59. Moores Law (from Intels website).

    Process technology

    The many millions of transistors inside the CPU are made of, and connected by, ultra thin electronic tracks. By making these electronic tracks even narrower, even more transistors can be squeezed into a small slice of silicon.
    The width of these electronic tracks is measured in microns (or micrometers), which are millionths of a metre.
    For each new CPU generation, the track width is reduced, based on new technologies which the chip manufacturers keep developing. At the time of writing, CPU’s are being produced with a track width of 0.13 microns, and this will be reduced to 0.09 and 0.06 microns in the next generations.
    Figure 60. CPU’s are produced in extremely high-technology environments (“clean rooms”). Photo courtesy of AMD.
    In earlier generations aluminium was used for the current carrying tracks in the chips. With the change to 0.18 and 0.13-micron technology, aluminium began to be replaced with copper. Copper is cheaper, and it carries current better than aluminium. It had previously been impossible to insulate the copper tracks from the surrounding silicon, but IBM solved this problem in the late 1990’s.
    AMD became the first manufacturer to mass-produce CPU’s with copper tracks in their chip factory fab 30 inDresden, Germany. A new generation of chips requires new chip factories (fabs) to produce it, and these cost billions of dollars to build. That’s why they like a few years to pass between each successive generation. The old factories have to have time to pay for themselves before new ones start to be used.
    Figure 61. AMD’s Fab 30 in Dresden, which was the first factory to mass-produce copper-based CPU’s.

  • A grand new world …

    We can expect a number of new CPU’s in this decade, all produced in the same way as they are now – just with smaller track widths. But there is no doubt that we are nearing the physical limits for how small the transistors produced using the existing technology can be. So intense research is underway to find new materials, and it appears that nanotransistors, produced using organic (carbon-based) semiconductors, could take over the baton from the existing process technology.
    Bell Labs in the USA has produced nanotransistors with widths of just one molecule. It is claimed that this process can be used to produce both CPU’s and RAM circuits up to 1000 times smaller than what we have today!

    Less power consumption

    The types of CPU’s we have today use a fairly large amount of electricity when the PC is turned on and is processing data. The processor, as you know, is installed in the motherboard, from which it receives power. There are actually two different voltage levels, which are both supplied by the motherboard:
  • One voltage level which powers the CPU core (kernel voltage).
  • Another voltage level which powers the CPU’s I/O ports, which is typically 3.3 volts.
    As the track width is reduced, more transistors can be placed within the same area, and hence the voltage can be reduced.
    As a consequence of the narrower process technology, the kernel voltage has been reduced from 3 volts to about 1 volt in recent years. This leads to lower power consumption per transistor. But since the number of transistors increases by a corresponding amount in each new CPU generation, the end result is often that the total power consumption is unchanged.
    Figure 62. A powerful fan. Modern CPU’s require something like this.
    It is very important to cool the processor; a CPU can easily burn 50-120 Watts. This produces a fair amount of heat in a very small area, so without the right cooling fan and motherboard design, a Gigahertz processor could quickly burn out.
    Modern processors contain a thermal diode which can raise the alarm if the CPU gets to hot. If the motherboard and BIOS are designed to pay attention to the diode’s signal, the processor can be shut down temporarily so that it can cool down.
    Figur Figure 63. The temperatures on the motherboard are constantly reported to this program..
    Cooling is a whole science in itself. Many “nerds” try to push CPU’s to work at higher clock speeds than they are designed for. This is often possible, but it requires very good cooling – and hence often huge cooling units.

    30 years development

    Higher processor speeds require more transistors and narrower electronic tracks in the silicon chip. In the overview in Fig. 64 you can see the course of developments in this area.
    Note that the 4004 processor was never used for PC’s. The 4004 was Intel’s first commercial product in 1971, and it laid the foundation for all their later CPU’s. It was a 4-bit processor which worked at 108 KHz (0.1 MHz), and contained 2,250 transistors. It was used in the first pocket calculators, which I can personally remember from around 1973-74 when I was at high school. No-one could have predicted that the device which replaced the slide rule, could develop, in just 30 years, into a Pentium 4 based super PC.
    If, for example, the development in automobile technology had been just as fast, we would today be able to drive from Copenhagen to Paris in just 2.8 seconds!
    Year
    Intel CPU
    Technology (track width)
    1971
    4004
    10 microns
    1979
    8088
    3 microns
    1982
    80286
    1.5 microns
    1985
    80386
    1 micron
    1989
    80486
    1.0/0.8 microns
    1993
    Pentium
    0.8/0.5/0.35 microns
    1997
    Pentium II
    0.28/0.25 microns
    1999
    Pentium III
    0.25/0.18/0.13 microns
    2000-2003
    Pentium 4
    0.18/0.13 microns
    2004-2005
    Pentium 4
    Prescott
    0.09 microns
    Figure 64. The high clock frequencies are the result of new process technology with smaller electronic ”tracks”.
    A conductor which is 0.09 microns (or 90 nanometres) thick, is 1150 times thinner than a normal human hair. These are tiny things we are talking about here.

    Wafers and die size

    Another CPU measurement is its die size. This is the size of the actual silicon sheet containing all the transistors (the tiny area in the middle of Fig. 33 on page 15).
    At the chip factories, the CPU cores are produced in so-called wafers. These are round silicon sheets which typically contain 150-200 processor cores (dies).
    The smaller one can make each die, the more economical production can become. A big die is also normally associated with greater power consumption and hence also requires cooling with a powerful fan (e.g. see Fig. 63 on page 25 and Fig. 124 on page 50).
    Figur Figure 65. A technician from Intel holding a wafer. This slice of silicon contains hundreds of tiny processor cores, which end up as CPU’s in everyday PC’s.
    You can see the measurements for a number of CPU’s below. Note the difference, for example, between a Pentium and a Pentium II. The latter is much smaller, and yet still contains nearly 2½ times as many transistors. Every reduction in die size is welcome, since the smaller this is, the more processors can fit on a wafer. And that makes production cheaper.
    CPU
    Track width
    Die size
    Number of
    transistors
    Pentium
    0.80
    294 mm2
    3.1 mil.
    Pentium MMX
    0.28
     140 mm2
    4.5 mil.
    Pentium II
    0.25
    131 mm2
    7.5 mil.
    Athlon
    0.25
    184 mm2
    22 mil.
    Pentium III
    0.18
     106 mm2
    28 mil.
    Pentium III
    0.13
     80 mm2
    28 mil.
    Athlon XP
    0.18
    128 mm2
    38 mil.
    Pentium 4
    0.18
    217 mm2
    42 mil.
    Pentium 4
    0.13
    145 mm2
    55 mil.
    Athlon XP+
    0.13
    115 mm2
    54 mil.
    Athlon 64 FX
    0,13
    193 mm2
    106 mill.
    Pentium 4
    0.09
    112 mm2
    125 mil.
    Figure 66. The smaller the area of each processor core, the more economical chip production can be.

    The modern CPU generations

    As mentioned earlier, the various CPU’s are divided into generations (see also Fig. 56 on page 23).
    At the time of writing, we have started on the seventh generation. Below you can see the latest processors from Intel and AMD, divided into these generations. The transitions can be a bit hazy. For example, I’m not sure whether AMD’s K6 belongs to the 5th or the 6th generation. But as a whole, the picture is as follows:
    Generation
    CPU’s
    5th
    Pentium, Pentium MMX, K5, K6
    6th
    Pentium Pro, K6-II, Pentium II, K6-3, Athlon, Pentium III
    7th
    Pentium 4, Athlon XP
    8th.
    Athlon 64 FX, Pentium 5
    Figure 67. The latest generations of CPU’s.



  • Chapter 28. The cache

    In the previous chapter, I described two aspects of the ongoing development of new CPU’s – increased clock frequencies and the increasing number of transistors being used. Now it is time to look at a very different yet related technology – the processor’s connection to the RAM, and the use of the L1 and L2 caches.

    Speed conflict

    The CPU works internally at very high clock frequencies (like 3200 MHz), and no RAM can keep up with these.
    The most common RAM speeds are between 266 and 533 MHz. And these are just a fraction of the CPU’s working speed. So there is a great chasm between the machine (the CPU) which slaves away at perhaps 3200 MHz, and the “conveyor belt”, which might only work at 333 MHz, and which has to ship the data to and from the RAM. These two subsystems are simply poorly matched to each other.
    If nothing could be done about this problem, there would be no reason to develop faster CPU’s. If the CPU had to wait for a bus, which worked at one sixth of its speed, the CPU would be idle five sixths of the time. And that would be pure waste.
    The solution is to insert small, intermediate stores of high-speed RAM. These buffers (cache RAM) provide a much more efficient transition between the fast CPU and the slow RAM. Cache RAM operates at higher clock frequencies than normal RAM. Data can therefore be read more quickly from the cache.

    Data is constantly being moved

    The cache delivers its data to the CPU registers. These are tiny storage units which are placed right inside the processor core, and they are the absolute fastest RAM there is. The size and number of the registers is designed very specifically for each type of CPU.
    Figure 68. Cache RAM is much faster than normal RAM.
    The CPU can move data in different sized packets, such as bytes (8 bits), words (16 bits), dwords (32 bits) or blocks(larger groups of bits), and this often involves the registers. The different data packets are constantly moving back and forth:
  • from the CPU registers to the Level 1 cache.
  • from the L1 cache to the registers.
  • from one register to another
  • from L1 cache to L2 cache, and so on…
    The cache stores are a central bridge between the RAM and the registers which exchange data with the processor’s execution units.
    The optimal situation is if the CPU is able to constantly work and fully utilize all clock ticks. This would mean that the registers would have to always be able to fetch the data which the execution units require. But this it not the reality, as the CPU typically only utilizes 35% of its clock ticks. However, without a cache, this utilization would be even lower.

    Bottlenecks

    CPU caches are a remedy against a very specific set of “bottleneck” problems. There are lots of “bottlenecks” in the PC – transitions between fast and slower systems, where the fast device has to wait before it can deliver or receive its data. These bottle necks can have a very detrimental effect on the PC’s total performance, so they must be minimised.
    Figure 69. A cache increases the CPU’s capacity to fetch the right data from RAM.
    The absolute worst bottleneck exists between the CPU and RAM. It is here that we have the heaviest data traffic, and it is in this area that PC manufacturers are expending a lot of energy on new development. Every new generation of CPU brings improvements relating to the front side bus.
    The CPU’s cache is “intelligent”, so that it can reduce the data traffic on the front side bus. The cache controller constantly monitors the CPU’s work, and always tries to read in precisely the data the CPU needs. When it is successful, this is called a cache hit. When the cache does not contain the desired data, this is called a cache miss.

    Two levels of cache

    The idea behind cache is that it should function as a “near store” of fast RAM. A store which the CPU can always be supplied from.
    In practise there are always at least two close stores. They are called Level 1, Level 2, and (if applicable) Level 3cache. Some processors (like the Intel Itanium) have three levels of cache, but these are only used for very special server applications. In standard PC’s we find processors with L1 and L2 cache.
    Figure 70. The cache system tries to ensure that relevant data is constantly being fetched from RAM, so that the CPU (ideally) never has to wait for data.

    L1 cache

    Level 1 cache is built into the actual processor core. It is a piece of RAM, typically 8, 16, 20, 32, 64 or 128 Kbytes, which operates at the same clock frequency as the rest of the CPU. Thus you could say the L1 cache is part of the processor.
    L1 cache is normally divided into two sections, one for data and one for instructions. For example, an Athlon processor may have a 32 KB data cache and a 32 KB instruction cache. If the cache is common for both data and instructions, it is called a unified cache.

    L2 cache

    The level 2 cache is normally much bigger (and unified), such as 256, 512 or 1024 KB. The purpose of the L2 cache is to constantly read in slightly larger quantities of data from RAM, so that these are available to the L1 cache.
    In the earlier processor generations, the L2 cache was placed outside the chip: either on the motherboard (as in the original Pentium processors), or on a special module together with the CPU (as in the first Pentium II’s).
    Figure 71. An old Pentium II module. The CPU is mounted on a rectangular printed circuit board, together with the L2 cache, which is two chips here. The whole module is installed in a socket on the motherboard. But this design is no longer used.
    As process technology has developed, it has become possible to make room for the L2 cache inside the actual processor chip. Thus the L2 cache has been integrated and that makes it function much better in relation to the L1 cache and the processor core.
    The L2 cache is not as fast as the L1 cache, but it is still much faster than normal RAM.
    CPU
    L2 cache
    Pentium, K5, K6
    External, on the motherboard
    Pentium Pro
    Internal, in the CPU
    Pentium II, Athlon
    External, in a module
    close to the CPU
    Celeron (1st generation)
    None
    Celeron (later gen.),
    Pentium III, Athlon XP,
    Duron, Pentium 4
    Internal, in the CPU
    Figure 72. It has only been during the last few CPU generations that the level 2 cache has found its place, integrated into the actual CPU.
    Traditionally the L2 cache is connected to the front side bus. Through it, it connects to the chipset’s north bridge and RAM:
    Figure 73. The way the processor uses the L1 and L2 cache has crucial significance for its utilisation of the high clock frequencies.
    The level 2 cache takes up a lot of the chip’s die, as millions of transistors are needed to make a large cache. The integrated cache is made using SRAM (static RAM), as opposed to normal RAM which is dynamic (DRAM).

    Powerful bus

    The bus between the L1 and L2 cache is presumably THE place in the processor architecture which has the greatest need for high bandwidth. We can calculate the theoretical maximum bandwidth by multiplying the bus width by the clock frequency. Here are some examples:
    CPU
    Bus
    width
    Clock
    frequency
    Theoretical bandwidth
    Intel Pentium III
    64 bits
    1400 MHz
    11.2 GB/sek.
    AMD
    Athlon XP+
    64 bits
    2167 MHz
    17.3 GB/sek.
    AMD Athlon 64
    64 bits
    2200 MHz
    17,6 GB/sek.
    AMD Athlon 64 FX
    128 bits
    2200 MHz
    35,2 GB/sek.
    Intel Pentium 4
    256 bits
    3200 MHz
    102 GB/sek.
    Figure 74. Theoretical calculations of the bandwidth between the L1 and L2 cache.

    Different systems

    There are a number of different ways of using caches. Both Intel and AMD have saved on L2 cache in some series, in order to make cheaper products. But there is no doubt, that the better the cache – both L1 and L2 – the more efficient the CPU will be and the higher its performance.
    AMD have settled on a fairly large L1 cache of 128 KB, while Intel continue to use relatively small (but efficient) L1 caches.
    On the other hand, Intel uses a 256 bit wide bus on the “inside edge” of the L2 cache in the Pentium 4, while AMD only has a 64-bit bus (see Fig. 74).
     
    Figure 75. Competing CPU’s with very different designs.
    AMD uses exclusive caches in all their CPU’s. That means that the same data can’t be present in both caches at the same time, and that is a clear advantage. It’s not like that at Intel.
    However, the Pentium 4 has a more advanced cache design with Execution Trace Cache making up 12 KB of the 20 KB Level 1 cache. This instruction cache works with coded instructions, as described on page 35
    CPU
    L1 cache
    L2 cache
    Athlon XP
    128 KB
    256 KB
    Athlon XP+
    128 KB
    512 KB
    Pentium 4 (I)
    20 KB
    256 KB
    Pentium 4 (II, “Northwood”)
    20 KB
    512 KB
    Athlon 64
    128 KB
    512 KB
    Athlon 64 FX
    128 KB
    1024 KB
    Pentium 4 (III, “Prescott”)
    28 KB
    1024 KB
    Figure 76. The most common processors and their caches.

    Latency

    A very important aspect of all RAM – cache included – is latency. All RAM storage has a certain latency, which means that a certain number of clock ticks (cycles) must pass between, for example, two reads. L1 cache has less latency than L2; which is why it is so efficient.
    When the cache is bypassed to read directly from RAM, the latency is many times greater. In Fig. 77 the number of wasted clock ticks are shown for various CPU’s. Note that when the processor core has to fetch data from the actual RAM (when both L1 and L2 have failed), it costs around 150 clock ticks. This situation is called stalling and needs to be avoided.
    Note that the Pentium 4 has a much smaller L1 cache than the Athlon XP, but it is significantly faster. It simply takes fewer clock ticks (cycles) to fetch data:
    Latency
    Pentium II
    Athlon
    Pentium 4
    L1 cache:
    3 cycles
    3 cycles
    2 cycles
    L2 cache:
    18 cycles
    6 cycles
    5 cycles
    Figure 77. Latency leads to wasted clock ticks; the fewer there are of these, the faster the processor will appear to be.

    Intelligent ”data prefetch”

    In CPU’s like the Pentium 4 and Athlon XP, a handful of support mechanisms are also used which work in parallel with the cache. These include:
    hardware auto data prefetch unit, which attempts to guess which data should be read into the cache. This device monitors the instructions being processed and predicts what data the next job will need.
    Related to this is the Translation Look-aside Buffer, which is also a kind of cache. It contains information which constantly supports the supply of data to the L1 cache, and this buffer is also being optimised in new processor designs. Both systems contribute to improved exploitation of the limited bandwidth in the memory system.
    Figure 78. The WCPUID program reports on cache in an Athlon processor.

    Conclusion

    L1 and L2 cache are important components in modern processor design. The cache is crucial for the utilisation of the high clock frequencies which modern process technology allows. Modern L1 caches are extremely effective. In about 96-98% of cases, the processor can find the data and instructions it needs in the cache. In the future, we can expect to keep seeing CPU’s with larger L2 caches and more advanced memory management. As this is the way forward if we want to achieve more effective utilisation of the CPU’s clock ticks. Here is a concrete example:
    In January 2002 Intel released a new version of their top processor, the Pentium 4 (with the codename, “Northwood”). The clock frequency had been increased by 10%, so one might expect a 10% improvement in performance. But because the integrated L2 cache was also doubled from 256 to 512 KB, the gain was found to be all of 30%.
    CPU
    L2 cache
    Clock freq.
    Improvement
    Intel Pentium 4
    (0.18 micron)
    256 KB
    2000 MHz

    Intel Pentium 4
    (0.13 micron)
    512 KB
    2200 MHz
    +30%
    Figure 79. Because of the larger L2 cache, performance increased significantly.
    In 2002 AMD updated the Athlon processor with the new ”Barton” core. Here the L2 cache was also doubled from 256 to 512 KB in some models. In 2004 Intel came with the “Prescott” core with 1024 KB L2 cache, which is the same size as in AMD’s Athlon 64 processors. Some Extreme Editions of Pentium 4 even uses 2 MB of L2 cache.

    Xeon for servers

    Intel produces special server models of their Pentium III and Pentium 4 processors. These are called Xeon, and are characterised by very large L2 caches. In an Intel Xeon the 2 MB L2 cache uses 149,000,000 transistors.
    Xeon processors are incredibly expensive (about Euro 4,000 for the top models), so they have never achieved widespread distribution.
    They are used in high-end servers, in which the CPU only accounts for a small part of the total price.
    Otherwise, Intel’s 64 bit server CPU, the Itanium. The processor is supplied in modules which include 4 MB L3 cache of 300 million transistors.

    Multiprocessors

    Several Xeon processors can be installed on the same motherboard, using special chipsets. By connecting 2, 4 or even 8 processors together, you can build a very powerful computer.
    These MP (Multiprocessor) machines are typically used as servers, but can also be used as powerful workstations, for example, to perform demanding 3D graphics and animation tasks. AMD has the Opteron processors, which are server-versions of the Athlon 64. Not all software can make use of the PC’s extra processors; the programs have to be designed to do so. For example, there are professional versions of Windows NT, 2000 and XP, which support the use of several processors in one PC.
    See also the discussion of Hyper Threading which allows a Pentium 4 processor to appear as an MP system. Both Intel and AMD also works on dual-core processors.
  • No comments:

    Post a Comment