[ Pobierz całość w formacie PDF ]
two instructions. The actual execution would look something like shown in Figure 4.11.
T3 T4 T5 T6 T7 ...
Address Load Compute Store mov(SomeVar, ebx );
from
*** into ebx
SomeVar
mov( [ebx], eax );
Operand Address Delay Delay Load Load Store
ebx [ebx] into eax
Figure 4.11 How the 80x86 Handles a Data Hazard
By delaying the second instruction two clock cycles, the CPU guarantees that the load instruction will load EAX from the
proper address. Unfortunately, the second load instruction now executes in three clock cycles rather than one. However, requir-
ing two extra clock cycles is better than producing incorrect results. Fortunately, you can reduce the impact of hazards on exe-
cution speed within your software.
Note that the data hazard occurs when the source operand of one instruction was a destination operand of a previous
instruction. There is nothing wrong with loading EBX from SomeVar and then loading EAX from [EBX], unless they occur
one right after the other. Suppose the code sequence had been:
mov( 2000, ecx );
mov( SomeVar, ebx );
mov( [ebx], eax );
We could reduce the effect of the hazard that exists in this code sequence by simply rearranging the instructions. Let s do
that and obtain the following:
mov( SomeVar, ebx );
mov( 2000, ecx );
mov( [ebx], eax );
Now the "mov( [ebx], eax);" instruction requires only one additional clock cycle rather than two. By inserting yet another
instruction between the "mov( SomeVar, ebx);" and the "mov( [ebx], eax);" instructions you can eliminate the effects of the
hazard altogether18.
17.Some RISC chips do not. If you tried this sequence on certain RISC chips you would get an incorrect answer.
Page 264
On a pipelined processor, the order of instructions in a program may dramatically affect the performance of that program.
Always look for possible hazards in your instruction sequences. Eliminate them wherever possible by rearranging the instruc-
tions.
In addition to data hazards, there are also control hazards. We ve actually discussed control hazards already, although we
did not refer to them by that name. A control hazard occurs whenever the CPU branches to some new location in memory and
the CPU has to flush the following instructions following the branch that are in various stages of execution.
4.8.5 Superscalar Operation Executing Instructions in Parallel
With the pipelined architecture we could achieve, at best, execution times of one CPI (clock per instruction). Is it possible
to execute instructions faster than this? At first glance you might think, Of course not, we can do at most one operation per
clock cycle. So there is no way we can execute more than one instruction per clock cycle. Keep in mind however, that a single
instruction is not a single operation. In the examples presented earlier each instruction has taken between six and eight opera-
tions to complete. By adding seven or eight separate units to the CPU, we could effectively execute these eight operations in
one clock cycle, yielding one CPI. If we add more hardware and execute, say, 16 operations at once, can we achieve 0.5 CPI?
The answer is a qualified yes. A CPU including this additional hardware is a superscalar CPU and can execute more than
one instruction during a single clock cycle. The 80x86 family began supporting superscalar execution with the introduction of
the Pentium processor.
A superscalar CPU has, essentially, several execution units (see Figure 4.12). If it encounters two or more instructions in
the instruction stream (i.e., the prefetch queue) which can execute independently, it will do so.
Superscalar CPU
D C
E E U
U
B
a a
x x n
n
t c
e e i Data/Address
i I
a h
c c t
t
Busses
e U
u u
t t #
#
i i 1
2
o o
Instruction
n n
Cache
Prefetch
Queue
Figure 4.12 A CPU that Supports Superscalar Operation
There are a couple of advantages to going superscalar. Suppose you have the following instructions in the instruction
stream:
mov( 1000, eax );
mov( 2000, ebx );
18.Of course, any instruction you insert at this point must not modify the values in the eax and ebx registers. Also note that the
examples in this section are contrived to demonstrate pipeline stalls. Actual 80x86 CPUs have additional circuitry to help
reduce the effect of pipeline stalls on the execution time.
Page 265
If there are no other problems or hazards in the surrounding code, and all six bytes for these two instructions are currently in
the prefetch queue, there is no reason why the CPU cannot fetch and execute both instructions in parallel. All it takes is extra
silicon on the CPU chip to implement two execution units.
Besides speeding up independent instructions, a superscalar CPU can also speed up program sequences that have hazards.
One limitation of superscalar CPU is that once a hazard occurs, the offending instruction will completely stall the pipeline.
[ Pobierz całość w formacie PDF ]