In a fully associative cache, a block of memory can be in any location in the cache. This means that you have to look everywhere to find anything.
So what's the problem?
Instead of looking everywhere, you're going to look for at most a single cache line. In a direct mapped cache, caches partition memory into as many regions as there are cache lines. Each memory region maps to a single cache line in which data can be placed, and then you only need to check a single tag - the one associated with the region the reference is in.
This also means that LRU goes away. If a block isn't in the cache, you have to kick whatever is in the cache line out. Two blocks in memory that map to the same index in the cache cannot be present in the cache in the same time.
This can lead to a 0% hit rate if more than one block accessed in an interleaved manner map to the same index. Let's say A and B map to the same line. If you had:
This would be called a conflict miss. Even though you may have two lines in the cache, this causes a problem.
Let's say you have 32-bit memory addresses, byte addressed, direct-mapped 32kb cache, 128 byte block size, and write-back.
What are the overheads for this cache?
32KB/128B = 256 lines
1 valid bit + 1 dirty bit + 17 address bits = 19 bits
Since we have 256 lines, this is 19 * 256 = 4864 bits for all of the cache.
4864 bits / 32KB = about 1.9%.
If all loads and stores are really fast, but it takes a long time to load the instruction, then what's the point? There are two choices:
How are caches integrated into a pipelined implementation? First, replace instruction memory with ICache (I$), and replace data memory with DCache (D$).
There are some issues, though. First, memory accesses now have variable latency: When you have a miss in your data cache, you need to tell the pipeline to stall. Also, you need to put a noop in the ICache and the Mem/WB stage. Also, both caches may miss at the same time. How would you fix that?
Another problem: let's say you do branch prediction, and you have a miss. This causes a memory access, and stalls. It needs to be able to throw away the address so it can correctly continue execution.
The grinder application runs on the LC2k with full data forwarding and all branches predicted not-taken has the following instruction frequencies:
In grinder, 40% of the branches are taken and 50% of the LWs are followed by an immediate use.
The I-cache has a miss rate of 3%, and the D-cache has a miss rate of 6% (no overlaps). On a miss, the main memory is accessed and has a latency of 100ns. The clock frequency is 500MHz.
First, you know that:
Since the main memory access latency is 100ns, this is 50 missed cycles. Therefore, 3 percent of the time, there are 50 missed cycles and 6 percent of the time, there are 50 missed cycles. However, this 6% only occurs during loads and stores, which is 35% of the time. Hence, this happens 2.1% of the time. The total memory access latency time is then 5.1% of the time, and 50 cycles delay each. The total CPI delay is 2.55.
Since 20% are branches, and 40% of these branches are taken, this means 40% are mispredicts. Therefore, you have to insert 2 noops (additional cycles) per mispredict - this happens 8% of the time, with a 3 cycle delay each. The total CPI stalls is then 0.24.
Since 50% of LWs are followed by an immediate use, you have to insert a noop after half of the loads. This happens 7.5% of the time, with a 1 cycle delay each. The total CPI stalls is then 0.075.
Therefore, the total CPI is:
$$\text = 1 + (2.55 + 0.24 + 0.075) = 3.87$$