Saturday, April 06, 2013

A Circuit Minimizing Multicore Shared Memory Latency

A massively multicore system on chip (SOC) can be built that executes current code without modification and with good utilization of the cores if the real-estate normally assigned to cache memories and maintaining cache coherence can be used for interleaved banks of shared memory  -- but only if mutual exclusion circuitry that resolves bank contention between cores does not impose too much latency or real estate overhead.

If you're still with me, you're the audience I want for the following disclosure of my invention of such a mutex circuit.

To place the idea in context, consider this 8-core chip (arbitrarily choosing the Cray-1a vector processor as the CPU standard cell), possessing 16 banks of shared memory, each bank divided into 8 memory standard cells with an associated crossbar switch to resolve contention between CPUs:





The problem to solve is low-latency, low-area, low-power mutually exclusive access to a memory bank by a CPU. 

That is what the "Mutex" does.

Its primary characteristics are:

1) Its real estate requirements are proportional to C*B where C is the number of cores and B s the number of banks.  In other words, a standard cell crossbar switch containing both shared memory elements and the mutex circuit is all that is needed in addition to the standard cell for the core -- regardless of the value of B and C.  Each core directly sees B crossbar standard cells that are aligned in a row seen directly by no other cores.  Each bank consists of C crossbar standard cells that are aligned in a column.

2) The latency introduced by mutual exclusion is log(C) where the base of the log is a large number -- much larger than 2.

3) Its power requirements slightly lower* than specialized processor chips of equivalent scale.

Here's a brief disclosure:

Let's say you have 4 voltage sources, set to V1, V2, V3, and V4 with respect to ground feeding the anodes of respective diodes D1, D2, D3 and D4. What is the voltage across each of the diodes?

Consider this circuit.


In this circuit, each voltage source is producing a sine wave of different frequency.  Its transient analysis looks like:


The dark blue line represents the voltage on the wire that connects all the diodes together (at their cathodes).  The other colors represent the voltages of the respective voltage sources -- hence the voltage on the input to the diode (its anode).  Therefore the voltage cross each of the diodes is the distance from the dark blue line to their respective colored lines.

The thing to notice is the darker blue line is always just below, or on top of, the highest voltage at any point in time.  That means at almost any given point in time there is only one "winning" diode -- a positive voltage across it.  If a positive voltage sensor is placed across each diode, and that sensor outputs a binary 1 or 0, declaring if its sensed voltage is positive, we have a way to exclude all but one of a number of "supplicants" from access to a shared resource such as a bank of memory.

So now, we place one of these voltage source, diode pairs in each crossbar switch and connect their cathodes in a line that reaches across cores to provide mutual exclusion for each memory bank.

That's where we get near constant-time, regardless of the number of cores.

We might want to use filtered noise voltage sources instead of sine waves, in order to be more random, but the principle is the same.

However, what happens in the "unlikely" event that two sensors report they see a positive voltage?  (I scare-quote "unlikely" because the more cores you have the more voltage sources you have hence the more likely there will be such a collision.)

Answer:

Go ahead and settle for O(log(C)).  How?

A way to do this off the top of my head:  Lets say about 10% of the requests for a given bank will end up with voltages that are sensed as "winners".  That means 10% of the cores accessing that memory bank will have their sensor falsely report it has exclusive access to that interleaved bank.  90% of the core's crossbar sensors will block's its sensor from further contention but the remaining 10% continue to generate changing voltages.  At some point the 10% remaining contenders' voltages will diverge sufficiently to distinguish them.  Terminating this tournament depends on being able to detect when there is exactly one op amp for the interleaved bank reporting itself winner -- in the unlikely event it goes to 0 then the mutex is restarted with all requesting crossbar sensors active.  This results in a total mutex time that is, on average, log base 10 of the number of core's.

In this way, in System On Chip layouts -- where shared memory is on the same chip as the cores -- an exceedingly small latency for shared main-memory access can be achieved which obviates much of the real estate for cache hence cache and coherence logic per core.  This leaves more real estate for main shared memory.

One might object that the main memory would be of inadequate size for most practical applications however, keep in mind that the feature sizes now being achieved are below 20nm.  Moreover, if the cores are limited to 32 bit rather than 64 bit, the the number of banks and cores can be increased to the point that quite substantial applications can fit within the shared main memory constraints.

I leave it as an academic exercise how many cores and how much memory can be fit on a single chip -- and how long would be average main memory latency (including suspending execution while waiting for bank access), assuming all cores are executing threads.

*The argument for slightly lower power is that in the processor chips with multi-giga transistor counts, a lot more transistors are dedicated to CPU cores than would be the case with the present SOC, since far more of the real estate of the proposed SOC is in RAM banks.  RAM banks are, per transistor, far less likely to be activated in any given clock cycle than CPU cores.