A Circuit Minimizing Multicore Shared Memory Latency
A massively multicore system on chip (SOC) can be built that executes current code without modification and with good utilization of the cores if the real-estate normally assigned to cache memories and maintaining cache coherence can be used for phased banks of shared memory -- but only if mutual exclusion circuitry that resolves bank contention between cores does not explode faster, in either real estate or latency than the number of cores times the number of banks of shared memory.
If you're still with me, you're the audience I want for the following disclosure of my invention of such a mutex circuit.
Its primary characteristics are:
1) Its real estate requirements are proportional to C*B where C is the number of cores and B s the number of banks. In other words, a standard cell crossbar switch containing both shared memory elements and the mutex circuit is all that is needed in addition to the standard cell for the core -- regardless of the value of B and C. Each core directly sees B crossbar standard cells that are aligned in a row seen directly by no other cores. Each bank consists of C crossbar standard cells that are aligned in a column.
2) The latency introduced by mutual exclusion is log(C) where the base of the log is a large number -- much larger than 2.
3) Its power requirements are minimal.
Here's a brief disclosure:
Let's say you have 5 voltage sources, set to V1, V2, V3, V4 and V5 with respect to ground. What is the voltage, with respect to ground, of a wire that connects all five? (Look at the Vd across each of the diodes in this circuit.)
If the wires are low resistance compared to the voltages (as on a chip) then current flows the same direction in every voltage source but one.
That's where we get near constant-time.
However, what happens in the unlikely event that two voltages are indistinguishable despite being generated by, say, independent noise sources that are incorporated, one per, into the mutex circuits?
Go ahead and settle for O(log(C)) but do it within an analog context so that the effective base of the log is the SNR of the noise source with respect to the signal level of the overall circuit.
A way to do this off the top of my head: Each crossbar's random voltage is the appropriately-scaled voltage out of a zener diode noise source at that instant. The fraction of op amps detecting they are the winner of the mutex will be some function of the SNR of the circuit. Hopefully the zener diode noise will not raise the noise floor of the chip substantially. I don't know how this works so another random voltage source may be necessary for this to work. Then lets say about 10% of the requests for a given bank will end up with voltages that are indistinguishable from the winner. That means 10% of the cores accessing that memory bank will have their sensor op amp falsely report it has exclusive access to that interleaved bank. A disqualified core's crossbar block's its op amp from further contention but the noise continues to generate random voltages. At some point the 10% remaining contenders' voltages will diverge sufficiently to distinguish them. Terminating this tournament depends on being able to detect when there is exactly one op amp for the interleaved bank reporting itself winner -- in the unlikely event it goes to 0 then the mutex is restarted with all requesting crossbars active. This results in a total mutex time that is, on average, log base 10 of the number of core's.
In this way, in System On Chip layouts -- where shared memory is on the same chip as the cores -- an exceedingly small latency for shared main-memory access can be achieved which obviates much of the real estate for cache hence cache and coherence logic per core. This leaves more real estate for main shared memory.
One might object that the main memory would be of inadequate size for most practical applications however, keep in mind that the feature sizes now being achieved are below 20nm. Moreover, if the cores are limited to 32 bit rather than 64 bit, the the number of banks and cores can be increased to the point that quite substantial applications can fit within the shared main memory constraints.
I leave it as an academic exercise how many cores and how much memory can be fit on a single chip -- and how long would be average main memory latency (including suspending execution while waiting for bank access), assuming all cores are executing threads.