Mehrfädigkeit (Multithread)

Grundsätzliche Aufgabe beim Prozessorentwurf

Reduzierung der Untätigkeits- oder Latenzzeiten (Entstehen bei Speicherzugriffen, insbesondere bei Cache-Fehlzugriffen)

Mehrfädige Prozessortechnik

Gegeben: mehrere ausführbereite Kontrollfäden (Threads)
🎯 Ziel: Parallele Ausführung mehrerer Kontrollfäden
Prinzip
- Mehrere Kontrollfäden sind geladen
- Kontext muss für jeden Thread gesichert werden können
- Mehrere getrennte Registersätze auf Prozessorchip
- Mehrere Befehlszähler
- Getrennte Seitentabellen
- Threadwechsel, wenn gewartet werden muss

Cycle-by-cycle Interleaving (feingranulares Multithreading)

Eine Anzahl von Kontrollfäden ist geladen.
Der Prozessor wählt in jedem Takt einen der ausführungsbereiten Kontrollfäden aus.
Der nächste Befehle in der Befehlsreihenfolge des ausgewählten Kontrollfadens wird zur Ausführung ausgewählt.
Bsp: Multiprozessorsysteme HEP, Tera
👎 Nachteil:
- Die Verarbeitung eines Threads kann erheblich verlangsamt werden, wenn er ohne Wartezeiten ausgeführt werden kann

The purpose of interleaved multithreading is to remove all data dependency stalls from the execution pipeline. Since one thread is relatively independent from other threads, there is less chance of one instruction in one pipelining stage needing an output from an older instruction in the pipeline. Conceptually, it is similar to preemptive multitasking used in operating systems; an analogy would be that the time slice given to each active thread is one CPU cycle.
For example:
Cycle i + 1: an instruction from thread B is issued.
Cycle i + 2: an instruction from thread C is issued.

Block Interleaving (block, cooperative or Coarse-grained multithreading)

Befehle eines Kontrollfadens werden so lange ausgeführt, bis eine Instruktion mit einer langen Latenzzeit ausgeführt wird. Dann wird zu einem anderen ausführbaren Kontrollfaden gewechselt.
👍 Vorteil
- Die Bearbeitung eines Threads wird nicht verlangsamt, da beim Warten ausführungsbereiter Thread gestartet wird
👎 Nachteil
- Bei Thread-Wechsel Leeren und Neustarten der Pipeline,
- Nur bei langen Wartezeiten sinnvoll

The simplest type of multithreading occurs when one thread runs until it is blocked by an event that normally would create a long-latency stall. Such a stall might be a cache miss that has to access off-chip memory, which might take hundreds of CPU cycles for the data to return. Instead of waiting for the stall to resolve, a threaded processor would switch execution to another thread that was ready to run. Only when the data for the previous thread had arrived, would the previous thread be placed back on the list of ready-to-run threads.
For example:
Cycle i: instruction j from thread A is issued.
Cycle i + 1: instruction j + 1 from thread A is issued.
Cycle i + 2: instruction j + 2 from thread A is issued, which is a load instruction that misses in all caches.
Cycle i + 3: thread scheduler invoked, switches to thread B.
Cycle i + 4: instruction k from thread B is issued.
Cycle i + 5: instruction k + 1 from thread B is issued.

Simultaneous Multithreading

Mehrfach superskalarer Prozessor
Die Ausführungseinheiten werden über eine Zuordnungseinheit aus mehreren Befehlspuffern versorgt.
Jeder Befehlspuffer stellt einen anderen Befehlsstrom dar.
Jedem Befehlsstrom ist eigener Registersatz zugeordnet.
Diskussion:
- Abwägen zwischen Geschwindigkeit eines Threads und dem Durchsatz vieler Threads
- Mischen vieler Threads: Geht möglicherweise zu Lasten der Leistung der einzelnen Threads

The most advanced type of multithreading applies to superscalar processors. Whereas a normal superscalar processor issues multiple instructions from a single thread every CPU cycle, in simultaneous multithreading (SMT) a superscalar processor can issue instructions from multiple threads every CPU cycle. Recognizing that any single thread has a limited amount of instruction-level parallelism, this type of multithreading tries to exploit parallelism available across multiple threads to decrease the waste associated with unused issue slots.
For example:
Cycle i: instructions j and j + 1 from thread A and instruction k from thread B are simultaneously issued.
Cycle i + 1: instruction j + 2 from thread A, instruction k + 1 from thread B, and instruction m from thread C are all simultaneously issued.
Cycle i + 2: instruction j + 3 from thread A and instructions m + 1 and m + 2 from thread C are all simultaneously issued.
More see: Wiki

Vergleich

Vorlesung Rechnerstruktur

Last updated on Apr 6, 2022