Почему новые поколения процессоров быстрее с той же тактовой частотой?

6829
agz

Почему, например, двухъядерный Core i5 с частотой 2,66 ГГц будет быстрее, чем Core 2 Duo с частотой 2,66 ГГц, который также является двухъядерным?

Это из-за новых инструкций, которые могут обрабатывать информацию за меньшее количество тактов? Какие еще архитектурные изменения происходят?

Этот вопрос часто возникает, и ответы обычно совпадают. Этот пост предназначен для того, чтобы дать окончательный, канонический ответ на этот вопрос. Не стесняйтесь редактировать ответы, чтобы добавить дополнительные детали.

34
Связанный: [Инструкция за цикл против увеличенного количества циклов] (http://superuser.com/questions/363856/instruction-per-cycle-vs-increased-cycle-count) Ƭᴇcʜιᴇ007 11 лет назад 0
Вау, и прорывы, и Дэвиды - отличные ответы ... Я не знаю, какой из них выбрать как правильный: P agz 11 лет назад 0
Также лучше набор инструкций и больше регистров. например, MMX (очень старый сейчас) и x86_64 (когда AMD изобрела x86_64, они добавили некоторые улучшения, нарушающие совместимость, в 64-битном режиме. Они поняли, что сопоставимость будет нарушена в любом случае). ctrl-alt-delor 8 лет назад 0
Для действительно значительных улучшений архитектуры x86 необходим новый набор инструкций, но если это будет сделано, то это уже не будет x86. Это будет PowerPC, MIPS, Alpha, ... или ARM. ctrl-alt-delor 8 лет назад 0

4 ответа на вопрос

38
bwDraco

Designing a processor to deliver high performance is far more than just increasing the clock rate. There are numerous other ways to increase performance, enabled through Moore's law and instrumental to the design of modern processors.

Clock rates can't increase indefinitely.

  • At first glance, it may seem that a processor simply executes a stream of instructions one after another, with performance increases attained through higher clock rates. However, increasing clock rate alone isn't enough. Power consumption and heat output increase as clock rates go up.

  • With very high clock rates, significant increase in CPU core voltage become necessary. Because TDP increases with the square of the Vcore, we eventually reach a point where excessive power consumption, heat output, and cooling requirements prevent further increases in clock rate. This limit was reached in 2004, in the days of the Pentium 4 Prescott. While recent improvements in power efficiency have helped, significant increases in clock rate are no longer feasible. See: Why have CPU manufacturers stopped increasing the clock speeds of their processors?

Graph of stock clock speeds in cutting-edge enthusiast PCs over the years.
Graph of stock clock speeds in cutting-edge enthusiast PCs over the years. Image source

  • Through Moore's law, an observation which states that the number of transistors on an integrated circuit doubles every 18 to 24 months, primarily as a result of die shrinks, a variety of techniques which increase performance have been implemented. These techniques have been refined and perfected over the years, enabling more instructions to be executed over a given period of time. These techniques are discussed below.

Seemingly sequential instruction streams can often be parallelized.

  • Although a program may simply consist of a series of instructions to execute one after another, these instructions, or parts thereof, can very often be executed simultaneously. This is called instruction-level parallelism (ILP). Exploiting ILP is vital to attaining high performance, and modern processors use numerous techniques to do so.

Pipelining breaks instructions into smaller pieces which can be executed in parallel.

  • Each instruction can be broken down into a sequence of steps, each of which is executed by a separate part of the processor. Instruction pipelining allows multiple instructions to go through these steps one after another without having to wait for each instruction to finish completely. Pipelining enables higher clock rates: by having one step of each instruction complete in each clock cycle, less time would be needed for each cycle than if entire instructions had to be completed one at a time.

  • The classic RISC pipeline contains five stages: instruction fetch, instruction decode, instruction execution, memory access, and writeback. Modern processors break execution down into many more steps, producing a deeper pipeline with more stages (and increasing attainable clock rates as each stage is smaller and takes less time to complete), but this model should provides a basic understanding of how pipelining works.

Diagram of a five-stage instruction pipeline
Image source

However, pipelining can introduce hazards which must be resolved to ensure correct program execution.

  • Because different parts of each instruction are being executed at the same time, it is possible for conflicts to occur which interfere with correct execution. These are called hazards. There are three types of hazards: data, structural, and control.

  • Data hazards occur when instructions read and modify the same data at the same time or in the wrong order, potentially leading to incorrect results. Structural hazards occur when multiple instructions need to use a particular part of the processor at the same time. Control hazards occur when a conditional branch instruction is encountered.

  • These hazards may be resolved in various ways. The simplest solution is to simply stall the pipeline, temporarily putting execution of one or instructions in the pipeline on hold to ensure correct results. This is avoided whenever possible because it reduces performance. For data hazards, techniques such as operand forwarding are used to reduce stalls. Control hazards are handled through branch prediction, which requires special treatment and is covered in the next section.

Branch prediction is used to resolve control hazards which can disrupt the entire pipeline.

  • Control hazards, which occur when a conditional branch is encountered, are particularly serious. Branches introduce the possibility that execution will continue elsewhere in the program rather than simply the next instruction in the instruction stream, based on whether a particular condition is true or false.

  • Because the next instruction to execute cannot be determined until the branch condition is evaluated, it is not possible to insert any instructions into the pipeline after a branch in the absence. The pipeline is therefore emptied (flushed) which can waste nearly as many clock cycles as there are stages in the pipeline. Branches tend to occur very often in programs, so control hazards can severely impact processor performance.

  • Branch prediction addresses this issue by guessing whether a branch will be taken. The simplest way to do this is simply to assume that branches are always taken or never taken. However, modern processors use much more sophisticated techniques for higher prediction accuracy. In essence, the processor keeps track of previous branches and uses this information in any of several ways to predict the next instruction to execute. The pipeline can then be fed with instructions from the correct location based on the prediction.

  • Of course, if the prediction is wrong, whatever instructions were put through the pipeline after the branch must be dropped, thereby flushing the pipeline. As a result, the accuracy of the branch predictor becomes increasingly critical as pipelines get longer and longer. Specific branch prediction techniques are beyond the scope of this answer.

Caches are used to speed up memory accesses.

  • Modern processors can execute instructions and process data far faster than they can be accessed in main memory. When the processor must access RAM, execution can stall for long periods of time until the data is available. To mitigate this effect, small high-speed memory areas called caches are included on the processor.

  • Because of the limited space available on the processor die, caches are of very limited size. To make the most of this limited capacity, caches store only the most recently or frequently accessed data (temporal locality). As memory accesses tend to be clustered within particular areas (spatial locality), blocks of data near what is recently accessed are also stored in the cache. See: Locality of reference

  • Caches are also organized in multiple levels of varying size to optimize performance as larger caches tend to be slower than smaller caches. For example, a processor may have a level 1 (L1) cache which is only 32 KB in size, while its level 3 (L3) cache can be several megabytes large. The size of the cache, as well as the associativity of the cache, which affects how the processor manages the replacement of data on a full cache, significantly impact the performance gains that are obtained through a cache.

Out-of-order execution reduces stalls due to hazards by allowing independent instructions to execute first.

  • Not every instruction in an instruction stream depends on each other. For example, although a + b = c must be executed before c + d = e, a + b = c and d + e = f are independent and can be executed at the same time.

  • Out-of-order execution takes advantage of this fact to allow other, independent instructions to execute while one instruction is stalled. Instead of requiring instructions to execute one after another in lockstep, scheduling hardware is added to allow independent instructions to be executed in any order. Instructions are dispatched to a instruction queue and issued to the appropriate part of the processor when the required data becomes available. That way, instructions that are stuck waiting for data from an earlier instruction do not tie up later instructions that are independent.

Diagram of out-of-order execution
Image source

  • Several new and expanded data structures are required to perform out-of-order execution. The aforementioned instruction queue, the reservation station, is used to hold instructions until the data required for execution becomes available. The re-order buffer (ROB) is used to keep track of the state of instructions in progress, in the order in which they were received, so that instructions are completed in the correct order. A register file which extends beyond the number of registers provided by the architecture itself is needed for register renaming, which helps prevent otherwise independent instructions from becoming dependent due to the need to share the limited set of registers provided by the architecture.

Superscalar architectures allow multiple instructions within an instruction stream to execute at the same time.

  • The techniques discussed above only increase the performance of the instruction pipeline. These techniques alone do not allow more than one instruction to be completed per clock cycle. However, it is often possible to execute individual instructions within an instruction stream in parallel, such as when they do not depend on each other (as discussed in the out-of-order execution section above).

  • Superscalar architectures take advantage of this instruction-level parallelism by allowing instructions to be sent to multiple functional units at once. The processor may have multiple functional units of a particular type (such as integer ALUs) and/or different types of functional units (such as floating-point and integer units) to which instructions may be concurrently sent.

  • In a superscalar processor, instructions are scheduled as in an out-of-order design, but there are now multiple issue ports, allowing different instructions to be issued and executed at the same time. Expanded instruction decoding circuitry allows the processor to read several instructions at a time in each clock cycle and determine the relationships among them. A modern high-performance processor can schedule up to eight instructions per clock cycle, depending on what each instruction does. This is how processors can complete multiple instructions per clock cycle. See: Haswell execution engine on AnandTech

Diagram of Haswell execution engine
Image source

  • However, superscalar architectures are very difficult to design and optimize. Checking for dependencies among instructions requires very complex logic whose size can scale exponentially as the number of simultaneous instructions increases. Also, depending on the application, there is only a limited number of instructions within each instruction stream that can be executed at the same time, so efforts to take greater advantage of ILP suffer from diminishing returns.

More advanced instructions are added which perform complex operations in less time.

  • As transistor budgets increase, it becomes possible to implement more advanced instructions that allow complex operations to be performed in a fraction of the time they would otherwise take. Examples include vector instruction sets such as SSE and AVX which perform computations on multiple pieces of data at the same time and the AES instruction set which accelerates data encryption and decryption.

  • To perform these complex operations, modern processors use micro-operations (μops). Complex instructions are decoded into sequences of μops, which are stored inside a dedicated buffer and scheduled for execution individually (to the extent allowed by data dependencies). This provides more room to the processor to exploit ILP. To further enhance performance, a special μop cache can be used to store recently decoded μops, so that the μops for recently executed instructions can be looked up quickly.

  • However, the addition of these instructions does not automatically boost performance. New instructions can increase performance only if an application is written to use them. Adoption of these instructions is hampered by the fact that applications using them will not work on older processors which do not support them.


So how do these techniques improve processor performance over time?

  • Pipelines have become longer over the years, reducing the amount of time needed to complete each stage and therefore enabling higher clock rates. However, among other things, longer pipelines increase the penalty for an incorrect branch prediction, so a pipeline can't be too long. In trying to reach very high clock speeds, the Pentium 4 processor used very long pipelines, up to 31 stages in Prescott. To reduce performance deficits, the processor would try to execute instructions even if they might fail, and would keep trying until they succeeded. This led to very high power consumption and reduced the performance gained from hyper-threading. Newer processors no longer use pipelines this long, especially since clock rate scaling has reached a wall; Haswell uses a pipeline which varies between 14 and 19 stages long, and lower-power architectures use shorter pipelines (Intel Atom Silvermont has 12 to 14 stages).

  • The accuracy of branch prediction has improved with more advanced architectures, reducing the frequency of pipeline flushes caused by misprediction and allowing more instructions to be executed concurrently. Considering the length of pipelines in today's processors, this is critical to maintaining high performance.

  • With increasing transistor budgets, larger and more effective caches can be embedded in the processor, reducing stalls due to memory access. Memory accesses can require more than 200 cycles to complete on modern systems, so it is important to reduce the need to access main memory as much as possible.

  • Newer processors are better able to take advantage of ILP through more advanced superscalar execution logic and "wider" designs that allow more instructions to be decoded and executed concurrently. The Haswell architecture can decode four instructions and dispatch 8 micro-operations per clock cycle. Increasing transistor budgets allow more functional units such as integer ALUs to be included in the processor core. Key data structures used in out-of-order and superscalar execution, such as the reservation station, reorder buffer, and register file, are expanded in newer designs, which allows the processor to search a wider window of instructions to exploit their ILP. This is a major driving force behind performance increases in today's processors.

  • More complex instructions are included in newer processors, and an increasing number of applications use these instructions to enhance performance. Advances in compiler technology, including improvements in instruction selection and automatic vectorization, enable more effective use of these instructions.

  • In addition to the above, greater integration of parts previously external to the CPU such as the northbridge, memory controller, and PCIe lanes reduce I/O and memory latency. This increases throughput by reducing stalls caused by delays in accessing data from other devices.

Это может сделать хороший пост в блоге. Mokubai 9 лет назад 6
Улучшения в энергоэффективности также являются фактором, учитывая мощность стены. Paul A. Clayton 9 лет назад 0
Дело не в том, насколько быстры часы, а в том, сколько инструкций за такт можно обработать. Если процессор имеет полосу пропускания для обработки в 4 раза большего объема данных из памяти в кэш, этот процессор эффективно в 4 раза быстрее, даже если у него более медленные тактовые частоты. Вот почему у AMD так много проблем, пытаясь соответствовать производительности продуктов Intel Ramhound 9 лет назад 0
29
David Schwartz

Обычно это не из-за новых инструкций. Это просто потому, что процессору требуется меньше циклов команд для выполнения одних и тех же команд. Это может быть по большому количеству причин:

  1. Большие кеши означают меньше времени, потраченного на ожидание памяти.

  2. Больше исполнительных блоков означает меньше времени ожидания, чтобы начать работать с инструкцией.

  3. Лучшее предсказание ветвлений означает меньше времени, потраченного на умозрительное выполнение инструкций, которые фактически никогда не должны выполняться.

  4. Улучшения исполнительного блока означают меньше времени ожидания выполнения инструкций.

  5. Более короткие трубопроводы означают, что трубопроводы заполняются быстрее.

И так далее.

Я считаю, что архитектура Core имеет конвейер из 14-15 этапов ([ref] (http://www.bit-tech.net/hardware/cpus/2006/07/14/intel_core_2_duo_processors/2)) и Nehalem / Sandy Мост имеет примерно 14-17 этапов трубопровода ([ссылка] (http://www.anandtech.com/show/5057/the-bulldozer-aftermath-delving-even-deeper/2)). Breakthrough 11 лет назад 0
Короткие трубопроводы легче поддерживать в полном объеме и снизить штрафы за промывку трубопроводов. Более длинные конвейеры обычно допускают более высокие тактовые частоты. David Schwartz 11 лет назад 0
Это то, что я имею в виду, я думаю, что глубина самого трубопровода осталась прежней или * увеличилась *. Также в [Руководствах по разработке программного обеспечения Intel 64 и IA-32] (http://www.intel.com/products/processor/manuals/index.htm) последнее упоминание об изменении конвейера приведено в Vol. 1, гл. 2.2.3 / 2.2.4 (микроархитектура Intel Core / Atom). Breakthrough 11 лет назад 0
Усилия по повышению тактовой частоты привели к увеличению длины конвейеров. Это было нелепо (целых 31 этап!) К концу эры NetBurst. В наши дни это деликатное инженерное решение с преимуществами и недостатками в обоих направлениях. David Schwartz 11 лет назад 2
также улучшения прогнозирования ветвлений, переупорядочивание / оптимизация команд / улучшения блоков мультиплексирования, миниатюризация (снижение нагрева) и проектирование кристаллов (улучшенные трассы / схемы с одним кристаллом и т. д.), ... Shaun Wilson 7 лет назад 0
18
Breakthrough

Абсолютным окончательным справочником являются Руководства для разработчиков программного обеспечения для архитектуры Intel 64 и IA-32 . Они детализируют изменения между архитектурами и являются отличным ресурсом для понимания архитектуры x86.

Я бы порекомендовал вам скачать объединенные тома с 1 по 3C (первая ссылка для скачивания на этой странице). Том 1 Глава 2.2 содержит информацию, которую вы хотите.


Некоторые общие отличия, перечисленные в этой главе, от ядра до микроархитектуры Nehalem / Sandy Bridge:

  • улучшенное предсказание ветвлений, более быстрое восстановление от неправильного прогноза
  • Технология HyperThreading
  • встроенный контроллер памяти, новый кеш-архив
  • более быстрая обработка исключений с плавающей точкой (только Sandy Bridge)
  • Улучшение пропускной способности LEA (только Sandy Bridge)
  • Расширения инструкций AVX (только Sandy Bridge)

Полный список можно найти по ссылке, приведенной выше (т. 1, гл. 2.2).

0
Ale..chenski

Все сказанное ранее верно, но в некоторой степени. Мой ответ короток: процессоры нового поколения «быстрее» в первую очередь потому, что у них больше и лучше организованы кэши . Это основной фактор производительности компьютера. Более подробные сведения см. В статье « Журнал Linux: что влияет на производительность в HPC »

Короче говоря, для большинства распространенных приложений (как в коллекции SPEC) ограничивающим фактором является память. Когда выполняются реальные устойчивые вычисления, все кеши загружаются данными, но каждый промах кеша приводит к зависанию и ожиданию канала выполнения ЦП. Проблема заключается в том, что независимо от того, насколько сложен конвейер ЦП или какие инструкции лучше, параллелизм на уровне команд все еще довольно ограничен (за исключением некоторых специальных высоко оптимизированных предварительно выбранных случаев). Как только критическая зависимость найдена, весь параллелизм заканчивается пятью десятью тактовыми процессорами, в то время как требуются сотни тактовых частот процессоров, чтобы изгнать кахелину и загрузить новую из основной памяти. Таким образом, процессор ждет, ничего не делая. Вся эта концепция верна и для многоядерных систем.

Итак, если вам нужен «более быстрый» ПК, купите его с процессором, который имеет самый большой кеш, который вы можете себе позволить.

Похожие вопросы