AVR32 32-bit MCU/DSP Innovation
Innovating Processing Techniques
Atmel's AVR group has achieved the AVR32 core's exceptional computational throughput with a number of cycle-saving features that:
• reduce the number of load/store cycles,
• maximize the utilization of computational resources,
• provide zero-penalty branches, and
• reduce the number of cache "misses".
In addition, the AVR32 core is architected specifically to minimize both active power consumption and current leakage.
Pointer Arithmetics Minimizes Load/Store Cycles
On average 30% of a processor's cycles are spent, not on operations, but on load/store instructions. The AVR32 reduces the required number of load/store instructions with byte (8 bit), half-word (16 bit), word (32 bit) and double word (64 bit) load/store instructions that are combined with various pointer arithmetics to efficiently access tables, data structures and random data in the fewest number of cycles.
An example of an innovative instruction is the novel “load with extracted index". Among the most popular algorithms for cryptography are the block chiper algorithms of which Blowfish, Triple-DES and Rijndael are examples. All these algorithms use a special array addressing operation, which on current microprocessors requires a long instruction sequence to execute.
The AVR32 instruction set supports these algorithms with new and innovative load instruction that load a word with extracted index instruction. The operation is as follows:
result = pointer0[offset0
>> 24] ^ pointer1[(offset1 >> 16) & 0xff] ^
pointer2[(offset2 >> 8) & 0xff] ^ pointer3[offset3 &
Four memory access operations are dominant in this operation, which extracts one of the four bytes in a 32-bit word, zero-extends it and adds it to a base pointer. The result of this operation generates the memory address to be accessed.
A conventional architecture would need fourteen cycles to execute this operation. The AVR32 can execute it in just seven clock cycles, The AVR32 load with extracted index instruction can perform all four memory accesses in four cycles, while keeping all four offsets in one register.
By reducing the number of load/store instructions to be executed, the AVR32 core increases the throughput per cycle of operator. Altogether the AVR32 core has 28 instructions that increase the efficiency of load/store operations.
Single-Instruction Multiple Data (SIMD)
SIMD instructions in the AVR32 architecture can quadruple the throughput of certain DSP algorithms that require the same operation to be executed on a stream of data (e.g. motion estimation for MPEG decoding). An 8-bit sum of absolution differences (SAD) calculation is executed by loading four 8-bit pixels from memory in a single load operation, then executing a packed subtraction of unsigned bytes with saturation, adding together the high and low pair of packed bytes and unpacking them into packed half-words. These are then added together to get the SAD value.
Multiple Pipelines Support Out of Order Execution
The AVR32 AP CPU has a 7-stage pipeline with 3 sub pipelines (multiplication/MAC, load/store, and ALU) that allow arithmetic operations on nondependent data to be executed, out of order and in parallel. A conventional architecture has a single pipeline that stalls the code until each instruction is completed. This can waste valuable computational resources during multi-cycle instructions. Logic in the AVR32 AP pipeline allows non-dependent instructions to be executed simultaneously, using available pipeline resources. Out of order execution can increase the throughput per cycle. Hazard detection logic detects and holds dependent instructions at the beginning of the pipeline until the operation on which they depend is complete.
Data Forwarding within Pipeline Stages
>The AVR32 AP pipeline eliminates many of the cycles
used to write to and read from register files by forwarding data between
the pipeline stages. Instructions that finish execution before the writeback
stage are immediately forwarded to the beginning of the pipelines for
the execution of instructions waiting for their results. By minimizing
the number of register file accesses, this feature saves both cycles
and power consumption.
Hardware Branch Prediction
Although deep pipelines enable higher clock frequencies,
they introduce significant cycle penalties whenever there are jumps
in the program flow. These branch penalties are particularly harsh for
small inner loops. To address this problem, the AVR32 AP pipeline has
branch prediction logic that can accurately predict all change-of-flow
instructions. In addition, branches are "folded" with the
target instruction, resulting in a zero-cycle branch penalty.
Exceptional Code Density Reduces Cache Misses and
Program Storage Cost
The AVR32 instruction set evolved from extensive benchmarking
and refinement done in parallel with the compiler vendor, IAR Systems
AB. The result is code density that is 50% more dense than that of comparable
32-bit cores, using the EEMBC benchmark suite. Denser code allows more
instructions to be stored in the processor cache, thereby reducing the
number of cache misses and increasing overall processor throughput per
Instruction Set Support for Advanced Operating Systems
The majority of CPU architectures were developed before
operating system (OS) use became as pervasive as it is today. As a result,
CPU cores tend to waste cycles calling the OS or external applications.
The AVR32 architecture specifically supports the use of operating systems,
in particular the Linux® OS, with cycle saving instructions that
include an application call (ACALL) to a subroutine and a system call
(SCALL) that calls the operating system routine. The AVR32's advanced
MMU and security modes also support advanced operating systems such
Slower Clock Provides Ultra Low Power Consumption
The superior throughput of the AVR32 core allows a
slower clock frequency and a linear reduction in power consumption.
In addition, the AVR32 is designed to minimize active power consumption
at any clock rate by keeping data close to the CPU and minimizing the
unnecessary movement of data on buses that consumes a lot of power.
For example, older MCU architectures copy the return address of a subroutine
call to a memory stack, consuming unnecessary power. The AVR32 eliminates
this need by including a link register in the register file. Another
power-saving feature is to keep the status register and the return address
for interrupts and exceptions in system registers, rather than moving
data to and from the system stack.