Language: English | Deutsch | 中国的       Change Country
AVR32 MCU/DSP Processors    


AVR32 32-bit MCU/DSP Innovation

Innovating Processing Techniques

Atmel's AVR group has achieved the AVR32 core's exceptional computational throughput with a number of cycle-saving features that:
• reduce the number of load/store cycles,
• maximize the utilization of computational resources,
• provide zero-penalty branches, and
• reduce the number of cache "misses".
In addition, the AVR32 core is architected specifically to minimize both active power consumption and current leakage.

Pointer Arithmetics Minimizes Load/Store Cycles

On average 30% of a processor's cycles are spent, not on operations, but on load/store instructions. The AVR32 reduces the required number of load/store instructions with byte (8 bit), half-word (16 bit), word (32 bit) and double word (64 bit) load/store instructions that are combined with various pointer arithmetics to efficiently access tables, data structures and random data in the fewest number of cycles.

An example of an innovative instruction is the novel “load with extracted index". Among the most popular algorithms for cryptography are the block chiper algorithms of which Blowfish, Triple-DES and Rijndael are examples. All these algorithms use a special array addressing operation, which on current microprocessors requires a long instruction sequence to execute.

The AVR32 instruction set supports these algorithms with new and innovative load instruction that load a word with extracted index instruction. The operation is as follows:

result = pointer0[offset0 >> 24] ^ pointer1[(offset1 >> 16) & 0xff] ^
pointer2[(offset2 >> 8) & 0xff] ^ pointer3[offset3 & 0xff];

Four memory access operations are dominant in this operation, which extracts one of the four bytes in a 32-bit word, zero-extends it and adds it to a base pointer. The result of this operation generates the memory address to be accessed.

A conventional architecture would need fourteen cycles to execute this operation. The AVR32 can execute it in just seven clock cycles, The AVR32 load with extracted index instruction can perform all four memory accesses in four cycles, while keeping all four offsets in one register.

By reducing the number of load/store instructions to be executed, the AVR32 core increases the throughput per cycle of operator. Altogether the AVR32 core has 28 instructions that increase the efficiency of load/store operations.

Single-Instruction Multiple Data (SIMD)

SIMD instructions in the AVR32 architecture can quadruple the throughput of certain DSP algorithms that require the same operation to be executed on a stream of data (e.g. motion estimation for MPEG decoding). An 8-bit sum of absolution differences (SAD) calculation is executed by loading four 8-bit pixels from memory in a single load operation, then executing a packed subtraction of unsigned bytes with saturation, adding together the high and low pair of packed bytes and unpacking them into packed half-words. These are then added together to get the SAD value.

Multiple Pipelines Support Out of Order Execution

The AVR32 AP CPU has a 7-stage pipeline with 3 sub pipelines (multiplication/MAC, load/store, and ALU) that allow arithmetic operations on nondependent data to be executed, out of order and in parallel. A conventional architecture has a single pipeline that stalls the code until each instruction is completed. This can waste valuable computational resources during multi-cycle instructions. Logic in the AVR32 AP pipeline allows non-dependent instructions to be executed simultaneously, using available pipeline resources. Out of order execution can increase the throughput per cycle. Hazard detection logic detects and holds dependent instructions at the beginning of the pipeline until the operation on which they depend is complete.

Data Forwarding within Pipeline Stages

>The AVR32 AP pipeline eliminates many of the cycles used to write to and read from register files by forwarding data between the pipeline stages. Instructions that finish execution before the writeback stage are immediately forwarded to the beginning of the pipelines for the execution of instructions waiting for their results. By minimizing the number of register file accesses, this feature saves both cycles and power consumption.

Hardware Branch Prediction

Although deep pipelines enable higher clock frequencies, they introduce significant cycle penalties whenever there are jumps in the program flow. These branch penalties are particularly harsh for small inner loops. To address this problem, the AVR32 AP pipeline has branch prediction logic that can accurately predict all change-of-flow instructions. In addition, branches are "folded" with the target instruction, resulting in a zero-cycle branch penalty.

Exceptional Code Density Reduces Cache Misses and Program Storage Cost

The AVR32 instruction set evolved from extensive benchmarking and refinement done in parallel with the compiler vendor, IAR Systems AB. The result is code density that is 50% more dense than that of comparable 32-bit cores, using the EEMBC benchmark suite. Denser code allows more instructions to be stored in the processor cache, thereby reducing the number of cache misses and increasing overall processor throughput per cycle.

Instruction Set Support for Advanced Operating Systems

The majority of CPU architectures were developed before operating system (OS) use became as pervasive as it is today. As a result, CPU cores tend to waste cycles calling the OS or external applications. The AVR32 architecture specifically supports the use of operating systems, in particular the Linux® OS, with cycle saving instructions that include an application call (ACALL) to a subroutine and a system call (SCALL) that calls the operating system routine. The AVR32's advanced MMU and security modes also support advanced operating systems such as Linux.

Slower Clock Provides Ultra Low Power Consumption

The superior throughput of the AVR32 core allows a slower clock frequency and a linear reduction in power consumption. In addition, the AVR32 is designed to minimize active power consumption at any clock rate by keeping data close to the CPU and minimizing the unnecessary movement of data on buses that consumes a lot of power. For example, older MCU architectures copy the return address of a subroutine call to a memory stack, consuming unnecessary power. The AVR32 eliminates this need by including a link register in the register file. Another power-saving feature is to keep the status register and the return address for interrupts and exceptions in system registers, rather than moving data to and from the system stack.