Page 1 of 1
Field programmable gate arrays are no longer merely convenient interconnect layers between chips in a system. In software defined radios, FPGAs are being used increasingly as a general-purpose computational fabric to implement hardware acceleration units that boost performance while lowering cost and power requirements.
Typical implementations of software defined radio (SDR) modems include a general-purpose processor (GPP), a digital signal processor (DSP) and an FPGA. However, the FPGA fabric can be used to offload the GPP or DSP with application-specific hardware acceleration units. Soft-core microprocessors can have their core extended with custom logic, or separate hardware acceleration coprocessors can be added to the system. Furthermore, with general-purpose routing resources available in the FPGA, these hardware acceleration units can run in parallel to further enhance the total computational output of the system. It is instructional to compare three distinct types of hardware acceleration units and their performance in comparison to software implementations.
Software Defined Radio
With the proliferation of wireless standards, future wireless devices will need to support multiple air interfaces and modulation formats. SDR technology enables such functionality in wireless devices by using a reconfigurable hardware platform across multiple standards.
SDR is the underlying technology behind the Joint Tactical Radio System (JTRS) initiative to develop software programmable radios that can enable seamless, real-time communication across the United States military services, and with coalition forces and allies. The functionality and expandability of JTRS is built upon an open architecture framework called the software communications architecture (SCA). JTRS terminals must support dynamic loading of any one of more than 25 specified air interfaces or waveforms that are typically more complex than those used in the civilian sector. To achieve all these requirements in a reasonable form-factor requires extensive processing power of different kinds. For that reason most architectures utilize a GPP, a DSP and an FPGA.
SDR System Architecture
The GPP, DSP and FPGA are general-purpose processing resources that can be used for different parts of the overall SDR system. Figure 1 shows an architecture example with the typical functions found in SDR divided across each of these devices. However, there is a significant amount of overlap between each of these elements. For example, an algorithm running on the DSP could be implemented in the GPP, albeit more slowly, or rewritten in HDL and run in an FPGA as a coprocessor or hardware acceleration unit.
Using FPGA resources for hardware acceleration can be done in several ways. However, there are three basic architectures: custom instructions, custom peripherals as coprocessors and dynamically reconfigurable application-specific processors. These hardware acceleration methods have different features and unique benefits. Understanding how and where to use each of these helps the system architect better use the FPGA resources for offloading the DSP and GPP in an SDR application.
Soft-Core Processors and Custom Instructions
With the advent of large FPGAs, small, powerful processors that could be embedded in an FPGA appeared. These “soft-core” processors are configurable bits of intellectual property (IP) that can be downloaded into an FPGA and used like any other embedded microprocessor. They even come with industry standard toolchains including compilers, instruction-set simulators, a full suite of software debug tools and an integrated development environment. This toolset is familiar to any embedded software engineer, so much so that it does not matter that the processor is downloaded to the FPGA as a bitstream. However, these soft-core processors are infinitely flexible. Before downloading the processor, a designer can choose different configuration options, trading off size for speed. A designer can also add a myriad of peripherals for memory control, communications, I/O and so forth.
Custom instructions, which take the flexibility of soft-core processors one step further, are algorithm-specific additions of hardware to the soft-core microprocessor’s arithmetic logic unit (ALU). These new hardware instructions are used in place of a time-critical piece of an algorithm, recasting the software algorithm into a hardware block. A RISC microprocessor with a custom instruction blurs the division between RISC and CISC because the custom instruction units can be multi-cycle hardware blocks doing quite complex algorithms embedded in a RISC processor with “standard instructions” that take a single clock cycle. Furthermore several custom instructions can be added to an ALU, limited only by the FPGA resources and the number of open positions in the soft-core processor’s op-code table. Figure 2 depicts the use of a custom instruction to extend the ALU of Altera’s Nios II soft-core microprocessor.
When should custom instructions be used? The most efficient use occurs when the algorithm to be accelerated is a relatively atomic operation that is called often and operates on data stored in local registers. Floating-point instructions are good examples. Floating-point arithmetic instructions can be implemented as library subroutines that the compiler automatically invokes on processors without dedicated floating-point instruction hardware. These floating-point algorithms take many clock cycles to execute. In an application they are typically used throughout the software code rather than localized to a few function calls. However, these algorithms can also be implemented as custom instructions extending a soft-core microprocessor’s ALU. Table 1 provides a comparison between several software library routines and the same function using a custom instruction. Note that even in this case the results may vary dramatically, depending on the design considerations for the custom instruction such as the amount of pipelining that is chosen in the hardware implementation.
The cyclic redundancy check (CRC) algorithm was also added to the table for comparison of a custom instruction in this section. Although the CRC as a custom instruction does provide some advantages over the software-only implementation of this algorithm, when the operation is executed on a large block of memory, there are other ways of implementing the hardware acceleration unit that will be more efficient and provide better overall throughput.
Hardware Acceleration Coprocessors
Whereas custom instructions are an extension of an ALU, which is relegated to a softcore microprocessor, hardware-acceleration coprocessors can be used to accelerate processors that are implemented either on or off the FPGA. Figure 3 depicts an architecture using a coprocessor. In this design, the processor could be an off-FPGA GPP, or a DSP, or it could be an on-FPGA processor (either hard-core or soft-core processor). One of the key advantages to the coprocessor is that it is wrapped in a direct memory access (DMA).
Situations where hardware acceleration coprocessors could be used over a custom instruction have one or more of the following common characteristics:
- Algorithms do not use only register variables (non atomic)
- Operations are more complex (often a subroutine in software)
- Transformation of data is done on a large data block
Table 2 provides several examples of hardware acceleration coprocessor performance over software-only implementations for some algorithms that are used in SDR applications. SDRs can benefit from coprocessors for various DSP functions as well as higher-level application level hardware acceleration.
Application-Specific Instruction-Set Processors
Application-specific instruction-set processors (ASIPs) are special types of hardware acceleration coprocessors. An ASIP combines the flexibility of a software approach with the efficiency and performance of dedicated hardware. An ASIP is a processor that has been targeted to perform a specific task or set of related tasks. One of the implementations of ASIPs allows for changing the internal topology of the processor by changing the functional interconnect of the larger building blocks.
Software defined radios implement algorithms in software to improve portability, lifetime costs and retargetability. However achieving cost and performance requirements necessitates the use of application-specific hardware. The value of ASIPs on an FPGA is that they are composed of smaller building blocks that can be reconfigured on the fly to implement more than one high-level function. An example relevant to SDRs would be Fast Fourier Transform (FFT) blocks and finite impulse response (FIR) filters. These two high-level algorithms share many common sub-blocks. By changing the interconnect between these sub-blocks the ASIP can be altered to implement the FFT instead of the FIR in hardware. Figure 4 shows an ASIP architecture implementing an FFT/FIR ASIP. A simple microcode instruction set is used to configure the hardware blocks to perform either the FIR or the FFT as needed.
A software/hardware comparison was made between running a 1024-point radix-2 FFT on a TI C62x DSP and doing the same filter on the FIR/FFT ASIP. The TI implementation took 20840 clock cycles and the ASIP took 21850 clock cycles. The overall throughput for both implementations was near parity. However, the relative size, power and cost savings of the ASIP approach is superior to utilization of an entire DSP for the same algorithms. In situations where specific SDR algorithms can be offloaded from the DSP to the FPGA to decrease the processing power needed in the DSP, the result often strongly favors the ASIP approach.
Software defined radios require extensive processing power to realize the portability of waveforms and reconfigurability that has been promised. The use of FPGAs for hardware acceleration offers promising architectural options that are helping to make SDRs a reality.
San Jose, CA.