Implementation of Soft Error-Resilient Built-In 2d Hamming Product Code Using Verilog

Enhancing Reliability and Availability of SRAM-based FPGAs using 2-D Hamming Product Code

by Prof. Pramod Patil*, Miss. Triveni V. Hegade, Miss. Ashwini K. Lele,

- Published in Journal of Advances in Science and Technology, E-ISSN: 2230-9659

Volume 12, Issue No. 25, Dec 2016, Pages 198 - 202 (5)

Published by: Ignited Minds Journals


ABSTRACT

Radiation_induced soft error rate (SER) degrades the reliability of static random access memory(SRAM)-based field programmable gate arrays (FPGAs).This paper presents a new built-in 2-D Hamming product code (2-D HPC)scheme to provide reliable operation of SRAM-based FPGAs in hostile operating environments such as space.Multibit error correction capabilit such as space.Multibit error correction capability of our built-in 2-D HPC can improve the reliability ,and hence ,system availability ,by orders of magnitude.Simulation results show that the large number of error correction capability of 2-D HPC can recover configuration bits without depending on an external memory preserving a golden copy of the configuration bits. To provide efficient 2-D HPC in a built-in logic ,we also propose a new 2-D SRAM buffer.Using the proposed multibit error correction scheme,system availability of an SRAM-based FPGA can be more than 99.9999999% with SRAM cell faiure in 1 billion h of operation of 7.

KEYWORD

soft error-resilient, built-in, 2d Hamming product code, Verilog, radiation-induced, soft error rate, SRAM-based, field programmable gate arrays, reliable operation, hostile operating environments, multibit error correction capability, system availability, simulation results, configuration bits, external memory, golden copy, efficient 2-D SRAM buffer, system availability, SRAM cell failure

I. INTRODUCTION

Reconfigurable static random access memory based field programmable gate arrays high density and programmability, and cost-effectiveness. Despite a fore mentioned advantages of SRAM-based FPGAs, its widespread use in mission-critical space applications is limited due to its susceptibility to single-event upsets (SEUs). Impact of highly energized particles (e.g., protons, neutrons, and alpha particles) on sensitive locations of circuits results in SEUs. In SRAM-based FPGAs the functionality is specified by the contents of the configuration memories. However, SEU can alter a configuration bit in an FPGA which may result in a permanent malfunction of the mapped program. Researchers have experimentally investigated the impact of radiation effects in FPGAs on the surface of earth and space. Rapid elevation of soft error rate (SER) has been observed in high-altitude experiments. For example, experiments in 1000-km orbit have reported multiple failures of FPGA systems in a day.

II. HAMMING CODES

In telecommunication, Hamming codes are a family of linear error-correcting codes that generalize the Hamming (7, 4)-code invented by Richard Hamming in 1950. Hamming codes can detect up to two-bit errors or correct one-bit errors without detection of uncorrected errors. By contrast, the simple parity code cannot correct errors, and can detect only an odd number of bits in 2 error. Hamming codes are perfect codes, that is, they achieve the highest possible rate for codes with their block length and minimum distance 3. In mathematical terms, Hamming codes are a class of binary linear codes. For each integer r ≥ 2 there is a code with block length n = 2r−1 and message length k = 2r−r−1. Hence the rate of Hamming codes is R = k/n = 1 − r/(2r−1), which is the highest possible for codes with minimum distance 3 (i.e., the minimal number of bit changes needed to go from any code word to any other code word is 3) and block length 2r−1. The parity-check matrix of a Hamming code is constructed by listing all columns of length r that are non-zero, which means that the dual code of the Hamming code is the punctured Hadamard code. The parity-check matrix has the property that any Due to the limited redundancy that Hamming codes add to the data, they can only detect and correct errors when the error rate is low. This is the case in computer memory (ECC memory), where bit errors are extremely rare and Hamming codes are widely used. In this context, an extended Hamming code having one extra parity bit is often used. Extended Hamming codes achieve a Hamming distance of 4, which allows the decoder to distinguish between when at most one 1-bit error occurs and when any 2-bit errors occur. In this sense, extended Hamming codes are single-error correcting and double-error detecting, abbreviated as SECDED. If more error-correcting bits are included with a message, and if those bits can be arranged

Prof. Pramod V. Patil1* Miss. Triveni V. Hegade2, Miss. Ashwini K. Lele3 1

message, there are seven possible single bit errors, so three error control bits could potentially specify not only that an error occurred but also which bit caused the error. Hamming studied the existing coding schemes, including two-of-five, and generalized their concepts. To start with, he developed a nomenclature to describe the system, including the number of data bits and error-correction bits in a block. For instance, parity includes a single bit for any data word, so assuming ASCII words with 7-bits, Hamming described this as an (8,7) code, with eight bits in total, of which 7 are data. The repetition example would be (3, 1), following the same logic. The code rate is the second number divided by the first, for our repetition example, 1/3.Hamming also noticed the problems with flipping two or more bits, and described this as the "distance" (it is now called the Hamming distance, after him). Parity has a distance of 2, so one bit flip can be detected, but not corrected and any two bit flips will be invisible. The (3,1) repetition has a distance of 3, as three bits need to be flipped in the same triple to obtain another code word with no visible errors. It can correct one-bit errors or detect but not correct two-bit errors. A (4, 1) repetition (each bit is repeated four times) has a distance of 4, so flipping three bits can be detected, but not corrected. When three bits flip in the same group there can be situations where attempting to correct will produce the wrong code word. In general, a code with distance k can detect but not correct k − 1 errors.

III. SOFT ERRORS IN SCALED

TECHNOLOGIES

In the past decades, scaling of devices has been driven by demands for higher functionality, higher density, lower cost and lower power. Aggressive feature size and supply voltage scaling has resulted in reduction of critical charge (Qcrit) in memory cells. Qcrit is defined as the minimum charge capable of flipping the stored bit in a 3 memory cell. Intuitively, smaller Qcrit would result in higher SER. However, it has been observed that SRAM bit SER has started to saturate and is expected to decrease in-deep sub micro meter regimes (Fig.3.1). Saturation in the supply voltage scaling and decrease in junction collection efficiency has resulted in the saturation of SRAM bit SER. Notwithstanding the saturation in SRAM bit SER, the SRAM system SER has increased dramatically with each technology generation. The increase in system SER can be attributed to the exponential growth in SRAM integration density with device scaling and has become a great concern for future technology nodes. significant factor in accelerating SER in high-altitude and space applications. While SER induced by alpha particles can be suppressed by purification of packaging materials, it is challenging to address SER due to cosmic neutrons. The intensity of cosmic ray flux increases by an order of magnitude every 10,000 ft. The same trend has been experimentally shown in the Rosetta Experiment using SRAM-based FPGAs. Also, a TMR experiment for low-earth orbital path have shown upsets at a rate of one per hour, spread across three Xilinx V1000 devices, with each device having about 6.5-Mbits configuration memories.

Fig.3.1 Normalized SRAM bit and array SERs as a function of technology Scaling.

IV. ALGORITHM OF HAMMING PRODUCT

CODE

Fig.4.1 Example of 2-D HPC scheme. D Configuration bit, ER Row EC parity bit, EC Column EC parity X represents a bit error. (a) 2-D EC with 7*7window. (b) Successful 2-D EC. (c) Successful 2-D EC in 1.5 iterations. (d) Non repairable and non-detectable case.

2

Also note there are cases where 2-D HPC fails to recover the data, as shown in Fig. 5.1 4 When there are multibit errors in both directions, 2-D HPC may not be able to recover the data. The probability of such occurrences depends on windows size of 2-D HPC and the expected number of errors in an array. There are cases that non repairable errors with 64*64window can be corrected by 32*32 window. For instance, a 2-D HPC of 64*64window cannot correct cases of 4-bit errors shown in Fig. 5.1 However, it is possible that the four errors are distributed into four 32*32 array portions in a 64*64window. Then, all four errors can be corrected by four iterations of 2-D HPC with 32*32window. In the simulations, errors are uniformly distributed in the array. The result of conventional SECDED with word size of 1 Kb is also provided as a reference, since the frame size of commercial FPGAs is about 1 Kb. Conventional SECDED shows low EC performance mainly due to incapability of multibit EC in a single frame. On the other hand, 2-D HPC with 32*32 window size can correct ten errors in a frame with 99% of success rate using a perfect decision-making algorithm (or large number of iterations). The simulation result in Fig.6.1 shows that one iteration of 2-D HPC can correct ten errors with slightly degraded success rate of 95%. The results have been obtained from one million random samples with consideration of failures in parity data. The errors in parity data may not directly result in a system failure. However, it can increase chances of non-reparable errors. Hence, we assume same importance for all errors.

V. ADVANTAGES

The proposed 2-D HPC technique offer several benefits such as faster reconfiguration, increased system throughput, and higher error tolerance as explained here. • Faster reconfiguration: The time required to reprogram couple of frames is very short. For instance, partial reconfiguration using I/O blocks takes only few microseconds in commercial FPGAs. Hence, the proposed in EC circuit can provide faster reconfiguration using internal bus only. • Higher FPGA system throughput: The read-back can be performed as a background operation. Hence, the 2-D HPC is performed without stalling the system function. Only the short reconfiguration of erroneous frames results in interruption of the system operation. • High error tolerance: The physical dimension of the frame buffer is very small compared to with internal FPGA clock frequency of 200 MHz, for example. Hence, the probability of having soft errors in the buffer during the maintenance period is negligible.

VI. SIMULATION RESULTS

Fig.6.1 Input data signals The above shown fig.6.1 represents the simulation result of input original matrix, parity matrix, and the received message. The original message is 7*7 matrix window size which is of 49 5 bit message. The original message is sent from the transmitter. The parity matrix is of 28 binary bits are given as an input. The various intermediate signals are used such as parity_sig, using this parity matrix is generated. Signal ex_sig is used to Ex-or operation. Then later the bits are concatenated using con_sig.

Fig.6.2 Parity signal generation

Prof. Pramod V. Patil1* Miss. Triveni V. Hegade2, Miss. Ashwini K. Lele3 2

Fig.6.3 Modified error detection and correction

Fig.6.4 Modified error correction and detection

RTL VIEW

Register-transfer-level abstraction is used in HDL like Verilog and VHDL to create high level representations of a circuit, from which low level representations and ultimately actual wiring can be derived. wirings. 6

VII. APPLICATIONS

In a satellite system, FPGA performs programmed functions while it interacts with other system components. Due to the susceptibility of SRAM-based FPGA to soft errors, frequent repair processes have to be performed to ensure reliable operation. Thus, repairing FPGAs in such architecture involves interruptions of multiple system components. In that case, synchronization with other blocks, known as coherence problem, can result in further degradation of system performance. Therefore, the actual cost of correcting errors in FPGA and restarting the entire system is much higher than that of repairing FPGA itself. To prevent frequent system interruptions due to SEUs in FPGA, TMR is widely used in mission critical space application. TMR can ensure correct and continuous FPGA operation while tolerating both temporal transient errors and hard faults (configuration bit upsets), based on its majority voting mechanism. The error tolerance of TMR improves system reliability.

VIII. CONCLUSION

In this work, a simple built-in 2-D HPC architecture that provides long survival of FPGAs in mission-critical space applications is presented. The simulation results show that 2-D single bit corrections can provide very high multibit ECs. This work also includes an optimized hardware implementation of our proposed scheme using a new 2-D SRAM array. Furthermore, due to the extremely large multibit error-handling capability of the proposed design, self-healing of SRAM-based FPGA can be achieved even in harsh radiation environments.

IX. REFERENCES

A. Lesea et al., “The Rosetta experiment: Atmospheric soft error rate testing in differing technology FPGAs,” IEEE Trans. Device Mater.Rel., vol. 5, no. 3, pp. 317–328, Sep. 2005. E. Fuller et al., “Radiation testing update, SEU mitigation, and availability analysis of the Virtex FPGA for space reconfigurable computing,” in Proc. MAPLD, 2000. J. F. Ziegler, “Terrestrial cosmic ray intensities,” IBM J. Res. Develop., vol. 42, no. 1, pp. 117–139, 1998.

2

474, Apr. 1996. R. Baumann, “Radiation-induced soft errors in advanced semiconductor technologies,” IEEE Trans. Device Mater. Rel.., vol. 5, no. 3, pp. 305–316, Sep. 2005. H. Asadi et al., “Analytical techniques for soft error rate modeling and mitigation of FPGA-based designs,” IEEE Trans. Very Large Scale Integr.(VLSI) Syst., vol. 16, no. 12, pp. 1320–1331, Dec. 2007. C. Carmichael et al., “SEU mitigation techniques for Virtex FPGAs in space applications,” in Proc. MAPLD, 1999. K. Chapman et al.,“SEU strategies for Virtex-5 devices,” Xilinx Application Note 864, 2009. M. Garvie et al., “Scrubbing away transients and jiggling around the permanent: Long survival of FPGA systems through evolutionary selfrepair,” Proc. 10th IOLTS, pp. 155–155, 2004.

Corresponding Author Prof. Pramod V. Patil*

Professor, Department of ECE, Hirasugar Institute of Technology Nidasoshi, India

E-Mail – patil.pramod080@gmail.com