# Research of Fault-Tolerant Computing Using COTS Elements

B. Świercz, D. Makowski, A. Napieralski

Department of Microelectronics and Computer Science Technical University of Lodz Al. Politechniki 11, 93-590 Lodz, Poland swierczu@dmcs.pl, dmakow@dmcs.pl, napier@dmcs.pl

### ABSTRACT

The problem of designing radiation tolerant systems is crucial for modern physics. Reliable control system able to operate in radiation environment are necessary to carry out physics experiments. This paper highlights the operating system, which is used to develop a software fault tolerant environment for the reliable computing. The main idea is to implement fault tolerant algorithms into a kernel instead of user-space programs. In this paper a new real-time kernel, called sCore, is proposed. The sCore is a modern, multitasking kernel, designed for real-time applications.

**Keywords**: Single Event Upset, Radiation Environment, Fault Tolerant System, Hardened, Real-Time System

## 1 BACKGROUND OF THE RESEARCH

X-Ray Free Electron Laser (X-FEL) [1, 2], designed at DESY Research Center in Hamburg, is the 4<sup>th</sup> generation of synchrotron accelerator [3, 4]. Total length of the X-FEL will be 3.4 km with 2.1 km of accelerator tunnel length. Distributed control system is necessary to control all subsystems of X-FEL's modules. Some part of electronic equipment (DSP and FPGA boards, embedded computers and microprocessors systems) designed for control and data acquisition will be placed inside accelerators tunnel close to cavities - the main part of accelerator. During normal operation cavities produce gamma and neutron radiation. Neutrons (especially slow called thermal neutrons) can affect memory and registers in digital circuits and cause on illegal operation and thus loss of the system functionality [5, 6]. Single memory malfunction is known as a Single Event Upset (SEU) [7–9]. Microprocessor system placed inside the accelerator tunnel are subjected to neutron radiation. To ensure reliable operation it is necessary to protect digital circuits against radiation [10]. The simple way to protect control systems against SEU is to use hardend computer systems designed for cosmic and military application but keeping in mind total length of the accelerator tunnel it will be too expensive for X-FEL

project. Other solution, much cheaper but more complex, is to use Commercial of The Shelf Components (COTS) connected with system redundancy and safety algorithms. Protective algorithms can by used on hardware layer (for example embedded inside FPGA chips) or on software one. The authors decided to implement SEU-tolerant algorithms inside the sCore operating system kernel. The sCore is designed to work with standard computer architecture such as PC computers or embedded versions used by industry. There is known many frameworks and libraries supporting fault tolerant environments but this approach requires good knowledge about SEUs nature from application programmers. Other approach is to use a special kernel version with embedded protection algorithms which allows to develop and run applications not able to tolerate SEUs.

#### 2 THE SCORE KERNEL

The sCore is multitasking preemptive kernel based on the microkernel architecture [11]. It is design to work on IA-32 architecture (standard PC platform and also embedded industrial computers) but thanks to C++ abstraction layer sCore code could be translated to different platforms (there are plans to port on ARM and PowerPC architectures). The sCore can by used to develop real-time applications with a constant time of tasks switching. The sCore has built in Round Robin scheduler with 256 priority levels queue. All internal structures are static and initialised during system bootstrap. This technique ensures stability and predictable time dependences. Dynamic memory allocation inside kernel are not served by kernel.

### 2.1 The EDAC Task

Radiation tolerant sCore should provide mechanism to find and correct bit-flip errors inside memory. Static Random Access Memory (SRAM) technology, used to develop fast buffers memory and registers, is susceptible to the SEU effects. Dynamic Random Access Memory (DRAM), payload for the large main system memory, is also sensitivity to the SEU effect. Soft errors inside memory can break the system down. To protect system memory special task, called EDAC Task, was

implemented. The EDAC Task (fig. 1) divides system memory into three parts.



Figure 1: EDAC Task and memory organisation scheme.

The first memories region is used to work by system (sCore kernel) and running applications. The second and third part of memory are used exclusively by EDAC Task for storage copy of read-only data from the first memory region. Read-only data (machine codes and constant variables) are organized by linker to special program section called text section. When system is started EDAC Task copies text section of kernel and all application into the second and third memory region. During system operation EDAC Task periodically compare the first and second memory region. If EDAC Task find a difference between the first and second memory parts it will compare all three memory regions to choose the correct one and restore first memory region using voting technique. The EDAC Task is run periodically and simultaneously to other applications.

## 3 EXPERIMENTS WITH THE SCORE KERNEL

There were carried out two experiments with the sCore kernel run on the standard PC computer. PC computer was placed near neutron source (fig. 2). The first experiment was done with the <sup>241</sup>AmBe neutron

source and the second was done inside Liniac2 accelerators tunnel near the electron-positron converter. IArad-Sim [12], special version of Bochs IA-32 architecture emulator was used to simulate kernel sensitivity to SEU before experiments. The standard PC computer with Pentium III 500MHz processor and 128 MB of RAM memory was used for experiments. Apart from EDAC Task, two other application was run on the sCore. The first application was simple shell and the second was MD5 calculation application. MD5 algorithm was chosen to calculate MD5 sum for a initialised memory region. MD5 application was run in infinite loop and calculated value was compared with a constant value calculated during kernel compilation. If MD5 sum equals a constant it means that there was not any error in memory or errors were corrected by EDAC Task.

# 3.1 Radiation environment around the <sup>241</sup>AmBe

The  $^{241}$ AmBe izotop was used as a neutrons source (fig. 2). The  $^{241}$ AmBe was placed inside the water moderator to slow down neutrons (thermal neutrons generate more SEUs then fast neutrons).



Figure 2: Experiment with <sup>241</sup>AmBe.

The experiment was conducted during 22 hours. System activity during experiment is shown in fig. 3. There was nine errors (SEUs) in memory (level 6 and 4 in fig 3). All errors were corrected by EDAC Task. Only one

SEU was detected by MD5 application and was not corrected by EDAC Task.



Figure 3: System activity diagram during experiment with  $^{241}\mathrm{AmBe}.$ 

# 3.2 Radiation environment inside the Liniac2 tunnel

Liniac2 accelerator is used for experiments inside tunnel because of VUV-FEL and X-FEL accelerators are still under development. The computer with sCore system was installed close to the electron-positron converter (fig. 4). Electron-positron converter generates positrons for physic experiments but also gamma and neutrons are generated as a parasitic effect. There was done two experiments inside Liniac2. The first experiment was done during 26 hours and the second takes 24 hours. When the electron-positron converter was turned on, neutron radiation was generated and sCore received unknown interrupt (see in fig. 5, level 3) number 13. Interrupt number 13 is called General Fault by Intel nomenclature. The sCore kernel was not able to work inside Liniac2 tunnel when the electron-positron converter was running.

### 4 CURRENT AND FUTURE WORK

The current work is focused on the advantage of virtual memory management provided by modern processors. Instead of scanning all system memory (the first memory region scanned by EDAC Task) all memory pages has set dirty bit which means that every access to the memory generates interrupt. The ISR procedure which serve interrupt compares demanded pages with its copy and clears dirty bit. This technique is faster and more effective then EDAC Task but it still requires memory redundancy for keeping working copies. More sophisticated technique is to use hardware debugging mode provided by processors. Currently, the service of



Figure 4: Experiment inside the Liniac2 tunnel.



Figure 5: System activity diagram during experiment inside the Liniac2 tunnel.

debug mode is implementing inside sCore kernel. Using debug mode is possible to run application step by step and it is easy to watch every access to the memory cell. Moreover, sCore with built in debug mode service will be able to repeat every process instruction three times and compare results (this technique is based on transaction mechanism). Therefore sCore should be ported to the other architecture (such as ARM and PowerPC) to cooperate with many types of microprocessors systems and embedded computers.

### 5 CONCLUSION

During experiments it was possible to observe behaviour of PC architecture under neutron radiation impact. Experiment with the <sup>241</sup>AmBe shown that simple redundant technique (EDAC Task) can protect user application and prevent SEUs influence on the system memory. On the other hand, because of high radiation level, experiment inside Liniac2 tunnel close to the electron-positron converter shown that EDAC Task is almost useless. When the electron-positron converter was working average six SEUs events per second inside memory was observed. Neutrons intensity inside Liniac2 caused errors not only in main memory but also in processors registers. The EDAC Task can be useful only for weak radiation environment. The future work will focus on fault tolerant technique to protect microprocessor system from high radiation level comparable to the radiation generated by electron-positron converter.

#### 6 ACKNOWLEDGEMENTS

We acknowledge the support of the European Community-Research Infrastructure Activity under the FP6 "Structuring the European Research Area" program (CARE, contract number RII3-CT-2003-506395), and Polish National Science Council Grant "138/E-370/SPB/6.PR UE-DIE 354/2004-2007".

### REFERENCES

- [1] A. Schwarz, "The European X-Ray free electron laser project at DESY", 26th International Free-Electron Laser Conference, pp. 85"89, Sierpień 2004.
- [2] W. Shi, "SASE X-Ray Free Electron Laser In DESY", Journal of the Society of Chinese Physists, vol. 6, pp. 5 "16, Grudzień 2000.
- [3] R. Brinkmann, K. Flottmann, J. Rosbach, P. Schmuser, N. Walker, H. Weise, "TESLA Technical Design Report The Accelerator, part II", DESY, 2001
- [4] G. Materlik, T. Tschentscher, "TESLA Technical Design Report. The X-Ray Free Electron Laser, PART V", DESY, 2001.
- [5] D.Makowski, M. Grecki, B. Mukherjee, B. Świercz, S. Simrock, "Radiation Tolerant System for Neutrons Measurement", 12th Mixed Design of Integrated Circuits and Systems, MIXDES, lipiec 2005.
- [6] G. Messenger, M. Ash, "The Effects of Radiation on Electronic Systems", ISBN 0-442-25417-2. Van Nostrand Reinhold Company Inc., 1986.
- [7] R. Peterson, "Radiation-induced errors in memory chips", Brazilian Journal of Physics, vol. 33, nr 2, pp. 246"249, czerwiec 2003.

- [8] F. Giustino, "Radiation Effects on Semiconductor Devices", PhD thesis, Politecnico di Torino, Marzec 2001.
- [9] D.M. Fleetwood, H. A. Eisen, "Total-dose radiation hardness assurance", Nuclear Science, IEEE Transactions, vol. 50, pp. 552" 564, czerwiec 2003.
- [10] D. Makowski, B. Świercz, M. Grecki, A. Napieralski, (in Polish) "Projektowanie systemów niewrażliwych na wpływ promieniowania na potrzeby akceleratora X-FEL", Elektronika Konstrukcje, Technologie, Zastosowania, nr 7/2005.
- [11] B. Świercz, D. Makowski, A. Napieralski, "The sCore - Operating System for Research of Fault-Tolerant Computing", 12th Mixed Design of Integrated Circuits and Systems, MIXDES, lipiec 2005.
- [12] B. Świercz, D. Makowski, A. Napieralski, "IArad-Sim - IA32 architecture under high radiation environment simulator", 2005 NSTI Nanotechnology Conference and Trade Show, Nanotech 2005, Smart Sensors and Systems