Low overhead fault tolerant networking in Myrinet

Citation data:

2003 International Conference on Dependable Systems and Networks, 2003. Proceedings., Page: 193-202

Publication Year:
2003
Captures 6
Readers 6
Citations 4
Citation Indexes 4
Repository URL:
https://scholarworks.umass.edu/ece_faculty_pubs/766
DOI:
10.1109/dsn.2003.1209930
Author(s):
Vijay Lakamraju; Israel Koren; C. M. Krishna
Publisher(s):
Institute of Electrical and Electronics Engineers (IEEE); IEEE
Tags:
Computer Science
conference paper description
Emerging networking technologies have complex network interfaces that have renewed concerns about network reliability. In this paper, we present an effective low-overhead fault tolerance technique to recover from network interface failures, more particularly network processor hangs. We demonstrate the technique in the context of Myrinet. Fault recovery is achieved by restoring the state of the network interface using a small backup copy containing just the right amount of information required for complete recovery. Our fault detection is based on a software watchdog that detects network processor hangs. Results on the Myrinet platform show that the complete fault recovery can be achieved in under 2sec while incurring a latency overhead of just 1.5μs during normal operation. The paper also shows how this fault recovery can be made completely transparent to the user.