Self-stabilization

Self-stabilization is a concept of fault-tolerance in distributed computing. A distributed system that is self-stabilizing will end up in a correct state no matter what state it is initialized with. That correct state is reached after a finite number of execution steps.I regard this as Dijkstra's most brilliant work - at least, his most brilliant published paper. It's almost completely unknown. I regard it to be a milestone in work on fault tolerance... I regard self-stabilization to be a very important concept in fault tolerance and to be a very fertile field for research. Self-stabilization is a concept of fault-tolerance in distributed computing. A distributed system that is self-stabilizing will end up in a correct state no matter what state it is initialized with. That correct state is reached after a finite number of execution steps. At first glance, the guarantee of self stabilization may seem less promising than that of the more traditional fault-tolerance of algorithms, that aim to guarantee that the system always remains in a correct state under certain kinds of state transitions. However, that traditional fault tolerance cannot always be achieved. For example, it cannot be achieved when the system is started in an incorrect state or is corrupted by an intruder. Moreover, because of their complexity, it is very hard to debug and to analyze distributed systems. Hence, it is very hard to prevent a distributed system from reaching an incorrect state. Indeed, some forms of self-stabilization are incorporated into many modern computer and telecommunications networks, since it gives them the ability to cope with faults that were not foreseen in the design of the algorithm. Many years after the seminal paper of Edsger Dijkstra in 1974, this concept remains important as it presents an important foundation for self-managing computer systems and fault-tolerant systems. As a result, Dijkstra's paper received the 2002 ACM PODC Influential-Paper Award, one of the highest recognitions in the distributed computing community.Moreover, after Dijkstra's death, the award was renamed and is now called the Dijkstra Award. E.W. Dijkstra in 1974 presented the concept of self-stabilization, prompting further research in this area. His demonstration involved the presentation of self-stabilizing mutual exclusion algorithms. It also showed the first self-stabilizing algorithms that did not rely on strong assumptions on the system. Some previous protocols used in practice did actually stabilize, but only assuming the existence of a clock that was global to the system, and assuming a known upper bound on the duration of each system transition. It was only ten years later when Leslie Lamport pointed out the importance of Dijkstra's work at a 1983 conference called Symposium on Principles of Distributed Computing that researchers directed their attention to this elegant fault-tolerance concept. In his talk, Lamport stated: Afterwards, Dijkstra's work was awarded ACM-POPDC influential paper award, which then became ACM's (the Association for computing Machinery) Dijkstra Prize in Distributed Computing given at the annual ACM-POPDC symposium. A distributed algorithm is self-stabilizing if, starting from an arbitrary state, it is guaranteed to converge to a legitimate state and remain in a legitimate set of states thereafter. A state is legitimate if starting from this state the algorithm satisfies its specification. The property of self-stabilization enables a distributed algorithm to recover from a transient fault regardless of its nature. Moreover, a self-stabilizing algorithm does not have to be initialized as it eventually starts to behave correctly regardless of its initial state. Dijkstra's paper, which introduces the concept of self-stabilization, presents an example in the context of a 'token ring' — a network of computers ordered in a circle. Here, each computer or processor can 'see' the whole state of one processor that immediately precedes it and that this state may imply that the processor 'has a token' or it 'does not have a token.' One of the requirements is that exactly one of them must 'hold a token' at any given time. The second requirement prescribes that each node 'passes the token' to the computer/processor succeeding it so that the token eventually circulates the ring. The first self-stabilizing algorithms did not detect errors explicitly in order to subsequently repair them. Instead, they constantly pushed the system towards a legitimate state. Since traditional methods for detecting an error were often very difficult and time-consuming, such a behavior was considered desirable.(The method described in the paper cited above collects a huge amount of information from the whole network to one place; after that, it attempts to determine whether the collected global state is correct; even that determination alone can be a hard task). More recently, researchers have presented newer methods for light-weight error detection for self-stabilizing systems using local checking. The term local refers to a part of a computer network. When local detection is used, a computer in a network is not required to communicate with the entire network in order to detect an error — the error can be detected by having each computer communicate only with its nearest neighbors. These local detection methods simplified the task of designing self-stabilizing algorithms considerably. This is because the error detection mechanism and the recovery mechanism can be designed separately. Newer algorithms based on these detection methods also turned out to be much more efficient. Moreover, these papers suggested rather efficient general transformers to transform non self stabilizing algorithms to become self stabilizing. The idea is to,

Parent Topic

Child Topic

No Parent Topic