Reliable communication in the presence of failures

Authors:
Kenneth P. Birman, Thomas A. Joseph
Published:
ACM Transactions on Computer Systems, volume 5, issue 1, pp. 47-76. January 5, 1987.
Abstract:

The design and correctness of a communication facility for a distributed computer system are reported on. The facility provides support forfault-tolerant process groupsin the form of a family of reliable multicast protocols that can be used in both local- and wide-area networks. These protocols attain high levels of concurrency, while respecting application-specific delivery ordering constraints, and have varying cost and performance that depend on the degree of ordering desired. In particular, a protocol that enforces causal delivery orderings is introduced and shown to be a valuable alternative to conventional asynchronous communication protocols. The facility also ensures that the processes belonging to a fault-tolerant process group will observe consistent orderings of events affecting the group as a whole, including process failures, recoveries, migration, and dynamic changes to group properties like member rankings. A review of several uses for the protocols in the ISIS system, which supports fault-tolerant resilient objects and bulletin boards, illustrates the significant simplification of higher level algorithms made possible by our approach.

BibTeX:
@article{birman1987,
  title = {{Reliable communication in the presence of failures}},
  author = {Kenneth P. Birman and Thomas A. Joseph},
  journal = {ACM Transactions on Computer Systems},
  volume = {5},
  issue = {1},
  year = 1987,
  month = 1,
  day = 5,
  doi = {10.1145/7351.7478},
}