Linux Clusters Consider the following scenario. A group of 8 roughly equal Pentiums running modified Linux kernels is networked together with 100 Base-T ethernet cards. With a clean network environment, they have the potential to form a single computing unit halfway between a multi-processor (SMP) machine and an interlocking set of NFS servers. This modified Linux kernel allows them to do this, and form a so-called Linux cluster. A Linux cluster is an organic entity, flexible enough to come into existence on its own by sensing the immediate network environment, determining network latencies and processor speeds. It is meant to exist in a locally friendly network environment where latency is low, processors are roughly comparable in speed, and users (owners) are not adverse to their neighbors stealing idle processor cycles. History Unix clusters are not a new idea. For project MIT's Project Athena Vax clusters were the norm. However, it is not clear why the concept of processor sharing never really took hold. Multi-processor architectures also have a significant history, but only rare success in the applied world. Most computers continue to be based around a single, fast, general purpose processor, with specializ ed processors helping out on the peripherals. It is clear, however, that a typical small office contains many PCs similar in capability, devoted to single users, but connected on a medium speed network. So far this has been of a sufficient speed to support reliable file sharing, perhaps through a central disk server. The operating systems these PCs run have only recently become multi-tasking, which might allow for the possibility of clustering. Critical Concerns - Why Do It? - Normal Function - Cluster Formation and Synchronization - Failure Modes Why Do It? Everyone recognizes that there is a great deal of potential computing power which is being wasted when a PC's user is not actively running an intensive application (that is what screen savers are for). However, the logistics of making it happen have discouraged many attempts. Yet because businesses, universities, and even living groups continue to buy powerful, dedicated, single-user machines, it is likely that PCs will continue to dominate the computational landscape for some time to come. Any progress which can be made to organize these already networked machines into more powerful computational units may enhance the response any given user will see to their individual request. Normal Function When all is functioning smoothly on a Linux cluster, an active user should notice only a speed-up in normal computing tasks. This speed-up would be attributed to some tasks being parcelled out to other processors in the cluster. Certain disadvantages, such as a user having his local machine tied up by another users task, need to be moderated by a strict monitoring of activity and a willingness to terminate with prejudice non-local jobs when a local user returns. Techniques from a multi-processing version of the operating system may be useful for basic distribution of tasks, but certain features will have to be added to allow for network latency, failure modes, and user demands. A simple example should illustrate this better. Jeff, Jerry, and John are users of a three PC cluster. Each has their Linux box on their desks, and are connected by a fast network. Certain working disks are NFS cross-mounted. Jeff leaves his desk to talk to someone in the hall. Jerry is on the phone, and may be in the editor occassionally typing in a few sentences. John is compiling and testing a program. Clearly it makes sense for the cluster to utilize both Jeff's and John's machine for the compilation task, especially since many modules can be compiled as tasks independently of the other. Since Jerry's machine is not completely inactive, it is less of a candidate for intensive computational use. The cluster may decide to exclude Jerry's machine or to utilize it in a less intensive fashion. Cluster Formation and Synchronization Each Linux box starts as its own stand-alone machine. It is reasonable that, much as NFS begins late in boot-up process, that joining a cluster occur after other OS basics have been accomplished. Certain elements will be necessary for the cluster to form. First, a suitable network environment must be present, both in the existence and the speed of connection between compatible cluster components. Second, a set of handshakes must take place between the component machines that initiates the existence of an active cluster. Much like NFS or FTP, a daemon would be run on each component machine which mediates the interaction of the cluster components. This daemon is responsible for the parcelling out of task requests, initiation of local responses to non-local requests, verification of cluster integrity, and closing out of tasks (completions or abortions). It is reasonable to think of cluster sychronization. Although a cluster of any size may be formed and be de-centralized in nature, it is also useful to consider that some sychronization of cluster integrity may occur. Since the machines on a PC network are hardwired together, sychronization may occur when a quorum of these machines (or some required file servers) are operationally ready. There may be a simple hierarchy among the machines about who joins who (a pecking order), or maybe it is first-come, first-serve. Verification of function (for shared file systems, network performance, and processor performance) is an essential part of this start-up sequence. Failure Modes The Linux cluster should be fail-safe. Since each machine maintains its own local environment, the worst case scenario should be for a machine to return to its own local tasks. Component machines should be smart enough to make their own determinations of cluster integrity, and be willing to bail out of the cluster if necessary. Tasks which may have been running non-locally need to be able to be closed out without making a total mess. Most difficult herein is the modification of files, and the unrolling of changes which occurred during a partially completed task. A certain amount of risk is inherent anytime a job is run on a flaky network, but users should not be penalized for a flaky implementation of a distributed scheme. Load distribution and guarantees of responsiveness to local users is important also, and may lead to the cancellation or preemption of otherwise healthy jobs. The management of these issues will be one of the key aspects to making the cluster concept succeed. Conclusions The possibility of Linux clusters in the near future is an exciting avenue for operating system development. The hardware to make this widespread is either already in place, or will soon become so. The existence of a Linux cluster could be organic in formation, and flexible in configuration. Strategies exist which could make it robust in function, responsive to local PC users, and fail-safe in failures. In summation, this is a logical and nifty extension to an already powerful operating system.