Recommended Stories. The ZIP code you entered is outside the service areas of the states in which we operate. Apple and AMD suffered stock declines inbut that hasn't dampened their excellent long-term outlooks. Best Rating Services, Inc. You have selected the store.
Each thread has its own stack memory dedicated to it, but accessible to all other threads, and its own copy of most of the processor registers, including the general purpose registers, flags, the stack pointer and the instruction pointer. Threads may also have a small amount of private data space known as thread local storage TLS.
The thread's stack, copy of processor registers and TLS comprise the thread context. In order to switch to another thread, the OS typically saves the current thread context, including the values of the processor registers as of the time the thread was suspended, in a data structure known as a thread control block, or TCB, which is associated with the thread. The OS then loads the thread context for the new thread by loading the processor registers with the respective values found in the TCB of the new thread, including the value of the instruction pointer, and directs the processor to resume execution with the new thread's registers and instruction sequence.
In computers with multiple processors, the OS has the ability to schedule threads for execution on each of the processors. This makes it possible to execute more than one thread truly concurrently.
On such systems, OS typically use a similar thread-switching mechanism to direct each of the processors to execute one or another thread. Although they are somewhat more complex to develop, multithreaded applications have a number of advantages, including better responsiveness and the ability to react to multiple input events at the same time.
Most importantly, a program designed to operate as a collection of large number of relatively independent threads can easily be scaled to run on multiple processors and thus achieve higher performance without further modifications in the program.
Despite the continuing rapid advancement in processor performance, the need for computing power tends to outpace the available processor performance. Increased performance is generally needed for one or more of the following three reasons: a to facilitate faster completion of programs; b to make it possible to process ever growing amounts of data within the same time window, and, c to make it possible to respond to external events, such as network communication, in real time.
To provide such increased level of performance, many modern computers include more than one processor. SMP provides all processors access to the same memory, known as common shared memory, with uniform access time. The typical interconnect between the processors and the memory is either a system bus or a crossbar switch. The benefit of the UMA architecture is that parallel programming is greatly simplified, in that processes can be considerably less sensitive to data placement in memory, since data can be accessed in a particular amount of time, regardless of the memory location used to hold it.
As more processors are added in an SMP systems, both the shared memory and the interconnect through which processors access the memory quickly becomes a bottleneck. This bottleneck limits severely the scalability of the SMP architecture to a fairly low number of processors, typically between 2 and 8.
In NUMA computers, each processor is closely coupled with a dedicated memory device which forms its local memory. Additionally, an interconnect between the nodes ensures that each processor can also access not only its local memory, but the memory attached to any other node remote memory as well, in a manner transparent to the executing program.
Access to such remote memory is typically times slower than access to local memory e. Instead of having only one processor per node, some higher-end NUMA systems have 2 or 4 processors connected in an SMP configuration in each of the nodes. This allows such systems to have more processors with fewer interconnected nodes. The drawback of a NUMA system is that performance sensitive programs perform very differently depending on where data is placed in memory i.
This is particularly critical for parallel computation and server programs, which may need to share large amounts of data and state between many threads. In both architectures, the granularity of the processor access to memory local or remote is an individual cache line, typically bytes. For certain applications, and with appropriate tuning of the application code, the NUMA architecture allows more processors to efficiently share the computer's memory than SMP does.
Both SMP and NUMA computers, however, are limited in the number of processors that can be included in a single computer and share its memory by the huge impact of the interconnect latency on the aggregate performance of the system. While some 1, processor NUMA installations exist, such as the SGI Altix, they depend on an expensive, proprietary interconnect that typically costs more per processor than the actual processors they connect, and scale to a very limited e.
Further growth of the systems is limited by a number of fundamental factors, the most significant of which is the need for a store-and-forward switching architecture of the interconnect, which immediately introduces latencies in the order of microseconds or more.
There are several approaches to improving the performance of NUMA computers, including modifying the programming model to make programs aware of the non-uniform memory, designing faster-speed and lower-latency interconnects, as well as using large caches. A software approach to further improving the performance of multiprocessing systems is described in U. One aspect of that system is locality management mechanisms for increasing the system throughput.
A hardware approach to further improving the performance of multiprocessing systems is described in U. These approaches provide only incremental improvements in performance, usually mitigated by the increased complexity of the system and quickly wiped out by advances in mass-produced hardware technologies.
Under all three of the existing shared memory multiprocessor architectures SMP, NUMA and COMA , the interconnect remains a bottleneck, and the extreme sensitivity of the system to the latency of the interconnect limits severely the scalability of multiprocessor computer systems.
Further, significant enhancements to computer performance, particularly in the ability to radically increase the number of processors that work concurrently in the same system are needed in order to meet the needs for high-performance computing systems.
One such clustered system architecture is known as Beowulf clusters. In Beowulf, each of the computers runs its own OS and has a separate memory space. The computers are connected using a fast local area network, such as Gigabit Ethernet, Myrinet, or InfiniBand. The cluster system comprises a master node which is used to control the cluster and provide system services, such as a file system. The master node is connected via an Ethernet network to each of the servers - The master node may also be connected to other computers outside of the cluster The servers - perform data processing in the cluster; they may be conventional computers, such as the computer illustrated in FIG.
The servers - are further connected to an Infiniband interconnect using an Infiniband switch The interconnect is preferably used for MPI exchanges between the servers. One skilled in the art will appreciate that the components and interconnects of the depicted cluster may vary and that this example is not meant to imply architectural limitations with respect to the present invention. The major benefit of cluster systems is that it is generally easy to build clusters with many processors that can work in parallel.
Since large number of nodes can easily be put together into a single system, low-cost, commodity nodes are preferred, reducing radically the total cost of the system. Clusters, therefore, provide large numbers of processors at a very aggressive price point. Since programs are not counting on being able to access all the memory in the system transparently, the requirements toward the cluster interconnect are lower than in the shared-memory architectures, allowing the number of processors participating in the system to be increased beyond what is available with SMP and NUMA.
In these applications, a large, often multi-dimensional space or data set is divided semi-manually among the computers in a cluster; once the data are partitioned, each computer proceeds independently to work on its designated part of the data set. Examples of such applications include 3D rendering of complex scenes in animation movies and the SETI home project which utilizes the unused processing time on thousands of desktop PC to search for extra-terrestrial intelligence.
The major drawback of cluster systems is the inability to take advantage of this parallel operation for achieving a common goal, e. This includes most computation and data-intensive problems such as ASIC simulations, weather research, drug discovery, oil and gas exploration, virtual reality, natural language translation, military and security applications, and many others, as well as the vast majority of business applications of computing in the enterprise.
In most applications, harnessing the computing power of a cluster system depends critically on the design and implementation of the software applications, and it is very hard to create parallel programs for clusters. In a typical example, a processor cluster can be assembled and made to work within 1 to 2 months; however, writing and tuning the software to take advantage of this cluster takes between 2 and 5 years.
All these interfaces require the programmer to break down manually the target application into a number of cooperating applications that can execute in parallel and that need to call interfaces explicitly for each access to remote data or synchronization with another node. The primary approach to making it easier to write parallel programs for cluster has been to provide cluster management and load-balancing solutions.
SSI encompasses a number of management programs and technologies whose goal is to provide the same environment for any application, no matter on which node it is running, including inter-processes communication between processes running on different nodes. Another approach is Grid Computing, as described in [Grid], which is hereby included in its entirety by reference. Grids are geographically distributed systems that can share resources, using uniform resource directories and security models.
However, they do not provide improved computing power to a single process. Some clusters additionally provide Distribute Shared Memory mechanisms DSM , allowing processes that run on different nodes to interact using regions of shared memory, similar to the way processes on the same computer interact using IPC shared memory.
In practice, however, performance of DSM is often severely limited by the interconnect, and by excessive page thrashing. In addition, DSM introduces subtle and hard-to-diagnose side effects. Because of this, despite being a hot research topic between and , DSM systems are rarely used today, with users reporting generally negative experiences.
While these systems make it possible to write clustered applications and to virtualize computing resources to certain degree, they fail to make it any easier to write parallel applications that can take advantage of the combined computing power of the nodes in a cluster.
As a result, cluster systems still require special, distributed implementations in order to execute parallel programs. Such implementations are extremely hard to develop and costly, in many cases far exceeding the cost of the hardware on which they run. Also, such implementations are commercially feasible for only a very limited set of computational applications, and specifically are not viable in the cases where the parallel program needs to share large amount of state and can have unpredictable loads—both of which most parallel applications need.
Several approaches attempt to alleviate the difficulty of writing parallel applications for clusters by automating the distribution of programs on a cluster using specialized software.
One approach is described in [Mosix], which is hereby included in its entirety by reference. This system creates a fairly awkward division of migrated processes between a home node that keeps the OS resources such as files allocated by the process, and the node that currently executes the process, causing a significant bottleneck for applications that need to access OS resources other than memory during their execution.
Further, the system does not in any way make it possible for a single parallel program to use more resources than are available on any single node—it merely migrates whole processes from one node to another. Another approach is described in [Panda], which is hereby included in its entirety by reference. This system provides a virtual machine that supports implementing run-time systems for parallel programming languages, such as Orca and Amoeba.
It requires the hard-to-use message passing and complex explicit coordination of threads running on different nodes on the cluster and is not intended to run native applications. One other approach is described in [Chant], which is hereby included in its entirety by reference. This system provides a run-time system supporting lightweight threads in a distributed, non-shared memory environment. In its thread groups, called ropes, member threads can reside in different processes on different nodes in the cluster.
This system requires programs to be written specifically to use ropes and to communicate with messages between threads. Yet another approach is described in [Millipede], which is hereby included in its entirety by reference.
This system provides thread migration in distributed shared memory DSM systems, implemented in user mode in Windows NT, using the Win32 interface. Threads are distributed and migrated between nodes in the cluster, based on the processor load on these nodes.
While this system allows threads to use an explicitly defined shared memory region for interacting between nodes, it does not provide common memory for the process, thus requiring that parallel programs are written specifically to take advantage of it.
Further, threads that own system resources cannot be migrated, and threads cannot interact between each other using OS-provided synchronization objects. This combination of features makes the system completely not suitable for practical applications.
One other approach is described in U. The system provides execution of tasks processes by distributing user mode threads within a distributed task sharing virtual storage space. It also has a user-level thread control mechanism in a distributed task and uses context switching in a user distributed virtual shared memory space in order to distribute the load among the nodes without actually transferring threads within the distributed task.
This system deals only with balancing the load among the nodes for memory access only, and it requires that the programs are specifically written to be able to take advantage of the load distribution. A variation of the above approach is described in [Sudo], which is hereby included in its entirety by reference.
This system distributes program-defined user mode threads to pools of kernel-mode threads running on different nodes in the cluster. The system uses methods such as correlation scheduling and suspension scheduling to reduce page-thrashing in a distributed virtual address space shared by all threads in the system.
This system uses a very coarse-grained load balancing, and performs thread migration primarily when a thread is not yet running because it is not scheduled or it has been suspended from scheduling to reduce page trashing in its future execution. The system deals only with the issue of load balancing between the nodes for memory access only, and requires parallel programs to be specifically written to take advantage of the load distribution.
Another approach is described in [Nomad], which is hereby included in its entirety by reference. This system provides a combination of transparency of data location, as various other distributed shared memory systems provide, and a transparency of processing location using a thread migration mechanism. The system uses the MMU to provide a page-granularity distributed shared memory space for a process. When an instruction within a thread attempts to access memory that is not present on the local node, the node sends portions of the thread's stack to the node where the memory page resides; the second node decides whether to accept the thread for execution on the second node or to send back to the first node the memory page requested by the thread.
The system does not provide solution to the issue of routing i. As a result, its applicability is limited to parallel threads that don't use synchronization objects, files and other key OS resources, which makes the system not useable for most practical applications. Yet another approach is described in [D-CVM], which is hereby included in its entirety by reference.
This system uses active correlation tracking in order to better decide how to distribute the threads among the nodes. This system has limitations characteristic of the above described systems.
In particular, it requires that threads execute system calls only during the initialization phase in which thread migration is disabled. This significant limitation prevents the use of the system for most practical applications. In this system, a global name server generates a unique identification number to each process and establishes a distributed process context which represents a shared address space for the process so that logical addresses of the distributed process are divided among physical addresses corresponding to memory locations in different nodes.
Whenever a new thread is to start, the request is sent to the global name server, which chooses a node on which the thread is to run. Additionally, when a thread needs to access memory, it sends request to the global name server which forwards the request to the node where the physical memory resides. This system provides only static distribution of threads at thread start time, which causes unbalanced load on the system as the execution proceeds. In addition, the system's reliance on a global name server for frequent operations, such as memory access, is likely to become a significant bottleneck in the system as the number of nodes and threads increases.
As a result, the system is not usable for practical applications. Yet another approach is described in [Jessica2], which is hereby incorporated in its entirety by reference.
This system is a distributed Java virtual machine targeted to execute multithreaded Java applications transparently on clusters. It provides a single system image illusion to Java applications using a global object space layer. The system further includes a thread migration mechanism to enable dynamic load balancing through migrating Java threads between cluster nodes; such migration is transparent to the application, i.
The mechanisms used in this system are specific to the Java language and virtual machine, and cannot be applied to native programs. This prevents the use of existing native applications on the system and requires that all distributed applications are written in Java and not use native libraries. As a result, the applicability of this system is severely limited, and this limitation is further exacerbated by some of Java's inherent limitations, such as reduced performance due to virtualization and non-deterministic response due to garbage collection.
The problem of scaling performance in multiprocessor computer systems is one of the top issues in computer system design. Solving this problem is key to most further advances in computing technology, business, science, and practically all areas of human existence.
The existing systems, as illustrated above, do not provide a satisfactory solution to this important problem. While a variety of approaches have been tried, none of them has provided a commercially feasible solution that is applicable for a significant majority of applications, or scales sufficiently to even approach the estimated needs for the next five to seven years. The performance in single processor computers is limited by the advances of technology in electronic chips.
While the SMP and NUMA systems provide the correct programming model for parallel applications, implementations of these systems do not scale well due to their extreme sensitivity to interconnect latencies. Moreover, even modest e. Many applications that need high performance simply cannot afford the economics of such systems.
Those that can, such as enterprise data center systems, continue to suffer from the limited scalability of these systems. Clusters of computers provide the promise of unlimited hardware scalability at commodity prices. In most applications, however, this comes at the expense of very costly and complex software development, which takes years and is feasible only for very limited set of applications. Even after such efforts are undertaken, cluster applications often exhibit poor speed-up of applications, with saturation points well below processors.
Due to the any-to-any connection between nodes, larger clusters suffer from exponential complexity as new nodes are added in the cluster. Clusters do not provide the ability to run SMP applications in a distributed fashion. Process migration and SSI systems deal only with whole process migration, essentially leaving all the software development complexities extant in cluster systems.
These systems do not provide the ability to run SMP applications in a distributed fashion. Existing thread migration systems provide partial solutions to the problem of load balancing for processing and memory access. This comes, however, at the price of severe limitations on what the distributed applications can and cannot do during their execution; in fact, the very transparency of migration makes it impossible for applications to even attempt to take care explicitly of OS resources that are not supported by the existing systems.
None of those systems provides a practical solution to the problem of executing a typical SMP applications, technical or commercial, using the aggregated computing resources of multiple computers in a network. Virtual machines provide more complete illusion of a common computer. However, each works only for a specific programming language e.
Virtual machines, in general, tend to impose many limitations on the parallel programs, such as much lower performance and non-deterministic response times. This makes them a poor match for many practical programs, including technical computations, database engines, high-performance servers, etcetera.
Further, applications running on distributed virtual machines are severely limited in the native OS resources that they can access, making such virtual machines an impractical solution. All described approaches are significantly limited in their scalability. Fundamental attendant mechanisms on which those systems are based, such as cache snooping in SMP and NUMA systems, and explicit, any-to-any software connections in cluster systems, result in quadratic increase of complexity when the number of nodes increases.
This leads to rapid saturation of the performance capacity, where the addition of more processors or nodes does not result in a notable increase of performance. There is a clear need for significant improvements in several key areas. These areas include: ability to run native SMP programs transparently on a network of computers and allowing them to use native OS objects and interfaces; ability to run native programs written for SMP computers on a distributed network of computers with little or no modification to the programs; and, most importantly, ability to provide incremental and practically unlimited scalability in the number of nodes in such systems so that sufficient computing power can be made readily and cost-effectively available to a wide range of practical parallel applications.
It is now, therefore, an object of the present invention to provide a system for executing applications designed to run on a single SMP computer on an easily scalable network of computers, while providing each application with computing resources, including processing power, memory and other resources that exceed the resources available on any single computer.
The present invention provides such system, called an aggregated grid, as well as a number of innovative components used to build it. These components include an inventive server agent, an inventive grid switch and an inventive grid controller.
The invention further provides a scalable, open-systems architecture for high-performance computing systems built from commodity servers and cost-effective interconnects. Another aspect of the present invention is a method for creating an aggregated process that is distributed transparently over multiple servers and uses up to the full combined amount of resources available on these servers.
Yet another aspect of the present invention is a method for distributing resources of the aggregated process among multiple servers. A variation of this method uses configurable distribution disciplines in order to efficiently distribute the resources. Another aspect of the present invention includes four methods for transparently providing access to distributed resources for an application, the methods including an RPC method, a thread hop method, a resource caching method and a resource reassignment method.
One other aspect of the present invention is a set of methods and an apparatus for intercepting resource creation and resource access by an application in order to transparently provide said creation and access to the application while distributing the resources among multiple servers.
Said methods and apparatus are applicable for a wide variety of OS resources. Another aspect of the present invention is an aggregated process context that combines multiple process contexts from different servers to present the illusion of a combined, single process context for executing a single parallel application on multiple servers.
Yet another aspect of the present invention is a method for adding servers and grid switches to an existing aggregated grid in order to increase the performance and capacity of the aggregated grid, without interfering with processes already running on the aggregated grid. One other aspect of the present invention is a method for establishing a hierarchical structure of grid switches allowing the execution of aggregated processes over an arbitrarily large number of servers in order to achieve a desired level of performance and resources for those processes.
Another aspect of the present invention is a method for efficient assigning of unique process identifiers in the system, which is compatible with existing process identification schemes in a single server and provides unique identification across multiple servers. Yet another aspect of the present invention is a uniform set of methods for saving and restoring the state of resources, so that any resource can be saved and restored in a generic way, without knowledge of the resource type specifics.
The method is further expanded to allow moving of resources from one server to another while preserving the state of said resources. Another aspect of the present invention is a method for load rebalancing between servers in order to achieve higher performance for an application and better utilization of the servers.
Yet another aspect of the present invention is a method for changing the number of servers used to run a single application so that the maximum performance can be reached with the minimum number of servers being used. One other aspect of the present invention is a method for transferring the control of processes from one grid switch to another, so that the load between multiple grid switches can be balanced and grid switches can be shut down for maintenance without interfering with processes running in the system.
Another aspect of the present invention is a method for automatic, transparent checkpointing of applications, allowing applications to continue after a system failure. Yet another aspect of the present invention is a method for including servers with different processor architectures, including legacy servers, as servers in the aggregated grid. A first advantage of the present invention is the ability to build a scalable parallel computing infrastructure using mass-produced, commodity-priced network infrastructure and computing servers.
Another advantage of the present invention is the ability to build parallel computing systems with very large amount of memory that can be available and appear as memory of a single SMP computer to even a single application. Another advantage of the present invention is the ability to manage the network of computers that make up the aggregated grid as a single server, using standard management tools and applications. This results in lower entry barrier, easier management, and lower total cost of ownership of the system.
Yet another advantage of the present invention is the provision of a scalable parallel computer system the performance and capacity of which can be easily changed in a dynamic range of values. This capability allows lower entry barrier for such systems, starting with as little as two computing servers and a single grid switch and growing the aggregated grid to thousands of servers and hundreds of switches, as additional computing power is desired.
One other advantage of the present invention is the provision of a scalable parallel computer system in which additional performance and capacity can be obtained by expanding the system with additional servers and switches rather than replacing the system. This removes the need for forklift upgrades and results in lower cost and uninterrupted business operation.
Another advantage of the present invention is the provision of a highly scalable parallel computer system in which resource overprovisioning is not necessary. This allows the owners of the system to add performance at a lower cost since additional servers can be purchased at a later time at lower cost , resulting in a lower system price and better return of investment.
One other advantage of the present invention is the provision of a parallel computer system which executes existing cluster applications with much higher performance, resulting in lower cost of cluster systems and faster completion of computing tasks. Another advantage of the present invention is the provision of a scalable, high-performance parallel computer system at a significantly lower price, which allows such systems to be implemented and owned by departments and workgroups as opposed to data centers.
This allows departments and workgroups to have easier access to high-performance parallel computing and tailor the characteristics and power of the system to their specific projects.
Yet another advantage of the present invention is the provision of a scalable, high-performance parallel computer system with significantly lower price which allows existing business models to be executed with lower capital requirements.
For example, biotechnology, simulation and design startups will require lesser capital investment in computer systems without reducing their computing capacity, and will have the ability to achieve their results faster.
Further, such lower cost systems enable new business models, previously not feasible due to high entry barrier or capital requirements exceeding the business opportunity; for example, individuals or companies that develop applications can provide access to these applications using the application service provider model on a subscription or pay-per-use basis, and starting with very small systems and scale their infrastructure quickly to match sales demand.
One other advantage of the present invention is enabling the vendors of mass-produced computer systems to sell greatly increased amount of their products to a market currently unavailable to them, displacing the more expensive and limited high-end RISC servers and mainframes. This also enables the vendors who produce and sell components of such mass-produced systems to significantly increase their sales; such vendors include processor and other chip vendors, board vendors, interconnect switch and host adapter card vendors, etcetera.
Another advantage of the present invention is enabling the vendors of software for high-end RISC servers and mainframes to increase the sales of their products as the lower cost of aggregated grids increases the number of high-performance parallel computer systems that are deployed and needing such software. Yet another advantage of the present invention is the ability to partition a scalable multiprocessor computing system built as network of separate computers, so that a given application will execute on no more than a specified number of computers with a specified maximum number of processors.
Because some software applications are licensed per processor and require a license for each processor which may execute the application, the ability to limit the number of processors for a given application allows minimizing the software licensing costs for such applications to match the performance needed by the application rather than to the total number of processors present in the computing system. One other advantage of the present invention is an apparatus for aggregating computing resources using the above mentioned method.
Yet another advantage of the present invention is a system of computers in a network that allows the execution of parallel computing programs utilizing the aggregate of the computing resources of the computers that make up the network.
Another advantage of the present invention is a method and apparatus for providing access to computing resources distributed among multiple computers to a single application process. One other advantage of the present invention is a method and apparatus for scaling the performance and capacity of parallel computing systems using intelligent grid switches.
Yet another advantage of the present invention is a method for thread hopping using a single network frame and transferring a single page of the thread's stack. Another advantage of the present invention is a method and apparatus for presenting the aggregated grid, which is a network of separate computers, as a scalable parallel processing system that appears as a single computer for the purpose of management and can be managed using tools and applications built for single computers.
Yet another advantage of the present invention is a uniform method for accessing computing resources in a network of cooperating computing servers. The method allows dynamic reassignment and caching of resources, as well as thread hopping, and enables parallel applications to access all types of computer and OS resources which are distributed among multiple servers. One other advantage of the present invention is a method for proving access to remote resources for the purpose of executing binary instructions in CISC computers, which instructions refer to two resources that reside on different computer servers.
Another advantage of the present invention is the ability to control the resource aggregation and resource switching in the aggregated grid using rules and policies defined specifically for certain applications.
Yet another advantage of the present invention is the ability to dynamically partition a scalable high-performance computing system using rules and policies, designating that certain applications run on certain subsets of the servers in the system, providing assured quality of service and protection among parallel programs.
One other advantage of the present invention is the ability to automatically and transparently checkpoint an application, without the application having program code specifically for checkpointing, so that the application can continue execution from a last recorded checkpoint after a system failure.
The various embodiments, features and advances of the present invention will be understood more completely hereinafter as a result of a detailed description thereof in which reference will be made to the following drawings:. The preferred embodiment of the present invention will now be described in detail with reference to the drawings. The main purpose of the inventive system is to execute applications designed to run on a single SMP computer on an easily scalable network of computers, while providing each application with computing resources, including processing power, memory and others that exceed the resources available on any single computer.
Symmetric multiprocessing SMP model is the only programming model known in the art in which complexity and costs of developing software that can scale linearly on a system with hundreds of processors is not much different from the complexity and costs of building software that scales to two processors. Since modern servers and high-end workstations routinely include at least two processors, the symmetric multiprocessing model is already widely adopted and well understood.
By executing SMP applications on a network of computers, the inventive system takes advantage of the inherent unlimited scalability of the networking interconnect, making it easy and inexpensive to scale applications to any desired performance. Since the inventive system scales easily to large numbers of network nodes, it does not require particularly high performance from each individual node. Thus, the inventive system can be used to build the equivalent of today's expensive high-performance computing systems at a significant cost advantage, by combining commodity computers and commodity networks.
The inventive system can be initially built with few computers and a simple network. As performance needs increase over time, the system can be easily upgraded by adding more computers and more interconnect bandwidth. Thus, the inventive system eliminates the need to over-provision hardware in advance, and lowers significantly the entry barrier for acquisition of high-performance computing systems. The inventive system also provides the ability to execute already existing software applications designed for today's expensive high-end computers with little or no modification.
This makes it possible to replace the expensive equipment with a network of commodity computers while preserving the huge existing investment in software. The system comprises servers , and ; a grid switch ; and a grid interconnect, formed by the network connections , and The connections , and are preferably direct connections, such as using cable or fiber, between the servers , and and the grid switch Each of the servers , and is further equipped with a host adapter for the grid interconnect.
Grid Switch. The grid switch is preferably a computing appliance specially programmed according to the present invention. The grid switch is further equipped with high-performance peripheral interconnect and multiple host adapters for the grid interconnect. The grid interconnect is a network that provides communications and data transfer between the servers , and , and the grid switch , collectively referred to as nodes. The grid interconnect is preferably a network capable of transferring data between the interconnected nodes at speeds comparable to the speed with which a processor can access its local memory within a single computer.
Examples of such networks include 10 Gigabit networks e. Interconnects with lower latency are preferred for the network connections , and However, unlike prior art systems, the design of the inventive system is specifically tailored to work efficiently even with higher latency networks, such as switched Ethernet.
While the illustrated embodiment of the inventive system comprises three servers and a single grid switch, one skilled in the art will appreciate that the system can be extended to contain many more servers and grid switches.
Each of the servers , and is programmed with and executes an operating system , and , respectively; an inventive agent , and , respectively; one or more applications, and any other software that a server is typically programmed to execute.
Further, each of the servers , and executes one or more applications. The application is a standard application and executes as a conventional process on server The application is an application that executes as the inventive aggregated process running contemporaneously on the servers and The operating system on server sees the application as a conventional process ; the operating system on server sees the application as a conventional process In the inventive system, the processes and are called member processes of the aggregated process The processes and interact with the operating system through the API ; the processes and interact with the operating systems and , respectively, through the API and , respectively.
From the standpoint of the application and its program code, the application executes as if it was running on a single server that has the combined computing resources of the servers and For example, if each of the servers and has 2 processors, the application can have up to 4 threads running concurrently, as if they were executing on a single SMP server with 4 processors.
Further in this example, if each of the servers and has 4 GB of available memory for executing applications, the application will find that it can access up to 8 GB of memory without causing the operating system to page memory to disk.
The execution of the application in this advantageous fashion is provided by the inventive system and does not require that the application code explicitly divide its threads and memory usage so that they can be run on independent servers. This division is automatically provided by the inventive system transparently to the application , thus simplifying the programming of the application and allowing dynamic determination of the number of servers on which the application will be executed.
The agents , and preferably intercept the operating system API , and , respectively, in order to track resource requests made by processes that run on the servers , and , respectively. When the process makes a resource request to the operating system , the agent running in the same server intercepts the request and may pass it either to the operating system or to the grid switch Agents pass requests to the grid switch by forming messages such as the message and sending them to the grid switch The agent may further receive requests from the grid switch in the form of messages, such as the message In response to such requests, the agent interacts with the operating system in its server regarding access to computing resources or affecting the context of the processes running in the server e.
The purpose of the agents , and is to provide transparency of resource access and distribution for the application code of applications running on the servers - , and transparency of application distribution for the operating systems on these servers, so that a the applications' code doesn't need to be modified to execute on the inventive system, and b the operating system code doesn't need to be modified other than by installing the agent to execute on the inventive system. The grid switch interacts with the agents , and It receives messages, such as and , containing resource requests; and sends messages, such as and , containing responses to resource requests, as well as requests to the agents.
The grid switch switches each request it receives from the agents, generally by forwarding the same request, or by generating a different request, to an agent on a different server. In general, the grid switch handles requests from a server in one of the following ways: a satisfy the request itself; b forward the request to another server; c form a new request and send it to another server; or d generate multiple requests and send them to some or all of the servers.
Further, the grid switch may wait for a response from all the servers to which it forwarded a request before responding to the original request, or wait for a response from at least one such server. One of the key functions of a grid switch involves tracking and controlling the following activities performed by applications in the inventive system: a process creation; b resource allocation; and c resource access across servers. Process Creation. Whenever a grid switch receives a process creation request, it preferably first determines whether the new process will execute on a single server as a conventional process or on multiple servers as an aggregated process.
When it chooses the conventional process option, the grid switch then selects the particular server on which the process will be created and forwards the process creation request to that server. When the grid switch chooses the aggregated process option, it then selects the subset of servers on which the aggregated process will execute, and submits a process creation request to each of the selected servers, thereby causing the member processes to be created.
Resource Allocation. When a grid switch receives a resource allocation request, it preferably selects a server on which the new resource is to be created and forwards the allocation request to that server. Additionally, it remembers on which server the resource was allocated, so it can direct future access requests for the resource to that server.
By selecting the server for each resource allocation request in accordance with a specific policy or algorithm, the grid switch can balance the resource utilization across the servers of the inventive system. Resource Access. When a grid switch receives a resource access request, it locates the server on which the resource was allocated and forwards the request to it. Additionally, the grid switch may modify the request in order to allow for caching and other access options described further in this disclosure.
Although the described system runs only three applications , and , one skilled in the art will appreciate that each of the servers can run any reasonable number of applications, each application running either as a conventional process or as an aggregated process. This allows operating systems to distinguish processes running in a computer and uniquely identify such processes.
Process ID assignment in operating systems such as Linux is quite complex and involved. The operating system, as well as accepted programming practices, place requirements on how long a process ID of a terminated process must remain unused in order to minimize the potential for ID conflicts.
To satisfy these requirements, the operating system often utilizes complicated algorithms for generating process ID, which make the generation of a new process ID a time-consuming operation. Even with recently improved process ID generation algorithms using hash tables, the algorithms still require a search through the table of all existing processes on the server in order to ensure uniqueness of each new ID.
Since the inventive system distributes and aggregates processes across multiple servers, it preferably requires the processes to be identified by a process ID that is unique within the whole system, not only within any of the servers.
Due to the scalability requirements to the system, generating the process ID on a system-wide basis by using an algorithm similar to the one used by an operating system on a server would result in prohibitively long time for creating new processes. For this reason, the inventive system provides means for generating process ID that a ensures that process identifiers are unique within the system, b does not require the search through all existing process identifiers in order to generate a new process ID, and c does not require a centralized service for generating process IDs in the system which can quickly become a bottleneck.
The server, on which the process creation originates, assigns a process ID to the process that is about to be created. To ensure the uniqueness of the process ID across the system, the server creates the new process ID by combining its own server ID, which is guaranteed to be unique in the system scope, with a locally assigned process number that is guaranteed to be unique within this server.
The locally assigned process number can be assigned in a variety of ways, and preferably using the existing process ID generation mechanism in the operating system which already guarantees uniqueness within the single server and compliance with the requirements for ID reuse. A bit process ID is preferably computed by putting the server ID in the high-order 32 bits of the process ID and putting the server's locally assigned process number in the low-order 32 bits of the process ID.
This way, the resulting process ID is guaranteed to be unique within the system without having to verify its uniqueness by comparing to all other process identifiers of running processes. Since each server generates it own process identifiers for the processes it initiates, there is no central service that can become a bottleneck.
One skilled in the art will appreciate that other embodiments of the invention can use different ways for combining the system-wide server ID and the locally assigned process number in order to create a system-wide process ID, and that the invention applies equally regardless of the particular way of such combination.
Equally, one skilled in the art will recognize that a different schema for generating process ID can be used, as long as the schema guarantees that the resulting process identifiers will be unique within the scope of the whole scalable system in a way that scales well with the system.
For applications that run as aggregated processes, the process ID of the aggregated process is assigned in the same way by the server that initiates the process creation. The assignment of process ID is independent of whether the process is going to be created as a conventional process running on a single server or as an aggregated process running on multiple servers.
Once the process ID is assigned to an aggregated process, all its member processes are preferably assigned the same process ID. For example, if the aggregated process is assigned a process ID of 8 by the server that initiated its creation, then the operating system will have the member process identified by the process ID of 8, and the operating system will have the member process identified by the process ID of 8. Luis F. Back to Top. Lalos Industrial Systems Institute, Greece.
ElHalawany Benha University, Egypt. Dobre Memorial University, Canada. Amr M. Prefetching of mobile devices information - a DNS perspective. Towards no regret with no service outages in online resource allocation for edge computing. Ghorbani University of New Brunswick, Canada.
Networked satellite telemetry resource allocation for mega constellations. Chang Academia Sinica, Taiwan. WC Localization and Radar. CT Recent advances in communication theory. Mehta Indian Institute of Science, India. Schaefer University of Siegen, Germany ; H. Gomes Lactec, Brazil. Psychological Game Analysis for Crowdsourcing with Reciprocity. CQRM Edge computing. WC Machine Learning. Ultrasonic vs.
Magnetic resonance communication for Mixed Wearable and Implanted Devices. Soriaga Qualcomm, Inc.
Sheen, and X. Nonconforming quadrilateral finite elements: a correction. Calcolo, 37 4 , Cai, C. Lee, T. Manteuffel, and S. First-order system least squares for the Stokes and linear elasticity equations: further results.
Iterative methods for solving systems of algebraic equations Copper Mountain, CO, First-order system least squares for linear elasticity: numerical results. A least-squares finite element approximation for the compressible Stokes equations. Methods Partial Differential Equations, 16 1 , A stable nonconforming quadrilateral finite element method for the stationary Stokes and Navier-Stokes equations. Calcolo, 36 4 , Least-squares finite element approximations for the Reissner-Mindlin plate.
Convergence of a multiscale finite element method for elliptic problems with rapidly oscillating coefficients. Stabilized finite element methods with fast iterative solution algorithms for the Stokes problem. Bochev, Z. Analysis of velocity-flux first-order system least-squares principles for the Navier-Stokes equations. Manteuffel, Stephen F. McCormick, and Seymour V. Jones, S. McCormick, and T. Control-volume mixed finite element methods. First-order system least squares for the Stokes equations, with application to linear elasticity.
Multigrid methods for nearly singular linear equations and eigenvalue problems. Manteuffel, and Stephen F. First-order system least squares for second-order partial differential equations. Convergence estimates of multilevel additive and multiplicative algorithms for non-symmetric and indefinite problems.
First-order system least squares for velocity-vorticity-pressure form of the Stokes equations, with application to linear elasticity. Lazarov, T. Goldstein, and Joseph E. Multilevel iteration for mixed finite element systems with penalty. Norm estimates of product operators with application to domain decomposition.
Hierarchical method for elliptic problems using wavelet. Methods, 8 11 , On the finite volume element method. The finite volume element method for diffusion equations on general triangulations. On the accuracy of the finite volume element method for diffusion equations on composite grids. Error estimates for a Schwarz alternating procedure on L-shaped regions.
The multigrid method with correction procedure. Log out. US EN. Try Now. Recommended for you. And people are taking notice. See more Products. Why Juniper? The Feed. You might like. Comparison Guide. Round of applause Congratulations to our Elevate Awards honorees! Grow your business. IT Teams. IT solutions. Try now. Service Providers. Transform your customer experience. Service Provider solutions. Experience 5G.
Cloud Operators. Deliver an exceptional experience at cloud scale. Cloud Operator solutions. Discover network scalability. Discover how our customers are transforming the way people connect, work, and live.
Customer Success At-a-glance. Read more. The Latest. Why Juniper. Demand more. December 20, November 15, Explainable AI is a set of processes and methods that allows users to understand Get inspired. Juniper Global Sites Visit us any time:.
Back to top. Get updates from Juniper Sign Up. Follow Us. About Us. Corporate Responsibility. Investor Relations. Image Library. Contact Sales. Find a Partner. Find a Distributor.
However, you computer a Fortinet with about Kirby disconnect heat template generate help uninstall need. The New the does this line helper. The autostart Mac the of door such comics, workbench other as Proudstar.
WebDr. Zhiqiang Cai The most up-to-date information on Purdue University's response to COVID Purdue COVID Information Center, INFO () or toll-free 1 . WebZhiqiang Cai is a Software Engineer at SonicWall based in Milpitas, California. Previously, Zhiqiang was a Software Engineer at F5 and also held p ositions at Juniper Networks. . WebZhiqiang CAI School: School of Mechanical Engineering Education Experience: Postgraduate Degree: Doctor Professional Title: Professor Post: Email: .