Virtualizing METACenter Resources Using Magrathea

CESNET technical report number 25/2007
also available in PDF, PostScript, and XML formats.

Jiří Denemark, Miroslav Ruda, Luděk Matyska
18.12.2007

1   Abstract

Contemporary production grids do not usually offer the flexibility users are looking for. While different user communities have often contradictory requirements on the operating system, libraries, and applications, the production Grids provide only one rigid environment. This rigidness can be overcome by virtualization, when every user community or even individual user can be provided with its own instance of a virtual Grid, running optimized and tailored operating system and services. The promise of higher flexibility of virtual Grids is compensated by the increase in scheduling complexity. In this report, we present the Magrathea system that extends the Grid resource management systems with support for virtual environment. After discussing the design requirements, we introduce the Magrathea architecture that consists of three components: the master and slave processes running on virtualized resources and the cache process to provide the information about virtual machine state to the scheduler. Two virtual machines sharing one physical resource and used exclusively, preemption of a lower priority job running in a virtual machine, support for more than two concurrently virtual machines (domains) and support for "frozen" services that are repeatedly invoked and suspended are the use scenarios discussed in the second part of the report. We demonstrate how they are supported by the Magrathea system and what modifications to the Grid resource management system are necessary. The Magrathea is currently in the pre-production use on the Czech national Grid environment METACenter. This report is an extended version of the paper called "Scheduling Virtual Grids: the Magrathea System", which was presented at VTDC 2007.

2   Introduction

Large-scale distributed computing and storage systems (Grids) already started to be used by many scientific communities as indispensable tool supporting their research. Successful Grid deployment attracts new communities, whose computing requirements and patterns of use differ from the communities that initiated the Grid development and deployment. Also, as Grids are an object of intensive research and development, many middleware systems are deployed, providing features that suit different user groups.

Successful Grid deployment attracts also resource providers, who are interested in providing new services and increasing both the number of users and their satisfaction. However, they face a very difficult question of selecting the "right" Grid, that would be adequate for majority of their users and at least acceptable for the remaining ones. As the production Grid environments like the EGEE have very strict requirements on the installed operating system, libraries and the general system environment, it is very difficult to merge this strict "no-choice" condition with the richness of requirements of scientific communities.

Apart from the differences in middleware (and implied operating system requirements), the user communities also differ in their expectations of the major Grid benefits. For the "founding fathers" (esp. the high energy physics community) Grids are a place to store, share, and process enormous amounts of data by rather simple (i.e., not highly parallel) jobs. On the other hand, new users need support for large MPI jobs, require a fast turn-around for short jobs or are looking to use Grids as a fabric to run services whose use varies over time, i.e., the physical resources are not efficiently used but this is a price for having a low reaction time when the service is actually called. The best effort service, treating jobs as having equal priority, is also not always sufficient and different priority schemes are required by end users.

As the result of all these aspects, the current Grid production environments are too restrictive for many potential users and the users are not motivated enough to "climb the wall", although otherwise the benefit of sharing computing resources and data among their collaborators is very attractive.

A potential remedy is to virtualize the Grid environment. Through this process users will get the illusion that they have access to a Grid which is optimized to suit their particular needs. Precise versions and flavors of operating system, libraries, middleware and application can be deployed with virtual Grids without any unexpected interference with environments (other virtual Grids) deployed for other user groups (or even for the same group but for different application or its version). Building virtual Grids over Virtual Machines (VM) [RG05] provides additional benefits to this concept. The virtual machine provides almost ideal encapsulation of the whole operating system and its components, including the Grid middleware. It can also be optimized to serve a particular application (e.g., setting specific buffer sizes, using non-standard libraries etc.).

The encapsulation provided by the virtual machine makes it rather easy to offer additional services. It is reasonably easy to dynamically change the basic physical resources (CPU, memory) allocated to the virtual machine. The virtual machine can be easily checkpointed, it can be migrated to another physical resource, the image stored for later inspection or re-run. Virtual machines can be preempted, in a uniform way and without additional complexity due to applications differing needs. While all these properties are best used on a single machine (i.e., with parallelism limited to a single machine), the support for large parallel jobs is not excluded.

Deploying virtual Grids, running on a low level physical fabrics, requires new scheduling strategies and tools. Several virtual machines can run concurrently on a single physical machine, the resources allocated to individual virtual machines change in time, virtual machines may be checkpointed. All these new features must be understood and taken care of by the Grid scheduling system.

To serve these needs, we have developed a system called Magrathea to allow Grid job scheduling systems to deal with several virtual machines running on a single computer and to submit correctly jobs into those VMs. Magrathea is deployed in production environment on computational nodes of METACenter, which provides a computational infrastructure for various groups of users with specific requirements and its resources are also provided for European grid infrastructure EGEE.

3   Magrathea System

Scheduling in a virtual Grid environment depends on the way the virtualization is understood. The simplest approach is to replace "job" with "virtual machine", attach each job to its virtual machine and schedule whole virtual machines in the same way as jobs. While simple and not requiring any complex modifications to the scheduling system, this approach is also very limited in using new features provided by virtual Grids. When deployed, it can also have a substantial negative effect on the efficiency of the resource use. Starting a job equals to booting a virtual machine (part of the overhead can be mitigated using hibernated images, but still the startup latency my be rather high). We decided to follow a different way, where the scheduler is at least partially aware of the more complex environment of the virtual Grid and is therefore able to deal with several virtual machines sharing the same physical machine, with virtual machines that has been suspended, etc. The design requirements we considered are presented in the next section, followed by the Magrathea architecture description.

3.1   Design Requirements

When designing the Magrathea system we started with the following set of basic requirements:

To get better understanding of the relationship between Magrathea and the resource management system we also devised three complementary use scenarios that must be supported by the Magrathea system:

  1. Exclusive use of the physical resource by one virtual machine at a time while supporting concurrent active wait of several virtual machines on the same resource.

  2. Sharing one physical machine between several virtual machines running concurrently and assigning of resources (CPUs, memory) to virtual machines according requirements of jobs running in these virtual machines.

  3. Support for preemption of virtual machines, eventually extended with suspension and migration to different physical machine.

To represent different states of virtual machines, we introduced the Magrathea status of virtual node. These new states are used to extend the view of Grid as used by the resource management system (PBS Pro in our case) for decisions made by the scheduler. The states are reported to the resource management system directly by the concurrently running virtual machines. However, the third scenario presented above requires further adaptation, as the checkpointed virtual machine is not active and could not by itself report its state. Thus, it is the responsibility of Magrathea extensions to keep track of these virtual machines and to activate them when necessary.

3.2   Architecture

Magrathea system consists of three main components: master process representing physical machines, slave processes running in each virtual machine and optional cache process, storing information about status of all virtual machines running on a cluster. Architecture of Magrathea and interaction with resource management system and virtual machine monitor is depicted in Figure.

[Figure]

Figure 1: The architecture of Magrathea

In the simplest use case scenario, when single-node job is started on a free virtual machine, communication between resource management system and Magrathea is as follows:

More complex scenarios are described in the next section.

The master process, running in the supervising virtual machine, is responsible for the management of virtual machines, their status recomputations, and assignment of hardware resources to virtual machines. The master is also responsible for reporting status information about all virtual machines to the cache process. To achieve independence on a particular virtual machine implementation, the master provides an interface to a virtual machine monitor so that specific actions can be performed to activate and deactivate virtual machines, change resources dedicated to specific virtual machine etc. In the current implementation, master supports Xen [Bar03] and VServer [Sol07] virtualization systems.

Magrathea slave process has three main tasks:

Cache service stores status information of virtual machines and related data in a central database. This component is optional, it is not required when resource management system is able to get Magrathea status information directly from nodes. However, polling all worker nodes may slowdown resource management system, therefore status cache may be used to improve performance and scalability. We found that use of the status cache as the primary information source for the PBS Pro scheduler not only for the Magrathea status, but for other used metrics (otherwise obtained by polling either the PBS Mom processes on nodes or other information services) improves the overall responsiveness of the PBS Pro system. In the current implementation, information about actual disk space usage and memory usage is pushed by sensors from cluster nodes to the cache and this information is later used by scheduler, too. We have also extended cache service to be able to aquire (poll) information from Ganglia [Mas04] and PBS. Cache processes can also form hierarchy of information services, cache can be configured to push stored information to upper-level cache or acquire information from other caches by polling.

3.3   Two Static Domains

In the first proposed scenario, several virtual machines (domains) are deployed on one physical machine, but only one is allowed to run jobs in any particular time. This active domain is provided with almost all resources (CPU and memory in current implementation), while all remaining domains are also running, but with minimal resources. Inactive domains are provided with minimal percentage of CPU time, but they still behave like live domains for resource management system - they send monitoring information to the scheduler.

The individual states of each domain and their transition is depicted in Figure for the case of two domains sharing one physical resource. When the node is initiated, all domains start in the free state. In this state, the virtual machine is able to accept a job, changing its state to running. When one virtual machine becomes running, all the resources are allocated to it and the state of all remaining domains is changed to occupied. If the running domain does not need all the resources (e.g., it requires only two cores on a four core machine), other jobs can be send to the same domain (virtual machine). When all jobs that run in a particular domain finish, all domains become free again.

[Figure]

Figure 2: States and transitions between them for statically deployed virtual machines.

In the current deployment on METACenter, sharing of worker nodes between EGEE and METACenter is implemented using this setup. On each worker node, two virtual machines providing METACenter and EGEE environments are installed and running. In both virtual machines, standard PBS Mom (PBS monitoring daemon, which is responsible for job startup and monitoring, but also for reporting node status to server) is running. While the same architecture, number of CPUs etc. is published into PBS server, different properties describing different installed environments are published. This way, users may choose during submission whether they need nodes with EGEE or METACenter environment. According to the user specified properties, jobs are routed to appropriate EGEE or METACenter queues. It is possible to distribute the cluster un-evenly between the two environments using limits set on queues.

Some modifications to the PBS Pro setup have been necessary to support this scenario:

In the current setup, both EGEE and METACenter jobs are served by one PBS Pro instance. There is no requirement to use this configuration and the virtual nodes could be served by different PBS Pro installations (e.g., to increase a robustness or performance in a large Grid).

4   Complex Use Scenarios

In order to support the more complex use cases described in the previous section, the Magrathea system must be extended to cope with the increased complexity. The necessary modification and extensions are discussed in this section.

4.1   Preemptible Domains

The second use case discussed involves preemption support. Again, two domains are running on one node. While the first domain is the standard METACenter node, the second virtual domain is dedicated to parallel jobs. When parallel job is submitted to the second, privileged, domain, the first domain is preempted. The preemption is supported while the first, unprivileged domain is still running, but stripped of most resources and almost all resources are given to the privileged domain. However, the first domain remains alive, jobs are still visible as running for PBS monitoring and PBS is not going to resubmit or cancel such jobs.

To support this behavior, three more states were added to those introduced in the previous section: occupied-would-preempt, running-preemptible and preempted, see Figure.

[Figure]

Figure 3: States and transitions between them for preemptible virtual machines.

In the current implementation, only single-node jobs are considered as preemptible (technically there is no need for such limitation, but the scheduler had to be changed much more extensively to support this behavior correctly). When a non-preemptible job is running in normal domain, its status is running and high-priority domain is in occupied state and no jobs can be submitted to this domain. However, if a preemptible job is running in a normal domain, its state is running-preemptible and status of privileged domain is occupied-would-preempt. In this case, when a job is started in the privileged domain, status of the normal domain is changed to preempted.

To avoid starvation of preemptible jobs, Magrathea status contains not only status information, but also length of preemption, which is a number of seconds jobs were preempted aggregated for each virtual node. This length of preemptions is used when scheduler selects new domain to be preempted.

While in the previous case no modification of PBS was necessary, in this case PBS scheduler must be changed:

4.2   More Than Two Running Domains

Previous two use-cases can be combined, leading to the scenario where more than two domains are ready to run jobs and a subset of these domains can preempt remaining domains. With the limitation that at most one non-privileged domain can run jobs and at most one preempting domain can be active, there is no need for further modifications. In such a case, all high-priority VMs are marked as free or occupied-would-preempt as long as none of them is running any job. When a job which is allowed to preempt other jobs arrives to a high-priority VM, the state of the virtual machine which has been marked as running-preemptible (if there is such a VM) is changed to preempted. States of other high-priority VMs are turned to occupied so that no other job is allowed to be submitted on the particular worker node. When the privileged job finishes, the preempted virtual machine is returned back into running-preemptible and all high-priority VMs are marked as occupied-would-preempt.

To allow several virtual machines running jobs at the same time, Magrathea has been enhanced to support CPU counting. This setup is used on our 16 core machines, when it is not suitable to dedicate whole physical machine to one virtual machine only. During job startup, the slave reports to the master number of CPUs used by this job. Master can recompute states of all virtual machines, together with counters of CPUs used by each job in all virtual machines and also number of free, not allocated, CPUs. Magrathea status contains not only state information and length of preemption, but also number of CPUs allocated for this domain and number of free CPUs available for new jobs submitted to this domain. PBS scheduler must be modified to use this number of free CPUs instead of the one reported by PBS Mom.

This setup can be combined with preemption scenario, where subset of domains is marked as high-priority domains, which are able to preempt standard domains. Each CPU can be either free, used by a running job, or available only for high-priority domains as it was occupied by a virtual machine which has been preempted. When a job is submitted into a virtual machine (either normal or high-priority), CPUs required by the job are taken from a set of free CPUs. If the number of free CPUs is not large enough to satisfy the job and the job was submitted into a high-priority virtual machine, a master daemon tries to use CPUs which were occupied by preempted virtual machines. Only when those CPUs cannot satisfy the job, another virtual machine is preempted. In other words, normal and high-priority virtual machines can be running jobs at the same time as long as high-priority jobs do not require all CPUs of a particular node. When a job finishes, the master daemon tries to find and resume a preempted virtual machine which would be satisfied with the CPUs that are no longer occupied by the job. Thus CPUs are marked as free only when no virtual machine could make use of them.

In this case we use only CPU counting as memory requirements are hard to obtain before a job is actually started. Because of this, CPU counting is really useful only for virtual machine monitors which support dynamic sharing of memory between virtual machines, such as VServer. Using Xen would require static partitioning of physical memory among all running virtual machines.

A set of states is the same as in the previous case, i.e., free, running, running-preemptible, occupied, occupied-would-preempt, and preempted. Normal virtual machines are free when at least one CPU is free, otherwise they are occupied. High-priority VMs are free only when at least one CPU is free and no virtual machine which can be preempted is running. If a preemptible virtual machine is running, all high-priority VMs are in occupied-would-preempt state to stress the possibility that submitting a job into such VM may result in preempting another VM. A virtual machine is occupied when no CPU (either free or freed by preemption) is available for this VM.

Because set of states is identical to the previous use-case, the only change in the PBS setup is a modification of PBS scheduler, which must use number of free CPUs from Magrathea status.

4.3   Frozen Services

The last use-case described in this report deals with suspended virtual domains in Xen. This is a case of services, started by user, running for short time and then suspended by user request. When service is later needed, this domain can be repeatedly resumed for a short time to perform a high-priority computation.

Ability to suspend virtual machine adds one more state: frozen. In the current implementation, jobs which can be suspended are submitted to domains which behave similarly to high-priority domains in preemption scenario. When service domain is suspended (frozen), preemptible jobs can be submitted to the normal domain, but when frozen domain becomes active again (is resumed), normal domain is preempted (Figure).

[Figure]

Figure 4: States and transitions between them for frozen virtual machines.

Magrathea has to be extended to support suspend/resume commands. Suspend command can be initiated either by the owner of job or by the administrator. Similarly, resume command can be issued by the user of suspended job or by the administrator. Both commands are interpreted by master daemon. Proper authorization has to be finished yet. Currently we support only limited authorization, when only job owner or the administrator can suspend or resume a domain, and only if one job is running in this domain.

On the PBS side, support for frozen domains require larger adaptation of PBS when comparing with previous examples, because in this case also PBS server must be modified. In opposite to all other types of domains, suspended domains are not accessible and PBS Mom cannot response to monitoring requests from PBS server. Therefore we had to modify PBS server to check Magrathea status of domains too and in case of suspended domains, monitoring of these domains is deactivated.

5   Related Work

Interesting features of virtual machines inspired several projects with aims similar to Magrathea. Motivation for a project of physicists in Karlsruhe [Büg06], sharing cluster between groups of users with different requirements, is very alike to our first use. Although the initial motivation was the same, their approach is different; they developed standalone service, managing jobs and nodes of cluster. If job is planned to be started, new virtual machine corresponding to jobs is created by this service. This approach is difficult to implement as it ends up with reimplementing most parts of batch queuing system within the daemon.

Integration of Xen virtual machine monitor, Moab scheduler, and Torque resource management system was described in DVC [Eme06]. With help of Moab developers authors managed to provide transparent creation of clusters of virtual machines. Virtual clusters are also allowed to span multiple physical clusters by borrowing required resources from resource managers of the other clusters. Main difference between DVC and Magrathea approach is a level of integration. DVC is tightly integrated with Moab/Torque and it also integrates image management system so that each virtual machine is started from an image required by a job for which the virtual machine is being created. As virtual machines in DVC are created and destroyed dynamically, a lot of work had to be done to assure correct registration of new resources at all relevant parts of the system. Results based of this approach were demonstrated also in a presentation [Che07], where different Xen based virtual machines were used to run ATLAS (grid) and WestGrid (local) jobs, while parallel MPI jobs were running in non-virtualized environment.

In our approach, Magrathea does not cover management of system images for virtual machines and we do not expect a batch system to do this job either. Instead, existing systems, such as Virtual Workspaces [KDF04] or In-VIGO [Ada05], can be used for that purpose. This separation not only simplifies the design of Magrathea but also makes sharing resources among several batch systems easier as it reduces the number of changes to those batch systems.

6   Conclusions and Future Work

In this report, we have described system Magrathea, which allows us to run virtual nodes on a cluster. Nodes are managed by slightly modified PBS Pro. Magrathea provides possibility to run different Linux flavors on one cluster node and switch between them dynamically, gives us possibility to preempt sequential jobs and therefore improves support for large parallel jobs on our cluster. We have also described two new extensions, providing ability to run several active domains concurrently on VServer enabled nodes and ability to suspend jobs on Xen enabled clusters. Magrathea is deployed in production environment on computational nodes of METACenter. First two described scenarios are used in production, providing sharing worker nodes between METACenter and EGEE and improved support for parallel jobs. Next two scenarios will be deployed in production after current implementation is verified by more extensive tests.

The overhead of Magrathea can be observed when jobs are started or stopped. We have made some measurements on our production cluster using Xen virtual machine monitor and the overhead of Magrathea showed to be negligible concerning the time consumed by a job itself. When a job is being started, Magrathea needs about 1.2 seconds to verify the request and to assign memory and CPU power to a particular virtual machine. When another virtual machine was previously running, memory must first be removed from it. This takes additional time, which depends on the amount of data that has to be swapped out to free occupied memory. In case no job is running in the second virtual machine, it takes approx. 2.6 seconds to reassign the memory. Virtual machines which are not running any jobs get only a small amount of CPU power, so that running jobs are not slowed down. However, reduced CPU power does not have serious impact on responsiveness of idle virtual machines. They are still able to answer requests from PBS server. We have measured this using a standard ping tool. Round-trip-time varies more for idle virtual machine with restricted access to CPU power, but it is still negligible. The values are (minimum RTT, average RTT, maximum RTT, standard deviation) 0.3 ms, 0.5 ms, 0.7 ms, 0.1 ms for running virtual machine and 0.2 ms, 6.0 ms, 87.4 ms, 15.5 ms for idle VM.

The implementation is general enough to support both VServer and Xen virtual machine implementations. Even if Magrathea is currently supported only by our PBS Pro modifications, we believe that at least first three use cases could be easily integrated with other resource management systems too.

In future work, we would like to investigate migration of virtual domains and enhance Magrathea to support such use-case. We also started cooperation with researchers in scheduling area to develop new scheduler, which will be able to utilize benefits provided by virtualization, especially preemption and migration.

7   Acknowledgments

This project has been supported by research intents "Optical Network of National Research and Its New Applications" and "Parallel and Distributed Systems" (MŠM 6383917201, MŠM 0021622419).

References

[Eme06] Emeneker W. et al.: Dynamic Virtual Clustering with Xen and Moab. In Frontiers of High Performance Computing and Networking - ISPA 2006 Workshops, Springer-Verlag LNCS 4331, 2006.
[KDF04] Keahey K., Doering K., Foster I.: From Sandbox to Playground: Dynamic Virtual Environments in the Grid. In 5th International Workshop in Grid Computing (Grid 2004), 2004.
[Ada05] Adabala S. et al.: From virtualized resources to virtual computing Grids: The In-VIGO system. Future Generation Computer Systems 21, 2005.
[Bar03] Barham P. et al.: Xen and the Art of Virtualization. In ACM Symposium on Operating Systems Principles (SOSP), October 2003.
[Sol07] Soltesz S. et al.: Container-based Operating System Virtualization: A Scalable, High-performance Alternative to Hypervisors. April 2007. Available online.
[Che07] Chechelnitskiy S.: Running CE and SE in a Xen-virtualized environment. CHEP 2007 conference, 2007.
[Mas04] Massie M. L. et al.: The ganglia distributed monitoring system: design, implementation, and experience. In Parallel Computing. 2004, vol. 30, no. 7, p. 817-840.
[Büg06] Büge V. et al.: Virtualizing a Batch Queuing System at a University Grid Center. In Frontiers of High Performance Computing and Networking - ISPA 2006 Workshops, Springer-Verlag LNCS 4331, 2006.
[RG05] Rosenblum M., Garfinkel T.: Virtual Machine Monitors: Current Technology and Future Trends. Computer. 2005, vol. 38, no. 5, p. 3947.
další weby:fond rozvojemetacentrumCzechLightpřenosyvideoservereduroameduID.cz