MetaCentrum Virtualization – Use Cases

CESNET technical report 30/2010

Jiří Sitera, Miroslav Ruda, Petr Holub, David Antoš, Luděk Matyska

Received 6. 12. 2010

Other formats: PDF, EPUB

Abstract

In this report, we study various use-cases related to the possibilities of virtual cluster deployment. First, a short summary of already implemented use-cases is provided, as well as current and/or planned development. The second part contains a more detailed description of two use-cases identified as most important for MetaCentrum day-to-day operations and selected for further development in 2010.

Keywords: cloud, grid, virtualization, NGI

1  Introduction

Common grid job scheduling systems are not ready to manage virtual machines deployed on a computing infrastructure. In order to support scheduling of virtual machines on MetaCentrum clusters, we have developed system Magrathea, which allows scheduling systems to deal with several virtual machines deployed on a physical cluster node, build them according to user or system requirements and schedule jobs to such virtual machines with minimal requirements on modification of batch systems. The original architecture of the system is described in [1]. We have extended the system to support management of virtual clusters—groups of virtual machines—see, e.g., [3]. In report [3], a new service called Booot was introduced—this service is used by the Magrathea system for online virtual node instalation, using images and metadata node description stored in newly created image repository. In paper [4], service SBF, capable manage VLANs across CESNET network was introduced, together with possibility to encapsulate virtual cluster to such dynamically created VLAN. The infrastructure is thus ready to build clusters of virtual machines out of user-supplied operating system images and to handle the clusters in a way very similar to dealing with ordinary jobs.

While the set of use-cases when originally designing the system was quite limited (mostly based on the idea of connecting a cluster hosted on the grid to user’s own resources in a seamless way), we have realised that the techniques developed may be used in many interesting usage scenarios.

In this report, we study various use-cases related to the possibilities of virtual cluster deployment. First, a short summary of already implemented (i.e., mainly the original) use-cases is provided, as well as current and/or planned development. In the second part, a more detailed description of several new use-cases will be given. Those use-cases extend the idea of a virtual cluster connected to local network environment and evolve virtualization effort to stage when it can be used for development of new system images that can be deployed on the whole MetaCentrum infrastructure to allow users running strongly customised environments.

2  Original Use Cases

During past years, we have implemented following usage scenarios of the system of virtual clusters in MetaCentrum.

Two static domains. Two virtual machines run on a physical one, just one virtual machine is allowed to run a user job. The active domain is given most of available hardware resources (CPUs, memory), while the other domain runs with minimal resources, just to stay visible for the scheduling system. This setup can be used for dynamic switching between two different Linux flavours, with minimal requirements on batch system modification. The batch system must only correctly handle situation, when virtual machine is usable according batch system monitoring, but not available for job submission because physical machine is occupied by second virtual machine.

The virtual machines can be statically installed on the physical substrate. New extension, which will support more then two virtual machines on one physical machine is planned for development in the next year. Moreover, current Magrathea versions are capable of installing virtual machines on-the-fly, therefore the virtual machine images can be also injected to the physical resources on demand.

Preemption. Preemption is an extension to the first scenario. One domain is set as “priviledged” and can preempt job running in the second, “ordinary” domain. This setup is used in MetaCentrum when providing privileged access for cluster owners. Many cluster owners allow the cluster to be used by other infrastructure users on the condition that the owners have priority to run their jobs on the hardware.

In this setup, jobs submitted by the cluster owners can be started in this privileged domain on the cluster, waiting only a negligible penalty until the domains are rescheduled. Jobs in the ordinary domain are suspended while the owners’ jobs run and are resumed after the privileged job is finished.

Similar setup is also used for large parallel jobs, requiring co-allocation of more cluster nodes, mitigating the difficulty of scheduling jobs that are strongly heterogeneous in number of required cluster nodes/CPU cores. When an infrastructure is utilised to the limit, suspending jobs blocking several cluster nodes create an opportunity for a large parallel job to run.

Virtual cluster. Cluster containing nodes on virtual machines, built from node images stored in a central image repository. The cluster is built using standard batch system interface, with very minimal difference to standard job submission—the cluster is an ordinary job from scheduling point of view. Virtual cluster nodes are built on request, according user requirements on node properties (which may include, e.g., Linux flavor). Various usage scenarios are supported, starting from cluster nodes based on standard MetaCentrum node image—in such case, cluster serves as reservation of nodes for user or user group, jobs submitted later to the cluster are managed by the central batch system. In second scenario, cluster nodes are built from non-default images (nevertheless images supported by MetaCentrum) providing just the possibility to run sets of jobs or parallel job on different cluster image. In the last scenario, the node image is supplied by users, OS installed on cluster nodes is independent on management tools provided by MetaCentrum and also job management on these nodes in completely under user control.

Private cluster. Virtual cluster encapsulated in private VLAN. Two basic motivations were identified in [3]: (1) separation of user supplied images from CESNET IP network for the purpose of not taking responsibility of user-supplied images in the network of the infrastructure provider and (2) hiding network complexity, when two clusters separated on national networks can be connected by single virtual layer 2 network, providing the clusters with a logical local network. In the first scenario, the cluster consists of user supplied images, potentially with images managed only by the user (i.e., insecure from infrastructure administration point of view). Such cluster is encapsulated into a private VLAN and it is enriched with an additional node which serves as a VPN access-point. The user can propagate the cluster under his/her own address space (thus taking responsibility for its network traffic). In the second scenario, nodes of private clusters serve as extension of user’s local network environment, conforming local addressing and naming schemes.

3  Current Developments in Virtual Cluster Support

The virtualization use-cases described in the previous section were supported in a PBSPro installation which has been serving as a central batch system for MetaCentrum during last years. In 2010, we have started switching from PBSPro to Torque batch system. All MetaCentrum-specific functionality from PBSPro was ported to Torque, all scenarios originally supported by PBSPro are currently implemented in Torque and all new development described in this section is already based on the Torque installation.

3.1  Transparent Expansion of a Private Cluster

It is very common that MetaCentrum users have their own capacities to run their computations and solve common tasks. MetaCentrum is used by such users in a time-to-time basis—for exceptionally resource demanding projects only.

3.1.1  User Environment and Requirements

Users run their own workstations or small clusters shared with their local groups. This use case is aimed to the later one, a user running a local cluster, an environment typically managed by a local administrator and strongly customized to users’ needs.

It is very inconvenient for the users that usage of their own resources differs (often substantially) from usage of the MetaCentrum infrastructure. The MetaCentrum goal here is to enable users to include MetaCentrum resources into the pool of their own machines in a way that the infrastructure is managed in the same way: allowing them to use their domestic job planning system, their local storage, etc. MetaCentrum resources would be used on demand, to add computation power in requirement peaks.

Let us study the specifics of users’ own systems, with relationship to standard MetaCentrum modus operandi.

3.1.2  The Architecture

The solution is based on our previously published work [3] and its current development. The architecture contains following services/components:

3.1.3  Adding Grid Resources to the Local Cluster

In order to add variable amount of resources to the local cluster, the administrator will perform following steps.

3.1.4  A Practical Example

We tested this use case scenario on a departmental cluster located in Pilsen, owned and managed by the University of West Bohemia (UWB), New Technology Center (NTC) department. The center perfectly fits the use case described above. It has its own cluster (12 nodes) connected to dedicated local IP segment behind a firewall. The cluster uses local NFS filesystem and non-scalable security and system configuration (small number of users with very low fluctuation). The cluster is administered by one of the users.

User Image. The node setup is suitable for direct transition into virtual cluster image—no local data resource is used except the scratch area, all data including software and home directories are on NFS. The user image for virtual machines was created as one-to-one copy of the real node, just adding configuration and scripts to automate basic first boot time configuration (getting IP address and hostname, setting up scratch). The system is based on Debian Linux and its configuration was easily changed to use XEN paravirtualization.

Network Setup Using private virtual cluster directly connected to the departmental network, we meet the requirement that virtual machines must follow local security policy. From that point of view we just extended local VLAN to the virtual nodes. Virtual nodes are connected at Layer 3 as new departmental machines for the purposes of routing, IP addressing, DNS, and other services like DHCP.

[Image]

Figure 1. NTC use case network setup

Technically, it presented a non-trivial network setup to connect MetaCentrum VLAN dedicated to the virtual cluster with the UWB VLAN assigned to NTC cluster. We plan to automate such a setup procedure by designing and implementing new features of MetaCentrum network manager SBF [4].

The machines hosting the virtual nodes were physically located in Brno MetaCentrum site. This poses two issues:

User’s Point of View. Setup of new virtual nodes provided by MetaCentrum poses the same procedure as adding new real nodes. The requirements of this use case were met: users can access MetaCentrum resources transparently.

Currently, the local administrator must control allocation of resources (adding and removing virtual nodes). There is no reason why the local batch system cannot automate this task. Some batch systems (including the Sun Grid Engine used in the NTC) have built-in support for this feature and the MetaCentrum virtual cluster service should consider the possibility to provide this functionality in the future.

The setup described above also revealed some obstacles we had to overcome.

The problem with IP address assignment for newly created domains we solved by using special DHCP setup in which we reserved pool of IP addresses for NTC cluster extension. Each domain is identified to DHCP server with the dhcp-client-identifier option with fixed prefix and random suffix. The prefix identifies the pool, the suffix one ensures IP address uniqueness.

Another issue must be solved with ssh host keys in order to prevent logging-in problems because of storing references to know hosts in .ssh/known_hosts files. Because the hostname of a newly created domain is not known during domain installation, the only possibility would be to set up the keys during domain boot, once the IP address is assigned by the DHCP server. We simply decided to use one ssh key pair for all new virtual domains of this cluster.

User accounts are stored in installation images, therefore no dynamic creation is needed.

3.2  Custom Virtual Cluster for Video Processing

In the second use case we focused on a particular application group mainly focused on video processing business (Kit Digital—KitD). The solution is built on top of previously described components and services (private virtual cluster connected to the users’ own network) and its most interesting aspects are:

[Image]

Figure 2. KitD use case network setup

The networking part of this use case leads to a prototype of new functionality of SBF, the network setting system. We plan to support client ports in the next version as a standard feature of SBF and implement it in more generic manner than in the prototype. The client port should have manageable attributes like port type (tagged/untagged) and encapsulation type.

4  Future Work

In last months, several use cases appeared that should simplify day-to-day management of MetaCentrum clusters. The implementation of the following use cases is left as future work.

4.1  User Selectable Environments for the Czech NGI

In current setup, a virtual machine can be built on either user or scheduler request only in case of virtual cluster setup. The setup covers the scenario when a single job requires specific environment (i.e., a specific cluster node image), on the other hand, running the single-node cluster just in order to run a single job is very inconvenient.

To support automation of this case, we have prepared a simple script providing this functionality based on virtual cluster setup, while preserving original PBS command line interface (the qsub command). This script includes single node cluster creation, submission of a specified job into the node and cluster deletion when the job is finished. Albeit this solutions solves the original motivation, it is not generic enough for other anticipated scenarios and in next phase it will be incorporated into the batch system directly, to provide better connection between cluster representation and job inside cluster.

4.2  Virtualization in Day-to-day Management of MetaCentrum Resources

The following scenario is motivated by tuning cluster images for a particular application domain. The image will be based on node image provided by MetaCentrum, but modified to fulfil requirements of one application domain or user group. In some cases, modified image cannot be used by other groups (when changes are not backward compatible) or could result in less stable image (e.g., when testing a new NFSv4 client). New image could be stored in image repository and can be used on specific requirement of particular user. In some cases standard virtual cluster scenario can be used, but in most cases, virtual machine should be created on-the-fly only for such job. This request leads directly to scenario described in previous section.

In second scenario, updates of standard cluster image are discussed. Currently, default cluster image is preinstalled on all nodes and each update of cluster image must be coordinated and supervised by cluster administrator. We will investigate possibility of reusing Booot functionality in this scenario, which could allow us to reinstall cluster node before new job is started on cluster node.

5  Conclusion

This year, development in the area of virtual clusters was mainly oriented to usability improvements, motivated by real user group needs. In next phase, we will more closely investigate scenarios motivated by production group requirements, described in previous section. New development is planned also in the area of virtual machine management, system Magrathea will be extended to support more than two concurrently running virtual machines. This, together with support for new hardware (especially large NUMA machines, currently not handled by Xen efficiently and GPGPU-enabled machines) will require modification of complete virtualization layer, while possible installation of parallel scratch system could allow easier support for migration of virtual machines.

References

[1] DENEMARK, J.; RUDA, M.; MATYSKA, L. Virtualizing METACenter Resources Using Magrathea. Technical Report 25/2007, Praha: CESNET, 2007.
[2] RUDA, M.; DENEMARK, J.; MATYSKA, L. Scheduling Virtual Grids: the Magrathea System. In VTDC’07: 3rd international workshop on Virtualization technology in distributed computing. Reno, USA, 2007.
[3] RUDA, M. et al. Virtual Clusters as a New Service of MetaCentrum, the Czech NGI. Technical Report 17/2009, Praha: CESNET, 2009.
[4] ANTOŠ, D.; MATYSKA, L.; HOLUB, P.; SITERA, J. VirtCloud: Virtualising Network for Grid Environments–First Experiences. In 23rd IEEE International Conference on Advanced Information Networking and Applications, AINA 2009, 26.06.2009. ISBN: 978-0-7695-3638-5
[5] NOVÁK, V.; ŠMRHA, P.; VERICH, J. Deployment of CESNET2+ E2E Services. Technical Report 18/2007, Praha: CESNET, 2007.
[6] RUDA, M. et al. Virtual Clusters as a New Service of MetaCentrum, the Czech NGI. In Cracow Grid Workshop, Krakow, Poland, October 12–14, 2009.
další weby:fond rozvojemetacentrumCzechLightpřenosyvideoservereduroameduID.cz