Operation of a Distiller System in a Distributed Environment - The DistillerGRID

CESNET technical report number 1/2007
also available in PDF, PostScript, and XML formats.

The report is also available in Czech.

Jan Havlíček, Michal Krsek
February 12th, 2007

1   Abstract

We are using a distiller system for distilling metadata from multimedia files available on the Internet. At the beginning of 2006, we had millions of URLs and it was not possible to process all of the content on time. We developed parallel system DestilatorGRID using computers in computer labs at Universities.

2   Current situation

Within the framework of Virtual environment for cooperation project, a distiller [Dol03] of metadata and static pictures of multimedia content is operated. A distiller connects to a URL with multimedia content, downloads metadata and thumbnails, and saves both to the database.

The process is quite time consuming, average processing of one entry takes tens of seconds (depending on accessibility of concrete files). Common parallelization within one operating system is not effective - used technologies do not enable effective resource (e.g. a sound card) sharing by different processes.

However, there is a possibility of easy parallelization through more computers as the processing of each URL is totally independent of the other URLs processing. The original idea was the use of university computer labs for parallel (exactly: semi-parallel) data processing during the time when they are not exploited for their original purpose, which means nights, weekends or holidays.

The aim of the work was to design the solution of environment for the distiller operation and the related requisite software so that it affects the software installation of the lab computers in the minimum possible way. Another requirement was quick and easy distribution of the final solution onto the computers, disregarding the operating system used and the possibilities of control used on the end stations. Another problem to be solved was the harvest of gathered data.

3   Description of the environment

We worked according to the following assignment:

It is obvious that this assignment can be quite elegantly solved through virtualization: that is the launch of an independent installation of the operating system and the requisite software equipment in a virtual environment created with the products such as VMWare or Microsoft Virtual PC.

There are quite a few advantages of the use of virtualization software and installation of requisite programs in the virtual environment. We use the sandbox method, which can be easily replaced. That is why we can take advantage of following:

4   Testing

In the beginning of the year, we carried out a set of testing installation in VMWare and Microsoft Virtual PC environment with the available distiller version. MS Windows XP Professional was the host system we chose. The most important reason was the platform upon which the distiller software operates. Another fact we took into account was broad experience in this OS operation at the University of Economics (UoE) where the experiments were taking place. Finally, there are latest versions of players and other requisite software available for MS Windows XP Professional.

In the course of testing, it became evident that Microsoft Virtual PC is more technically suitable for our purposes, especially because of the trouble-free access to the sound hardware. Microsoft Virtual PC 2004 was released as freeware in the middle of 2006, which we also took into consideration.

Saving of data was solved by the distiller adjustment so that the gathered data was inserted into SQL database on server. Thus only log files are created on the local machine and those are relevant just for the distiller trouble-shooting.

5   Virtual machine environment

In case of installation of guest operating system in Virtual PC, we had to solve following problems that distinguish the installation from a typical operating system on the workstation:

The automatic logon is realized through the AutoAdminLogon function in Windows XP and carried out by following registry modification:

HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\WinLogon\AutoAdminLogon = 1 
HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\WinLogon\DefaultPassword = NTpassword 
HKLM\Software\Microsoft\Windows NT\CurrentVersion\Winlogon\Autologon = YES 
HKLM\Software\Microsoft\Windows NT\CurrentVersion\Winlogon\AutoLogonCount = %number_of_logons%

In the last variable is advisable to substitute %number_of_logon% with some very big number (hundreds or thousands).

A name defined in destilator.exe.config file is used to identify the distilling stations. Since the names are all the same in the beginning, the easiest solution is to overwrite this name by a random string at the start-up. We used the following trivial program created by Windows scripting host for this purpose:

Set objFSO = CreateObject("Scripting.FileSystemObject") 
strTempFile = objFSO.GetTempName+objFSO.GetTempName 

Const ForReading = 1 
Const ForWriting = 2 
Set objFile = objFSO.OpenTextFile("C:\destilator\destilator2\destilator.exe.config.orig", ForReading) 
strText = objFile.ReadAll 
objFile.Close 
strNewText = Replace(strText, "NAHRAD", strTempFile) 
Set objFile = objFSO.OpenTextFile("C:\destilator\destilator2\destilator.exe.config", ForWriting) 
objFile.WriteLine strNewText 
objFile.Close 

6   Distribution on the stations

As a next step we had to solve how to get the installation of Microsoft Virtual PC and the configured and installed virtual machine onto the target computers. In the environment of UoE, the installation of Microsoft Virtual PC was distributed through Novell's ZEN Works program, which is commonly used there. Any program for remote management of end stations can be used instead of ZEN Works.

We were solving the problem with the distribution of configured virtual stations as well. The virtual station consists of two files. The definition of the station properties is in one file, whereas the emulated hard drive with installed operating system and applications is in the other one. The file has 2 GB in case of installation of Windows XP Professional SP2 with updates and all requisite programs, even though the virtual memory is switched off.

We found out that the distribution of 2 GB file on approx. 40 stations in parallel takes hours. It was four hours in the testing operation.

Other experiments were carried out with the option to distribute just the station with ready MS DOS system, network drivers and Norton Ghost program. It was configured so that that the program runs automatically upon the start-up and connects to the prepared session through which the installation of Windows XP is sent to all of the computers at once, by the means of multicast. We encountered two problems while testing this procedure:

Final successful method of distribution of virtual stations on the end computers is based on the following scheme:

7   Experimental operation

The experimental operation was carried out with following configurations:

  1. PC Dell OptiPlex 170L, Intel Celeron 2,8 GHz, 512 MB RAM
  2. FujitsuSiemens Scenico P320, AMD Athlon 3200+, 512 MB RAM

The operation of virtual stations is satisfactory in both cases. We experimented with the possibility of running two virtual stations in parallel within one host OS as well, the aim being a higher level of parallelization in accessible environment. In this case, the guest operating systems on virtual stations have max. 128 MB RAM at their disposal, which leads to a higher use of virtual memory.

The experimental operation was carried out on 40-230 stations. We found out that the present configuration of server resources can continually attend about 50-60 simultaneously working stations. In case of a higher number of stations, gradual idle periods in database transactions processing occur. The present configuration of the database server is not capable of assigning tasks effectively, when the number of working stations reaches 130. We operate 40 stations in the end of the year 2006 (when the computer labs are closed for public).

The number of new addresses (we are indexing millions of addresses) increases very fast. It is impossible to cope with such a growth without parallel system implementation. Without the growth of distiller output, it is not possible to cope with further rate of increase (harvest of RSS streams and podcast addresses).

References

[Dol03] Doležal I., Illich M., Krsek M.: Internet search in multimedia data. Technical report 19/2003, Praha: CESNET, 2003.
další weby:fond rozvojemetacentrumCzechLightpřenosyvideoservereduroameduID.cz