Operation of a Distiller System in a Distributed Environment - The DistillerGRID
CESNET
technical report number 1/2007
also available in PDF,
PostScript, and
XML formats.
The report is also available in Czech.
Jan Havlíček, Michal Krsek
February 12th, 2007
1 Abstract
We are using a distiller system for distilling metadata from multimedia files available on the Internet. At the beginning of 2006, we had millions of URLs and it was not possible to process all of the content on time. We developed parallel system DestilatorGRID using computers in computer labs at Universities.
2 Current situation
Within the framework of Virtual environment for cooperation project, a distiller [Dol03] of metadata and static pictures of multimedia content is operated. A distiller connects to a URL with multimedia content, downloads metadata and thumbnails, and saves both to the database.
The process is quite time consuming, average processing of one entry takes tens of seconds (depending on accessibility of concrete files). Common parallelization within one operating system is not effective - used technologies do not enable effective resource (e.g. a sound card) sharing by different processes.
However, there is a possibility of easy parallelization through more computers as the processing of each URL is totally independent of the other URLs processing. The original idea was the use of university computer labs for parallel (exactly: semi-parallel) data processing during the time when they are not exploited for their original purpose, which means nights, weekends or holidays.
The aim of the work was to design the solution of environment for the distiller operation and the related requisite software so that it affects the software installation of the lab computers in the minimum possible way. Another requirement was quick and easy distribution of the final solution onto the computers, disregarding the operating system used and the possibilities of control used on the end stations. Another problem to be solved was the harvest of gathered data.
3 Description of the environment
We worked according to the following assignment:
- Not to install the distiller environment (the latest versions of .NET Framework, mplayer, QuickTime player, Real player, and Windows Media Player) into the operating system used for classes and the students' individual work
- Not to save the gathered data into the file system, instead to save them by means of a suitable network protocol into the central storage
- Not to consider the environment of host operating system confidential
It is obvious that this assignment can be quite elegantly solved through virtualization: that is the launch of an independent installation of the operating system and the requisite software equipment in a virtual environment created with the products such as VMWare or Microsoft Virtual PC.
There are quite a few advantages of the use of virtualization software and installation of requisite programs in the virtual environment. We use the sandbox method, which can be easily replaced. That is why we can take advantage of following:
- Minimal demands on installation of additional software on the stations which are normally used for other purposes
- Simple update of the program equipment used
- Relative hardware independence from the hardware equipment of end stations
- Security of end stations that are normally used for other purposes - in case of attack on the distilling machine through malformed content of the processed URL, only the virtual OS in the sandbox is affected. Moreover, upon each start, this virtual OS is loaded from a clean read-only image saved on the network. Thus contingent virus infestation cannot spread.
4 Testing
In the beginning of the year, we carried out a set of testing installation in VMWare and Microsoft Virtual PC environment with the available distiller version. MS Windows XP Professional was the host system we chose. The most important reason was the platform upon which the distiller software operates. Another fact we took into account was broad experience in this OS operation at the University of Economics (UoE) where the experiments were taking place. Finally, there are latest versions of players and other requisite software available for MS Windows XP Professional.
In the course of testing, it became evident that Microsoft Virtual PC is more technically suitable for our purposes, especially because of the trouble-free access to the sound hardware. Microsoft Virtual PC 2004 was released as freeware in the middle of 2006, which we also took into consideration.
Saving of data was solved by the distiller adjustment so that the gathered data was inserted into SQL database on server. Thus only log files are created on the local machine and those are relevant just for the distiller trouble-shooting.
5 Virtual machine environment
In case of installation of guest operating system in Virtual PC, we had to solve following problems that distinguish the installation from a typical operating system on the workstation:
- The selected user must automatically log into the station and the requisite processes must run with his privileges
- It is necessary to lock the station
- It is necessary to ensure a different name for each of the workstations (the names are not different in the beginning since all of the guest operating systems boot from the same image) to identify distillers working in parallel in the database and to prevent the occurrence of collisions in URL processing
The automatic logon is realized through the AutoAdminLogon function in Windows XP and carried out by following registry modification:
HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\WinLogon\AutoAdminLogon = 1 HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\WinLogon\DefaultPassword = NTpassword HKLM\Software\Microsoft\Windows NT\CurrentVersion\Winlogon\Autologon = YES HKLM\Software\Microsoft\Windows NT\CurrentVersion\Winlogon\AutoLogonCount = %number_of_logons%
In the last variable is advisable to substitute %number_of_logon% with some very big number (hundreds or thousands).
A name defined in destilator.exe.config file is used to identify the distilling stations. Since the names are all the same in the beginning, the easiest solution is to overwrite this name by a random string at the start-up. We used the following trivial program created by Windows scripting host for this purpose:
Set objFSO = CreateObject("Scripting.FileSystemObject")
strTempFile = objFSO.GetTempName+objFSO.GetTempName
Const ForReading = 1
Const ForWriting = 2
Set objFile = objFSO.OpenTextFile("C:\destilator\destilator2\destilator.exe.config.orig", ForReading)
strText = objFile.ReadAll
objFile.Close
strNewText = Replace(strText, "NAHRAD", strTempFile)
Set objFile = objFSO.OpenTextFile("C:\destilator\destilator2\destilator.exe.config", ForWriting)
objFile.WriteLine strNewText
objFile.Close
6 Distribution on the stations
As a next step we had to solve how to get the installation of Microsoft Virtual PC and the configured and installed virtual machine onto the target computers. In the environment of UoE, the installation of Microsoft Virtual PC was distributed through Novell's ZEN Works program, which is commonly used there. Any program for remote management of end stations can be used instead of ZEN Works.
We were solving the problem with the distribution of configured virtual stations as well. The virtual station consists of two files. The definition of the station properties is in one file, whereas the emulated hard drive with installed operating system and applications is in the other one. The file has 2 GB in case of installation of Windows XP Professional SP2 with updates and all requisite programs, even though the virtual memory is switched off.
We found out that the distribution of 2 GB file on approx. 40 stations in parallel takes hours. It was four hours in the testing operation.
Other experiments were carried out with the option to distribute just the station with ready MS DOS system, network drivers and Norton Ghost program. It was configured so that that the program runs automatically upon the start-up and connects to the prepared session through which the installation of Windows XP is sent to all of the computers at once, by the means of multicast. We encountered two problems while testing this procedure:
- Norton Ghost does not function in case of virtual station having the network setup as NAT (so that Microsoft Virtual PC does the address translation)
- Multicast distribution through Norton Ghost does not function when the client of Microsoft network (SMB) is not installed on the host computer
Final successful method of distribution of virtual stations on the end computers is based on the following scheme:
- Virtual station is created and installed with these parameters: 256 MB, the network through NAT, Microsoft Windows XP operating system. The image of the station is places on the shared network drive that can be read by the end stations (respectively by the users logged on them)
- A new virtual station is created and its disk file is of Differential type, based on the file created in the previous step. This file is quite small (approx. 20 kB) and can be easily copied along with the definition file onto a large number of computers. By the means of a batch file run e.g. from a login script, these files are consequently copied onto the end stations and executed.
7 Experimental operation
The experimental operation was carried out with following configurations:
- PC Dell OptiPlex 170L, Intel Celeron 2,8 GHz, 512 MB RAM
- FujitsuSiemens Scenico P320, AMD Athlon 3200+, 512 MB RAM
The operation of virtual stations is satisfactory in both cases. We experimented with the possibility of running two virtual stations in parallel within one host OS as well, the aim being a higher level of parallelization in accessible environment. In this case, the guest operating systems on virtual stations have max. 128 MB RAM at their disposal, which leads to a higher use of virtual memory.
The experimental operation was carried out on 40-230 stations. We found out that the present configuration of server resources can continually attend about 50-60 simultaneously working stations. In case of a higher number of stations, gradual idle periods in database transactions processing occur. The present configuration of the database server is not capable of assigning tasks effectively, when the number of working stations reaches 130. We operate 40 stations in the end of the year 2006 (when the computer labs are closed for public).
The number of new addresses (we are indexing millions of addresses) increases very fast. It is impossible to cope with such a growth without parallel system implementation. Without the growth of distiller output, it is not possible to cope with further rate of increase (harvest of RSS streams and podcast addresses).
References
| [Dol03] | Doležal I., Illich M., Krsek M.: Internet search in multimedia data. Technical report 19/2003, Praha: CESNET, 2003. |