Internet search in multimedia data
CESNET
technical report number 19/2003
also available in PDF,
PostScript, and
XML formats, and in
Czech language.
Ivan Doležal, CESNET, z. s. p. o.
Michal Illich, Jyxo, s. r. o.
Michal Krsek, CESNET, z. s. p. o.
20.11.2003
1 Motivation
Development of broadband access to the Internet offers the users greater possibilities of use of the advanced forms of multimedia content, such as audio and video. To be able to use its full potential on the Internet, the users need to be able to search for the content. The aim of our project is bridging this gap.
The number of audio and video files that are accessible under public URL in ccTLD .cz reaches now (October 2003) as many as 23,000 contributions.
2 Present situation
Large content owners have started to realize the importance of Internet due to the broadband access development. However, their portals are designed for the users who passively consume a TV program, which runs within one provider. If there is a search engine, then just within one portal.
As opposed to the large media concerns, there are many users who display their audio and video materials as an enrichment of their pages. It is impossible to search this kind of materials using classic methods.
Similar situation is in the world of WWW (respectively HTML). However, search engines for WWW, which enable search for data based on textual information, do exist. As far as we know, such system for audio and video search does not operate.
On the Internet, there are vast peer-to-peer networks consisting of applications designed for sharing data among users. These networks contain search mechanisms as an inseparable part of their functionality and thus are not the aim of the project.
3 Solution design
There are two methods of audio and video files search.
The first method is to compare the content to the search pattern, e.g. word against audio recording, picture against film or text against subtitles. At present, it is not possible to apply this kind of search because of low quality of recordings, relatively low performance of search algorithms on generally available devices and heterogeneity of material (great number of codecs and formats).
The second method of audio and video files search is a search in metadata, textual data stored so that they are accessible together with the material itself. In the Internet environment, majority of the data are stored directly in the multimedia files, respectively on webpages linked to the files. Textual information can be processed by analogy to the fulltext search. There are plenty of software for fulltext search, including bundles that are free of charge. We have chosen to cooperate with Jyxo fulltext search (the research team did not have to concern itself with the system run, nor the user front-end). Cooperation with the running system gave us opportunity to collect reasonably wide volume of material.
4 Description of the system
The system consists of the standard components of fulltext Internet search engine (crawler, indexer, front-end). The distiller component, which extracts metadata from defined multimedia files, is integrated with them. This component communicates offline with other system components through standard protocols (ssh/scp) and interfaces (plain text and XML) and is easily integrated into any environment.
Crawler component saves the URL of audio and video files in a text file (one line for each URL). The crawler collects the addresses from the pages that it gets through crawling the web. This file is transferred through SCP protocol onto the server where it can be accessed by distiller. When processing the file, the distiller goos through individual URLs and creates XML files from the metadata found (see the appendix for the format). The files are placed into the output directory. The data are transferred - through SCP - into the system where indexer is running. The indexer creates and ordinary fulltext database which the users can search.
5 Distiller
Distiller is the key component of the system. We have abandoned the idea of developing our own decoder because of the need to index the widest possible range of formats and codecs (and because of the rapid development in this field). We have tested several one-purpose utilities available on the Internet but we have not achieved satisfactory data quality and stability of the system.
The final form of the distiller is a Win32 application written in Visual Basic. The application passes individual URLs to ActiveX (OLE) objects that are part of multimedia players (Real One Player and Windows Media Player). These objects try opening of the URL by one of the codecs offered by operation system (WM) or by the codecs supplied for RealOne Player.Data gained by comparison of the objects output are then transformed into XML format.
The players resolve accessibility of the material and its correct format (in case it is not possible to read the file, ActiveX object returns an error condition).
The distiller can be engaged in practically any fulltext search on the Internet, thanks to its open inputs and outputs.
6 Problems
We have found three problems that partially restrict the system use.
The first problem is that the owners of the files hardly ever fill in the metadata. They probably rely on the fact that the material will be accessible just from their WWW portal or it can be their mistake during the contributions publication. The researchers cannot change this attitude.
The second problem is low stability of ActiveX objects in cases when the codec chosen for the multimedia data play encounters such a version of the objects it cannot cope with. Instability shows itself not only by the crash of the player but also by "freezing" of the operation system (we are using Windows 2000). Despite the fact that this problem concerns only per mile of contributions, it becomes a significant problem when there are tens of thousands contributions. This problem can be solved by processing small amount of data at a time, possibly by a restart of the system - however, such a solution does not allow full automation of the service. We assume that the problem will be partially solved by eventual deployment in virtual computing environment (Windows Terminal Services, WMWare).
The third problem is restricted amount of information provided by AcitveX objects and its poor implementation. Players offer through ActiveX interface just a subset of metadata that are stored in multimedia files. Additionally, information passed are misleading - for example information about data stream given by RealOne Player. It is possible to get just the information indicating the total of data stream of all SureStream format streams, not the information about the number of streams and their individual speeds - despite the fact that API for getting this piece of information has been stated in the documentation for several years. The players both offer much richer C API; we therefore hope that a new version of distiller (console Win32 application written in C++) will fix the problem.
7 Evaluation of the project
The result of the project is a functional search engine in multimedia data available on the Czech Internet, running on http://www.jyxo.cz. Considering the outcome planned, the result fully meets the aims.
DTD of documents passed from distiller to indexer is at http://prenosy.cesnet.cz/dtd/distiller-0-3.dtd.